How can I create or design a table to query data from various file formats in S3 effectively?

292 Asked by CsabaToth in AWS , Asked on Mar 5, 2024

I am a data engineer and I am currently engaged in a particular task that is related to setting up a data lake with Amazon Athena to query data stored in various file formats such as CSV, and JSON parquet in Amazon S3. Explain to me how can I create or design a table to query data from various file formats in S3 effectively.

Answered by Deepak Mistry

In the context of AWS, you can create a table in Athena to query data from various file sources such as CSV, JSON, and parquet files stored in S3 by using the following approach:-

Identification of file format

You should begin by identifying the file format of the data stored in S3. You should try to determine whether the data is in CSV, JSON, or parquet.

Define table schemas

You should define table schemas by using the “CREATE TABLE” statement in Athena. You would need to specify the names of columns and data types according to the structure of the data files.

Example: Creating a table for CSV files

CREATE EXTERNAL TABLE IF NOT EXISTS my_csv_table (

    Column1_name data_type,

    Column2_name data_type,

    …

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘,’

LOCATION ‘s3://your-bucket/path/to/csv-files/’;

Example: Creating a table for JSON files

CREATE EXTERNAL TABLE IF NOT EXISTS my_json_table (

    Column1_name data_type,

    Column2_name data_type,

    …

)

ROW FORMAT SERDE ‘org.openx.data.jsonserde.JsonSerDe’

WITH SERDEPROPERTIES (

    ‘serialization.format’ = ‘1’

)

LOCATION ‘s3://your-bucket/path/to/json-files/’;

Example: Creating a table for Parquet files

CREATE EXTERNAL TABLE IF NOT EXISTS my_parquet_table (

    Column1_name data_type,

    Column2_name data_type,

    …

)

STORED AS PARQUET

LOCATION ‘s3://your-bucket/path/to/parquet-files/’;

How can I create or design a table to query data from various file formats in S3 effectively?

Your Answer