Querying Data in S3 Using Amazon Athena and Python

Published On: 6 August 2024.By Parikshit Rathore.

Querying Data in S3 Using Amazon Athena and Python

Introduction
What is Athena?
Querying Data in S3 with Amazon Athena
- Prepare Your Data in S3
- Set Up an Athena Table
- Run SQL Queries
- Analyze Query Results
Querying Data in S3 with Python
- Install Boto3
- Set Up AWS Credentials
- Initialize a Session and Create a Database
- Create a Table with DDL
- Query the Data
Conclusion
References

Introduction

Amazon S3 is a popular storage service for data lakes, and Amazon Athena is a powerful tool for querying data stored in S3 using SQL. This blog will guide you through the process of querying data in S3 using Athena and demonstrate how to perform the same queries using Python and Boto3.

What is Athena?

Amazon Athena is a powerful serverless query engine that enables you to run SQL queries on data stored in S3. Athena is versatile and capable of handling unstructured, semi-structured, and structured data formats such as CSV, JSON, and Parquet. This makes it an ideal tool for performing ad-hoc queries on data that isn’t in a traditional database, as well as for data exploration and analysis.

Athena charges based on the amount of data scanned per query, and there are no upfront costs or infrastructure to set up. However, due to the cost structure, extensive querying on large datasets can become expensive. Therefore, Athena is particularly well-suited for occasional or ad-hoc querying.

Athena can be utilized in various scenarios, including:

Log Analysis: Querying log files stored in S3 to identify errors.
Data Quality Checks: Investigate data in S3 to detect quality issues.
Intermediate Data Queries: Accessing data extracted from other sources and stored temporarily in S3 before transformation or database loading.
Archival Data Access: Retrieving and querying archival data stored in S3.

Querying Data in S3 with Amazon Athena

Amazon Athena is an interactive query service that allows you to analyze data directly in S3 using standard SQL. It is serverless, meaning you don’t need to manage any infrastructure, and you only pay for the queries you run.

Steps to Query Data in S3 Using Athena:

Prepare Your Data in S3:
Ensure your data is stored in S3 in a structured format such as CSV, JSON, Parquet, or ORC.

Set Up an Athena Table:
Use the Athena console or AWS CLI to create a table that references your data in S3. Define the schema of your data, specifying the column names and data types.Example SQL to create a table:

CREATE EXTERNAL TABLE IF NOT EXISTS my_table (
    column1 STRING,
    column2 INT,
    column3 DOUBLE
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
    'separatorChar' = ',',
    'quoteChar' = '\"'
)
LOCATION 's3://your-bucket/path/to/data/';

CREATE EXTERNAL TABLE IF NOT EXISTS my_table (

column1 STRING,

column2 INT,

column3 DOUBLE

)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

WITH SERDEPROPERTIES (

'separatorChar' = ',',

'quoteChar' = '\"'

)

LOCATION 's3://your-bucket/path/to/data/';

Run SQL Queries:
Execute SQL queries on your data using the Athena console or AWS CLI. For example, to select all records from the table:

SELECT * FROM my_table;

1

SELECT * FROM my_table;
Analyze Query Results:
View and analyze the results directly in the Athena console or download them for further processing.

Querying Data in S3 with Python

To query data in S3 using Athena through Python, you can use the boto3 library, which is the AWS SDK for Python. This allows you to programmatically execute Athena queries and retrieve results.

Steps to Query Data in S3 Using Python:

Install Boto3:
Ensure you have “boto3” installed in your Python environment. If not, you can install it using pip:

pip install boto

pip install boto3

1

pip install boto3
Set Up AWS Credentials:
Configure your AWS credentials using the AWS CLI or by setting environment variables.

Initialize a Session and Create a Database:
First, create a session with AWS using “boto3” and set up a database in Athena if it doesn’t already exist.

import boto3
import time

# Initialize a session using Amazon Athena
session = boto3.Session(
    aws_access_key_id='YOUR_ACCESS_KEY',
    aws_secret_access_key='YOUR_SECRET_KEY',
    region_name='YOUR_REGION'
)
athena = session.client('athena')

# Define the query to create a database
create_database_query = """
CREATE DATABASE IF NOT EXISTS my_database
"""

# Execute the query to create the database
response = athena.start_query_execution(
    QueryString=create_database_query,
    ResultConfiguration={'OutputLocation': 's3://your-bucket/query-results/'}
)

# Wait for the query to complete
query_execution_id = response['QueryExecutionId']
status = 'RUNNING'
while status in ['RUNNING', 'QUEUED']:
    response = athena.get_query_execution(QueryExecutionId=query_execution_id)
    status = response['QueryExecution']['Status']['State']
    if status in ['FAILED', 'CANCELLED']:
        raise Exception(f'Query failed or was cancelled: {response}')
    time.sleep(5)

print("Database created successfully.")

import boto3

import time

# Initialize a session using Amazon Athena

session = boto3.Session(

aws_access_key_id='YOUR_ACCESS_KEY',

aws_secret_access_key='YOUR_SECRET_KEY',

region_name='YOUR_REGION'

)

athena = session.client('athena')

# Define the query to create a database

create_database_query = """

CREATE DATABASE IF NOT EXISTS my_database

"""

# Execute the query to create the database

response = athena.start_query_execution(

QueryString=create_database_query,

ResultConfiguration={'OutputLocation': 's3://your-bucket/query-results/'}

)

# Wait for the query to complete

query_execution_id = response['QueryExecutionId']

status = 'RUNNING'

while status in ['RUNNING', 'QUEUED']:

response = athena.get_query_execution(QueryExecutionId=query_execution_id)

status = response['QueryExecution']['Status']['State']

if status in ['FAILED', 'CANCELLED']:

raise Exception(f'Query failed or was cancelled: {response}')

time.sleep(5)

print("Database created successfully.")

Create a Table with DDL:
Next, create a table within the database.

# Define the query to create a table
create_table_query = """
CREATE EXTERNAL TABLE IF NOT EXISTS my_database.my_table (
    id INT,
    name STRING,
    age INT,
    city STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
    'serialization.format' = '1'
) 
LOCATION 's3://your-bucket/path/to/data/';
"""

# Execute the query to create the table
response = athena.start_query_execution(
    QueryString=create_table_query,
    QueryExecutionContext={'Database': 'my_database'},
    ResultConfiguration={'OutputLocation': 's3://your-bucket/query-results/'}
)

# Wait for the query to complete
query_execution_id = response['QueryExecutionId']
status = 'RUNNING'
while status in ['RUNNING', 'QUEUED']:
    response = athena.get_query_execution(QueryExecutionId=query_execution_id)
    status = response['QueryExecution']['Status']['State']
    if status in ['FAILED', 'CANCELLED']:
        raise Exception(f'Query failed or was cancelled: {response}')
    time.sleep(5)

print("Table created successfully.")

# Define the query to create a table

create_table_query = """

CREATE EXTERNAL TABLE IF NOT EXISTS my_database.my_table (

id INT,

name STRING,

age INT,

city STRING

)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

WITH SERDEPROPERTIES (

'serialization.format' = '1'

)

LOCATION 's3://your-bucket/path/to/data/';

"""

# Execute the query to create the table

response = athena.start_query_execution(

QueryString=create_table_query,

QueryExecutionContext={'Database': 'my_database'},

ResultConfiguration={'OutputLocation': 's3://your-bucket/query-results/'}

)

# Wait for the query to complete

query_execution_id = response['QueryExecutionId']

status = 'RUNNING'

while status in ['RUNNING', 'QUEUED']:

response = athena.get_query_execution(QueryExecutionId=query_execution_id)

status = response['QueryExecution']['Status']['State']

if status in ['FAILED', 'CANCELLED']:

raise Exception(f'Query failed or was cancelled: {response}')

time.sleep(5)

print("Table created successfully.")

Query the Data:
Now, you can query the table you just created.

# Define the query to select data from the table
select_query = "SELECT * FROM my_database.my_table"

# Execute the query to select data
response = athena.start_query_execution(
    QueryString=select_query,
    QueryExecutionContext={'Database': 'my_database'},
    ResultConfiguration={'OutputLocation': 's3://your-bucket/query-results/'}
)

# Get the query execution ID
query_execution_id = response['QueryExecutionId']

# Wait for the query to complete
status = 'RUNNING'
while status in ['RUNNING', 'QUEUED']:
    response = athena.get_query_execution(QueryExecutionId=query_execution_id)
    status = response['QueryExecution']['Status']['State']
    if status in ['FAILED', 'CANCELLED']:
        raise Exception(f'Query failed or was cancelled: {response}')
    time.sleep(5)

# Fetch the results
result_response = athena.get_query_results(QueryExecutionId=query_execution_id)
rows = result_response['ResultSet']['Rows']

# Process and print the results
for row in rows:
    print(row)

print("Data queried successfully.")

# Define the query to select data from the table

select_query = "SELECT * FROM my_database.my_table"

# Execute the query to select data

response = athena.start_query_execution(

QueryString=select_query,

QueryExecutionContext={'Database': 'my_database'},

ResultConfiguration={'OutputLocation': 's3://your-bucket/query-results/'}

)

# Get the query execution ID

query_execution_id = response['QueryExecutionId']

# Wait for the query to complete

status = 'RUNNING'

while status in ['RUNNING', 'QUEUED']:

response = athena.get_query_execution(QueryExecutionId=query_execution_id)

status = response['QueryExecution']['Status']['State']

if status in ['FAILED', 'CANCELLED']:

raise Exception(f'Query failed or was cancelled: {response}')

time.sleep(5)

# Fetch the results

result_response = athena.get_query_results(QueryExecutionId=query_execution_id)

rows = result_response['ResultSet']['Rows']

# Process and print the results

for row in rows:

print(row)

print("Data queried successfully.")

Conclusion

Amazon Athena provides a powerful and flexible way to query data stored in S3 using SQL, while Python and “boto3” offer a programmable interface to execute these queries and process the results. By combining Athena and Python, you can automate and integrate data analysis tasks into your applications and workflows seamlessly.