- Like
In this blog, you will learn how to setup boto3 and automate your data engineering workflow using Athena.
- What is Amazon Athena?
- With the help of Amazon Athena, you can query data instantly. Get results in seconds and pay only for the queries you run.
- It is easy to analyze data in Amazon S3 using SQL.
- Amazon Athena simply points to your data in Amazon S3, defines the schema, and start querying using standard SQL.
- S3 file system
- S3 allows users to easily manage data and access control.
- You can use it to store and protect any amount of data.
- It also allows you to query CSV files using Amazon Athena.
- You can store and query CSV and Parquet files from S3 storage.
- Boto3 installation
- Type cmd in the search window
- Enter the "pip install boto3" command.
- Open Jupyter notebook and test the installation with the "import boto3" command.
- Creating Athena tables using boto3
- Create a new directory in the S3 bucket and place your raw files in it.
- ex: bucket_name/new_directory/Raw_input_files.csv
- Before you run your first query, you need to set up the location for query results in S3.
- Example: s3://query-results-bucket/folder/
- Run below code to create a table in Athena using boto3.
- Create a new directory in the S3 bucket and place your raw files in it.
import boto3 # python library to interface with S3 and athena.
s3 = boto3.resource('s3') # Passing resource as s3
client = boto3.client('athena') # and client as athena
database = 'database_name' # Data base name
query=""" create external table data_base_name.table1 (
'ID' Int,
'Name' string,
'Address' string)
Location "s3://query-results-bucket/input_folder/";
"""
s3_output = 's3://query-results-bucket/output_folder/' # output location
response = client.start_query_execution(QueryString=query,QueryExecutionContext={ 'Database': database},
ResultConfiguration={ 'OutputLocation': s3_output})