Athena
💡 Definition
Amazon Athena is an interactive query service that makes it easy to analyze data in S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
🔑 Key Concepts
- Serverless: No servers to provision or manage.
- Direct S3 Querying: You don't need to load data into a database first; you query files (CSV, JSON, Parquet) sitting directly in S3.
- Standard SQL: Uses Presto under the hood, supporting standard SQL syntax.
- Integration with Glue: Uses the AWS Glue Data Catalog to store metadata (schema) about your data.
⚙️ How it Works
- Define Schema: Use the Glue Data Catalog to define the table structure of your S3 data.
- Write SQL: Write a standard SQL query in the Athena console.
- Run: Athena scans the files in S3 and returns the results.
🎯 Use Cases
- Ad-hoc Analysis: Quickly checking logs or data files in S3 without setting up a warehouse.
- Log Analysis: Querying CloudTrail, CloudFront, or VPC Flow Logs stored in S3.
💰 Pricing Model
- Per Query: Charged based on the amount of data scanned ($5 per TB).
- Cost Tip: Compressing data (GZIP) and using columnar formats (Parquet) reduces costs significantly.
📝 Exam Tips (CLF-C02)
- Serverless SQL queries on S3.
- Pay per query (data scanned).
- Great for ad-hoc analysis.