Glue

aws/analytics aws/serverless aws/service

💡 Definition

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. It is primarily an ETL (Extract, Transform, Load) service.

🔑 Key Concepts

ETL: Extracts data from sources, transforms it (cleans, normalizes), and loads it into destinations.
Serverless: No infrastructure to provision.
Data Catalog: A central repository to store structural and operational metadata for all your data assets.
Crawlers: Automatically discover data schema and populate the Data Catalog.

⚙️ How it Works

Crawler: Scans your data (e.g., in S3) and creates table definitions in the Data Catalog.
Job: You write a script (Python/Scala) or use the visual editor to define transformations.
Trigger: Run the job on a schedule or event to move data to a destination (e.g., Redshift).

🎯 Use Cases

Data Lake creation: Preparing data for analysis in S3.
Data Warehouse loading: Loading cleaned data into Redshift.
Cataloging: organizing metadata for use by Athena or EMR.

💰 Pricing Model

DPU-hours: Charged for the Data Processing Units used to run your ETL jobs and crawlers.
Catalog Storage: Small fee for storing metadata.

📝 Exam Tips (CLF-C02)

Keyword: "ETL" (Extract, Transform, Load).
Keyword: "Data Catalog".
Serverless data preparation.

See Also: * Athena * Redshift * EMR