Glue
💡 Definition
AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. It is primarily an ETL (Extract, Transform, Load) service.
🔑 Key Concepts
- ETL: Extracts data from sources, transforms it (cleans, normalizes), and loads it into destinations.
- Serverless: No infrastructure to provision.
- Data Catalog: A central repository to store structural and operational metadata for all your data assets.
- Crawlers: Automatically discover data schema and populate the Data Catalog.
⚙️ How it Works
- Crawler: Scans your data (e.g., in S3) and creates table definitions in the Data Catalog.
- Job: You write a script (Python/Scala) or use the visual editor to define transformations.
- Trigger: Run the job on a schedule or event to move data to a destination (e.g., Redshift).
🎯 Use Cases
- Data Lake creation: Preparing data for analysis in S3.
- Data Warehouse loading: Loading cleaned data into Redshift.
- Cataloging: organizing metadata for use by Athena or EMR.
💰 Pricing Model
- DPU-hours: Charged for the Data Processing Units used to run your ETL jobs and crawlers.
- Catalog Storage: Small fee for storing metadata.
📝 Exam Tips (CLF-C02)
- Keyword: "ETL" (Extract, Transform, Load).
- Keyword: "Data Catalog".
- Serverless data preparation.