EMR (Elastic MapReduce)
💡 Definition
Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Hadoop, Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto.
🔑 Key Concepts
- Managed Hadoop/Spark: AWS manages the setup and configuration of the clusters.
- Cluster: A collection of EC2 instances (Master node, Core nodes, Task nodes).
- Big Data Processing: Designed for heavy data lifting, machine learning, and scientific simulation.
- Transient Clusters: Spin up a cluster, process data, and shut it down to save money.
⚙️ How it Works
- Launch Cluster: Select applications (e.g., Spark) and hardware config.
- Process: Submit steps/jobs to the cluster.
- Output: Results are usually written to S3.
🎯 Use Cases
- Machine Learning: Training models on large datasets.
- Scientific Simulation: Genomic data processing.
- Log Analysis: Processing petabytes of web logs.
💰 Pricing Model
- Instance Hours: You pay for the underlying EC2 instances.
- EMR Fee: A small per-instance fee for the EMR management software.
- Cost Tip: Often uses Spot Instances for Task nodes to reduce costs significantly.
📝 Exam Tips (CLF-C02)
- Keyword: "Big Data", "Hadoop", "Spark".
- Processes vast amounts of data.
- Managed cluster platform.