This article is about choosing between the various compute options of master, core and task nodes on EMR clusters based on the workload profile
Instance Fleets vs Instance Groups
Instance Groups are limited to one instance type and one purchase type (reserved, demand, spot, etc) per group
Instance Fleets support multiple instances and multiple payments types and hence offer more flexibility and possible efficiency in matching the workload demands to compute resources available.
What Instance Type Should You Use?
There are several ways to add EC2 instances to a cluster, which depend on whether you use the instance groups configuration or the instance fleets configuration for the cluster.
Instance Groups
- AUTOMATIC: Set up automatic scaling in Amazon EMR for an instance group, adding and removing instances automatically based on the value of an Amazon CloudWatch metric that you specify. For more information, see Scaling Cluster Resources.
- MANUAL GROUP UPDATE: Manually add instances of the same type to existing core and task instance groups.
- MANUAL ADD GROUPS: Manually add a task instance group, which can use a different instance type.
Instance Fleets
- Add a single task instance fleet.
- Change the target capacity for On-Demand and Spot Instances for existing core and task instance fleets.
Node Types
The master node type, which assigns tasks, doesn’t require an EC2 instance with much processing power. Consider using an m4.xlarge, or m5.xlarge.
EC2 instances for the core node type, which process tasks and store data in HDFS, need both processing power and storage capacity. m5.xlarge is typical
EC2 instances for the task node type, which don’t store data, need only processing power. m5.xlarge is typical
Core Nodes on Spot Instances
Core nodes process data and store information using HDFS. Terminating a core instance risks data loss. For this reason, you should only run core nodes on Spot Instances when partial HDFS data loss is tolerable.
When you launch the core instance group as Spot Instances, Amazon EMR waits until it can provision all of the requested core instances before launching the instance group. In other words, if you request six Amazon EC2 instances, and only five are available at or below your maximum Spot price, the instance group won’t launch
Task Nodes on Spot Instances
The task nodes process data but do not hold persistent data in HDFS. If they terminate because the Spot price has risen above your maximum Spot price, no data is lost and the effect on your cluster is minimal.
When you launch one or more task instance groups as Spot Instances, Amazon EMR provisions as many task nodes as it can, using your maximum Spot price. This means that if you request a task instance group with six nodes, and only five Spot Instances are available at or below your maximum Spot price, Amazon EMR launches the instance group with five nodes, adding the sixth later if possible.
Instance Configurations for Application Scenarios
The following table is a quick reference to node type purchasing options and configurations that are usually appropriate for various application scenarios.
Application Scenario | Master Node | Core Nodes | Task Nodes |
---|---|---|---|
Long-Running Clusters and Data Warehouses |
On-Demand | On-Demand or instance-fleet mix | Spot or instance-fleet mix |
Cost-Driven Workloads | Spot | Spot | Spot |
Data-Critical Workloads | On-Demand | On-Demand | Spot or instance-fleet mix |
Application Testing | Spot | Spot | Spot |