4. Partitioning and Sorting

Laten we beginnen. Het is Gratis
of registreren met je e-mailadres
4. Partitioning and Sorting Door Mind Map: 4. Partitioning and Sorting

1. Partitioners

1.1. Overview

1.1.1. Every MR job has

1.1.2. Determine which Reducer gets records

1.1.3. Attempt even distribution of map output Although like keys go to the same Reducer

1.1.4. Default is HashPartitioner Uses hashCode method of Object along with modulus

1.1.5. Data Skew: small number of Reducers processing large number of records. Understand what data looks like to avoid

1.2. How it works

1.2.1. Default Partitioner org.apache.hadoop.mapreduce.Partitioner parent of HashPartitioner. Same-key records go to same partition getPartition() invoked for each <K, V> output from Mapper Both Key and Value can be used in partitioning logic Many times value is ignored int return must be 0...numReduceTasks Done with modulus Define custom partitioner when HashCode uneven distribution

1.2.2. Custom Partitioner Extends Partitioner Implement the getPartition method return int 0...numReduceTasks Input parm types match class generics Use Job's setPartitionClass() to configure Configure in run method

1.2.3. TotalOrderPartitioner org.apache.hadoop.mapreduce.lib.partition package Output from all Reducers sorted Partition file defines how keys split across partitions Partition file generated using InputSampler class.

2. Sorting

2.1. 2 Key tasks

2.1.1. Keys sorted in natural order Key type is WriteableComparable Forces compareTo to be defined

2.1.2. Equal keys are grouped together. Configurable component Grouping Comparator decides key equality

2.2. Secondary sort

2.2.1. Move part of value into key

2.2.2. Sorting work done during Shuffle/Sort 1. Write custom key class that contains secondary key CustomKey 2. Write a custom grouping comparator Custom grouping comparator 3. Write a custom partitioner that ensures grouped keys are sent to the same reducer.