3. Map Aggregation

Laten we beginnen. Het is Gratis
of registreren met je e-mailadres
3. Map Aggregation Door Mind Map: 3. Map Aggregation

1. Combiners

1.1. Special type of Reducer

1.1.1. Combiner extends class Reducer

1.1.2. Invoked arbitrary number of times (1...many) MR framework invokes Invoked just prior to spill

1.1.3. Combiner must deserialize Mapper <K, V>

1.1.4. Only combines data on one node Not output from multiple Mappers

1.1.5. Combiner logic should be transparent I.e. Combiner should be able to be removed w/out altering the outcome of Reducer.

1.1.6. Combiner input and output types must be the same I.e. must match output of Mapper

1.1.7. Reducer can be used as a Combiner in some cases. Example: if the Reducer computes a commutative and associative function like summation or max value.

1.1.8. Combiner always runs at least once per Mapper. Always at least 1 spill file

1.2. Reduce-side combining

1.2.1. Purpose of Combiner is decrease network traffic between Mapper and Reducer.

1.2.2. Reducer can use Combiner to improve file I/O. Should make NO difference to program logic.

1.3. Combiner Example

1.3.1. Extends org.apache.hadoop.mapreduce.Reducer class

1.3.2. Combiner <K, V> output must match Mapper output which matches Reducer input.

1.3.3. Multiple invocations won't affect algorithm

1.3.4. WordCountCombiner Input <K, V> = output <K, V> Therefore not all Reducers can be Combiners. setCombinerClass in the run() method to configure Combiner Example Sometimes you can use the Reducer:

2. In-map aggregation

2.1. Also called local aggregation

2.2. Mapper stores records in memory

2.3. Won't work for large # records

2.4. Good performance gains when it works

2.4.1. Doesn't have overhead of Combiner (JVM)

2.5. Example: TopResultMapper

2.5.1. ArrayList eventually holds every word Won't work if split too big Doesn't hold dupes - freq ++

2.5.2. Clever trick: ArrayList put into PriorityQueue sorted by frequency. NOTE: Only finds top 10 words per InputSplit - not entire Hadoop cluster.

3. Counters

3.1. Pre-defined

3.1.1. Job Counters

3.1.2. FilesSystemCounters

3.1.3. Many in the Map-Reduce Framework group

3.2. User-defined

3.2.1. enum-based incrementing

3.2.2. string group and counter name