Apache > Hadoop > Core
 

Hadoop Pluggable Sort

Introduction

The pluggable sort capability allows replacing the built in sort logic with alternate implementations. Example use cases for this are replacing the sort logic with custom algorithms that enable Hash aggregation and Limit-N query.

IMPORTANT: The pluggable sort capability is experimental and unstable. This means the provided APIs may change and break compatibility in future versions of Hadoop.

Implementing a Custom Sort

A custom sort implementation requires a org.apache.hadoop.mapred.MapOutputCollector implementation class running in the Mapper tasks and (optionally, depending on the sort implementation) a org.apache.hadoop.mapred.ShuffleConsumerPlugin implementation class running in the Reducer tasks.

The default implementations provided by Hadoop can be used as references:

  • org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  • org.apache.hadoop.mapreduce.task.reduce.Shuffle

Configuration

All the pluggable components run in the job tasks. This means, they can be configured on per job basis.

Job Configuration Properties

  • mapreduce.job.reduce.shuffle.consumer.plugin.class. Default:org.apache.hadoop.mapreduce.task.reduce.Shuffle. The ShuffleConsumerPlugin implementation to use
  • mapreduce.job.map.output.collector.class. Default:org.apache.hadoop.mapred.MapTask$MapOutputBuffer. The MapOutputCollector implementation to use

These properties can also be set in the mapred-site.xml to change the default values for all jobs.