datafu.pig.bags
Class DistinctBy

java.lang.Object
  extended by org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>
      extended by datafu.pig.bags.DistinctBy

public class DistinctBy
extends org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>

Get distinct elements in a bag by a given set of field positions. The input and output schemas will be identical. The first tuple containing each distinct combination of these fields will be taken. This operation is order preserving. If both A and B appear in the output, and A appears before B in the input, then A will appear before B in the output. Example:

 define DistinctBy datafu.pig.bags.DistinctBy('0');
 
 -- input:
 -- ({(a, 1),(a,1),(b, 2),(b,22),(c, 3),(d, 4)})
 input = LOAD 'input' AS (B: bag {T: tuple(alpha:CHARARRAY, numeric:INT)});
 
 output = FOREACH input GENERATE DistinctBy(B);
 
 -- output:
 -- ({(a,1),(b,2),(c,3),(d,4)})
  
 


Field Summary
 
Fields inherited from class org.apache.pig.EvalFunc
log, pigLogger, reporter, returnType
 
Constructor Summary
DistinctBy(java.lang.String... fields)
           
 
Method Summary
 org.apache.pig.data.DataBag exec(org.apache.pig.data.Tuple input)
           
 org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
           
 
Methods inherited from class org.apache.pig.EvalFunc
finish, getArgToFuncMapping, getCacheFiles, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, isAsynchronous, progress, setPigLogger, setReporter, warn
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DistinctBy

public DistinctBy(java.lang.String... fields)
Method Detail

exec

public org.apache.pig.data.DataBag exec(org.apache.pig.data.Tuple input)
                                 throws java.io.IOException
Specified by:
exec in class org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>
Throws:
java.io.IOException

outputSchema

public org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
Overrides:
outputSchema in class org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>


Matthew Hayes, Sam Shah