datafu.pig.bags
Class DistinctBy
java.lang.Object
org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>
datafu.pig.bags.DistinctBy
public class DistinctBy
- extends org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>
Get distinct elements in a bag by a given set of field positions.
The input and output schemas will be identical.
The first tuple containing each distinct combination of these fields will be taken.
This operation is order preserving. If both A and B appear in the output,
and A appears before B in the input, then A will appear before B in the output.
Example:
define DistinctBy datafu.pig.bags.DistinctBy('0');
-- input:
-- ({(a, 1),(a,1),(b, 2),(b,22),(c, 3),(d, 4)})
input = LOAD 'input' AS (B: bag {T: tuple(alpha:CHARARRAY, numeric:INT)});
output = FOREACH input GENERATE DistinctBy(B);
-- output:
-- ({(a,1),(b,2),(c,3),(d,4)})
Fields inherited from class org.apache.pig.EvalFunc |
log, pigLogger, reporter, returnType |
Constructor Summary |
DistinctBy(java.lang.String... fields)
|
Method Summary |
org.apache.pig.data.DataBag |
exec(org.apache.pig.data.Tuple input)
|
org.apache.pig.impl.logicalLayer.schema.Schema |
outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
|
Methods inherited from class org.apache.pig.EvalFunc |
finish, getArgToFuncMapping, getCacheFiles, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, isAsynchronous, progress, setPigLogger, setReporter, warn |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
DistinctBy
public DistinctBy(java.lang.String... fields)
exec
public org.apache.pig.data.DataBag exec(org.apache.pig.data.Tuple input)
throws java.io.IOException
- Specified by:
exec
in class org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>
- Throws:
java.io.IOException
outputSchema
public org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
- Overrides:
outputSchema
in class org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>
Matthew Hayes, Sam Shah