datafu.pig.linkanalysis
Class PageRank

java.lang.Object
  extended by org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>
      extended by datafu.pig.linkanalysis.PageRank
All Implemented Interfaces:
org.apache.pig.Accumulator<org.apache.pig.data.DataBag>

public class PageRank
extends org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>
implements org.apache.pig.Accumulator<org.apache.pig.data.DataBag>

A UDF which implements PageRank. Each graph is stored in memory while running the algorithm, with edges optionally spilled to disk to conserve memory. This can be used to distribute the execution of PageRank on a large number of reasonable sized graphs. It does not distribute execuion of PageRank on a single graph. Each graph is identified by an integer valued topic ID.

Example:

 topic_edges = LOAD 'input_edges' as (topic:INT,source:INT,dest:INT,weight:DOUBLE);
 
 topic_edges_grouped = GROUP topic_edges by (topic, source) ;
 topic_edges_grouped = FOREACH topic_edges_grouped GENERATE
    group.topic as topic,
    group.source as source,
    topic_edges.(dest,weight) as edges;
 
 topic_edges_grouped_by_topic = GROUP topic_edges_grouped BY topic; 
 
 topic_ranks = FOREACH topic_edges_grouped_by_topic GENERATE
    group as topic,
    FLATTEN(PageRank(topic_edges_grouped.(source,edges))) as (source,rank);

 skill_ranks = FOREACH skill_ranks GENERATE
    topic, source, rank;
 
 
 


Field Summary
 
Fields inherited from class org.apache.pig.EvalFunc
log, pigLogger, reporter, returnType
 
Constructor Summary
PageRank()
           
PageRank(java.lang.String... parameters)
           
 
Method Summary
 void accumulate(org.apache.pig.data.Tuple t)
           
 void cleanup()
           
 org.apache.pig.data.DataBag exec(org.apache.pig.data.Tuple input)
           
 org.apache.pig.data.DataBag getValue()
           
 org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
           
 
Methods inherited from class org.apache.pig.EvalFunc
finish, getArgToFuncMapping, getCacheFiles, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, isAsynchronous, progress, setPigLogger, setReporter, warn
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

PageRank

public PageRank()

PageRank

public PageRank(java.lang.String... parameters)
Method Detail

accumulate

public void accumulate(org.apache.pig.data.Tuple t)
                throws java.io.IOException
Specified by:
accumulate in interface org.apache.pig.Accumulator<org.apache.pig.data.DataBag>
Throws:
java.io.IOException

getValue

public org.apache.pig.data.DataBag getValue()
Specified by:
getValue in interface org.apache.pig.Accumulator<org.apache.pig.data.DataBag>

cleanup

public void cleanup()
Specified by:
cleanup in interface org.apache.pig.Accumulator<org.apache.pig.data.DataBag>

exec

public org.apache.pig.data.DataBag exec(org.apache.pig.data.Tuple input)
                                 throws java.io.IOException
Specified by:
exec in class org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>
Throws:
java.io.IOException

outputSchema

public org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
Overrides:
outputSchema in class org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>


Matthew Hayes, Sam Shah