org.apache.pig
Interface LoadFunc

All Known Subinterfaces:
ReversibleLoadStoreFunc
All Known Implementing Classes:
BinaryStorage, BinStorage, HBaseStorage, PigStorage, RandomSampleLoader, TextLoader

public interface LoadFunc

This interface is used to implement functions to parse records from a dataset. This also includes functions to cast raw byte data into various datatypes. These are external functions because we want loaders, whenever possible, to delay casting of datatypes until the last possible moment (i.e. don't do it on load). This means we need to expose the functionality so that other sections of the code can call back to the loader to do the cast.


Method Summary
 void bindTo(String fileName, BufferedPositionedInputStream is, long offset, long end)
          Specifies a portion of an InputStream to read tuples.
 DataBag bytesToBag(byte[] b)
          Cast data from bytes to bag value.
 String bytesToCharArray(byte[] b)
          Cast data from bytes to chararray value.
 Double bytesToDouble(byte[] b)
          Cast data from bytes to double value.
 Float bytesToFloat(byte[] b)
          Cast data from bytes to float value.
 Integer bytesToInteger(byte[] b)
          Cast data from bytes to integer value.
 Long bytesToLong(byte[] b)
          Cast data from bytes to long value.
 Map<Object,Object> bytesToMap(byte[] b)
          Cast data from bytes to map value.
 Tuple bytesToTuple(byte[] b)
          Cast data from bytes to tuple value.
 Schema determineSchema(String fileName, ExecType execType, DataStorage storage)
          Find the schema from the loader.
 void fieldsToRead(Schema schema)
          Indicate to the loader fields that will be needed.
 Tuple getNext()
          Retrieves the next tuple to be processed.
 

Method Detail

bindTo

void bindTo(String fileName,
            BufferedPositionedInputStream is,
            long offset,
            long end)
            throws IOException
Specifies a portion of an InputStream to read tuples. Because the starting and ending offsets may not be on record boundaries it is up to the implementor to deal with figuring out the actual starting and ending offsets in such a way that an arbitrarily sliced up file will be processed in its entirety.

A common way of handling slices in the middle of records is to start at the given offset and, if the offset is not zero, skip to the end of the first record (which may be a partial record) before reading tuples. Reading continues until a tuple has been read that ends at an offset past the ending offset.

The load function should not do any buffering on the input stream. Buffering will cause the offsets returned by is.getPos() to be unreliable.

Parameters:
fileName - the name of the file to be read
is - the stream representing the file to be processed, and which can also provide its position.
offset - the offset to start reading tuples.
end - the ending offset for reading.
Throws:
IOException

getNext

Tuple getNext()
              throws IOException
Retrieves the next tuple to be processed.

Returns:
the next tuple to be processed or null if there are no more tuples to be processed.
Throws:
IOException

bytesToInteger

Integer bytesToInteger(byte[] b)
                       throws IOException
Cast data from bytes to integer value.

Parameters:
b - byte array to be cast.
Returns:
Integer value.
Throws:
IOException - if the value cannot be cast.

bytesToLong

Long bytesToLong(byte[] b)
                 throws IOException
Cast data from bytes to long value.

Parameters:
b - byte array to be cast.
Returns:
Long value.
Throws:
IOException - if the value cannot be cast.

bytesToFloat

Float bytesToFloat(byte[] b)
                   throws IOException
Cast data from bytes to float value.

Parameters:
b - byte array to be cast.
Returns:
Float value.
Throws:
IOException - if the value cannot be cast.

bytesToDouble

Double bytesToDouble(byte[] b)
                     throws IOException
Cast data from bytes to double value.

Parameters:
b - byte array to be cast.
Returns:
Double value.
Throws:
IOException - if the value cannot be cast.

bytesToCharArray

String bytesToCharArray(byte[] b)
                        throws IOException
Cast data from bytes to chararray value.

Parameters:
b - byte array to be cast.
Returns:
String value.
Throws:
IOException - if the value cannot be cast.

bytesToMap

Map<Object,Object> bytesToMap(byte[] b)
                              throws IOException
Cast data from bytes to map value.

Parameters:
b - byte array to be cast.
Returns:
Map value.
Throws:
IOException - if the value cannot be cast.

bytesToTuple

Tuple bytesToTuple(byte[] b)
                   throws IOException
Cast data from bytes to tuple value.

Parameters:
b - byte array to be cast.
Returns:
Tuple value.
Throws:
IOException - if the value cannot be cast.

bytesToBag

DataBag bytesToBag(byte[] b)
                   throws IOException
Cast data from bytes to bag value.

Parameters:
b - byte array to be cast.
Returns:
Bag value.
Throws:
IOException - if the value cannot be cast.

fieldsToRead

void fieldsToRead(Schema schema)
Indicate to the loader fields that will be needed. This can be useful for loaders that access data that is stored in a columnar format where indicating columns to be accessed a head of time will save scans. If the loader function cannot make use of this information, it is free to ignore it.

Parameters:
schema - Schema indicating which columns will be needed.

determineSchema

Schema determineSchema(String fileName,
                       ExecType execType,
                       DataStorage storage)
                       throws IOException
Find the schema from the loader. This function will be called at parse time (not run time) to see if the loader can provide a schema for the data. The loader may be able to do this if the data is self describing (e.g. JSON). If the loader cannot determine the schema, it can return a null. LoadFunc implementations which need to open the input "fileName", can use FileLocalizer.open(String fileName, ExecType execType, DataStorage storage) to get an InputStream which they can use to initialize their loader implementation. They can then use this to read the input data to discover the schema. Note: this will work only when the fileName represents a file on Local File System or Hadoop file system

Parameters:
fileName - Name of the file to be read.(this will be the same as the filename in the "load statement of the script)
execType - - execution mode of the pig script - one of ExecType.LOCAL or ExecType.MAPREDUCE
storage - - the DataStorage object corresponding to the execType
Returns:
a Schema describing the data if possible, or null otherwise.
Throws:
IOException.
IOException


Copyright © ${year} The Apache Software Foundation