::Go back to Oozie Documentation Index::


Oozie Workflow Action Extensions

This document specify action extensions to Oozie Workflow system.

AE.1 Hive Action

The hive action starts a Hive job.

The workflow job will wait until the Hive job completes before continuing to the next action.

To run the Hive job, you have to configure the hive action with the =job-tracker=, name-node , conf-dir , and Hive script elements as well as the necessary parameters and configuration.

A hive action can be configured to create or delete HDFS directories before starting the Hive job. This may be necessary if the Hive script directs Hive to write output to a specific directory.

You can specify a directory containing Hive configuration files by using the conf-dir element. This directory should contain valid =hive-default.xml= and hive-log4j.properties files. This directory can also contain a hive-site.xml file as well as a =hive-exec-log4j.properties= file. Any other files placed in this directory will be ignored by the hive action.

You can also specify Hive configuration properties inline by using the =configuration= element; you can optionally parameterize (templatize) these property values by using EL expressions. Property values specified in the configuration element override values that were also specified in either the hive-default.xml or hive-site.xml files located in the conf-dir .

Note that Hadoop mapred.job.tracker and fs.default.name properties must not be present in the inline configuration.

As with Hadoop map-reduce jobs, it is possible to add files and archives in order to make them available to the Hive job. Refer to the [#FilesAchives][Adding Files and Archives for the Job] section for more information about this feature.

Syntax:

<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="[NODE-NAME]">
        <hive xmlns="uri:oozie:hive-action:0.1">
            <job-tracker>[JOB-TRACKER]</job-tracker>
            <name-node>[NAME-NODE]</name-node>
            <prepare>
               <delete path="[PATH]"/>
               ...
               <mkdir path="[PATH]"/>
               ...
            </prepare>
            <conf-dir>[HIVE CONFIGURATION DIRECTORY]</conf-dir>
            <configuration>
                <property>
                    <name>[PROPERTY-NAME]</name>
                    <value>[PROPERTY-VALUE]</value>
                </property>
                ...
            </configuration>
            <script>[HIVE-SCRIPT]</script>
            <param>[PARAM-VALUE]</param>
                ...
            <param>[PARAM-VALUE]</param>
            <file>[FILE-PATH]</file>
            ...
            <archive>[FILE-PATH]</archive>
            ...
        </hive>
        <ok to="[NODE-NAME]"/>
        <error to="[NODE-NAME]"/>
    </action>
    ...
</workflow-app>

The prepare element, if present, indicates a list of paths to delete or create before starting the job.

The conf-dir element, if present, points to an HDFS directory that must contain valid hive-default.xml and hive-log4j.properties configuration files. This directory can optionally contain =hive-site.xml= and hive-exec-log4j.properties files. The Oozie Hive action will ignore any other files found in this directory.

The configuration element, if present, contains configuration properties that are passed to the Hive job. These property values may be used to override property values that were previously set by configuration files located in the conf-dir directory.

The script element must contain the path of the Hive script to execute. The Hive script can be templatized with variables of the form =${VARIABLE}=. The values of these variables can then be specified using the params element.

The params element, if present, contains parameters to be passed to the Hive script.

All the above elements can be parameterized (templatized) using EL expressions.

Example:

<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="myfirsthivejob">
        <hive xmlns="uri:oozie:hive-action:0.1">
            <job-traker>foo:9001</job-tracker>
            <name-node>bar:9000</name-node>
            <prepare>
                <delete path="${jobOutput}"/>
            </prepare>
            <conf-dir>conf/</conf-dir>
            <configuration>
                <property>
                    <name>mapred.compress.map.output</name>
                    <value>true</value>
                </property>
            </configuration>
            <script>myscript.q</script>
            <param>InputDir=/home/tucu/input-data</param>
            <param>OutputDir=${jobOutput}</param>
        </hive>
        <ok to="myotherjob"/>
        <error to="errorcleanup"/>
    </action>
    ...
</workflow-app>

AE.2 Sqoop Action

The sqoop action starts a Sqoop job.

The workflow job will wait until the Sqoop job completes before continuing to the next action.

In order for Oozie to run the Sqoop job, you must configure the =sqoop= action with the job-tracker , name-node , conf-dir , and Sqoop command .

A sqoop action can be configured to perform HDFS file/directory cleanup or creation before starting the Sqoop job. This may be necessary if the Sqoop command directs Sqoop to write output to a directory that may already exist.

A directory containing Sqoop configuration files may be specified using the conf-dir element. This directory may contain =sqoop-default.xml= and sqoop-site.xml configuration files. If Sqoop's hive-import feature is being used, this directory should also contain the Hive configuration files described earlier in the [#HiveAction][Hive Action] section. Any other files placed in this directory will be ignored by the sqoop action.

You can also specify Sqoop configuration properties inline by using the configuration element; you can optionally parameterize (templatize) these property values by using EL expressions. Property values specified in the configuration element override property values defined in files located in the conf-dir .

Note that Hadoop mapred.job.tracker and fs.default.name properties must not be present in the inline configuration.

Syntax:

<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="[NODE-NAME]">
        <sqoop xmlns="uri:oozie:sqoop-action:0.1">
            <job-tracker>[JOB-TRACKER]</job-tracker>
            <name-node>[NAME-NODE]</name-node>
            <prepare>
               <delete path="[PATH]"/>
               ...
               <mkdir path="[PATH]"/>
               ...
            </prepare>
            <conf-dir>[SQOOP CONFIGURATION DIRECTORY]</conf-dir>
            <configuration>
                <property>
                    <name>[PROPERTY-NAME]</name>
                    <value>[PROPERTY-VALUE]</value>
                </property>
                ...
            </configuration>
            <command>[SQOOP-COMMAND]</sqoop>
        </sqoop>
        <ok to="[NODE-NAME]"/>
        <error to="[NODE-NAME]"/>
    </action>
    ...
</workflow-app>

The prepare element, if present, indicates a list of paths to delete or create before starting the job.

Th conf-dir element, if present, gives the path of an HDFS directory that may contain sqoop-default.xml and sqoop-site.xml files. If Sqoop is being used to import tables into Hive using Sqoop's =hive-import= feature, then the contents of this directory must also satisfy the conditions outlined in the [#HiveAction][Hive Action] section.

The configuration element, if present, contains configuration properties that are passed to the Sqoop job. These property values may be used to override property values that were previously set by configuration files located in the conf-dir directory.

The command element must contain the Sqoop command line arguments that Oozie should execute. Oozie performs parameters substitution on the value of this element and then passes the result verbatim as arguments to the Sqoop command. Consult the Sqoop documentation for a complete list of valid Sqoop commands.

All the above elements can be parameterized (templatized) using EL expressions.

<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="myfirstsqoopjob">
        <sqoop xmlns="uri:oozie:sqoop-action:0.1">
            <job-traker>foo:9001</job-tracker>
            <name-node>bar:9000</name-node>
            <prepare>
                <delete path="${sqoopOutput}"/>
            </prepare>
            <conf-dir>conf/</conf-dir>
            <command>import --connect jdbc:mysql://....</command>
        </sqoop>
        <ok to="myotherjob"/>
        <error to="errorcleanup"/>
    </action>
    ...
</workflow-app>

AE Appendixes

AE.A Appendix A, Hive XML-Schema

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
           xmlns:hive="uri:oozie:hive-action:0.1" elementFormDefault="qualified"
           targetNamespace="uri:oozie:hive-action:0.1">    <xs:element name="hive" type="hive:ACTION"/>
    <xs:complexType name="ACTION">
        <xs:sequence>
            <xs:element name="job-tracker" type="xs:string" minOccurs="1" maxOccurs="1"/>
            <xs:element name="name-node" type="xs:string" minOccurs="1" maxOccurs="1"/>
            <xs:element name="prepare" type="hive:PREPARE" minOccurs="0" maxOccurs="1"/>
            <xs:element name="conf-dir" type="xs:string" minOccurs="0" maxOccurs="1"/>
            <xs:element name="configuration" type="hive:CONFIGURATION" minOccurs="0" maxOccurs="1"/>
            <xs:element name="script" type="xs:string" minOccurs="1" maxOccurs="1"/>
            <xs:element name="param" type="xs:string" minOccurs="0" maxOccurs="unbounded"/>
            <xs:element name="file" type="xs:string" minOccurs="0" maxOccurs="unbounded"/>
            <xs:element name="archive" type="xs:string" minOccurs="0" maxOccurs="unbounded"/>
        </xs:sequence>
    </xs:complexType>
    <xs:complexType name="CONFIGURATION">
        <xs:sequence>
            <xs:element name="property" minOccurs="1" maxOccurs="unbounded">
                <xs:complexType>
                    <xs:sequence>
                        <xs:element name="name" minOccurs="1" maxOccurs="1" type="xs:string"/>
                        <xs:element name="value" minOccurs="1" maxOccurs="1" type="xs:string"/>
                        <xs:element name="description" minOccurs="0" maxOccurs="1" type="xs:string"/>
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
        </xs:sequence>
    </xs:complexType>
    <xs:complexType name="PREPARE">
        <xs:sequence>
            <xs:element name="delete" type="hive:DELETE" minOccurs="0" maxOccurs="unbounded"/>
            <xs:element name="mkdir" type="hive:MKDIR" minOccurs="0" maxOccurs="unbounded"/>
        </xs:sequence>
    </xs:complexType>
    <xs:complexType name="DELETE">
        <xs:attribute name="path" type="xs:string" use="required"/>
    </xs:complexType>
    <xs:complexType name="MKDIR">
        <xs:attribute name="path" type="xs:string" use="required"/>
    </xs:complexType>
</xs:schema>

AE.B Appendix B, Sqoop XML-Schema

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
           xmlns:sqoop="uri:oozie:sqoop-action:0.1" elementFormDefault="qualified"
           targetNamespace="uri:oozie:sqoop-action:0.1">    <xs:element name="sqoop" type="sqoop:ACTION"/>
    <xs:complexType name="ACTION">
        <xs:sequence>
            <xs:element name="job-tracker" type="xs:string" minOccurs="1" maxOccurs="1"/>
            <xs:element name="name-node" type="xs:string" minOccurs="1" maxOccurs="1"/>
            <xs:element name="prepare" type="sqoop:PREPARE" minOccurs="0" maxOccurs="1"/>
            <xs:element name="conf-dir" type="xs:string" minOccurs="0" maxOccurs="1"/>
            <xs:element name="configuration" type="sqoop:CONFIGURATION" minOccurs="0" maxOccurs="1"/>
            <xs:element name="command" type="xs:string" minOccurs="1" maxOccurs="1"/>
        </xs:sequence>
    </xs:complexType>
    <xs:complexType name="CONFIGURATION">
        <xs:sequence>
            <xs:element name="property" minOccurs="1" maxOccurs="unbounded">
                <xs:complexType>
                    <xs:sequence>
                        <xs:element name="name" minOccurs="1" maxOccurs="1" type="xs:string"/>
                        <xs:element name="value" minOccurs="1" maxOccurs="1" type="xs:string"/>
                        <xs:element name="description" minOccurs="0" maxOccurs="1" type="xs:string"/>
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
        </xs:sequence>
    </xs:complexType>
    <xs:complexType name="PREPARE">
        <xs:sequence>
            <xs:element name="delete" type="sqoop:DELETE" minOccurs="0" maxOccurs="unbounded"/>
            <xs:element name="mkdir" type="sqoop:MKDIR" minOccurs="0" maxOccurs="unbounded"/>
        </xs:sequence>
    </xs:complexType>
    <xs:complexType name="DELETE">
        <xs:attribute name="path" type="xs:string" use="required"/>
    </xs:complexType>
    <xs:complexType name="MKDIR">
        <xs:attribute name="path" type="xs:string" use="required"/>
    </xs:complexType>
</xs:schema>

::Go back to Oozie Documentation Index::