::Go back to Oozie Documentation Index::

Oozie SLA Monitoring

Overview

Critical jobs can have certain SLA requirements associated with them. This SLA can be in terms of time i.e. a maximum allowed time limit associated with when the job should start, by when should it end, and its duration of run. Oozie workflows and coordinators allow defining such SLA limits in the application definition xml.

With the addition of SLA Monitoring, Oozie can now actively monitor the state of these SLA-sensitive jobs and send out notifications for SLA mets and misses.

In versions earlier than 4.x, this was a passive feature where users needed to query the Oozie client SLA API to fetch the records regarding job status changes, and use their own custom calculation engine to compute whether SLA was met or missed, based on initial definition of time limits.

Oozie now also has a SLA tab in the Oozie UI, where users can query for SLA information and have a summarized view of how their jobs fared against their SLAs.

Oozie Server Configuration

Refer to Notifications Configuration for configuring Oozie server to track SLA for jobs and send notifications.

SLA Tracking

Oozie allows tracking SLA for meeting the following criteria:

  • Start time
  • End time
  • Job Duration

Event Status

Corresponding to each of these 3 criteria, your jobs are processed for whether Met or Miss i.e.

  • START_MET, START_MISS
  • END_MET, END_MISS
  • DURATION_MET, DURATION_MISS

SLA Status

Expected end-time is the most important criterion for majority of users while deciding overall SLA Met or Miss. Hence the "SLA Status"_ for a job will transition through these four stages

  • Not_Started <-- Job not yet begun
  • In_Process <-- Job started and is running, and SLAs are being tracked
  • Met <-- caused by an END_MET
  • Miss <-- caused by an END_MISS

In addition to overshooting expected end-time, and END_MISS (and so an eventual SLA MISS) also occurs when the job does not end successfully e.g. goes to error state - Failed/Killed/Error/Timedout.

Configuring SLA in Applications

To make your jobs trackable for SLA, you simply need to add the tag to your workflow application definition. If you were already using the existing SLA schema in your workflows (Schema xmlns:sla="uri:oozie:sla:0.1"), you don't need to do anything extra to receive SLA notifications via JMS messages. This new SLA monitoring framework is backward-compatible - no need to change application XML for now and you can continue to fetch old records via the command line API . However, usage of old schema and API is deprecated and we strongly recommend using new schema.

  • New SLA schema is 'uri:oozie:sla:0.2'
  • In order to use new SLA schema, you will need to upgrade your workflow/coordinator schema to 0.5 i.e. 'uri:oozie:workflow:0.5'

SLA Definition in Workflow

Example:

<workflow-app name="test-wf-job-sla"
              xmlns="uri:oozie:workflow:0.5"
              xmlns:sla="uri:oozie:sla:0.2">
    <start to="grouper"/>
    <action name="grouper">
        <map-reduce>
            <job-tracker>jt</job-tracker>
            <name-node>nn</name-node>
            <configuration>
                <property>
                    <name>mapred.input.dir</name>
                    <value>input</value>
                </property>
                <property>
                    <name>mapred.output.dir</name>
                    <value>output</value>
                </property>
            </configuration>
        </map-reduce>
        <ok to="end"/>
        <error to="end"/>
    </action>
    <end name="end"/>
    <sla:info>
        <sla:nominal-time>${nominal_time}</sla:nominal-time>
        <sla:should-start>${10 * MINUTES}</sla:should-start>
        <sla:should-end>${30 * MINUTES}</sla:should-end>
        <sla:max-duration>${30 * MINUTES}</sla:max-duration>
        <sla:alert-events>start_miss,end_miss,duration_miss</sla:alert-events>
        <sla:alert-contact>joe@example.com</sla:alert-contact>
    </sla:info>
</workflow-app>

For the list of tags usable under , refer to Schemas Appendix . This new schema is much more compact and meaningful, getting rid of redundant and unused tags.

  • nominal-time : As the name suggests, this is the time relative to which your jobs' SLAs will be calculated. Generally since Oozie workflows are aligned with synchronous data dependencies, this nominal time can be parameterized to be passed the value of your coordinator nominal time. Nominal time is also required in case of independent workflows and you can specify the time in which you expect the workflow to be run if you don't have a synchronous dataset associated with it.
  • should-start : Relative to nominal-time this is the amount of time (along with time-unit - MINUTES, HOURS, DAYS) within which your job should start running to meet SLA. This is optional.
  • should-end : Relative to nominal-time this is the amount of time (along with time-unit - MINUTES, HOURS, DAYS) within which your job should finish to meet SLA.
  • max-duration : This is the maximum amount of time (along with time-unit - MINUTES, HOURS, DAYS) your job is expected to run. This is optional.
  • alert-events : Specify the types of events for which Email alerts should be sent. Allowable values in this comma-separated list are start_miss, end_miss and duration_miss. *_met events can generally be deemed low priority and hence email alerting for these is not necessary. However, note that this setting is only for alerts via email alerts and not via JMS messages, where all events send out notifications, and user can filter them using desired selectors. This is optional and only applicable when alert-contact is configured.
  • alert-contact : Specify a comma separated list of email addresses where you wish your alerts to be sent. This is optional and need not be configured if you just want to view your job SLA history in the UI and do not want to receive email alerts.

NOTE: All tags can be parameterized.

Same schema can be applied to and embedded under Workflow-Action as well as Coordinator-Action XML.

SLA Definition in Workflow Action

<workflow-app name="test-wf-action-sla" xmlns="uri:oozie:workflow:0.5" xmlns:sla="uri:oozie:sla:0.2">
    <start to="grouper"/>
    <action name="grouper">
        ...
        <ok to="end"/>
        <error to="end"/>
        <sla:info>
            <sla:nominal-time>${nominal_time}</sla:nominal-time>
            <sla:should-start>${10 * MINUTES}</sla:should-start>
        ...
        </sla:info>
    </action>
    <end name="end"/>
</workflow-app>

SLA Definition in Coordinator Action

<coordinator-app name="test-coord-sla" frequency="${coord:days(1)}" freq_timeunit="DAY"
    end_of_duration="NONE" start="2013-06-20T08:01Z" end="2013-12-01T08:01Z"
    timezone="America/Los_Angeles" xmlns="uri:oozie:coordinator:0.4" xmlns:sla="uri:oozie:sla:0.2">
    <action>
        <workflow>
            <app-path>${wfAppPath}</app-path>
        </workflow>
        <sla:info>
            <sla:nominal-time>${nominal_time}</sla:nominal-time>
            ...
        </sla:info>
    </action>
</coordinator-app>

Accessing SLA Information

SLA information is accessible via the following ways:

  • Through the SLA tab of the Oozie Web UI.
  • JMS messages sent to a configured JMS provider for instantaneous tracking.
  • RESTful API to query for SLA summary.
  • As an Instrumentation.Counter entry that is accessible via RESTful API and reflects to the number of all SLA tracked external
entities. Name of this counter is sla-calculator.sla-map .

For JMS Notifications, you have to have a message broker in place, on which Oozie publishes messages and you can hook on a subscriber to receive those messages. For more info on setting up and consuming JMS messages, refer JMS Notifications documentation.

In the REST API, the following filters can be applied while fetching SLA information:

  • app_name - Application name
  • id - id of the workflow job, workflow action or coordinator action
  • parent_id - Parent id of the workflow job, workflow action or coordinator action
  • nominal_start and nominal_end - Start and End range for nominal time of the workflow or coordinator.
  • event_status - event status such as START_MET/START_MISS/DURATION_MET/DURATION_MISS/END_MET/END_MISS/ALL

Multiple event_status can be specified with comma separation. When multiple statuses are specified, they are considered as OR. For example, event_status=START_MET,END_MISS list the coordinator actions where event status is either START_MET OR END_MISS.

If event_status is not present, the event status field values are not returned as part of JSON response, and no filtering is performed on that same field.

If event_status is set to ALL, the event status field values are returned as part of JSON response, and no filtering is performed on that same field. Note that it's not possible to mix the filter value ALL and any of START_MET/START_MISS/DURATION_MET/DURATION_MISS/END_MET/END_MISS.

When timezone query parameter is specified, the expected and actual start/end time returned is formatted. If not specified, the number of milliseconds that have elapsed since January 1, 1970 00:00:00.000 GMT is returned.

The examples below demonstrate the use of REST API and explains the JSON response.

Scenario 1: Workflow Job Start_Miss

Request:

GET <oozie-host>:<port>/oozie/v2/sla?timezone=GMT&filter=nominal_start=2013-06-18T00:01Z;nominal_end=2013-06-23T00:01Z;app_name=my-sla-app

JSON Response

{    id : "000056-1238791320234-oozie-joe-W"
    parentId : "000001-1238791320234-oozie-joe-C@8"
    appType : "WORKFLOW_JOB"
    msgType : "SLA"
    appName : "my-sla-app"
    slaStatus : "IN_PROCESS"
    jobStatus : "RUNNING"
    user: "joe"
    nominalTime: "2013-16-22T05:00Z"
    expectedStartTime: "2013-16-22T05:10Z" <-- (should start by this time)
    actualStartTime: "2013-16-22T05:30Z" <-- (20 min late relative to expected start)
    expectedEndTime: "2013-16-22T05:40Z" <-- (should end by this time)
    actualEndTime: null
    expectedDuration: 900000 <-- (expected duration in milliseconds)
    actualDuration: 120000 <-- (actual duration in milliseconds)
    notificationMessage: "My Job has encountered an SLA event!"
    upstreamApps: "dependent-app-1, dependent-app-2"
}

Scenario 2: Workflow Action End_Miss

Request:

GET <oozie-host>:<port>/oozie/v2/sla?timezone=GMT&filter=parent_id=000056-1238791320234-oozie-joe-W

JSON Response

{    id : "000056-1238791320234-oozie-joe-W@map-reduce-action"
    parentId : "000056-1238791320234-oozie-joe-W"
    appType : "WORKFLOW_ACTION"
    msgType : "SLA"
    appName : "map-reduce-action"
    slaStatus : "MISS"
    jobStatus : "SUCCEEDED"
    user: "joe"
    nominalTime: "2013-16-22T05:00Z"
    expectedStartTime: "2013-16-22T05:10Z"
    actualStartTime: "2013-16-22T05:05Z"
    expectedEndTime: "2013-16-22T05:40Z" <-- (should end by this time)
    actualEndTime: "2013-16-22T06:00Z" <-- (20 min late relative to expected end)
    expectedDuration: 3600000 <-- (expected duration in milliseconds)
    actualDuration: 3300000 <-- (actual duration in milliseconds)
    notificationMessage: "My Job has encountered an SLA event!"
    upstreamApps: "dependent-app-1, dependent-app-2"
}

Scenario 3: Coordinator Action Duration_Miss

Request:

GET <oozie-host>:<port>/oozie/v2/sla?timezone=GMT&filter=id=000001-1238791320234-oozie-joe-C

JSON Response

{    id : "000001-1238791320234-oozie-joe-C@2"
    parentId : "000001-1238791320234-oozie-joe-C"
    appType : "COORDINATOR_ACTION"
    msgType : "SLA"
    appName : "my-coord-app"
    slaStatus : "MET"
    jobStatus : "SUCCEEDED"
    user: "joe"
    nominalTime: "2013-16-22T05:00Z"
    expectedStartTime: "2013-16-22T05:10Z"
    actualStartTime: "2013-16-22T05:05Z"
    expectedEndTime: "2013-16-22T05:40Z"
    actualEndTime: "2013-16-22T05:30Z"
    expectedDuration: 900000 <-- (expected duration in milliseconds)
    actualDuration: 1500000 <- (actual duration in milliseconds)
    notificationMessage: "My Job has encountered an SLA event!"
    upstreamApps: "dependent-app-1, dependent-app-2"
}

Scenario #3 is particularly interesting because it is an overall "MET" because it met its expected End-time, but it is "Duration_Miss" because the actual run (between actual start and actual end) exceeded expected duration.

Scenario 4: All Coordinator actions of a coordinator application using ALL event status filter

Request:

GET <oozie-host>:<port>/oozie/v2/sla?timezone=GMT&filter=app_name=my-coord-app;event_status=ALL

JSON Response

{
    id : "000001-1238791320234-oozie-joe-C@1"
    parentId : "000001-1238791320234-oozie-joe-C"
    appType : "COORDINATOR_ACTION"
    msgType : "SLA"
    appName : "my-coord-app"
    slaStatus : "MET"
    eventStatus : "START_MET,DURATION_MISS,END_MISS"
    user: "joe"
    nominalTime: "2014-01-10T12:00Z"
    expectedStartTime: "2014-01-10T12:00Z"
    actualStartTime: "2014-01-10T11:59Z"
    startDelay: -1
    expectedEndTime: "2014-01-10T13:00Z"
    actualEndTime: "2014-01-10T13:05Z"
    endDelay: 5
    expectedDuration: 3600000 <-- (expected duration in milliseconds)
    actualDuration: 3960000 <-- (actual duration in milliseconds)
    durationDelay: 6 <-- (duration delay in minutes)
}
{
    id : "000001-1238791320234-oozie-joe-C@2"
    parentId : "000001-1238791320234-oozie-joe-C"
    appType : "COORDINATOR_ACTION"
    msgType : "SLA"
    appName : "my-coord-app"
    slaStatus : "MET"
    eventStatus : "START_MISS,DURATION_MET,END_MISS"
    user: "joe"
    nominalTime: "2014-01-11T12:00Z"
    expectedStartTime: "2014-01-11T12:00Z"
    actualStartTime: "2014-01-11T12:05Z"
    startDelay: 5
    expectedEndTime: "2014-01-11T13:00Z"
    actualEndTime: "2014-01-11T13:01Z"
    endDelay: 1
    expectedDuration: 3600000 <-- (expected duration in milliseconds)
    actualDuration: 3360000 <-- (actual duration in milliseconds)
    durationDelay: -4 <-- (duration delay in minutes)
}

Scenario #4 (All Coordinator actions of a single application) is to get SLA information of all coordinator actions of an application in one call.

startDelay/durationDelay/endDelay values returned indicate how much delay compared to expected time (positive values in case of MISS, and negative values in case of MET).

Note that adding the event_status filter the ALL value results in the field eventStatus to be included in the JSON response, without filtering on that same field.

Scenario 5: All Coordinator actions of a coordinator application using START_MET event status filter

Request:

GET <oozie-host>:<port>/oozie/v2/sla?timezone=GMT&filter=app_name=my-coord-app;event_status=START_MET

JSON Response

{
    id : "000001-1238791320234-oozie-joe-C@1"
    parentId : "000001-1238791320234-oozie-joe-C"
    appType : "COORDINATOR_ACTION"
    msgType : "SLA"
    appName : "my-coord-app"
    slaStatus : "MET"
    eventStatus : "START_MET,DURATION_MISS,END_MISS"
    user: "joe"
    nominalTime: "2014-01-10T12:00Z"
    expectedStartTime: "2014-01-10T12:00Z"
    actualStartTime: "2014-01-10T11:59Z"
    startDelay: -1
    expectedEndTime: "2014-01-10T13:00Z"
    actualEndTime: "2014-01-10T13:05Z"
    endDelay: 5
    expectedDuration: 3600000 <-- (expected duration in milliseconds)
    actualDuration: 3960000 <-- (actual duration in milliseconds)
    durationDelay: 6 <-- (duration delay in minutes)
}

Scenario #5 (All Coordinator actions of a single application) is to get SLA information of all coordinator actions of an application in one call based on an additional event status filter.

startDelay/durationDelay/endDelay values returned indicate how much delay compared to expected time (positive values in case of MISS, and negative values in case of MET).

Note that adding the event_status filter the START_MET value results in the field eventStatus to be included in the JSON response, applying the filtering on that same field.

Sample Email Alert

Subject: OOZIE - SLA END_MISS (AppName=wf-sla-job, JobID=0000004-130610225200680-oozie-oozi-W)Status:
  SLA Status - END_MISS
  Job Status - RUNNING
  Notification Message - Missed SLA for Data Pipeline job
Job Details:
  App Name - wf-sla-job
  App Type - WORKFLOW_JOB
  User - strat_ci
  Job ID - 0000004-130610225200680-oozie-oozi-W
  Job URL - http://host.domain.com:4080/oozie//?job=0000004-130610225200680-oozie-oozi-W
  Parent Job ID - N/A
  Parent Job URL - N/A
  Upstream Apps - wf-sla-up-app
SLA Details:
  Nominal Time - Mon Jun 10 23:33:00 UTC 2013
  Expected Start Time - Mon Jun 10 23:35:00 UTC 2013
  Actual Start Time - Mon Jun 10 23:34:04 UTC 2013
  Expected End Time - Mon Jun 10 23:38:00 UTC 2013
  Expected Duration (in mins) - 5
  Actual Duration (in mins) - -1

In-memory SLA entries and database content

There are special circumstances when the in-memory SLACalcStatus entries can exist without the workflow or coordinator job or action instances in database. For example:

  • SLA tracked database content may already have been deleted, and SLA_SUMMARY entry is not present anymore in database
  • SLA tracked database content and SLA_SUMMARY entry aren't yet present in database

By the time SLAService scheduled job will be running, SLA map contents are checked. When the SLA_SUMMARY entry for the in-memory SLA entry is missing, a counter is increased. When this counter reaches the server-wide preconfigured value =oozie.sla.service.SLAService.maximum.retry.count= (by default 3 ), in-memory SLA entry will get purged.

Known issues

There are two known issues when you define SLA for a workflow action.

  • If there are decision nodes and SLA is defined for a workflow action not in the execution path because of the decision node, you will still get an SLA_MISS notification.
  • If you have dangling action nodes in your workflow definition and SLA is defined for it, you will still get an SLA_MISS notification.

::Go back to Oozie Documentation Index::