Skip to content

Gobblin Deployment

Yinan Li edited this page Feb 11, 2015 · 47 revisions
  • Author: Yinan
  • Reviewer: Sahil

One important feature of Gobblin is it can be deployed and run in different settings on different platforms. Currently Gobblin supports the standalone mode that runs on a single machine and Hadoop MapReduce mode that runs on a Hadoop cluster (both Hadoop 1.x and Hadoop 2.x are supported). This page summarizes the different deployment modes of Gobblin. It is also helpful to understand the architecture of Gobblin in a specific deployment mode, so this page also describes the architecture of each deployment mode.

Standalone Architecture

The following diagram illustrates the Gobblin standalone architecture. In the standalone mode, a Gobblin instance runs in a single JVM and tasks run in a thread pool, the size of which is configurable. The standalone mode is good for light-weight data sources such as small databases. The standalone mode is also the default mode for trying and testing Gobblin.

Gobblin Image

In the standalone deployment, the JobScheduler runs as a daemon process that schedules and runs jobs using the so-called JobLaunchers. The JobScheduler maintains a thread pool in which a new JobLauncher is started for each job run. Gobblin ships with two types of JobLaunchers, namely, the LocalJobLauncher and MRJobLauncher for launching and running Gobblin jobs on a single machine and on Hadoop MapReduce, respectively. Which JobLauncher to use can be configured on a per-job basis, which means the JobScheduler can schedule and run jobs in different deployment modes. This section will focus on the LocalJobLauncher for launching and running Gobblin jobs on a single machine. The MRJobLauncher will be covered in a later section on the architecture of Gobblin on Hadoop MapReduce.

Each LocalJobLauncher starts and manages a few components for executing tasks of a Gobblin job. Specifically, a TaskExecutor is responsible for executing tasks in a thread pool, whose size is configurable on a per-job basis. A LocalTaskStateTracker is responsible for keep tracking of the state of running tasks, and particularly updating the task metrics. The LocalJobLauncher follows the steps below to launch and run a Gobblin job:

  1. Starting the TaskExecutor and LocalTaskStateTracker.
  2. Creating an instance of the Source class specified in the job configuration and getting the list of WorkUnits to do.
  3. Creating a task for each WorkUnit in the list, registering the task with the LocalTaskStateTracker, and submitting the task to the TaskExecutor to run.
  4. Waiting for all the submitted tasks to finish.
  5. Upon completion of all the submitted tasks, collecting tasks states and persisting them to the state store, and publishing the extracted data.

Standalone Deployment

Before Gobblin can be deployed, it needs to be built and packaged. Refer to the Getting Started Guide on how to build Gobblin.

In the standalone mode, the JobScheduler, upon startup, will pick up job configuration files from a user-defined directory and schedule the jobs to run. An environment variable named GOBBLIN_JOB_CONFIG_DIR must be set to point to the directory where job configuration files are stored. Note that this job configuration directory is different from gobblin-dist/conf, which stores Gobblin system configuration files, in which deployment-specific configuration properties applicable to all jobs are stored. In comparison, job configuration files store job-specific configuration properties such as the Source and Converter classes used.

The JobScheduler is backed by a Quartz scheduler and it supports cron-based triggers using the configuration property job.schedule for defining the cron schedule. Please refer to this tutorial for more information on how to use and configure a cron-based trigger.

Gobblin needs a working directory at runtime, which is defined using an environment variable GOBBLIN_WORK_DIR. Once started, Gobblin will create some subdirectories under the root working directory, as follows:

    task-staging\ # Staging area where data pulled by individual tasks lands
    task-output\  # Output area where data pulled by individual tasks lands
    job-output\   # Final output area of data pulled by jobs
    state-store\  # Persisted job/task state store
    metrics\      # Metrics store (in the form of metric log files), one subdirectory per job.

Before starting the Gobblin standalone daemon, make sure the environment variable JAVA_HOME is properly set. Below is a summary of the environment variables that need to be set for standalone deployment.

  • GOBBLIN_JOB_CONFIG_DIR: this variable defines the directory where job configuration files are stored.
  • GOBBLIN_WORK_DIR: this variable defines the working directory for Gobblin to operate.
  • JAVA_HOME: this variable defines the path to the home directory of the Java Runtime Environment (JRE) used to run the daemon process.

Gobblin ships with a script for starting and stopping the standalone Gobblin daemon on a single node, found at bin/

To start the Gobblin standalone daemon, run the following command:

bin/ start

After the Gobblin standalone daemon is started, the logs can be found under gobblin-dist/logs. Gobblin uses SLF4J and the slf4j-log4j12 binding for logging. The log4j configuration can be found at gobblin-dist/conf/log4j-standalone.xml.

By default, the Gobblin standalone daemon uses the following JVM settings. Change the settings in gobblin-dist/conf/ if necessary for your deployment.

-Xmx2g -Xms1g
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=<gobblin log dir>

To restart the Gobblin standalone daemon, run the following command:

bin/ restart

To stop the running Gobblin standalone daemon, run the following command:

bin/ stop

The script also supports checking the status of the running daemon process using the bin/ status command.

Hadoop MapReduce Architecture

The digram below shows the architecture of Gobblin on Hadoop MapReduce. As the diagram shows, a Gobblin job runs as a mapper-only MapReduce job that runs tasks of the Gobblin job in the mappers. The basic idea here is to use the mappers purely as containers to run Gobblin tasks. This design also makes it easier to integrate with Yarn. Unlike in the standalone mode, task retries are not handled by Gobblin itself in the Hadoop MapReduce mode. Instead, Gobblin relies on the task retry mechanism of Hadoop MapReduce.

Gobblin Image

In this mode, a MRJobLauncher is used to launch and run a Gobblin job on Hadoop MapReduce, following the steps below:

  1. Creating an instance of the Source class specified in the job configuration and getting the list of WorkUnits to do.
  2. Serializing each WorkUnit into a file on HDFS that will be read later by a mapper.
  3. Creating a file that lists the paths of the files storing serialized WorkUnits.
  4. Creating and configuring a mapper-only Hadoop MapReduce job that takes the file created in step 3 as input.
  5. Starting the MapReduce job to run on the cluster of choice and waiting for it to finish.
  6. Upon completion of the MapReduce job, collecting tasks states and persisting them to the state store, and publishing the extracted data.

A mapper in a Gobblin MapReduce job runs one or more tasks, depending on the number of WorkUnits to do and the (optional) maximum number of mappers specified in the job configuration. If there is no maximum number of mappers specified in the job configuration, each WorkUnit corresponds to one task that is executed by one mapper and each mapper only runs one task. Otherwise, if a maximum number of mappers is specified and there are more WorkUnits than the maximum number of mappers allowed, each mapper may handle more than one WorkUnit. There is also a special type of WorkUnits named MultiWorkUnit that group multiple WorkUnits to be executed together in one batch in a single mapper.

A mapper in a Gobblin MapReduce job follows the step below to run tasks assigned to it:

  1. Starting the TaskExecutor that is responsible for executing tasks in a configurable-size thread pool and MRTaskStateTracker that is responsible for keep tracking of the state of running tasks in the mapper.
  2. Reading the next input record that is the path to the file storing a serialized WorkUnit.
  3. Deserializing the WorkUnit, creating a task from the WorkUnit, registering the task with the MRTaskStateTracker, and submitting the task to the TaskExecutor to run. If the input is a MultiWorkUnit, a task is created for each WorkUnit wrapped in the MultiWorkUnit.
  4. Waiting for all the submitted tasks to finish.
  5. Upon completion of all the submitted tasks, writing out the state of each task into a file that will be read by the MRJobLauncher when collecting task states.
  6. Going back to step 2 and reading the next input record if available.

Hadoop MapReduce Deployment

Gobblin out-of-the-box supports running individual Gobblin jobs on Hadoop MapReduce, using the script bin/ It is assumed that you already have Hadoop (both MapReduce and HDFS) setup and running somewhere. Before launching any Gobblin jobs on Hadoop MapReduce, check the Gobblin system configuration file located at conf/ for property fs.uri, which defines the file system URI used. The default value is file:///, which points to the local file system. Change it to the right value depending on your Hadoop/HDFS setup. For example, if you have HDFS setup somwhere on port 9000, then set the property as follows:

fs.uri=hdfs://<namenode host name>:9000/

All job data and persisted job/task states will be written to the specified file system. Before launching any jobs, make sure the environment variable HADOOP_BIN_DIR is set to point to the bin directory under the Hadoop installation directory, and the environment variable GOBBLIN_WORK_DIR is set to point to the working directory of Gobblin. Note that the Gobblin working directory will be created on the file system specified above. Below is a summary of the environment variables that need to be set for deployment on Hadoop MapReduce:

  • GOBBLIN_WORK_DIR: this variable defines the working directory for Gobblin to operate.
  • HADOOP_BIN_DIR: this variable defines the path to the bin directory under the Hadoop installation directory.

To launch a Gobblin job on Hadoop MapReduce, run the following command. The logs are located under the gobblin-dist/logs directory.

bin/ <job tracker URL> <file system URL> <job configuration file>

For example, if you have Hadoop and HDFS setup somewhere on port 9001 and 9000, respectively, then the command should look like:

bin/ <job tracker host name>:9001 hdfs://<namenode host name>:9000/ <job configuration file>

This setup will have the minimum set of jars Gobblin needs to run the job added to the Hadoop DistributedCache for use in the mappers. If a job has additional jars needed for task executions (in the mappers), those jars can also be included by using the following job configuration property in the job configuration file:

job.jars=<comma-separated list of jars the job depends on>
Clone this wiki locally