Skip to content

Latest commit

 

History

History
170 lines (130 loc) · 5.46 KB

python-api-walkthrough.md

File metadata and controls

170 lines (130 loc) · 5.46 KB

Use the Python Client Library to call Dataproc APIs

Estimated completion time:

Overview

This Cloud Shell walkthrough leads you through the steps to use the Google Cloud Client Libraries for Python to programmatically interact with Dataproc.

As you follow this walkthrough, you run Python code that calls Dataproc gRPC APIs to:

  • create a Dataproc cluster
  • submit a small PySpark word sort job to run on the cluster
  • get job status
  • tear down the cluster after job completion

Using the walkthrough

The submit_job_to_cluster.py file used in this walkthrough is opened in the Cloud Shell editor when you launch the walkthrough. You can view the code as your follow the walkthrough steps.

For more information: See Dataproc→Use the Python Client Library for an explanation of how the code works.

To reload this walkthrough: Run the following command from the ~/python-docs-samples/dataproc directory in Cloud Shell:

cloudshell launch-tutorial python-api-walkthrough.md

To copy and run commands: Click the "Paste in Cloud Shell" button () on the side of a code box, then press Enter to run the command.

Prerequisites (1)

  1. Create or select a Google Cloud Platform project to use for this tutorial.

  2. Click the link below to enable the Dataproc, Compute Engine, and Cloud Storage APIs in a separate GCP console tab in your browser.

    Note: After you select your project and enable the APIs, return to this tutorial by clicking on the Cloud Shell tab in your browser.

Prerequisites (2)

  1. This walkthrough uploads a PySpark file (pyspark_sort.py) to a Cloud Storage bucket in your project.

        OR

    • To create a new bucket, run the following command. Your bucket name must be unique.
    gsutil mb -p {{project-id}} gs://your-bucket-name
  2. Set environment variables.

    • Set the name of your bucket.
    BUCKET=your-bucket-name

Prerequisites (3)

  1. Set up a Python virtual environment in Cloud Shell.

    • Create the virtual environment.
    virtualenv ENV
    • Activate the virtual environment.
    source ENV/bin/activate
  2. Install library dependencies in Cloud Shell.

    pip install -r requirements.txt

Create a cluster and submit a job

  1. Set a name for your new cluster.

    CLUSTER=new-cluster-name
  2. Set a zone where your new cluster will be located. You can change the "us-central1-a" zone that is pre-set in the following command.

    ZONE=us-central1-a
  3. Run submit_job.py with the --create_new_cluster flag to create a new cluster and submit the pyspark_sort.py job to the cluster.

    python submit_job_to_cluster.py \
    --project_id={{project-id}} \
    --cluster_name=$CLUSTER \
    --zone=$ZONE \
    --gcs_bucket=$BUCKET \
    --create_new_cluster

Job Output

Job output in Cloud Shell shows cluster creation, job submission, job completion, and then tear-down of the cluster.

 ...
 Creating cluster...
 Cluster created.
 Uploading pyspark file to Cloud Storage.
 new-cluster-name - RUNNING
 Submitted job ID ...
 Waiting for job to finish...
 Job finished.
 Downloading output file
 .....
 ['Hello,', 'dog', 'elephant', 'panther', 'world!']
 ...
 Tearing down cluster
 ```

Congratulations on Completing the Walkthrough!


Next Steps:

  • View job details from the Console. View job details by selecting the PySpark job from the Dataproc = Jobs page in the Google Cloud Platform Console.

  • Delete resources used in the walkthrough. The submit_job_to_cluster.py job deletes the cluster that it created for this walkthrough.

    If you created a bucket to use for this walkthrough, you can run the following command to delete the Cloud Storage bucket (the bucket must be empty).

    gsutil rb gs://$BUCKET

    You can run the following command to delete the bucket and all objects within it. Note: the deleted objects cannot be recovered.

    gsutil rm -r gs://$BUCKET
  • For more information. See the Dataproc documentation for API reference and product feature information.