Skip to content

Commit

Permalink
Update python-api-walkthrough.md (#398)
Browse files Browse the repository at this point in the history
Co-authored-by: meredithslota <[email protected]>
Co-authored-by: Lo Ferris <[email protected]>
  • Loading branch information
3 people authored Apr 14, 2022
1 parent cca6c6a commit d864ffa
Showing 1 changed file with 74 additions and 92 deletions.
166 changes: 74 additions & 92 deletions dataproc/snippets/python-api-walkthrough.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,9 @@ As you follow this walkthrough, you run Python code that calls
[Dataproc gRPC APIs](https://cloud.google.com/dataproc/docs/reference/rpc/)
to:

* create a Dataproc cluster
* submit a small PySpark word sort job to run on the cluster
* get job status
* tear down the cluster after job completion
* Create a Dataproc cluster
* Submit a PySpark word sort job to the cluster
* Delete the cluster after job completion

## Using the walkthrough

Expand All @@ -32,144 +31,127 @@ an explanation of how the code works.

cloudshell launch-tutorial python-api-walkthrough.md

**To copy and run commands**: Click the "Paste in Cloud Shell" button
**To copy and run commands**: Click the "Copy to Cloud Shell" button
(<walkthrough-cloud-shell-icon></walkthrough-cloud-shell-icon>)
on the side of a code box, then press `Enter` to run the command.

## Prerequisites (1)

<walkthrough-watcher-constant key="project_id" value="<project_id>"
></walkthrough-watcher-constant>
<walkthrough-watcher-constant key="project_id" value="<project_id>"></walkthrough-watcher-constant>

1. Create or select a Google Cloud project to use for this
tutorial.
* <walkthrough-project-setup billing="true"></walkthrough-project-setup>
tutorial.
* <walkthrough-project-setup billing="true"></walkthrough-project-setup>

1. Enable the Dataproc, Compute Engine, and Cloud Storage APIs in your
project.
```sh
gcloud services enable dataproc.googleapis.com \
compute.googleapis.com \
storage-component.googleapis.com \
--project={{project_id}}
```
project.

```bash
gcloud services enable dataproc.googleapis.com \
compute.googleapis.com \
storage-component.googleapis.com \
--project={{project_id}}
```

## Prerequisites (2)

1. This walkthrough uploads a PySpark file (`pyspark_sort.py`) to a
[Cloud Storage bucket](https://cloud.google.com/storage/docs/key-terms#buckets) in
your project.
* You can use the [Cloud Storage browser page](https://console.cloud.google.com/storage/browser)
in Google Cloud Platform Console to view existing buckets in your project.
in Google Cloud Console to view existing buckets in your project.

&nbsp;&nbsp;&nbsp;&nbsp;**OR**
**OR**

* To create a new bucket, run the following command. Your bucket name must be unique.
```bash
gsutil mb -p {{project-id}} gs://your-bucket-name
```

1. Set environment variables.
gsutil mb -p {{project-id}} gs://your-bucket-name


* Set the name of your bucket.
```bash
BUCKET=your-bucket-name
```
2. Set environment variables.
* Set the name of your bucket.

BUCKET=your-bucket-name

## Prerequisites (3)

1. Set up a Python
[virtual environment](https://virtualenv.readthedocs.org/en/latest/)
in Cloud Shell.
[virtual environment](https://virtualenv.readthedocs.org/en/latest/).

* Create the virtual environment.
```bash
virtualenv ENV
```

virtualenv ENV

* Activate the virtual environment.
```bash
source ENV/bin/activate
```

source ENV/bin/activate

1. Install library dependencies in Cloud Shell.
```bash
pip install -r requirements.txt
```
1. Install library dependencies.

pip install -r requirements.txt

## Create a cluster and submit a job

1. Set a name for your new cluster.
```bash
CLUSTER=new-cluster-name
```

1. Set a [zone](https://cloud.google.com/compute/docs/regions-zones/#available)
where your new cluster will be located. You can change the
"us-central1-a" zone that is pre-set in the following command.
```bash
ZONE=us-central1-a
```
CLUSTER=new-cluster-name

1. Run `submit_job.py` with the `--create_new_cluster` flag
to create a new cluster and submit the `pyspark_sort.py` job
to the cluster.
1. Set a [region](https://cloud.google.com/compute/docs/regions-zones/#available)
where your new cluster will be located. You can change the pre-set
"us-central1" region beforew you copy and run the following command.

```bash
python submit_job_to_cluster.py \
--project_id={{project-id}} \
--cluster_name=$CLUSTER \
--zone=$ZONE \
--gcs_bucket=$BUCKET \
--create_new_cluster
```
REGION=us-central1

1. Run `submit_job_to_cluster.py` to create a new cluster and run the
`pyspark_sort.py` job on the cluster.

python submit_job_to_cluster.py \
--project_id={{project-id}} \
--cluster_name=$CLUSTER \
--region=$REGION \
--gcs_bucket=$BUCKET

## Job Output

Job output in Cloud Shell shows cluster creation, job submission,
job completion, and then tear-down of the cluster.

...
Creating cluster...
Cluster created.
Uploading pyspark file to Cloud Storage.
new-cluster-name - RUNNING
Submitted job ID ...
Waiting for job to finish...
Job finished.
Downloading output file
.....
['Hello,', 'dog', 'elephant', 'panther', 'world!']
...
Tearing down cluster
```
## Congratulations on Completing the Walkthrough!
Job output displayed in the Cloud Shell terminaL shows cluster creation,
job completion, sorted job output, and then deletion of the cluster.

```xml
Cluster created successfully: cliuster-name.
...
Job finished successfully.
...
['Hello,', 'dog', 'elephant', 'panther', 'world!']
...
Cluster cluster-name successfully deleted.
```

## Congratulations on completing the Walkthrough!
<walkthrough-conclusion-trophy></walkthrough-conclusion-trophy>

---

### Next Steps:

* **View job details from the Console.** View job details by selecting the
PySpark job from the Dataproc
=
* **View job details in the Cloud Console.** View job details by selecting the
PySpark job name on the Dataproc
[Jobs page](https://console.cloud.google.com/dataproc/jobs)
in the Google Cloud Platform Console.
in the Cloud console.

* **Delete resources used in the walkthrough.**
The `submit_job_to_cluster.py` job deletes the cluster that it created for this
The `submit_job_to_cluster.py` code deletes the cluster that it created for this
walkthrough.

If you created a bucket to use for this walkthrough,
you can run the following command to delete the
Cloud Storage bucket (the bucket must be empty).
```bash
gsutil rb gs://$BUCKET
```
You can run the following command to delete the bucket **and all
objects within it. Note: the deleted objects cannot be recovered.**
```bash
gsutil rm -r gs://$BUCKET
```
If you created a Cloud Storage bucket to use for this walkthrough,
you can run the following command to delete the bucket (the bucket must be empty).

gsutil rb gs://$BUCKET

* You can run the following command to **delete the bucket and all
objects within it. Note: the deleted objects cannot be recovered.**

gsutil rm -r gs://$BUCKET


* **For more information.** See the [Dataproc documentation](https://cloud.google.com/dataproc/docs/)
for API reference and product feature information.

0 comments on commit d864ffa

Please sign in to comment.