This repository deploys a few different modes of a Spark cluster and Data Science Platform based on Anaconda and Jupyter Notebook stack
Changelog from
- Uses OpenJDK
- A single
file that does not depend on previous role variables to function - Proxy config for http and https proxies (i.e. for those using certificates). Place cert files (*.crt) in
- Hadoop compiled separately from Spark, providing greater flexibility
- Preconfigured Jupyter Notebook file with auto generated config to launch sparkContext object appropriately
- H2O Flow configuration (not compatible with loading in files via S3a, so is selectable)
- Auto start spark cluster, jupyter lab instance, and optionally, h2o flow when booting headnode, graceful shutdown. Useful for scheduled boot and shutdown when using cloud (i.e. AWS, Google Compute Cloud, Azure)
You will need a driver machine with ansible installed and a clone of the current repository:
- If you are running on cloud (public/private network)
- Install ansible on the head node
- The designated headnodes will need a ssh key titled 'cluster_key'
- All nodes will need public ssh key ''
- All nodes need user ec2-user with sudo permissions
- Host inventory is manually specified and located at inventory/hosts
- Variables for installation are located at
- Sparkling water can be turned on or off via
- Sparkling water can be turned on or off via
Software versions and URLs are specified in the variables file
curl -O
sudo rpm -i epel-release-latest-7.noarch.rpm
sudo yum update -y
sudo yum install -y ansible
In order to have variable overriding from host inventory, please add the following configuration into your ~/.ansible.cfg file
host_key_checking = False
hash_behaviour = merge
- RHEL 7.x
- Ansible 2.6.3
Ansible uses 'host inventory' files to define the cluster configuration, nodes, and groups of nodes that serves a given purpose (e.g. master node).
Below is a host inventory sample definition:
Testh0 ansible_host=<IP> ansible_host_private=<IP> ansible_host_id=1
Testc0 ansible_host=<IP> ansible_host_private=<IP> ansible_host_id=3
Testc1 ansible_host=<IP> ansible_host_private=<IP> ansible_host_id=4
Testc2 ansible_host=<IP> ansible_host_private=<IP> ansible_host_id=5
Some specific configurations are:
: install/update java 8install_temp_dir=/tmp/ansible-install
: temporary folder used for install filesinstall_dir=/opt
: where packages are installed (e.g. Spark)
Note: ansible_host_id is only used when deploying a "Spark Standalone" cluster. Note: Ambari is currently only supporting Python 2.x
- Common Deploys Java and common dependencies
ansible-playbook --verbose <deployment playbook.yml> -i <hosts inventory>
ansible-playbook -i inventory/hosts -c paramiko setup-ds-platform.yml
In this scenario, a Standalone Spark cluster will be deployed with a few optional components.
- proxy_config Configures for proxy network
- cluster_network Configures networking for cluster
- Common Deploys Java and common dependencies
- hadoop Deploys hadoop in Standalone mode using slave nodes as data nodes
- Spark Deploys Spark in Standalone mode using slave nodes as workers
- Spark-CLuster-Admin Utility scripts for managing Spark cluster
- ElasticSearch Deploy ElasticSearch nodes on all slave nodes
- Zookeeper Depoys Zookeeper on all nodes (required by Kafka)
- Kafka Deploy Kafka nodes on all slave nodes
- Anaconda Deploys Anaconda Python
- Sparkling Water Deploys h20 and Sparkling Water for Spark
- Spark Start Stop Configures cluster to start Hadoop, Spark, Anaconda, and optionally Sparkling Water
The Ambari role will install MySQL community edition which is available under GPL license.
The Notebook role will install R which is available under GPL2 | GPL 3
By deploying these packages via the ansible utility scripts in this project you are accepting the license terms for these components.