This project utilizes Java with the MapReduce framework to perform KMeans clustering on advertising performance data, aiming to categorize key phrases based on performance metrics like bid amounts, impressions, clicks, and ad ranks. The purpose is to uncover underlying patterns in advertising strategies, offering insights that could potentially guide advertisers towards more impactful methodologies.
Input Format The input dataset should be in a tab-separated format with the following fields:
- Day of the data record
- Anonymized account ID of the advertiser
- Rank of the advertisement
- Anonymized keyphrase (a list of anonymized keywords)
- Average bid for the keyphrase
- Number of impressions (times the ad was shown)
- Number of clicks (times users interacted with the ad)
The file specified by should contain initial centroid values, one centroid per line, with values comma-separated:
bid,impressions,clicks,rank
The MapReduce job outputs recalculated centroid values after processing the dataset. Each line in the output file represents a centroid with its updated values, formatted as follows:
centroid_id avg_bid,impressions,clicks,rank
- Prerequisites
- Local Installation
- Docker Deployment
- Running the Application
- AWS Deployment
- Cleanup
- Contributing
- Acknowledgement
- Contact
- License
Before you begin, ensure you have the following installed:
- Java JDK 11
- Apache Maven
- Apache Hadoop 3.x
- Apache Spark 3.x
- Docker (optional for Docker deployment)
- AWS CLI (configured for AWS deployment)
- macOS:
brew install wget curl vim make # tzdata is generally not required for macOS, as timezone handling is built into the OS
- Ubuntu:
sudo apt-get update sudo apt-get install -y --no-install-recommends apt-utils wget curl vim make sudo apt-get install -y tzdata
- macOS:
brew install openjdk@11
- Ubuntu:
sudo apt update sudo apt install -y openjdk-11-jdk
- Add to your .bashrc or .zshrc file:
export JAVA_HOME=$(dirname $(dirname $(readlink -f $(which java)))) # For macOS, adjust JAVA_HOME accordingly: export JAVA_HOME=/usr/local/opt/openjdk@11
- macOS and Ubuntu:
brew install maven # macOS sudo apt install maven # Ubuntu
- macOS:
brew install awscli
- Ubuntu:
sudo apt-get install -y awscli
- Common for both OS:
curl -fLo cs https://github.com/coursier/coursier/releases/latest/download/cs-x86_64-pc-linux.gz gunzip cs chmod +x cs ./cs setup -y ./cs install scala:2.12.17 scalac:2.12.17
- Common for both OS:
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.5/hadoop-3.3.5.tar.gz tar -xzf hadoop-3.3.5.tar.gz -C /usr/local sudo mv /usr/local/hadoop-3.3.5 /usr/local/hadoop
- Common for both OS:
wget https://archive.apache.org/dist/spark/spark-3.3.2/spark-3.3.2-bin-without-hadoop.tgz tar -xzf spark-3.3.2-bin-without-hadoop.tgz -C /usr/local sudo mv /usr/local/spark-3.3.2-bin-without-hadoop /usr/local/spark
- Add the following lines to your shell configuration file (.bashrc, .zshrc, etc.):
export HADOOP_HOME=/usr/local/hadoop export SPARK_HOME=/usr/local/spark export SCALA_HOME=$HOME/.local/share/coursier/bin export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SPARK_HOME/bin
- For ARM64 architecture:
docker build -t kmeans-project .
- For AMD64 architecture (adjust Dockerfile as needed):
docker build -f DockerfileAMD -t kmeans-project .
docker run -it --name kmeans-container kmeans-project
docker exec -it kmeans-container bash
Ensure your Makefile is properly set up to handle tasks from compilation to cleanup:
- Compile the project:
make jar
- Run KMeans locally or within Docker:
make run-kmeans
- Clean up generated output files:
make clean-local-output
- Configure your AWS CLI and ensure your credentials are set up:
# Make sure to add your AWS Credentials for the following locations:-
~/.aws/config
~/.aws/credentials
- Create a bucket on S3:
make make-bucket
- Upload the dataset to S3 Bucket:
make upload-input-aws
- Upload the app jar to S3 Bucket:
make upload-app-aws
- Deploy the application on AWS EMR:
make aws
- Download results from AWS S3 after execution:
make download-output-aws
- Local cleanup:
make clean-local-output
- AWS cleanup (to avoid unnecessary charges):
make delete-output-aws
aws emr terminate-clusters --cluster-ids <cluster-id>
Contributions to enhance the project are welcome. Please create a branch for your contributions.
- Yahoo! for providing the Search Marketing Advertiser Bid-Impression-Click dataset. A4 - Yahoo Data Targeting User Modeling, Version 1.0 (Hosted on AWS)(3.7Gb)
- Apache Hadoop and Apache Maven communities for their open-source software.
Parag Ghorpade - Github Profile
Feel free to reach out for any questions or contributions to the project.
Distributed under the MIT License. See LICENSE for more information.