Skip to content

Utlized AWS EMR to process big dataset to filter out the specific data according to my requirements using pyspark code.

Notifications You must be signed in to change notification settings

alexaustin007/Used-AWS-EMR-to-process-big-data-using-Spark-and-Hadoop

Repository files navigation

Used-AWS-EMR-to-process-big-data-using-Spark-and-Hadoop

Utlized AWS EMR to process big dataset to filter out the specific data according to my requirements using pyspark code and the resultant data was stored in S3 bucket in csv format.

AWS EMR Configuration: I set up the AWS EMR environment, configuring EC2 instances and the Spark engine to ensure optimal performance and resource utilization.

Dataset Filtering: Leveraging the power of Spark's distributed computing capabilities, I designed and executed data filtering operations on the dataset.

PySpark Development: I utilized the PySpark API, to develop the code for data filtering. This involved leveraging inbuilt PySpark's library ecosystem and powerful functions to achieve efficient and scalable data processing.

Spark Submit Command: To execute the Spark code on the AWS EMR cluster, I utilized the "spark-submit" command for submission and execution of the Spark application also monitored using spark webUI.

Faced a lot of difficulties while setting up AWS EMR due to some configurations, somehow overcame it using stackoverflow and chatgpt.

About

Utlized AWS EMR to process big dataset to filter out the specific data according to my requirements using pyspark code.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages