Utlized AWS EMR to process big dataset to filter out the specific data according to my requirements using pyspark code and the resultant data was stored in S3 bucket in csv format.
AWS EMR Configuration: I set up the AWS EMR environment, configuring EC2 instances and the Spark engine to ensure optimal performance and resource utilization.
Dataset Filtering: Leveraging the power of Spark's distributed computing capabilities, I designed and executed data filtering operations on the dataset.
PySpark Development: I utilized the PySpark API, to develop the code for data filtering. This involved leveraging inbuilt PySpark's library ecosystem and powerful functions to achieve efficient and scalable data processing.
Spark Submit Command: To execute the Spark code on the AWS EMR cluster, I utilized the "spark-submit" command for submission and execution of the Spark application also monitored using spark webUI.
Faced a lot of difficulties while setting up AWS EMR due to some configurations, somehow overcame it using stackoverflow and chatgpt.