Project to compare various techniques to find the top K words in a very large file i.e. different techniques to process Big Data.
In recent years, data has become very abundant due to the rapid increase of internet. It's consequently getting increasingly challenging to read, store and process large datasets; the input size is one of the key factors in determining how well or efficiently a program works and therefore, poses a problem and reduces efficiency the larger it gets. Being able to read, store and process huge amounts of data is the main problem of Big Data. Traditional database and data processing systems have been in place; however, the datasets are now so large that it's becoming difficult to manage this data efficiently. Apart from input size, there are various factors that affect how a program executes: the data structures used, the memory available and the algorithm used. This project focuses on analysing how these factors affect the performance for three datasets of different sizes i.e. three input sizes.
In this project, the main objective was to find the top K words in an input file where K is some integer pertaining to the frequency of each word i.e. how many times each word occurs in the text file and then print the top K most frequent words. Three text files were used as input, each of a different size: 400MB, 8GB and 32GB. Firstly, standard python technique is used, then MapReduce and Hive are used to showcase the improvement in performance for the same task.
Case 1: Read entire file into memory and used loop to count top K words.
Case 2: Read entire file into memory and used Python Counter for top K words.
Case 3: Read file line by line and used loop to count top K words.
Case 4: Read file line by line and used Python Counter for top K words.
Case 5: Read file in chunks and processed in parallel to find top K words.
Case 1: 1 Reducer
Case 2: Many Reducers (96)
Subcases:
Case A: Mapper & Reducer
Case B: Mapper, Reducer & Combiner
Case C: Mapper, Reducer, Combiner with Partitioner
Case D: Mapper, Reducer & Combiner using Compression of Text File
Case 3: Varying the number of reducers