Big-Data-Top-K-Words

Project to compare various techniques to find the top K words in a very large file i.e. different techniques to process Big Data.

Introduction

In recent years, data has become very abundant due to the rapid increase of internet. It's consequently getting increasingly challenging to read, store and process large datasets; the input size is one of the key factors in determining how well or efficiently a program works and therefore, poses a problem and reduces efficiency the larger it gets. Being able to read, store and process huge amounts of data is the main problem of Big Data. Traditional database and data processing systems have been in place; however, the datasets are now so large that it's becoming difficult to manage this data efficiently. Apart from input size, there are various factors that affect how a program executes: the data structures used, the memory available and the algorithm used. This project focuses on analysing how these factors affect the performance for three datasets of different sizes i.e. three input sizes.

Aim

In this project, the main objective was to find the top K words in an input file where K is some integer pertaining to the frequency of each word i.e. how many times each word occurs in the text file and then print the top K most frequent words. Three text files were used as input, each of a different size: 400MB, 8GB and 32GB. Firstly, standard python technique is used, then MapReduce and Hive are used to showcase the improvement in performance for the same task.

Methods & Results

Python

Case 1: Read entire file into memory and used loop to count top K words.
Case 2: Read entire file into memory and used Python Counter for top K words.
Case 3: Read file line by line and used loop to count top K words.
Case 4: Read file line by line and used Python Counter for top K words.
Case 5: Read file in chunks and processed in parallel to find top K words.

Map Reduce

Case 1: 1 Reducer
Case 2: Many Reducers (96)

Subcases:
Case A: Mapper & Reducer
Case B: Mapper, Reducer & Combiner
Case C: Mapper, Reducer, Combiner with Partitioner
Case D: Mapper, Reducer & Combiner using Compression of Text File

Case 3: Varying the number of reducers

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Hive		Hive
MapReduce		MapReduce
Python		Python
BD-Top-K-Words-Report1.pdf		BD-Top-K-Words-Report1.pdf
BD-Top-K-Words-Report2.pdf		BD-Top-K-Words-Report2.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big-Data-Top-K-Words

Introduction

Aim

Methods & Results

Python

Map Reduce

Hive

Map Reduce VS Hive

Conclusion

About

Releases

Packages

Languages

ridakn/Big-Data-Top-K-Words

Folders and files

Latest commit

History

Repository files navigation

Big-Data-Top-K-Words

Introduction

Aim

Methods & Results

Python

Map Reduce

Hive

Map Reduce VS Hive

Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages