TF-IDF with Mapreduce & Python

Compute TF-IDF using Python with Hadoop Streaming.

Term Frequency — Inverse Document Frequency

It stands to statistically measure how important a word is in a collection of documents.

We will use the formula

w_t,d = (tf_t,d / n_d) x log(N/df_t)

Prepare data

We will first copy input data to HDFS.

$ hadoop fs -mkdir input
$ hadoop fs -copyFromLocal input /input

Step 1 - Wordcount

Execute first step with Hadoop streaming

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
   -input /your_hdfs_path/input/* \
	-output /your_hdfs_path/results/step_1 \
	-mapper 1_wordcount_mapper.py \
	-reducer 1_wordcount_reducer.py

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
input		input
utils		utils
1_wordcount_mapper.py		1_wordcount_mapper.py
1_wordcount_reducer.py		1_wordcount_reducer.py
2_wordperdoc_mapper.py		2_wordperdoc_mapper.py
2_wordperdoc_reducer.py		2_wordperdoc_reducer.py
3_tfidf_mapper.py		3_tfidf_mapper.py
3_tfidf_reducer.py		3_tfidf_reducer.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TF-IDF with Mapreduce & Python

Term Frequency — Inverse Document Frequency

Prepare data

Step 1 - Wordcount

About

Releases

Packages

Languages

patrick-randria/tf-idf-mapreduce

Folders and files

Latest commit

History

Repository files navigation

TF-IDF with Mapreduce & Python

Term Frequency — Inverse Document Frequency

Prepare data

Step 1 - Wordcount

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages