The WordCount example (WordCount.java) uses MapReduce and Accumulo to compute word counts for a set of documents. This is accomplished using a map-only MapReduce job and an Accumulo table with combiners.
To run this example, create a directory in HDFS containing text files. You can use the Accumulo README for data:
$ hdfs dfs -mkdir /wc
$ hdfs dfs -copyFromLocal /path/to/accumulo/README.md /wc/README.md
Verify that the file was created:
$ hdfs dfs -ls /wc
After creating the table, run the WordCount MapReduce job with your HDFS input directory:
$ ./bin/runmr mapreduce.WordCount -i /wc
WordCount.java creates an Accumulo table named with a SummingCombiner iterator attached to it. It runs a map-only M/R job that reads the specified HDFS directory containing text files and writes word counts to Accumulo table.
After the MapReduce job completes, query the Accumulo table to see word counts.
$ accumulo shell
username@instance> table examples.wordcount
username@instance examples.wordcount> scan -b the
the count:20080906 [] 75
their count:20080906 [] 2
them count:20080906 [] 1
then count:20080906 [] 1
...
When the WordCount MapReduce job was run above, the client properties were serialized
into the MapReduce configuration. This is insecure if the properties contain sensitive
information like passwords. A more secure option is store accumulo-client.properties
in HDFS and run the job with the -D
options. This will configure the MapReduce job
to obtain the client properties from HDFS:
$ hdfs dfs -mkdir /user
$ hdfs dfs -mkdir /user/myuser
$ hdfs dfs -copyFromLocal /path/to/accumulo/conf/accumulo-client.properties /user/myuser/
$ ./bin/runmr mapreduce.WordCount -i /wc -t examples.wordcount2 -d /user/myuser/accumulo-client.properties
After the MapReduce job completes, query the examples.wordcount2
table. The results should
be the same as before:
$ accumulo shell
username@instance> table examples.wordcount2
username@instance examples.wordcount2> scan -b the
the count:20080906 [] 75
their count:20080906 [] 2
...