forked from lintool/bigdata-2017w
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathassignment0.html
413 lines (324 loc) · 16.8 KB
/
assignment0.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<meta name="description" content="Course homepage for CS 489 Big Data Infrastructure (Winter 2017) at the University of Waterloo">
<meta name="author" content="Jimmy Lin">
<title>Big Data Infrastructure</title>
<!-- Bootstrap -->
<link href="css/bootstrap.min.css" rel="stylesheet">
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<link href="css/ie10-viewport-bug-workaround.css" rel="stylesheet">
<style>
body {
padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */
}
</style>
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
</head>
<body>
<nav class="navbar navbar-inverse navbar-fixed-top">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
</div>
<div id="navbar" class="collapse navbar-collapse">
<ul class="nav navbar-nav">
<li><a href="index.html">Overview</a></li>
<li><a href="organization.html">Organization</a></li>
<li><a href="syllabus.html">Syllabus</a></li>
<li class="active"><a href="assignments.html">Assignments</a></li>
<li><a href="software.html">Software</a></li>
</ul>
</div><!--/.nav-collapse -->
</div>
</nav>
<div class="container">
<div class="page-header">
<div style="float: right"/><img src="images/waterloo_logo.png"/></div>
<h1>Assignments <small>CS 489/698 Big Data Infrastructure (Winter 2017)</small></h1>
</div>
<div class="subnav">
<ul class="nav nav-pills">
<li><a href="assignment0.html">0</a></li>
<li><a href="assignment1.html">1</a></li>
<li><a href="assignment2.html">2</a></li>
<li><a href="assignment3.html">3</a></li>
<li><a href="assignment4.html">4</a></li>
<li><a href="assignment5.html">5</a></li>
<li><a href="assignment6.html">6</a></li>
<li><a href="assignment7.html">7</a></li>
<li><a href="project.html">Final Project</a></li>
</ul>
</div>
<section style="padding-top:0px">
<div>
<h3>Assignment 0: Warmup <small>due 1:00pm January 12</small></h3>
<p>The purpose of this assignment is to serve as a warmup exercise and
a practice "dry run" for the submission procedures of subsequent
assignments. You'll have to write a bit of code but this assignment is
mostly about the "mechanics" of setting up your Hadoop development
environment. In addition to running Hadoop locally in either the Linux
student CS environment or on your own machine, you'll also try running
jobs on the Altiscale cluster.</p>
<p>The general setup is as follows: you will complete your assignments
and check everything into a private GitHub repo. Shortly after the
assignment deadline, we'll pull your repo for marking.</p>
<p>I'm assuming you already have
a <a href="http://github.com/">GitHub</a> account. If not, create one
as soon as possible. Once you've signed up for an account, go and
<a href="https://education.github.com/discount_requests/new">request
an educational account</a>. This will allow you to create private
repos for free. Please do this as soon as possible since there may be
delays in the request verification process.</p>
<h4 style="padding-top: 10px">Setting up Hadoop and Spark</h4>
<p>Hadoop and Spark are already installed in
the <code>linux.student.cs.uwaterloo.ca</code> environment (you just
need to do some simple config). Alternatively, you may wish to install
everything locally on your own machine. For both, see
the <a href="software.html">software page</a> for more details.</p>
<p>Bespin is a library that contains reference implementations of "big
data" algorithms in MapReduce and Spark. We'll be using it throughout
this course. Go and run
the <a href="https://github.com/lintool/bespin">Word Count in
MapReduce and Spark</a> example as shown in the Bespin README (clone
and build the repo, download the data files, run word count in both
MapReduce in Spark, and verify output). Assuming you are
using <code>linux.student.cs.uwaterloo.ca</code> (or if you have
properly set up your local environment), this task should be as simple
as copying and pasting commands from the Bespin README.</p>
<p>When running Hadoop, you might get the following warning:</p>
<pre>
Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
</pre>
<p>It's okay: no need to worry.</p>
<h4 style="padding-top: 10px">Time to write some code!</h4>
<p>Create a <b>private</b> repo
called <code>bigdata2017w</code>. I'm assuming that you're
already familiar with Git and GitHub, but just in case, here
is <a href="https://help.github.com/articles/create-a-repo">how you
create a repo on GitHub</a>. For "Who has access to this repository?",
make sure you click "Only the people I specify". If you've
successfully gotten an educational account (per above), you should be
able to create private repos for free. If you're not already familiar
with Git, there are plenty of good tutorials online: do a simple web
search and find one you like.</p>
<p>What you're going to do now is to copy the MapReduce word count
example into you own private repo. Start with
<a href="assignments/pom.xml">this <code>pom.xml</code></a>: copy it
into your <code>bigdata2017w</code> repo.</p>
<p>Next, copy:</p>
<ul>
<li><code>bespin/src/main/java/io/bespin/java/mapreduce/wordcount/WordCount.java</code> over to
<li><code>bigdata2017w/src/main/java/ca/uwaterloo/cs/bigdata2017w/assignment0/WordCount.java</code>.
</ul>
<p>Open up this new version of <code>WordCount.java</code> using a
text editor (or your IDE of choice) and change the Java package
to <code>ca.uwaterloo.cs.bigdata2017w.assignment0</code>.</p>
<p>Now, in the <code>bigdata2017w/</code> base directory, you should
be able to run Maven to build your package:</p>
<pre>
$ mvn clean package
</pre>
<p>Once the build succeeds, you should be able to run the word count
demo program in your own repository:</p>
<pre>
$ hadoop jar target/bigdata2017w-0.1.0-SNAPSHOT.jar \
ca.uwaterloo.cs.bigdata2017w.assignment0.WordCount -input data/Shakespeare.txt -output wc
</pre>
<p>You should be running this in the Linux student CS environment or
on your own machine. Note that you'll need to copy over the
Shakespeare collection in <code>data/</code>. The output should be
exactly the same as the same program in Bespin, but the difference
here is that the code is now in a repository under your control.<p>
<p>Let's make a simple modification to word count: I would like to
know the distribution (counts) of all words that follow the word
"perfect". That is, for the phrase "perfect <i>x</i>", I want to know
how many times each word appears as the <i>x</i>, where <i>x</i> is
any non-zero-length word. To reduce noise, I am not interested
in <i>x</i>'s that appear only once.</p>
<p>Create a program called <code>PerfectX</code> in the
package <code>ca.uwaterloo.cs.bigdata2017w.assignment0</code> that
implements the specifications above.</p>
<p>To be clear, use the same definition of "word"
from <code>WordCount</code>, as follows:</p>
<pre>
String w = itr.nextToken().toLowerCase().replaceAll("(^[^a-z]+|[^a-z]+$)", "");
</pre>
<p>We should be able to run your program as follows:</p>
<pre>
$ hadoop jar target/bigdata2017w-0.1.0-SNAPSHOT.jar \
ca.uwaterloo.cs.bigdata2017w.assignment0.PerfectX \
-input data/Shakespeare.txt -output cs489-2017w-lintool-a0-shakespeare
</pre>
<p>You shouldn't need to write more than a couple lines of code
(beyond changing class names and other boilerplate). We'll go over the
Hadoop API in more detail in class, but the changes should be
straightforward.</p>
<p>Answer the following questions:</p>
<p><b>Question 1.</b> In the Shakespeare collection, what is the most
frequent <i>x</i> and how many times does it appear? (Answer this
question with command-line tools.)</p>
<p>You can run the above instructions using
<a href="assignments/check_assignment0_public_linux.py"><code>check_assignment0_public_linux.py</code></a> as follows:</p>
<pre>
$ wget http://lintool.github.io/bigdata-2017w/assignments/check_assignment0_public_linux.py
$ chmod +x check_assignment0_public_linux.py
$ ./check_assignment0_public_linux.py lintool
</pre>
<p>We'll be using exactly this script to check your assignment in the
Linux Student CS environment. <b>Important:</b> Make sure that your
code runs there even if you do development on your own machine.</p>
<h4 style="padding-top: 10px">Using the Altiscale Cluster</h4>
<p>The <a href="software.html">software page</a> has details on
getting started with the Altiscale cluster. Register your account and
follow instructions to set up ssh into the "workspace". Make sure
you've properly set up the proxy to view the cluster Resource Manager
(RM) webapp
at <a href="http://rm-ia.s3s.altiscale.com:8088/cluster/"><code>http://rm-ia.s3s.altiscale.com:8088/cluster/</code></a>.
Getting access to the RM webapp is important—you'll need it to
track your job status and for debugging purposes.</p>
<p>Once you've ssh'ed into the workspace, check out Bespin and run
word count:</p>
<pre>
$ hadoop jar target/bespin-0.1.0-SNAPSHOT.jar io.bespin.java.mapreduce.wordcount.WordCount \
-input /shared/cs489/data/enwiki-20161220-sentences-0.1sample.txt -output wc-jmr-combiner
</pre>
<p>Note that we're running word count over a larger collection here: a
10% sample of English Wikipedia totaling 1.6 GB (here's a chance to
exercise your newly-acquired HDFS skills to confirm for yourself).</p>
<p><b>Question 2.</b> Run word count on the Altiscale cluster and make
sure you can access the Resource Manager webapp. What is your
application id? It looks something like
<code>application_XXXXXXXXXXXXX_XXXX</code> and can be found in the
Resource Manager webapp. If you ran word count multiple times, any id
will do.</p>
<p><b>Question 3.</b> For this word count job, how many mappers ran in
parallel?</p>
<p><b>Question 4.</b> From the word count program, how many times does
"waterloo" appear in the sample Wikipedia collection?</p>
<p>Now switch into your own <code>bigdata2017w/</code> repo and run
your <code>PerfectX</code> program on the sample Wikipedia data:</p>
<pre>
$ hadoop jar target/bigdata2017w-0.1.0-SNAPSHOT.jar \
ca.uwaterloo.cs.bigdata2017w.assignment0.PerfectX \
-input /shared/cs489/data/enwiki-20161220-sentences-0.1sample.txt \
-output cs489-2017w-lintool-a0-wiki
</pre>
<p><b>Question 5.</b> In the sample Wikipedia collection, what are the
10 most frequent <i>x</i>'s and how many times does each appear?
(Answer this question with command-line tools.)</p>
<p>Note that the Altiscale cluster is a shared resource, and how fast
your jobs complete will depend on how busy it is. You're advised to
begin the assignment early as to avoid long job queues. "I wasn't able
to complete the assignment because there were too many jobs running on
the cluster" will not be accepted as an excuse if your assignment is
late.</p>
<p>You can run the above instructions using
<a href="assignments/check_assignment0_public_altiscale.py"><code>check_assignment0_public_altiscale.py</code></a> as follows:</p>
<pre>
$ wget http://lintool.github.io/bigdata-2017w/assignments/check_assignment0_public_altiscale.py
$ chmod +x check_assignment0_public_altiscale.py
$ ./check_assignment0_public_altiscale.py lintool
</pre>
<p>We'll be using exactly this script to check your
assignment on the Altiscale cluster.</p>
<h4 style="padding-top: 10px">Turning in the Assignment</h4>
<p>At this point, you should have a GitHub
repo <code>bigdata2017w/</code> and inside the repo, you should have
the word count program copied over from Bespin and the new
perfect <i>x</i> count implementation, along with
your <code>pom.xml</code>. Commit these files. Next, create a file
called <code>assignment0.md</code>
inside <code>bigdata2017w/</code>. In that file, put your answers to
the above questions (1—5). Use the Markdown annotation format:
here's
a <a href="http://daringfireball.net/projects/markdown/basics">simple
guide</a>.</p>
<p><b>Note:</b> there is no need to commit <code>data/</code>
or <code>target/</code> (or any results that you may have generated),
so your repo should be very compact — it should only have four
files: two Java source files, <code>pom.xml</code>,
and <code>assignment0.md</code>. You can add a <code>.gitignore</code>
file if you wish.</p>
<p>For this and all subsequent assignments, make sure everything is on
the master branch. Push your repo to GitHub. You can verify that it's
there by logging into your GitHub account in a web browser: your
assignment should be viewable in the web interface.</p>
<p>This and subsequent assignments contain two parts, one that can be
completed locally, and another that requires the Altiscale
cluster. For the first, make sure that your code runs in the Linux
Student CS environment (even if you do development on your own
machine), which is where we will be doing the marking. "But it runs on
my laptop!" will not be accepted as an excuse if we can't get your
code to run.</p>
<p>Almost there! Add the
user <a href="https://github.com/teachtool">teachtool</a> a
collaborator to your repo so that we can access it (under settings in
the main web interface on your repo). Note: do <b>not</b> add my
primary GitHub
account <a href="https://github.com/lintool">lintool</a> as a
collaborator.</p>
<p>Finally, you need to tell us your GitHub account so we can link it
to you. Submit your information <a href="https://goo.gl/forms/9D7q1T20IRm18PHo1">here</a>.</p>
<p>And that's it!</p>
<p>To give you an idea of how we'll be marking this and future
assignments—we will clone your repo and use the above check
scripts:</p>
<ul>
<li><a href="assignments/check_assignment0_public_linux.py"><code>check_assignment0_public_linux.py</code></a>
in the Linux Student CS environment.</li>
<li><a href="assignments/check_assignment0_public_altiscale.py"><code>check_assignment0_public_altiscale.py</code></a> on the Altiscale cluster.</li>
</ul>
<p>We'll make sure the data files are in the right place, and once the
code completes, we will verify the output. It is highly recommend that
you run these check scripts: if it doesn't work for you, it won't work
for us either.</p>
<h4 style="padding-top: 10px">Grading</h4>
<p>This assignment is worth a total of 20 points, broken down as
follows:</p>
<ul>
<li>The questions above are worth a total of 10 points.</li>
<li>Getting your code to compile and successfully run is worth
another 8 points (4 points each for <code>PerfectX</code> in the
Linux student CS environment and on Altiscale). We will make a
minimal effort to fix <i>trivial</i> issues with your code (e.g., a
typo)—and deduct points—but <b>will not</b> spend time
debugging your code. It is your responsibility to make sure your
code runs: we have taken care to specify exactly how we will run
your code—if anything is unclear, it is your responsibility to
seek clarification. In order to get a perfect score of 8 for this
portion of the grade, we should be able to run the two public check
scripts above successfully without any errors.</li>
<li>Another 2 points is allotted to us verifying the output of your
program in ways that we will not tell you. We're giving you the
"public" versions of the check scripts; we'll run a "private"
version to examine your output further (i.e., think blind test
cases).</li>
</ul>
<p style="padding-top: 20px"><a href="#">Back to top</a></p>
</div>
</section>
<p style="padding-top:100px" />
</div><!-- /.container -->
<!-- jQuery (necessary for Bootstrap's JavaScript plugins) -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
<!-- Include all compiled plugins (below), or include individual files as needed -->
<script src="js/bootstrap.min.js"></script>
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<script src="js/ie10-viewport-bug-workaround.js"></script>
</body>
</html>