forked from aimjwizards/MapReduce-course-2013s
-
Notifications
You must be signed in to change notification settings - Fork 0
/
assignments.html
1478 lines (1116 loc) · 55.5 KB
/
assignments.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Data-Intensive Computing with MapReduce</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="">
<meta name="author" content="">
<!-- Le styles -->
<link href="assets/css/bootstrap.css" rel="stylesheet">
<style>
body {
padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */
}
</style>
<link href="assets/css/bootstrap-responsive.css" rel="stylesheet">
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
<!-- Fav and touch icons -->
<!--link rel="apple-touch-icon-precomposed" sizes="144x144" href="assets/ico/apple-touch-icon-144-precomposed.png">
<link rel="apple-touch-icon-precomposed" sizes="114x114" href="assets/ico/apple-touch-icon-114-precomposed.png">
<link rel="apple-touch-icon-precomposed" sizes="72x72" href="assets/ico/apple-touch-icon-72-precomposed.png">
<link rel="apple-touch-icon-precomposed" href="assets/ico/apple-touch-icon-57-precomposed.png">
<link rel="shortcut icon" href="assets/ico/favicon.png"-->
</head>
<body>
<div class="navbar navbar-inverse navbar-fixed-top">
<div class="navbar-inner">
<div class="container">
<a class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</a>
<div class="nav-collapse collapse">
<ul class="nav">
<li><a href="index.html">Home</a></li>
<li><a href="overview.html">Overview</a></li>
<li><a href="syllabus.html">Syllabus</a></li>
<li class="active"><a href="assignments.html">Assignments</a></li>
</ul>
</div>
</div>
</div>
</div>
<div class="container">
<div class="page-header">
<h1>Assignments <small>Data-Intensive Computing with MapReduce (Spring 2013)</small></h1>
</div>
<div class="subnav">
<ul class="nav nav-pills">
<li><a href="#assignment0">0</a></li>
<li><a href="#assignment1">1</a></li>
<li><a href="#assignment2">2</a></li>
<li><a href="#assignment3">3</a></li>
<li><a href="#assignment4">4</a></li>
<li><a href="#assignment5">5</a></li>
<li><a href="#assignment6">6</a></li>
<li><a href="#finalproject">Final Project</a></li>
</ul>
</div>
<section id="assignment0" style="padding-top:35px">
<div>
<h3>Assignment 0: Prelude <small>due 6:00pm January 24</small></h3>
<p>Complete
the <a href="http://lintool.github.com/Cloud9/docs/word-count.html">word
count tutorial</a> in Cloud<sup>9</sup>, which is a Hadoop toolkit we're
going to use throughout the course. The tutorial will take you
through setting up Hadoop on your local machine and running Hadoop on
the virtual machine. It'll also begin familiarizing you with
GitHub.</p>
<p><b>Note:</b> This assignment is not explicitly graded, except as
part of Assignment 1.</p>
<p style="padding-top: 20px"><a href="#">Back to top</a></p>
</div>
</section>
<section id="assignment1" style="padding-top:35px">
<div>
<h3>Assignment 1: Warmup <small>due 6:00pm January 31</small></h3>
<p>Make sure you've completed
the <a href="http://lintool.github.com/Cloud9/docs/word-count.html">word
count tutorial</a> in Cloud<sup>9</sup>.</p>
<p>Sign up for a <a href="http://github.com/">GitHub</a> account. It is
very important that you do so as soon as possible, because GitHub is
the mechanism by which you will submit assignments. Once you've signed
up for an account, go to this page
to <a href="https://github.com/edu">request an educational
account</a>.</p>
<p>Next, create a <b>private</b> repo
called <code>MapReduce-assignments</code>. Here
is <a href="https://help.github.com/articles/create-a-repo">how you
create a repo on GitHub</a>. For "Who has access to this repository?",
make sure you click "Only the people I specify". If you've
successfully gotten an educational account (per above), you should be
able to create private repos for free. Take some time to learn about
git if you've never used it before. There are plenty of good tutorials
online: do a simple web search and find one you like. If you've used
svn before, many of the concepts will be familiar, except that git
is far more powerful.</p>
<p>After you've learned about git, set aside the repo for now; you'll
come back to it later.</p>
<p>In the single node virtual cluster in the word count tutorial, you
should have run the word count demo with five reducers:</p>
<pre>
etc/hadoop-cluster.sh edu.umd.cloud9.example.simple.DemoWordCount \
-input bible+shakes.nopunc.gz -output wc -numReducers 5
</pre>
<p>Answer the following questions:</p>
<p><b>Question 1.</b> What is the first term
in <code>part-r-00000</code> and how many times does it appear?</p>
<p><b>Question 2.</b> What is the third to last term
in <code>part-r-00004</code> and how many times does it appear?</p>
<p><b>Question 3.</b> How many unique terms are there? (Hint: read the
counter values)</p>
<p>Let's do a little bit of cleanup of the words. Modify the word
count demo so that only words consisting entirely of letters are
counted. To be more specific, the word must match the following Java
regular expression:</p>
<pre>
word.matches("[A-Za-z]+")
</pre>
<p>Now run word count again, also with five reducers. Answer the
following questions:</p>
<p><b>Question 4.</b> What is the first term
in <code>part-r-00000</code> and how many times does it appear?</p>
<p><b>Question 5.</b> What is the third to last term
in <code>part-r-00004</code> and how many times does it appear?</p>
<p><b>Question 6.</b> How many unique terms are there?
<h4 style="padding-top: 10px">Turning in the Assignment</h4>
<p>Please follow these instructions carefully!</p>
<p>Per above, you should have a private GitHub repo called
<code>MapReduce-assignments</code>. Inside the repo, create a
directory called <code>assignment1</code>; in that directory, create a
file called <code>assignment1.md</code>. In that file, put your
answers to the above questions 1 through 6. Use the Markdown
annotation format: here's
a <a href="http://daringfireball.net/projects/markdown/basics">simple
guide</a>. Here's an <a href="http://markable.in/editor/">online
editor</a> that's also helpful.</p>
<p>Create a directory called <code>assignment1/src/</code> and
put your source code in there (i.e., the modified word count
demo).</p>
<p>Next, in the directory <code>assignment1/</code>, create a shell
script called <code>run-assignment1.sh</code>, which when executed
will run the code that answers questions 4 through 6 (i.e., modified
word count). When I check out your repo in my copy of the virtual
machine, I should be able to execute that script to run your code.</p>
<p>You can assume, just like in the the word count tutorial, that the
input file is already in HDFS
as <code>bible+shakes.nopunc.gz</code>. Your script should put the
word count output in a directory that is the same as your GitHub
username. Please don't name the output directory <code>output/</code>
or something generic.
<p>However you structure your repo to make the script work is up to
you. My recommendation would be to check in compiled jars into the
repo and then have <code>run-assignment1.sh</code> execute the
appropriate Hadoop command, but working out these details is part of the
assignment.</p>
<table><tr><td valign="top"><span class="label label-warning">Warning</span></td>
<td style="padding-left: 10px">Make sure your assignment <b>does
not</b> depend on any files or paths outside your repo. If you have a
hard-coded absolute path in your run script, for example, it will
probably break, since the same locations may not exist on my
machine. The only exception is the <code>bible+shakes.nopunc.gz</code>
data, which you can except to be in HDFS.</td></tr></table>
<p>In summary, there are three deliverables to this homework:</p>
<ul>
<li><code>MapReduce-assignments/assignment1/assignment1.md</code>: actual answers to questions.</li>
<li><code>MapReduce-assignments/assignment1/src/</code>: source code goes into this directory.</li>
<li><code>MapReduce-assignments/assignment1/run-assignment1.sh</code>: run script.</li>
</ul>
<p>Make sure you've committed your code and pushed your repo back to
origin. You can verify that it's there by logging into your GitHub
account, and your assignment should be viewable in the web
interface.</p>
<p>Almost there! Add the
user <a href="https://github.com/teachtool">teachtool</a> a
collaborator to your repo so that I can check it out (under settings
in the main web interface on the top right corner of your repo). Note:
do <b>not</b> add my primary GitHub
account <a href="https://github.com/lintool">lintool</a> as a
collaborator.</p>
<p>Finally, send me an email, to [email protected] with the subject
line "MapReduce Assignment #1". In the body of the email message, tell
me what your GitHub username is so that I can link your repo to
you. Also, in your email please tell me how long you spent doing the
assignment, including everything (installing the VM, learning about
git, working through the tutorial, etc.).</p>
<h4 style="padding-top: 10px">Grading</h4>
<p>The purpose of this assignment is to familiarize you with the
Hadoop development environment. You'll get a "pass" if you've
successfully completed the assignment. I expect everyone to get a
"pass".</p>
<p style="padding-top: 20px"><a href="#">Back to top</a></p>
</div>
</section>
<section id="assignment2" style="padding-top:35px">
<div>
<h3>Assignment 2: Counting <small>due 6:00pm February 14</small></h3>
<h4>Setting Up Your Development Environment</h4>
<p>First, just so everyone's VM is sync'ed up, update all packages
via:</p>
<pre>
sudo yum update
</pre>
<p>If it's the first time you've done this after downloading the VM
image, it might take a bit, so grab a cup of coffee.</p>
<p>After the VM is updated, clone
the <a href="https://github.com/lintool/MapReduce-course-2013s">Git
repo for the course</a>. If you have the repo already cloned,
make sure you do a pull to get the latest updates. When in
doubt, type <code>git log</code> in your console to pull up the most
recent commits, and it should match the latest commit
<a href="https://github.com/lintool/MapReduce-course-2013s">here</a>.</p>
<p>You'll find a
directory named <code><a href="https://github.com/lintool/MapReduce-course-2013s/blob/master/assignments-stub/">assignments-stub/</a></code>, which provides a template
for your assignment. Copy the contents of the directory into your own assignments
private repo,
under <code>MapReduce-assignments/assignment2/</code>.
Note that the source directory contains (normally invisible) dot-files
e.g., <code>.gitignore</code>, etc.; remember to copy these also.
Go ahead and
commit the contents so that you can revert to this point easily. Go
into <code>MapReduce-assignments/assignment2/</code>: you should be
able to type <code>ant</code> and successfully build the project.</p>
<p>Since we're going to be working with this basic repository
structure for subsequent assignments, you should familiarize yourself
with the setup. Let's first take a tour: <a href="http://ant.apache.org/">Ant</a> is a build
system, and through <a href="http://ant.apache.org/ivy/">Ivy</a>, it
downloads all dependent jars and places them
in <code>lib/</code>. That is, all the jars in <code>lib/</code> are
automatically placed there—you shouldn't ever need to worry
about copying jars there directly. Also, <code>lib/</code> should <b>not</b>
be placed under version control.</p>
<p>How does Ivy know what dependencies to pull in? This is specified
in <code><a href="https://github.com/lintool/MapReduce-course-2013s/blob/master/assignments-stub/ivy/ivy.xml">ivy/ivy.xml</a></code>,
in this line:</p>
<pre>
<dependency org="edu.umd" name="cloud9" rev="1.4.10" conf="*->*,!sources,!javadoc"/>
</pre>
<p>Ivy automatically finds and downloads Cloud<sup>9</sup> and
transitively pulls its dependencies also. Add to this file if you want
to use any external libraries.</p>
<p>Source code is kept in <code>src/</code>: main code goes
into <code><a href="https://github.com/lintool/MapReduce-course-2013s/blob/master/assignments-stub/src/main">src/main/</a></code>,
JUnit tests go
into <code><a href="https://github.com/lintool/MapReduce-course-2013s/blob/master/assignments-stub/src/test">src/test/</a></code>. There
are source code stubs to get you started. If you use Eclipse as your
IDE, you should be able to directly import the project.</p>
<p>After Ant successfully completes the build, the packaged jar is
created in <code>dist/</code>. Note that <code>dist/</code>
should <b>not</b> be placed under version control since it is built
automatically.</p>
<p>For your convenience, <code>ant</code> generates four run scripts
in <code>etc/</code>:</p>
<p>Use <code>run.sh</code> to run any normal Java class with
a <code>main</code>, e.g.:</p>
<pre>
etc/run.sh HelloWorld
</pre>
<p>Use <code>junit.sh</code> to run a specific JUnit test, e.g.:</p>
<pre>
etc/junit.sh SampleTest
</pre>
<p>Use <code>hadoop-local.sh</code> to run a Hadoop job in local
(standalone) mode, e.g.:</p>
<pre>
etc/hadoop-local.sh WordCount -input bible+shakes.nopunc.gz -output wc
</pre>
<p>Use <code>hadoop-cluster.sh</code> to run a Hadoop job in the VM in
pseudo-distributed mode, e.g.:</p>
<pre>
etc/hadoop-cluster.sh WordCount -input bible+shakes.nopunc.gz -output wc -numReducers 5
</pre>
<p>Ant provides a couple other useful features. To run all test
cases:</p>
<pre>
ant test
</pre>
<p>If you're getting an error along the lines of "the class
org.apache.tools.ant.taskdefs.optional.junit.JUnitTask was not found"
or "java.lang.ClassNotFoundException:
org.apache.tools.ant.taskdefs.optional.TraXLiaison", do:</p>
<pre>
sudo yum install ant-junit
sudo yum install ant-trax
</pre>
<p>To generate Javadoc:</p>
<pre>
ant javadoc
</pre>
<p>The API docs will be deposited in <code>docs/api/</code>.</p>
<h4 style="padding-top: 10px">The Assignment</h4>
<p>This assignment begins with an optional <i>but recommended</i>
component: complete
the <a href="http://lintool.github.com/Cloud9/docs/exercises/bigrams.html">bigram
counts exercise</a> in Cloud<sup>9</sup>. The solution is already
checked in the repo, so it won't be graded. Even if you decide not to
write code for the exercise, take some time to sketch out what the
solution would look like. The exercises are designed to help you
learn: jumping directly to the solution defeats this purpose.</p>
<p>In this assignment you'll be
computing <a href="http://en.wikipedia.org/wiki/Pointwise_mutual_information">pointwise
mutual information</a>, which is a function of two events <i>x</i>
and <i>y</i>:</p>
<p><img width="200" src="assets/images/PMI.png"/></p>
<p>The larger the magnitude of PMI for <i>x</i> and <i>y</i> is,
the more information you know about the probability of seeing <i>y</i>
having just seen <i>x</i> (and vice-versa, since PMI is
symmetrical). If seeing <i>x</i> gives you no information about seeing
<i>y</i>, then <i>x</i> and <i>y</i> are independent and the PMI is
zero.</p>
<p>Write a program that computes the PMI of words in the
sample <code>bible+shakes.nopunc.gz</code> corpus. To be more
specific, the event we're after is <i>x</i> occurring on a line in the
file or <i>x</i> and <i>y</i> co-occurring on a line. That is, if a
line contains A, A, B; then there are <i>not</i> two instances of A
and B appearing together, only one. To reduce the number of spurious
pairs, we are only interested in pairs of words that co-occur in ten
or more lines. Use the same definition of "word" as in the word count
demo: whatever Java's <code>StringTokenizer</code> gives.</p>
<p>You will build two versions of the program:</p>
<ol>
<li>A "pairs" implementation. The implementation must use
combiners. Name this implementation <code>PairsPMI</code>.</li>
<li>A "stripes" implementation. The implementation must use
combiners. <code>StripesPMI</code>.</li>
</ol>
<p>If you feel compelled (for extra credit), you are welcome to try
out the "in-mapper combining" technique for both implementations.</p>
<p>Since PMI is symmetrical, PMI(x, y) = PMI(y, x). However, it's
actually easier in your implementation to compute both values, so
don't worry about duplicates. Also, use <code>TextOutputFormat</code>
so the results of your program are human readable.</p>
<p><b>Note:</b> just so everyone's answer is consistent, please use
log base 10.</p>
<p>Answer the following questions:</p>
<p><b>Question 0.</b> <i>Briefly</i> describe in prose your solution,
both the pairs and stripes implementation. For example: how many
MapReduce jobs? What are the input records? What are the intermediate
key-value pairs? What are the final output records? A paragraph for
each implementation is about the expected length.</p>
<p><b>Question 1.</b> What is the running time of the complete pairs
implementation (in your VM)? What is the running time of the complete
stripes implementation (in your VM)?</p>
<p><b>Question 2.</b> Now disable all combiners. What is the running
time of the complete pairs implementation now? What is the running
time of the complete stripes implementation?</p>
<p><b>Question 3.</b> How many distinct PMI pairs did you extract?</p>
<p><b>Question 4.</b> What's the pair (x, y) with the highest PMI?
Write a sentence or two to explain what it is and why it has such a
high PMI.</p>
<p><b>Question 5.</b> What are the three words that have the highest
PMI with "cloud" and "love"? And what are the PMI values?</p>
<p>Note that you can compute the answer to questions 3—6 however
you wish: a helper Java program, a Python script, command-line
manipulation, etc.</p>
<h4 style="padding-top: 10px">Turning in the Assignment</h4>
<p>Please follow these instructions carefully!</p>
<p>Make sure your repo has the following items:</p>
<ul>
<li>Similar to your first assignment, the answers to the questions go
in <code>MapReduce-assignments/assignment2/assignment2.md</code>.</li>
<li>The pairs implementation should be
in <code>MapReduce-assignments/assignment2/src/main/PairsPMI.java</code>.</li>
<li>The stripes implementation should be
in <code>MapReduce-assignments/assignment2/src/main/StripesPMI.java</code>.</li>
<li>Of course, your repo may contain other Java code, which goes in
in <code>MapReduce-assignments/assignment2/src/main/</code>.</li>
</ul>
<p>When grading, I will perform a clean clone of your repo in my
VM, <code>cd MapReduce-assignments/assignment2/</code> and
type <code>ant</code> to build. Your code should build
successfully.</p>
<p>Next, I'll type (exactly) the following command to run the pairs
implementation (in the VM):</p>
<pre>
etc/hadoop-cluster.sh PairsPMI -input bible+shakes.nopunc.gz -output YOURNAME-pairs -numReducers 5
</pre>
<p>You can assume that <code>bible+shakes.nopunc.gz</code> is already
in HDFS but otherwise there is nothing else on HDFS. The final output
should appear in a directory called <code>YOURNAME-pairs</code>. The part
files in that directory should be human readable.</p>
<p>Similarly, I'll type the following command to run the stripes
implementation (in the VM):</p>
<pre>
etc/hadoop-cluster.sh StripesPMI -input bible+shakes.nopunc.gz -output YOURNAME-stripes -numReducers 5
</pre>
<p>As in the pairs case, you can assume
that <code>bible+shakes.nopunc.gz</code> is already in HDFS but
otherwise there is nothing else on HDFS. The final output should
appear in a directory called <code>YOURNAME-stripes</code>. The part
files in that directory should be human readable.</p>
<p>Before you consider the assignment "complete", I would recommend
that you verify everything above works by performing a clean clone of
your repo and going through the steps above.</p>
<p>One final suggestion: sometimes Ivy gets into a weird state due to
multiple interacting repositories. Just to make sure I can pull in all
dependencies, remove the Ivy cache with <code>rm -r
~/.ivy2/cache</code> and make sure the build still works. Ivy should
re-download all dependent jars from their original sources.</p>
<p>When you've done everything, commit to your repo and remember to
push back to origin. You should be able to see your edits in the web
interface. That's it! There's no need to send me anything—I
already know your username from the first assignment. Note that
everything should be committed and pushed to origin before the
deadline (before class on February 14).</p>
<h4 style="padding-top: 10px">Hints</h4>
<ul>
<li>Did you take a look at the <a href="http://lintool.github.com/Cloud9/docs/exercises/bigrams.html">bigram
counts exercise</a>?</li>
<li>Your solution may require more than one MapReduce job.</li>
<li>Recall from lecture techniques for loading in "side data"?</li>
<li>Look in <code>edu.umd.cloud9.example.cooccur</code> for a reference implementation of the pairs and stripes techniques.</li>
<li>Note that you have access to everything that's in Cloud<sup>9</sup>, for example, there are many useful <code>Writable</code> types in <code>edu.umd.cloud9.io</code>.</li>
</ul>
<h4 style="padding-top: 10px">Grading</h4>
<p>The entire assignment is worth 35 points:
<ul>
<li>Each of the questions 1 to 5 is worth 2 points, for a total of 10
points.</li>
<li>The pairs implementation is worth 10 points and the stripes
implementation is worth 10 points. The purpose of question 0 is to
help me understand your implementation.</li>
<li>Getting your code to run is worth 5
points. That is, to earn all five points, I should be able to run your
code (building and running), following exactly the procedure
above. Therefore, if all the answers are correct and the
implementation seems correct, but I cannot get your code to build and
run inside my VM, you will only get a score of 30/35.</li>
</ul>
<p style="padding-top: 20px"><a href="#">Back to top</a></p>
</div>
</section>
<section id="assignment3" style="padding-top:35px">
<div>
<h3>Assignment 3: Inverted Indexing <small>due 6:00pm February 28</small></h3>
<h4>Setting Up Your Development Environment</h4>
<p>Before you begin, update all packages in the VM:</p>
<pre>
sudo yum update
</pre>
<p>In your private repo,
create <code>MapReduce-assignments/assignment3/</code>. Pull the
<a href="https://github.com/lintool/MapReduce-course-2013s">Git repo
for the course</a> to make sure you have the latest updates. Copy
the contents of <code>assignments-stub/</code>
into <code>MapReduce-assignments/assignment3/</code>. Make sure you have
the latest version of the files: <b>do not</b> just copy files from
<code>MapReduce-assignments/assignment2/</code> because you may not
get the latest code updates.</p>
<h4 style="padding-top: 10px">The Assignment</h4>
<p>This assignment begins with an optional <i>but recommended</i>
component: complete
the <a href="http://lintool.github.com/Cloud9/docs/exercises/indexing.html">inverted
indexing exercise</a>
and <a href="http://lintool.github.com/Cloud9/docs/exercises/retrieval.html">boolean
retrieval exercise</a> in Cloud<sup>9</sup>. The solution is already
checked in the repo, so it won't be graded. However, the rest of the
assignment builds from there. Even if you decide not to write code for
those two exercises, take some time to sketch out what the solution
would look like. The exercises are designed to help you learn: jumping
directly to the solution defeats the purpose.</p>
<p>Starting from the inverted indexing baseline, modify the indexer
code in the two following ways:</p>
<p><b>1. Index Compression.</b> The index should be compressed using
VInts: see <code>org.apache.hadoop.io.WritableUtils</code>. You should
also use gap-compression techniques as appropriate.</p>
<p><b>2. Scalability.</b> The baseline indexer implementation
currently buffers and sorts postings in the reducer, which as we
discussed in class is not a scalable solution. Address this
scalability bottleneck using techniques we discussed in class and in
the textbook.</p>
<p><b>Note:</b> The major scalability issue is
buffering <i>uncompressed</i> postings in memory. In your solution,
you'll still end up buffering each postings list, but
in <i>compressed</i> form (raw bytes, no additional object
overhead). This is fine because if you use the right compression
technique, the postings lists are quite small. As a data point, on a
collection of 50 million web pages, 2GB heap is more than enough for a
full <i>positional</i> index (and in this assignment you're not asked
to store positional information in your postings).</p>
<p>To go into a bit more detail: in the reference implementation, the
final key type is <code>PairOfWritables<IntWritable,
ArrayListWritable<PairOfInts>></code>. The most obvious idea
is to change that into something
like <code>PairOfWritables<VIntWritable,
ArrayListWritable<PairOfVInts>></code>. This does not work!
The reason is that you will still be materializing each posting, i.e.,
all <code>PairOfVInts</code> objects in memory. This translates into a
Java object for every posting, which is wasteful in terms of memory
usage and will exhaust memory pretty quickly as you scale. In other
words, you're <i>still</i> buffering objects—just inside
the <code>ArrayListWritable</code>.
<p>This new indexer should be
named <code>BuildInvertedIndexCompressed</code>.</p>
<p>Modify <code><a href="https://github.com/lintool/Cloud9/blob/master/src/dist/edu/umd/cloud9/example/ir/LookupPostings.java">LookupPostings</a></code>
so that it works with the new compressed indexes. Name this new
class <code>LookupPostingsCompressed</code> in your private repo under
<code>MapReduce-assignments/assignment3/src/main/</code>. This new
class should give <i>exactly</i> the same output as the old
version.</p>
<p>Modify <code><a href="https://github.com/lintool/Cloud9/blob/master/src/dist/edu/umd/cloud9/example/ir/BooleanRetrieval.java">BooleanRetrieval</a></code>
so that it works with the new compressed indexes. Name this new
class <code>BooleanRetrievalCompressed</code> in your private repo under
<code>MapReduce-assignments/assignment3/src/main/</code>. This new
class should give <i>exactly</i> the same output as the old
version.</p>
<p>The single question to answer is:</p>
<p><b>Question 1.</b> What is the size of your compressed index? (In
need this value just in case I can't get your code to compile and
run)</p>
<h4 style="padding-top: 10px">Turning in the Assignment</h4>
<p>Make sure your repo has the following items:</p>
<ul>
<li>The answer to the question goes
in <code>MapReduce-assignments/assignment3/assignment3.md</code>.</li>
<li><code>MapReduce-assignments/assignment3/src/main/BuildInvertedIndexCompressed.java</code>.</li>
<li><code>MapReduce-assignments/assignment3/src/main/LookupPostingsCompressed.java</code>.</li>
<li><code>MapReduce-assignments/assignment3/src/main/BooleanRetrievalCompressed.java</code>.</li>
<li><code>MapReduce-assignments/assignment3/LookupPostingsCompressed.out</code>: the output of <code>LookupPostingsCompressed</code>.</li>
<li><code>MapReduce-assignments/assignment3/BooleanRetrievalCompressed.out</code>: the output of <code>BooleanRetrievalCompressed</code>.</li>
<li>Of course, your repo may contain other Java code, which goes in
in <code>MapReduce-assignments/assignment3/src/main/</code>.</li>
</ul>
<p>When grading, I will perform a clean clone of your repo in my
VM, <code>cd MapReduce-assignments/assignment3/</code> and
type <code>ant</code> to build. Your code should build
successfully.</p>
<p>Next, I'll type (exactly) the following command to run the indexer
(in the VM):</p>
<pre>
etc/hadoop-cluster.sh BuildInvertedIndexCompressed -input bible+shakes.nopunc -output YOURNAME-index -numReducers 1
</pre>
<p>You can assume that <code>bible+shakes.nopunc</code> is already
in HDFS but otherwise there is nothing else on HDFS. The final output
should appear in a directory called <code>YOURNAME-index</code> on HDFS. There
is no need for the index to be human readable; in fact, it shouldn't
be.</p>
<p>I will then issue the following commands:</p>
<pre>
etc/hadoop-cluster.sh LookupPostingsCompressed -index YOURNAME-index -collection bible+shakes.nopunc > lookup.out
etc/hadoop-cluster.sh BooleanRetrievalCompressed -index YOURNAME-index -collection bible+shakes.nopunc > retrieval.out
</pre>
<p><b>Note:</b> The above two classes should read the index and
collection directly from HDFS. If you look at the solutions to the
inverted indexing exercise and boolean retrieval exercise, you'll see
that the equivalent classes can also read data directly from HDFS.</p>
<p>The file <code>lookup.out</code> should be identical to what you
checked in as <code>LookupPostingsCompressed.out</code> and the file
<code>retrieval.out</code> should be identical to what you checked in
as <code>BooleanRetrievalCompressed.out</code>. I will use the
command <code>diff</code> to verify. The purpose of you storing the
two <code>.out</code> files is in case I can't get your code to run, I
still have the output to examine.</p>
<p>Note that the output of your new classes should match the old
versions in Cloud<sup>9</sup> <i>exactly</i>. I will
use <code>diff</code> to verify this.</p>
<p>Before you consider the assignment "complete", I would recommend
that you verify everything above works by performing a clean clone of
your repo and going through the steps above. When you've done
everything, commit to your repo and remember to push back to
origin. You should be able to see your edits in the web
interface. That's it! There's no need to send me anything—I
already know your username from before. Note that everything should be
committed and pushed to origin before the deadline (before class on
February 28).</p>
<h4 style="padding-top: 10px">Grading</h4>
<p>The entire assignment is worth 35 points:
<ul>
<li>The implementation of index compression is worth 10 points.</li>
<li>The implementation of the scalable algorithm is worth 10
points.</li>
<li>The implementation of <code>LookupPostingsCompressed</code> is
worth 5 points.</li>
<li>The implementation of <code>BooleanRetrievalCompressed</code> is
worth 5 points.</li>
<li>Getting your code to run is worth 5 points. That is, to earn all
five points, I should be able to run your code (building and running),
following exactly the procedure above. Therefore, if all the answers
are correct and the implementation seems correct, but I cannot get
your code to build and run inside my VM, you will only get a score of
30/35.</li>
</ul>
<p style="padding-top: 20px"><a href="#">Back to top</a></p>
</div>
</section>
<section id="assignment4" style="padding-top:35px">
<div>
<h3>Assignment 4: Graphs <small>due 6:00pm March 14</small></h3>
<h4>Setting Up Your Development Environment</h4>
<p>Before you begin, update all packages in the VM:</p>
<pre>
sudo yum update
</pre>
<p>In your private repo,
create <code>MapReduce-assignments/assignment4/</code>. Pull the
<a href="https://github.com/lintool/MapReduce-course-2013s">Git repo
for the course</a> to make sure you have the latest updates. Copy
the contents of <code>assignments-stub/</code>
into <code>MapReduce-assignments/assignment4/</code>. Make sure you have
the latest version of the files: <b>do not</b> just copy files from
<code>MapReduce-assignments/assignment3/</code> because you may not
get the latest code updates.</p>
<h4 style="padding-top: 10px">The Assignment</h4>
<p>Begin this assignment by taking the time to understand
the <a href="http://lintool.github.com/Cloud9/docs/exercises/pagerank.html">PageRank
reference implementation</a> in Cloud<sup>9</sup>. There is no need to
try the exercise from scratch, but study the code carefully to
understand exactly how it works.</p>
<p>For this assignment, you are going to implement multiple-source
personalized PageRank. As we discussed in class, personalized PageRank
is different from ordinary PageRank in a few respects:</p>
<ul>
<li>There is the notion of a <i>source</i> node, which is what we're
computing the personalization with respect to.</li>
<li>When initializing PageRank, instead of a uniform distribution
across all nodes, the source node gets a mass of one and every other
node gets a mass of zero.</li>
<li>Whenever the model makes a random jump, the random jump is
always back to the source node; this is unlike in ordinary PageRank,
where there is an equal probability of jumping to any node.</li>
<li>All mass lost in the dangling nodes are put back into the source
node; this is unlike in ordinary PageRank, where the missing mass is
evenly distributed across all nodes.</li>
</ul>
<p>Here are some publications about personalized PageRank if you're
interested. They're just provided for background; neither is necessary
for completing the assignment.</p>
<ul>
<li>Daniel Fogaras, Balazs Racz, Karoly Csalogany, and Tamas Sarlos. (2005) <a href="material/Fogaras_etal_2005.pdf">Towards Scaling Fully Personalized PageRank: Algorithms, Lower Bounds, and Experiments.</a> Internet Mathematics, 2(3):333-358.</li>
<li>Bahman Bahmani, Abdur Chowdhury, and Ashish Goel. (2010) <a href="material/Bahmani_etal_VLDB2010.pdf">Fast Incremental and Personalized PageRank.</a> Proceedings of the 36th International Conference on Very Large Data Bases (VLDB 2010).</li>
</ul>
<p>Your implementation is going to run multiple personalized PageRank
computations in parallel, one with respect to each source. The user is
going to specify on the command line the sources. This means that each
PageRank node object (i.e., <code>Writable</code>) is going to contain
an array of PageRank values.</p>
<p>Here's how the implementation is going to work; it largely follows
the reference implementation in the exercise above. It's your
responsibility to make your implementation work with respect to the
command-line invocations specified below.</p>
<p>First, the user is going to convert the adjacency list into
PageRank node records:</p>
<pre>
etc/hadoop-cluster.sh BuildPersonalizedPageRankRecords -input sample-large.txt \
-output YOURNAME-PageRankRecords -numNodes 1458 -sources 9627181,9370233,10207721
</pre>
<p>Note that we're going to use the "large" graph from the exercise
linked above. The <code>-sources</code> option specifies the source
nodes for the personalized PageRank computations. In this case, we're
running three computations in parallel, with respect to node
ids 9627181, 9370233, and 10207721. You can expect the option value to
be in the form of a comma-separated list, and that all node ids
actually exist in the graph. The list of source nodes may be
arbitrarily long, but for practical purposes I won't test your code
with more than a few.</p>
<p>Since we're running three personalized PageRank computations in
parallel, each PageRank node is going to hold an array of three
values, the personalized PageRank values with respect to the first
source, second source, and third source. You can expect the array
positions to correspond exactly to the position of the node id in the
source string.</p>
<p>Next, the user is going to partition the graph and get ready to
iterate:</p>
<pre>
hadoop fs -mkdir YOURNAME-PageRank
etc/hadoop-cluster.sh PartitionGraph -input YOURNAME-PageRankRecords \
-output YOURNAME-PageRank/iter0000 -numPartitions 5 -numNodes 1458
</pre>
<p>This will be standard hash partitioning.</p>
<p>After setting everything up, the user will iterate multi-source
personalized PageRank:</p>
<pre>
etc/hadoop-cluster.sh RunPersonalizedPageRankBasic -base YOURNAME-PageRank \
-numNodes 1458 -start 0 -end 20 -sources 9627181,9370233,10207721
</pre>
<p>Note that the sources are passed in from the command-line
again. Here, we're running twenty iterations.</p>
<p>Finally, the user runs a program to extract the top ten personalized
PageRank values, with respect to each source.</p>
<pre>
etc/hadoop-cluster.sh ExtractTopPersonalizedPageRankNodes -input YOURNAME-PageRank/iter0020 \
-top 10 -sources 9627181,9370233,10207721
</pre>
<p>The output should look something like this (printed to stdout):</p>
<pre>
Source: 9627181
0.43721 9627181
0.10006 8618855
0.09015 8980023
0.07705 12135350
0.07432 9562469
0.07432 10027417
0.01749 9547235
0.01607 9880043
0.01402 8070517
0.01310 11122341
Source: 9370233
0.42118 9370233
0.08627 11325345
0.08378 11778650
0.07160 10952022
0.07160 10767725
0.07160 8744402
0.03259 10611368
0.01716 12182886
0.01467 12541014
0.01467 11377835
Source: 10207721
0.38494 10207721
0.07981 11775232
0.07664 12787320
0.06565 12876259
0.06543 8642164
0.06543 10541592
0.02224 8669492
0.01963 10940674
0.01911 10867785
0.01815 9619639
</pre>
<h4 style="padding-top: 10px">Additional Specifications</h4>
<p>To make the final output easier to read, in the
class <code>ExtractTopPersonalizedPageRankNodes</code>, use the
following format to print each (personalized PageRank value, node id)
pair:</p>
<pre>
String.format("%.5f %d", pagerank, nodeid)
</pre>
<p>This will generate the final results in the same format as
above. Also note: print actual probabilities, not log
probabilities—although during the actual PageRank computation
keeping values as log probabilities is better.</p>
<p>The final class <code>ExtractTopPersonalizedPageRankNodes</code>
does not need to be a MapReduce job (but it does need to read from
HDFS). Obviously, the other classes need to run MapReduce jobs.</p>
<p>The reference implementation of PageRank in Cloud<sup>9</sup> has
many options: you can either use in-mapper combining or
ordinary combiners. In your implementation, choose one or the
other. You do not need to implement both options. Also, the reference
implementation has an option to either use range partitioning or hash
partitioning: you only need to implement hash partitioning. You can
start with the reference implementation and remove code that you don't
need (see #2 below).</p>
<h4 style="padding-top: 10px">Hints and Suggestion</h4>
<p>To help you out, there's a small helper program in
Cloud<sup>9</sup> that computes personalized PageRank using a
sequential algorithm. Use it to check your answers:</p>
<pre>
etc/run.sh edu.umd.cloud9.example.pagerank.SequentialPersonalizedPageRank -input sample-large.txt -source 9627181
</pre>
<p>The values from your implementation should be pretty close to the
output of the above program, but might differ a bit due to convergence
issues. After 20 iterations, the output of the MapReduce
implementation should match to at least the fourth decimal place.</p>
<p>This is complex assignment. I would suggest breaking the
implementation into the following steps:</p>
<ol>
<li>First, copy the reference PageRank implementation into your own