forked from pnugues/edan20
-
Notifications
You must be signed in to change notification settings - Fork 0
/
cw2.xml
executable file
·265 lines (264 loc) · 15.5 KB
/
cw2.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Assignment #2: Language Models</title>
</head>
<body>
<!--<h1>Assignment #1: Language Models</h1>-->
<h2>Objectives</h2>
<p>The objectives of this assignment are to:</p>
<ul>
<li>Write a program to find <i>n</i>-gram statistics
</li>
<li>Compute the probability of a sentence</li>
<li>Know what a language model is</li>
<li>Experiment with word completion and prediction and sentence segmentation</li>
<li>Write a short report of 1 to 2 pages on the assignment</li>
</ul>
<h2>Organization and location</h2>
<p>The third lab session (lab 2) will take place on</p>
<ol>
<li>Group 1, September 14, 2021, 13:15 to 15:00, in the Beta room,
<br/>
Discord link: https://discord.gg/83wWpF7
<br/>
</li>
<li>group 2, September 14, 2021, 13:15 to 15:00, in the Gamma room,
<br/>
Discord link: https://discord.gg/83wWpF7
<br/>
</li>
<li>group 3, September 14, 2021, 15:15 to 17:00, in the Gamma room,
<br/>
Discord link: https://discord.gg/83wWpF7
<br/>
</li>
<li>group 4, September 15, 2021, 13:15 to 15:00, in the Alpha room,
<br/>
Discord link: https://discord.gg/83wWpF7
<br/>
</li>
<li>group 5, September 15, 2021, 13:15 to 15:00, in the Varg room,
<br/>
Discord link: https://discord.gg/83wWpF7
<br/>
</li>
<li>group 6, September 15, 2021, 15:15 to 17:00, in the Alpha room.
<br/>
Discord link: https://discord.gg/83wWpF7
<br/>
</li>
<li>group 7, SSeptember 15, 2021, 15:15 to 17:00, in the Varg room.
<br/>
Discord link: https://discord.gg/83wWpF7
<br/>
</li>
</ol>
<p>There can be last minute changes. Please always check the official times here:
<a
href="https://cloud.timeedit.net/lu/web/lth1/ri1X50gQ6560YfQQ15Z5771Y0Zy7007335Y67Q565.html">
https://cloud.timeedit.net/lu/web/lth1/ri1Q5006.html
</a>
</p>
<p>You can work alone or collaborate with another student.</p>
<p>Each group will have to:</p>
<ul>
<li>Write a Python program.</li>
<li>Check the results and comment them briefly</li>
</ul>
<h2>Content of the lab</h2>
<p>
The text of the lab is in the language models notebook available here
<a href="https://github.com/pnugues/edan20/tree/master/notebooks">
https://github.com/pnugues/edan20/tree/master/notebooks
</a>
</p>
<h2>Turning in the assignment</h2>
<p>Now your are done with the program. To complete this assignment, you will:</p>
<ol>
<li>Write a short individual report on your program,</li>
<li>Execute the Jupyter notebook by Peter Norvig here: <a
href="http://nbviewer.jupyter.org/url/norvig.com/ipython/How%20to%20Do%20Things%20with%20Words.ipynb">
https://nbviewer.jupyter.org/url/norvig.com/ipython/How to Do Things with Words.ipynb</a>. Just run all
the cells and be sure that you understand the code. You will find the data here: <a
href="http://norvig.com/ngrams/">http://norvig.com/ngrams/</a>.
</li>
<li>In your report, after the description of your program, you will describe one experiment with Norvig's
notebook and a long string of words your will create yourself or copy from a text you like. You will
remove all the punctuation and white spaces from this string. You will set this string in lowercase
letters. You will just add a cell at the end of Sect. 7 in Norvig's notebook, where you will use your
string and run the notebook cell with the <tt>segment()</tt> and <tt>segment2()</tt> functions. You will
comment the segmentation results you obtain with the unigram and bigram models.
</li>
</ol>
<p>You will submit your report as well as your notebook (for archiving purposes) to Canvas:
<a href="https://canvas.education.lu.se/">https://canvas.education.lu.se/</a>. To write your report, you can
either:
</p>
<ol>
<li>Write directly your text in Canvas, or</li>
<li>Use Latex and Overleaf (<a href="https://www.overleaf.com">https://www.overleaf.com</a>). This will
probably help you structure your text. You will
then upload a PDF file in Canvas.
</li>
</ol>
<p>The submission deadline is September 24, 2021.</p>
<!--
<h2>Organization and location</h2>
<p>The second lab session will take place on</p>
<ul>
<li>Group 1: Tuesday, September 17 from 10:15 to 12:00 in the Alpha room</li>
<li>Group 2: Tuesday, September 17 from 13:15 to 15:00 in the Alpha room</li>
<li>Group 3: Wednesday, September 18 from 13:15 to 15:00 in the Val room</li>
<li>Group 4: Wednesday, September 18 from 13:15 to 15:00 in the Falk room</li>
<li>Group 5: Wednesday, September 18 from 15:15 to 15:00 in the Val room</li>
<li>Group 6: Wednesday, September 18 from 15:15 to 17:00 in the Falk room</li>
</ul>
<p>There can be last minute changes. Please always check the official times here:
<a href="https://cloud.timeedit.net/lu/web/lth1/ri14566340000YQQ45Z5577007y5Y3713gQ5g5X6Y55ZQ076.html">
https://cloud.timeedit.net/lu/web/lth1/ri1Q5006.html
</a>
</p>
<p>You can work alone or collaborate with another student:</p>
<ul>
<li>Each group will have to write Python programs to count unigrams, bigrams, and trigrams in a corpus of
approximately one million words and to determine the probability of a sentence.
</li>
<li>You can test you regular expression using the <a href="https://regex101.com/">regex101.com</a> site
</li>
<li>Each student will have to write a short report of one to two pages and comment briefly the results. In
your report, you must produce the tabulated results of your analysis as described below.
</li>
</ul>
<h2>Programming</h2>
<h3>Collecting a corpus</h3>
<ol>
<li>Retrieve a corpus of novels by Selma Lagerlöf from this URL:
<a href="https://github.com/pnugues/ilppp/blob/master/programs/corpus/Selma.txt">
<tt>https://github.com/pnugues/ilppp/blob/master/programs/corpus/Selma.txt</tt>
</a>. The text of these novels was extracted
from <a href="https://litteraturbanken.se/forfattare/LagerlofS/titlar">Lagerlöf arkivet</a> at
<a href="https://litteraturbanken.se/">Litteraturbanken</a>.
</li>
<li>Alternatively, you can collect a corpus of at least 750,000 words. You will check the number of words using the Unix
command <tt>wc -w</tt>.
</li>
<li>Run the <a href="https://github.com/pnugues/ilppp/tree/master/programs/ch02/python">concordance
program
</a> to print the lines containing a specific word, for instance <i>Nils</i>.
</li>
<li>Run the <a href="https://github.com/pnugues/ilppp/tree/master/programs/ch05/python">tokenization
program
</a> on your corpus and count the words using the Unix <tt>sort</tt> and <tt>uniq</tt> commands.
</li>
</ol>
<h3>Normalizing a corpus</h3>
<ol>
<li>Write a program to insert <tt><s></tt> and <tt></s></tt> tags to delimit sentences. You can
start from the tokenization and modify it. Use a simple heuristics such as: a sentence starts with a
capital letter and ends with a period. Estimate roughly the accuracy of your program.
</li>
<li>Modify your program to remove the punctuation signs and set all the text in lower case letters.</li>
<li>The result should be a normalized text without punctuation signs where all the sentences are delimited
with <tt><s></tt> and <tt></s></tt> tags.
</li>
<li>The five last lines of the text should look like this:
<pre>
<s> hon hade fått större kärlek av sina föräldrar än någon annan han visste och sådan kärlek måste vändas i välsignelse </s>
<s> då prästen sade detta kom alla människor att se bort mot klara gulla och de förundrade sig över vad de såg </s>
<s> prästens ord tycktes redan ha gått i uppfyllelse </s>
<s> där stod klara fina gulleborg ifrån skrolycka hon som var uppkallad efter själva solen vid sina föräldrars grav och lyste som en förklarad </s>
<s> hon var likaså vacker som den söndagen då hon gick till kyrkan i den röda klänningen om inte vackrare </s>
</pre>
</li>
</ol>
<h3>Counting unigrams and bigrams</h3>
<ol>
<li>Read and try programs to compute the frequency of unigrams and bigrams of the training set: [<a
href="https://github.com/pnugues/ilppp/tree/master/programs/ch05/python">Program folder</a>].
</li>
<li>What is the possible number of bigrams and their real number? Explain why such a difference. What would
be the possible number of 4-grams.
</li>
<li>Propose a solution to cope with bigrams unseen in the corpus. This topic will be discussed during the
lab session.
</li>
</ol>
<h3>Computing the likelihood of a sentence</h3>
<ol>
<li>Write a program to compute a sentence's probability using unigrams.
You may find useful the dictionaries that we saw in the mutual information program: [<a
href="https://github.com/pnugues/ilppp/tree/master/programs/ch05/python">Program
folder</a>].
</li>
<li>Write a program to compute the sentence probability using bigrams.</li>
<li>Select five sentences in your test set and run your programs on them.</li>
<li>Tabulate your results as in the examples below with the sentence <i>Det var en gång en katt som hette
Nils</i>:
<pre>
Unigram model
=====================================================
wi C(wi) #words P(wi)
=====================================================
det 21108 1041631 0.0202643738521607
var 12090 1041631 0.01160679741674355
en 13514 1041631 0.01297388422579589
gång 1332 1041631 0.001278763784871994
en 13514 1041631 0.01297388422579589
katt 16 1041631 1.5360525944408337e-05
som 16288 1041631 0.015637015411407686
hette 97 1041631 9.312318853797554e-05
nils 87 1041631 8.352285982272032e-05
</s> 59047 1041631 0.056687060964967444
=====================================================
Prob. unigrams: 5.361459667285409e-27
Geometric mean prob.: 0.0023600885848765307
Entropy rate: 8.726943273141258
Perplexity: 423.71290908655254
Bigram model
=====================================================
wi wi+1 Ci,i+1 C(i) P(wi+1|wi)
=====================================================
<s> det 5672 59047 0.09605907158704083
det var 3839 21108 0.1818741709304529
var en 712 12090 0.058891645988420185
en gång 706 13514 0.052242119283705785
gång en 20 1332 0.015015015015015015
en katt 6 13514 0.0004439840165754033
katt som 2 16 0.125
som hette 45 16288 0.002762770137524558
hette nils 0 97 0.0 *backoff: 8.352285982272032e-05
nils </s> 2 87 0.022988505747126436
=====================================================
Prob. bigrams: 2.376007803503683e-19
Geometric mean prob.: 0.013727289294133601
Entropy rate: 6.186809422848149
Perplexity: 72.84759420254609
</pre>
</li>
</ol>
<h2>Reading</h2>
<p>As an application of n-grams, execute the Jupyter notebook by Peter Norvig <a
href="http://nbviewer.jupyter.org/url/norvig.com/ipython/How%20to%20Do%20Things%20with%20Words.ipynb">
here</a>. Just run all the cells and be sure that you understand the code.
You will find the data <a href="http://norvig.com/ngrams/">here</a>.</p>
<p>In your report, you will also describe one experiment with a long string of words
your will create yourself or copy from a text you like. You will remove all the punctuation and
white spaces from this string. Set this string in lowercase letters.</p>
<p>You will just add a cell at the end of Sect. 7 in Norvig's notebook, where you will use your string and
run the notebook cell with the <tt>segment()</tt> and <tt>segment2()</tt> functions. </p>
<p>You will comment the segmentation results you obtain with unigram and bigram models.
</p>
<h2>Complement</h2>
<p>As a complement, you can read a paper by <a
href="http://researcher.watson.ibm.com/researcher/view.php?person=us-kwchurch">Church
</a> and Hanks, <a href="http://www.aclweb.org/anthology/J/J90/J90-1003.pdf">Word Association Norms, Mutual
Information, and Lexicography</a>, Computational Linguistics, 16(1):22-29, 1990, as well as another one on
backoff by Brants et al. (2007) <a href="http://www.aclweb.org/anthology/D07-1090.pdf">Large language models
in machine translation</a>.
</p>-->
</body>
</html>