-
Notifications
You must be signed in to change notification settings - Fork 64
/
NLP-RNN.tex
939 lines (689 loc) · 35.6 KB
/
NLP-RNN.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
%\documentclass[mathserif]{beamer}
\documentclass[handout]{beamer}
%\usetheme{Goettingen}
%\usetheme{Warsaw}
\usetheme{Singapore}
%\usetheme{Frankfurt}
%\usetheme{Copenhagen}
%\usetheme{Szeged}
%\usetheme{Montpellier}
%\usetheme{CambridgeUS}
%\usecolortheme{}
%\setbeamercovered{transparent}
\usepackage[english, activeacute]{babel}
\usepackage[utf8]{inputenc}
\usepackage{amsmath, amssymb}
\usepackage{dsfont}
\usepackage{graphics}
\usepackage{cases}
\usepackage{graphicx}
\usepackage{pgf}
\usepackage{epsfig}
\usepackage{amssymb}
\usepackage{multirow}
\usepackage{amstext}
\usepackage[ruled,vlined,lined]{algorithm2e}
\usepackage{amsmath}
\usepackage{epic}
\usepackage{epsfig}
\usepackage{fontenc}
\usepackage{framed,color}
\usepackage{palatino, url, multicol}
%\algsetup{indent=2em}
\newcommand{\factorial}{\ensuremath{\mbox{\sc Factorial}}}
\newcommand{\BIGOP}[1]{\mathop{\mathchoice%
{\raise-0.22em\hbox{\huge $#1$}}%
{\raise-0.05em\hbox{\Large $#1$}}{\hbox{\large $#1$}}{#1}}}
\newcommand{\bigtimes}{\BIGOP{\times}}
\vspace{-0.5cm}
\title{Natural Language Processing \\ Recurrent Neural Networks}
\vspace{-0.5cm}
\author[Felipe Bravo Márquez]{\footnotesize
%\author{\footnotesize
\textcolor[rgb]{0.00,0.00,1.00}{Felipe Bravo-Marquez}}
\date{\today}
\begin{document}
\begin{frame}
\titlepage
\end{frame}
\begin{frame}{Recurrent Neural Networks}
\begin{scriptsize}
\begin{itemize}
\item While representations derived from convolutional networks offer some sensitivity to word order, their order sensitivity is restricted to mostly local patterns, and disregards the order of patterns that are far apart in the sequence.
\item Recurrent neural networks (RNNs) allow representing arbitrarily sized sequential inputs in fixed-size vectors, while paying attention to the structured properties of the inputs \cite{goldberg2016primer}.
\item RNNs, particularly ones with gated architectures such as the LSTM and the GRU, are very powerful at capturing statistical regularities in sequential inputs.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{The RNN Abstraction}
\begin{scriptsize}
\begin{itemize}
\item We use $\vec{x}_{i:j}$ to denote the sequence of vectors $\vec{x}_i, \dots, \vec{x}_j$.
\item On a high-level, the RNN is a function that takes as input an arbitrary length ordered sequence of $n$ $d_{in}$-dimensional vectors $\vec{x}_{1 :n}=\vec{x}_1,\vec{x}_2, \dots, \vec{x}_n$ ( $\vec{x}_i \in \mathcal{R}^{d_{in}}$) and returns as output a single $d_{out}$ dimensional vector $\vec{y}_n$ $\in \mathcal{R}^{d_{out}}$:
\begin{equation}
\begin{split}
\vec{y}_n & = RNN(\vec{x}_{1:n}) \\
\vec{x}_i \in \mathcal{R}^{d_{in}} & \quad \vec{y}_n \in \mathcal{R}^{d_{out}}
\end{split}
\end{equation}
\item This implicitly defines an output vector $\vec{y}_i$ for each prefix $\vec{x}_{1:i}$ of the sequence $\vec{x}_{i:n}$.
\item We denote by $RNN^{*}$ the function returning this sequence:
\begin{equation}
\begin{split}
\vec{y}_{1:n} & = RNN^{*}(\vec{x}_{1:n}) \\
\vec{y}_i & = RNN(\vec{x}_{1:i}) \\
\vec{x}_i \in \mathcal{R}^{d_{in}} & \quad \vec{y}_n \in \mathcal{R}^{d_{out}}
\end{split}
\end{equation}
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{The RNN Abstraction}
\begin{scriptsize}
\begin{itemize}
\item The output vector $\vec{y}_n$ is then used for further prediction.
\item For example, a model for predicting the conditional probability of an event $e$ given the sequence $\vec{x}_{1:n}$ can be defined as the $j$-th element in the output vector resulting from the softmax operation over a linear transformation of the RNN encoding:
\begin{displaymath}
p(e = j|\vec{x}_{1:n}) = \text{softmax}(RNN(\vec{x}_{1:n})\cdot W +\vec{b})_{[j]}
\end{displaymath}
\item The RNN function provides a framework for conditioning on the entire history without resorting to the Markov assumption which is traditionally used for modeling sequences.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{The RNN Abstraction}
\begin{scriptsize}
\begin{itemize}
\item The RNN is defined recursively, by means of a function $R$ taking as input a state vector $\vec{s}_{i-1}$ and an input vector $\vec{x}_{i}$ and returning a new state vector $\vec{s}_i$.
\item The state vector $\vec{s}_i$ is then mapped to an output vector $\vec{y}_i$ using a simple deterministic function $O(\cdot)$.
\item The base of the recursion is an initial state vector, $\vec{s}_{0}$ , which is also an input to the RNN.
\item For brevity, we often omit the initial vector $s_{0}$ , or assume it is the zero vector.
\item When constructing an RNN, much like when constructing a feed-forward network, one has to specify the dimension of the inputs $\vec{x}_i$ as well as the dimensions of the outputs $\vec{y}_i$.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{The RNN Abstraction}
\begin{scriptsize}
\begin{equation}
\begin{split}
RNN^{*}(\vec{x}_{1:n};\vec{s}_0) & = \vec{y}_{1:n} \\
\vec{y}_i & = O(\vec{s}_i) \\
\vec{s}_i & = R(\vec{s}_{i-1},\vec{x}_i) \\
\vec{x}_i \in \mathcal{R}^{d_{in}}, & \quad \vec{y}_i \in \mathcal{R}^{d_{out}}, \quad \vec{s}_i \in \mathcal{R}^{f(d_{out})}
\end{split}
\end{equation}
\begin{itemize}
\item The functions $R$ and $O$ are the same across the sequence positions.
\item The RNN keeps track of the states of computation through the state vector $\vec{s}_i$ that is kept and being passed across invocations of $R$.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{The RNN Abstraction}
\begin{scriptsize}
\begin{figure}[h]
\includegraphics[scale = 0.4]{pics/RNN.png}
\end{figure}
\begin{itemize}
\item This presentation follows the recursive definition, and is correct for arbitrarily long sequences.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{The RNN Abstraction}
\begin{scriptsize}
\begin{itemize}
\item For a finite sized input sequence (and all input sequences we deal with are finite) one can unroll the recursion.
\begin{figure}[h]
\includegraphics[scale = 0.35]{pics/RNN-unrolled.png}
\end{figure}
\item The parameters $\theta$ highlight the fact that the same parameters are shared across all time steps.
\item Different instantiations of $R$ and $O$ will result in different network structures.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{The RNN Abstraction}
\begin{scriptsize}
\begin{itemize}
\item We note that the value of $\vec{s}_i$ (and hence $\vec{y}_i$) is based on the entire input $\vec{x}_1,\dots, \vec{x}_i$.
\item For example, by expanding the recursion for $i = 4$ we get:
\begin{figure}[h]
\includegraphics[scale = 0.35]{pics/RNN-recursion.png}
\end{figure}
\item Thus, $\vec{s}_n$ and $\vec{y}_n$ can be thought of as encoding the entire input sequence.
\item The job of the network training is to set the parameters of $R$ and $O$ such that the state conveys useful information for the task we are tying to solve.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{Elman Network or Simple-RNN}
\begin{scriptsize}
\begin{itemize}
\item After describing the RNN abstraction, we are now in place to discuss the simplest instantiations of it.
\item Recall that we are interested in a recursive function $\vec{s}_i = R(\vec{x}_i, \vec{s}_{i-1})$ such that $\vec{s}_i$ encodes the sequence $\vec{x}_{1:n}$.
\item The simplest RNN formulation is known as an Elman Network or Simple-RNN (S-RNN).
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{Elman Network or Simple-RNN}
\begin{scriptsize}
\begin{equation}
\begin{split}
\vec{s}_i & = R_{SRNN}(\vec{x}_{i},\vec{s}_{i-1}) = g(\vec{s}_{i-1}W^{s}+\vec{x}_{i}W^{x}+\vec{b}) \\
\vec{y}_i & = O_{SRNN}(\vec{s}_i) = \vec{s}_i \\
\vec{s}_i, \vec{y}_i \in \mathcal{R}^{d_{s}}, & \quad \vec{x}_i \in \mathcal{R}^{d_{x}}, \quad W^{x} \in \mathcal{R}^{d_{x}\times d_{s}}, \quad W^{s} \in \mathcal{R}^{d_{s}\times d_{s}}, \vec{b} \in \mathcal{R}^{d_{s}}
\end{split}
\end{equation}
\begin{itemize}
\item The state $\vec{s}_i$ and the input $\vec{x}_i$ are each linearly transformed.
\item The results are added (together with a bias term) and then passed through a nonlinear activation function g (commonly tanh or ReLU).
\item The Simple RNN provides strong results for sequence tagging as well as language modeling.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{Elman Network or Simple-RNN}
\begin{figure}[h]
\includegraphics[scale = 0.55]{pics/elman.pdf}
\end{figure}
\end{frame}
\begin{frame}{RNN Training}
\begin{scriptsize}
\begin{itemize}
\item An unrolled RNN is just a very deep neural network.
\item The same parameters are shared across many parts of the computation.
\item Additional input is added at various layers.
\item To train an RNN network, need to create the unrolled computation graph for a given input sequence, add a loss node to the unrolled graph.
\item Then use the backward (backpropagation) algorithm to compute the gradients with respect to that loss.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{RNN Training}
\begin{scriptsize}
\begin{itemize}
\item This procedure is referred to in the RNN literature as backpropagation through time (BPTT).
\item The RNN does not do much on its own, but serves as a trainable component in a larger network.
\item The final prediction and loss computation are performed by that larger network, and the error is back-propagated through the RNN.
\item This way, the RNN learns to encode properties of the input sequences that are
useful for the further prediction task.
\item The supervision signal is not applied to the RNN directly, but through the larger network.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{RNN usage-patterns: Acceptor}
\begin{scriptsize}
\begin{itemize}
\item A common RNN usage-pattern is the acceptor.
\item This pattern is used for text classification.
\item The supervision signal is only based on the final output vector $\vec{y}_n$.
\item The RNN's output vector $\vec{y}_n$ is fed into a fully connected layer or an MLP, which produce a prediction.
\item The error gradients are then backpropagated through the
rest of the sequence.
\item The loss can take any familiar form: cross entropy, hinge, etc.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{RNN usage-patterns: Acceptor}
\begin{scriptsize}
\begin{figure}[h]
\includegraphics[scale = 0.3]{pics/acceptor.png}
\caption{Acceptor RNN training graph.}
\end{figure}
\begin{itemize}
\item Example 1: an RNN that reads the characters of a word one by one and then use the final state to predict the part-of-speech of that word.
\item Example 2: an RNN that reads in a sentence and, based on the final state decides if it conveys positive or negative sentiment.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{RNN usage-patterns: Transducer}
\begin{scriptsize}
\begin{itemize}
\item Another option is to treat the RNN as a transducer, producing an output $\hat{t}_i$ for each input it reads in.
\item This pattern is very handy for sequence labeling tasks (e.g, POS tagging, NER).
\item We compute a local loss signal (e.g., cross-entropy) $L_{\text{local}}(\hat{t}_{i},{t}_{i})$ for each of the outputs $\hat{t}_{i}$ based on a true label $t_i$.
\item The loss for unrolled sequence will then be: $L(\hat{t}_{i:n},{t}_{i:n}) = \sum_{i} L_{\text{local}}(\hat{t}_{i},{t}_{i})$ , or using another combination rather than a sum such as an average or a
weighted average
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{RNN usage-patterns: Transducer}
\begin{scriptsize}
\begin{figure}[h]
\includegraphics[scale = 0.25]{pics/transducer.png}
\caption{Transducer RNN training graph.}
\end{figure}
\end{scriptsize}
\end{frame}
\begin{frame}{RNN usage-patterns: Transducer}
\begin{scriptsize}
\begin{itemize}
\item Example 1: a sequence tagger (e.g., NER, POS), in which we take $\vec{x}_{i:n}$ to be feature representations for the $n$ words of a sentence, and $t_i$ as an input for predicting the tag assignment of word $i$ based on words $1:i$.
\item Example 2: language modeling, in which the sequence of words $x_{1:n}$ is used to predict a distribution over the $(i+1)th$ word.
\item RNN-based language models are shown to provide vastly better perplexities than traditional language models.
\item Using RNNs as transducers allows us to relax the Markov assumption that is traditionally taken in language models and HMM taggers, and condition on the entire prediction history.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{Bidirectional RNNS (BIRNN)}
\begin{scriptsize}
\begin{itemize}
\item A useful elaboration of an RNN is a bidirectional-RNN (also commonly referred to as biRNN)
\item Consider the task of sequence tagging over a sentence.
\item An RNN allows us to compute a function of the $i$-th word $x_i$ based on the
past words $x_{1:i}$ up to and including it.
\item However, the following words $x_{i+1:n}$ may also be useful for prediction,
\item The biRNN allows us to look arbitrarily far at both the past and the future within the sequence.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{Bidirectional RNNS (BIRNN)}
\begin{scriptsize}
\begin{itemize}
\item Consider an input sequence $\vec{x}_{1:n}$ .
\item The biRNN works by maintaining two separate states, $s_{i}^{f}$ and $s_{i}^{b}$ for each input position $i$.
\item The forward state $s_{i}^{f}$ is based on $\vec{x}_1, \vec{x}_2, \dots ,\vec{x}_i$, while the backward state $s_{i}^{b}$ is based on $\vec{x}_n, \vec{x}_{n-1}, \dots ,\vec{x}_i$.
\item The forward and backward states are generated
by two different RNNs.
\item The first RNN$(R^f, O^f)$ is fed the input sequence $\vec{x}_{1:n}$ as is, while the second RNN$(R^b , O^b)$ is fed the input sequence in reverse.
\item The state representation $\vec{s}_i$ is then composed of both the forward and backward states.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{Bidirectional RNNS (BIRNN)}
\begin{scriptsize}
\begin{itemize}
\item The output at position $i$ is based on the concatenation of the two output vectors:
\begin{displaymath}
\vec{y}_i = [\vec{y}_{i}^{f};\vec{y}_{i}^{b}]=[O^{f}(s_{i}^{f});O^{b}(s_{i}^{b})
]
\end{displaymath}
\item The output takes into account both the past and the future.
\item The biRNN encoding of the $i$th word in a sequence is the concatenation of two RNNs, one reading the sequence from the beginning, and the other reading it from the end.
\item We define $biRNN(\vec{x}_{1:n}, i)$ to be the output vector corresponding to the $i$th sequence position:
\begin{displaymath}
biRNN(\vec{x}_{1:n}, i) = \vec{y}_i = [RNN^{f}(\vec{x}_{1:i});RNN^{b}(\vec{x}_{n:i})]
\end{displaymath}
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{Bidirectional RNNS (BIRNN)}
\begin{scriptsize}
\begin{itemize}
\item The vector $\vec{y}_i$ can then be used directly for prediction, or fed as part of the input to a more complex network.
\item While the two RNNs are run independently of each other, the error gradients
at position i will flow both forward and backward through the two RNNs.
\item Feeding the vector $\vec{y}_i$ through an MLP prior to prediction will further mix the forward and backward signals.
\begin{figure}[h]
\includegraphics[scale = 0.35]{pics/biRNN.png}
\end{figure}
\item Note how the vector $\vec{y}_4$ , corresponding to the word \textbf{jumped}, encodes an infinite window around (and including) the focus vector $\vec{x}_{jumped}$.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{Bidirectional RNNS (BIRNN)}
\begin{scriptsize}
\begin{itemize}
\item Similarly to the $RNN$ case, we also define $biRNN^{*}(\vec{x}_{1:n})$ as the sequence of vectors $\vec{y}_{1:n}$:
\begin{displaymath}
biRNN^{*}(\vec{x}_{1:n})= \vec{y}_{1:n} = biRNN(\vec{x}_{1:n},1),\dots,biRNN(\vec{x}_{1:n},n)
\end{displaymath}
\item The $n$ output vectors $\vec{y}_{1:n}$ can be efficiently computed in linear time by first running the forward and backward RNNs, and then concatenating the relevant outputs.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{Bidirectional RNNS (BIRNN)}
\begin{scriptsize}
\begin{figure}[h]
\includegraphics[scale = 0.35]{pics/biRNNstar.png}
\end{figure}
\begin{itemize}
\item The biRNN is very effective for tagging tasks, in which each input vector corresponds to one output vector.
\item It is also useful as a general-purpose trainable feature-extracting component, that can be used whenever a window around a given word is required.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{Multi-layer (stacked) RNNS}
\begin{scriptsize}
\begin{itemize}
\item RNNs can be stacked in layers, forming a grid.
\item Consider $k$ RNNs, $RNN_{1},\dots, RNN_{k}$, where the $j$th RNN has states $\vec{s}_{1:n}^{j}$ and outputs $\vec{y}_{1:n}^{j}$.
\item The input for the first RNN are $\vec{x}_{1:n}$.
\item The input of the $j$th RNN ($j\geq 2$) are the outputs of the RNN below it, $\vec{y}_{1:n}^{j-1}$.
\item The output of the entire formation is the output of the last RNN, $\vec{y}_{1:n}^k$.
\item Such layered architectures are often called deep RNNs.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{Multi-layer (stacked) RNNS}
\begin{scriptsize}
\begin{figure}[h]
\includegraphics[scale = 0.35]{pics/stackedRNN.png}
\end{figure}
\begin{itemize}
\item It is not theoretically clear what is the additional power gained by the deeper architecture.
\item It was observed empirically that deep RNNs work better than shallower ones on some tasks (e.g., machine translation).
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{Gated Architectures}
\begin{scriptsize}
\begin{itemize}
\item So far we have seen only one instantiation of the RNN: the simple RNN (S-RNN) or Elman network.
\item This model is hard to train effectively because of the \textbf{vanishing gradients} problem.
\item Error signals (gradients) in later steps in the sequence diminish quickly in the backpropagation process.
\item Thus, they do not reach earlier input signals, making it hard for the S-RNN to capture long-range dependencies.
\item Gating-based architectures, such as the LSTM \cite{hochreiter1997long} and the GRU \cite{cho2014learning} are designed to solve this deficiency.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{The vanishing gradients problem in Recurrent Neural Networks}
\begin{scriptsize}
\begin{itemize}
\item Intuitively, recurrent neural networks can be thought of as very deep feed-forward networks, with shared parameters across different layers.
\item For the Simple-RNN, the gradients then include repeated multiplication of the matrix $W$, making it very likely for the values to vanish or explode.
\item The gating mechanism mitigate this problem to a large extent by getting rid of this repeated multiplication of a single matrix.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{Gated Architectures}
\begin{scriptsize}
\begin{itemize}
\item Consider the RNN as a general purpose computing device, where the state $\vec{s}_i$ represents a finite memory.
\item Each application of the function $R$ reads in an input $\vec{x}_{i+1}$ , reads in the current memory $\vec{s}_i$ , operates on them in some way, and writes the result into memory.
\item This results in a new memory state $\vec{s}_{i+1}$.
\item An apparent problem with the S-RNN architecture is that the memory access is not controlled.
\item At each step of the computation, the entire memory state is read, and the entire memory state is written.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{Gated Architectures}
\begin{scriptsize}
\begin{itemize}
\item How does one provide more controlled memory access?
\item Consider a binary vector $\vec{g} \in [0,1]^n$.
\item Such a vector can act as a \textbf{gate} for controlling access to $n$-dimensional vectors, using the hadamard-product operation $\vec{x} \odot \vec{g}$.
\item The hadamard operation is the same as the element-wise multiplication of two vectors:
\begin{displaymath}
\vec{x} = \vec{u} \odot \vec{v} \Leftrightarrow \vec{x}_{[i]} = \vec{u}_{[i]} \cdot \vec{v}_{[i]} \quad \forall i \in [1,n]
\end{displaymath}
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{Gated Architectures}
\begin{scriptsize}
\begin{itemize}
\item Consider a memory $\vec{s} \in \mathcal{R}^{d}$, an input $\vec{x} \in \mathcal{R}^{d}$ and a gate $\vec{g} \in [0,1]^{d}$.
\item The following computation:
\begin{displaymath}
\vec{s}' \leftarrow \vec{g} \odot \vec{x} + (\vec{1}-\vec{g}) \odot (\vec{s})
\end{displaymath}
\item Reads the entries in $\vec{x}$ that correspond to the $\vec{1}$ values in $\vec{g}$ , and writes them to the new memory $\vec{s}'$.
\item Locations that weren't read to are copied from the memory $\vec{s}$ to the new memory $\vec{s}'$ through the use of the gate $(\vec{1}-\vec{g})$.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{Gated Architectures}
\begin{scriptsize}
\begin{figure}[h]
\includegraphics[scale = 0.35]{pics/gatedEx.png}
\end{figure}
\begin{itemize}
\item This gating mechanism can serve as a building block in our RNN.
\item Gate vectors can be used to control access to the memory state $\vec{s}_i$.
\item We are still missing two important (and related) components:
\begin{enumerate}
\begin{scriptsize}
\item The gates should not be static, but be controlled by the
current memory state and the input.
\item Their behavior should be learned.
\end{scriptsize}
\end{enumerate}
\item This introduced an obstacle, as learning in our framework entails being differentiable (because of the backpropagation algorithm).
\item The binary 0-1 values used in the gates are not differentiable.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{Gated Architectures}
\begin{scriptsize}
\begin{itemize}
\item A solution to this problem is to approximate the hard gating mechanism with a soft—but differentiable—gating mechanism.
\item To achieve these differentiable gates , we replace the requirement that $\vec{g} \in [0,1]^n$ and allow arbitrary real numbers, $\vec{g}' \in \mathcal{R}^n$.
\item These real numbers are then pass through a sigmoid function $\sigma(\vec{g}')$.
\item This bounds the value in the range $(0,1)$, with most values
near the borders.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{Gated Architectures}
\begin{scriptsize}
\begin{itemize}
\item When using the gate $\sigma(\vec{g}')\odot \vec{x}$ , indices in $\vec{x}$ corresponding to near-one values in $\sigma(\vec{g}')$ are allowed to pass.
\item While those corresponding to near-zero values are blocked.
\item The gate values can then be conditioned on the input and the current memory.
\item And can be trained using a gradient-based method to perform a desired behavior.
\item This controllable gating mechanism is the basis of the LSTM and the GRU architectures.
\item At each time step, differentiable gating mechanisms decide which parts of the inputs will be written to memory and which parts of memory will be overwritten (forgotten).
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{LSTM}
\begin{scriptsize}
\begin{itemize}
\item The Long Short-Term Memory (LSTM) architecture \cite{hochreiter1997long} was designed to solve the vanishing gradients problem.
\item It was the the first architecture to introduce the gating mechanism.
\item The LSTM architecture explicitly splits the state vector $\vec{s}_i$ into two halves: 1) memory cells and 2) working memory.
\item The memory cells are designed to preserve the memory, and also the error gradients, across time, and are controlled through differentiable gating components\footnote{Smooth mathematical functions that simulate logical gates.}.
\item At each input state, a gate is used to decide how much of the new input should be written to the memory cell, and how much of the current content of the memory cell should be forgotten.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{LSTM}
\begin{scriptsize}
\begin{itemize}
\item Mathematically, the LSTM architecture is defined as:
\begin{figure}[h]
\includegraphics[scale = 0.35]{pics/LSTMform.png}
\end{figure}
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{LSTM}
\begin{scriptsize}
\begin{itemize}
\item The state at time $j$ is composed of two vectors, $\vec{c}_j$ and $\vec{h}_{j}$, where $\vec{c}_j$ is the memory component and $\vec{h}_j$ is the hidden state component.
\item There are three gates, $\vec{i}$ , $\vec{f}$ , and $\vec{o}$, controlling for \textbf{i}nput, \textbf{f}orget, and \textbf{o}utput.
\item The gate values are computed based on linear combinations of the current input $\vec{x}_j$ and the previous state $\vec{h}_{j-1}$, passed through a sigmoid activation function.
\item An update candidate $\vec{z}$ is computed as a linear combination of $\vec{x}_j$ and $\vec{h}_{j-1}$ , passed through a tanh activation function (to push the values to be between -1 and 1).
\item The memory $\vec{c}_j$ is then updated: the forget gate controls how much of the previous memory to keep ($\vec{f} \odot \vec{c}_{j-1}$), and the input gate controls how much of the proposed update to keep ($\vec{i} \odot \vec{z}$).
\item Finally, the value of $\vec{h}_j$ (which is also the output $\vec{y}_j$ ) is determined based on the content of the memory $\vec{c}_j$ , passed through a tanh nonlinearity and controlled by the output gate.
\item The gating mechanisms allow for gradients related to the memory part $\vec{c}_j$ to stay high across very long time ranges.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{LSTM}
\begin{figure}[h]
\includegraphics[scale = 0.35]{pics/LSTMchain.png}
\end{figure}
\footnotemark{source: \url{http://colah.github.io/posts/2015-08-Understanding-LSTMs/}}
\end{frame}
\begin{frame}{LSTM}
\begin{scriptsize}
\begin{itemize}
\item Intuitively, recurrent neural networks can be thought of as very deep feed-forward networks, with shared parameters across different layers.
\item For the Simple-RNN, the gradients then include repeated multiplication of the matrix $W$.
\item This makes the gradient values to vanish or explode.
\item The gating mechanism mitigate this problem to a large extent by getting rid of this repeated multiplication of a single matrix.
\item LSTMs are currently the most successful type of RNN architecture, and they are responsible for many state-of-the-art sequence modeling results.
\item The main competitor of the LSTM RNN is the GRU, to be discussed next.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{GRU}
\begin{scriptsize}
\begin{itemize}
\item The LSTM architecture is very effective, but also quite complicated.
\item The complexity of the system makes it hard to analyze, and also computationally expensive to work with.
\item The gated recurrent unit (GRU) was introduced in \cite{cho2014learning} as an alternative to the LSTM.
\item It was subsequently shown in \cite{chung2014empirical} to perform comparably to the LSTM on several (non textual) datasets.
\item Like the LSTM, the GRU is also based on a gating mechanism, but with substantially fewer gates and without a separate memory component.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{GRU}
\begin{figure}[h]
\includegraphics[scale = 0.35]{pics/GRU.png}
\end{figure}
\end{frame}
\begin{frame}{GRU}
\begin{scriptsize}
\begin{itemize}
\item One gate $\vec{r}$ is used to control access to the previous state $\vec{s}_{j-1}$ and compute a proposed update $\vec{\widetilde{s}}_j$.
\item The updated state $\vec{s}_j$ (which also serves as the output $\vec{y}_j$) is then determined based on an interpolation of the previous state $\vec{s}_{j-1}$ and the proposal $\vec{\widetilde{s}}_j$.
\item The proportions of the interpolation are controlled using the gate $\vec{z}$.
\item The GRU was shown to be effective in language modeling and machine translation.
\item However, the jury is still out between the GRU, the LSTM and possible alternative RNN architectures, and the subject is actively researched.
\item For an empirical exploration of the GRU and the LSTM architectures, see \cite{jozefowicz2015empirical}.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{Sentiment Classification with RNNs}
\begin{scriptsize}
\begin{itemize}
\item The simplest use of RNNs is as acceptors: read in an input sequence, and produce a binary or multi-class answer at the end.
\item RNNs are very strong sequence learners, and can pick-up on very
intricate patterns in the data.
\item An example of naturally occurring positive and negative sentences in the movie-reviews domain would be the following:
\item Positive: It's not life-affirming—it’s vulgar and mean, but I liked it.
\item Negative: It's a disappointing that it only manages to be decent instead of dead brilliant.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{Sentiment Classification with RNNs}
\begin{scriptsize}
\begin{itemize}
\item Note that the positive example contains some negative phrases (not life affirming, vulgar, and mean).
\item While the negative example contains some positive ones (dead brilliant).
\item Correctly predicting the sentiment requires understanding not only the individual phrases but also the context in which they occur, linguistic constructs such as negation, and the overall structure of the sentence.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{Sentiment Classification with RNNs}
\begin{scriptsize}
\begin{itemize}
\item The sentence-level sentiment classification task is modelled using an RNN-acceptor.
\item After tokenization, the RNN reads in the words of the sentence one at a time.
\item The final RNN state is then fed into an MLP followed by a softmax-layer with two outputs (positive and negative).
\item The network is trained with cross-entropy loss based on the gold sentiment labels.
\begin{equation}
\begin{split}
p(label = k | \vec{w}_{1:n}) & = \hat{\vec{y}}_{[k]} \\
\hat{\vec{y}} & = softmax(MLP(RNN(\vec{x}_{1:n}))) \\
\vec{x}_{1:n} & = E_{[w_{1}]}, \dots, E_{[w_{n}]}
\end{split}
\end{equation}
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{Sentiment Classification with RNNs}
\begin{scriptsize}
\begin{itemize}
\item The word embeddings matrix $E$ is initialized using pre-trained embeddings learned over a large external corpus using an algorithm such as Word2vec or Glove with a relatively wide window.
\item It is often helpful to extend the model by considering bidirectional RNNs.
\item For longer sentences, \cite{li2015tree} found it useful to use a hierarchical architecture, in which the sentence is split into smaller spans based on punctuation.
\item Then, each span is fed into a bidirectional RNN.
\item Sequence of resulting vectors (one for each span) are then fed into an RNN acceptor.
\item A similar hierarchical architecture was used for document-level sentiment classification in \cite{tang2015document}.
\end{itemize}
\end{scriptsize}
\end{frame}
% https://arxiv.org/abs/1606.01781
\begin{frame}{Twitter Sentiment Classification with LSTMS Emojis}
\begin{scriptsize}
\begin{itemize}
\item An emoji-based distant supervision model for detecting sentiment and other affective states from short social media messages was proposed in \cite{FelboMSRL17}.
\item Emojis are used as a distant supervision approach for various affect detection tasks (e.g., emotion, sentiment, sarcasm) using a large corpus of 634M tweets with 64 emojis.
\item A neural network architecture is pretrained with this corpus.
\item The network is an LSTM variant formed by an embedding layer, 2 bidrectional LSTM layers with normal skip connections and temporal average pooling-skip connections.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{DeepEmoji}
\begin{figure}[h]
\includegraphics[scale = 0.45]{pics/deepEmoji1.png}
\end{figure}
\end{frame}
\begin{frame}{DeepEmoji}
\begin{figure}[h]
\includegraphics[scale = 0.45]{pics/deepEmoji2.png}
\end{figure}
\end{frame}
\begin{frame}{Twitter Sentiment Classification with LSTMS Emojis}
\begin{scriptsize}
\begin{itemize}
\item Authors propose the chain-thaw transfer-learning approach in which the pretrained network is fine-tuned for the target task.
\item Here, each layer is individually fine-tuned in each step with the target gold data, and then they are all fine-tuned together.
\item The model achieves state-of-the-art results the detection of emotion, sentiment, and sarcasm.
\item The pretrained network is released to the public.
\item A demo of the model: \url{https://deepmoji.mit.edu/}.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{DeepEmoji}
\begin{figure}[h]
\includegraphics[scale = 0.45]{pics/deepEmoji3.png}
\end{figure}
\end{frame}
\begin{frame}{Bi-LSTM CRF}
\begin{scriptsize}
\begin{itemize}
\item An alternative approach to the transducer for sequence labeling is to combine a BI-LSTM with a Conditional Random Field (CRF).
\item This produces a powerful tagger \cite{huang2015bidirectional} called BI-LSTM-CRF, in which the LSTM acts as feature extractor, and the CRF as a special layer that models the transitions from one tag to another.
\item The CRF layers performs a global normalization over all possible sequences which helps to find the optimum sequence.
\item In a CRF we model the conditional probability of the output tag sequence given the input sequence:
\begin{displaymath}
P(s_1, \dots, s_m | x_1, \dots, x_m) = P(s_{1:m}|x_{1:m})
\end{displaymath}
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{Bi-LSTM CRF}
\begin{scriptsize}
\begin{itemize}
\item We do this by defining a feature map
\begin{displaymath}
\vec{\Phi}(x_{1:m},s_{1:m}) \in \mathcal{R}^d
\end{displaymath}
that maps an entire input sequence $x_{1:m}$ paired with an entire tag sequence $s_{1:m}$ to some $d$-dimensional feature vector.
\item Then we can model the probability as a log-linear model with the parameter vector $\vec{w} \in \mathcal{R}^d$:
\begin{displaymath}
P(s_{1:m}|x_{1:m}; \vec{w}) = \frac{\exp (\vec{w} \cdot \vec{\Phi}(x_{1:m},s_{1:m}))}{\sum_{s'_{1:m} \in S^m}\exp (\vec{w} \cdot \vec{\Phi}(x_{1:m},s'_{1:m}))}
\end{displaymath}
where $s'_{1:m}$ ranges over all possible output sequences.
\item We can view the expression $\vec{w} \cdot \vec{\Phi}(x_{1:m},s_{1:m})= \text{score}_{crf}(x_{1:m},s_{1:m}) $ as a score of how well the tag sequence fits the given input sequence.
\item Recall from the CRF lecture that $\vec{\Phi}(x_{1:m},s_{1:m})$ was created with manually designed features.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}{Bi-LSTM CRF}
\begin{scriptsize}
\begin{itemize}
\item The idea of the LSTM CRF is to replace $\vec{\Phi}(x_{1:m},s_{1:m})$ with the output of an LSTM transducer (automatically learned features):
\begin{displaymath}
\text{score}_{lstm-crf}(x_{1:m},s_{1:m})= \sum_{i=0}^{m} W_{[s_{i-1},s_i]}\cdot\text{LSTM}^*(x_{1:m})_{[i]} +\vec{b}{[s_{i-1},s_i]}
\end{displaymath}
where $W_{[s_{i-1},s_i]}$ corresponds to a transition score from tag $s_{i-1}$ to $s_i$, $\vec{b}{[s_{i-1},s_i]}$ is a bias term for the transition and $\text{LSTM}^*(x_{1:m})_{[i]}$ comes from the hidden state of the Bi-LSTM at timestep i.
\item All these parameters are learned jointly.
\item Note that the transition matrix $W$ is position independent.
\item The forward-backward algorithm is used during training and the Viterbi algorithm during decoding.
\item More info in\footnote{\url{https://pytorch.org/tutorials/beginner/nlp/advanced_tutorial.html}} and\footnote{\url{https://www.depends-on-the-definition.com/sequence-tagging-lstm-crf/}}.
\end{itemize}
\end{scriptsize}
\end{frame}
\begin{frame}
\frametitle{Questions?}
%\vspace{1.5cm}
\begin{center}\LARGE Thanks for your Attention!\\ \end{center}
\end{frame}
\begin{frame}[allowframebreaks]\scriptsize
\frametitle{References}
\bibliography{bio}
\bibliographystyle{apalike}
%\bibliographystyle{flexbib}
\end{frame}
%%%%%%%%%%%%%%%%%%%%%%%%%%%
\end{document}