-
Notifications
You must be signed in to change notification settings - Fork 213
/
Copy pathLecture 10 _ Recurrent Neural Networks.srt
3569 lines (2852 loc) · 101 KB
/
Lecture 10 _ Recurrent Neural Networks.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1
00:00:07,961 --> 00:00:12,153
- Okay. Can everyone hear me?
Okay. Sorry for the delay.
2
00:00:12,153 --> 00:00:24,081
I had a bit of technical difficulty. Today was the first time I was trying to use my new touch bar Mac book pro for presenting,
and none of the adapters are working. So, I had to switch laptops at the last minute. So, thanks. Sorry about that.
3
00:00:25,353 --> 00:00:30,420
So, today is lecture 10. We're talking
about recurrent neural networks.
4
00:00:30,420 --> 00:00:33,003
So, as of, as usual, a
couple administrative notes.
5
00:00:33,003 --> 00:00:37,353
So, We're working hard
on assignment one grading.
6
00:00:37,353 --> 00:00:46,251
Those grades will probably be out sometime later today. Hopefully,
they can get out before the A2 deadline. That's what I'm hoping for.
7
00:00:46,251 --> 00:00:50,361
On a related note, Assignment
two is due today at 11:59 p.m.
8
00:00:50,361 --> 00:00:56,633
so, who's done with that already?
About half you guys.
9
00:00:56,633 --> 00:01:03,811
So, you remember, I did warn you when the assignment went out that
it was quite long, to start early. So, you were warned about that.
10
00:01:03,811 --> 00:01:06,561
But, hopefully, you guys
have some late days left.
11
00:01:06,561 --> 00:01:10,531
Also, as another reminder,
the midterm will be in class on Tuesday.
12
00:01:10,531 --> 00:01:15,881
If you kind of look around the lecture hall, there are not enough
seats in this room to seat all the enrolled students in the class.
13
00:01:15,881 --> 00:01:20,062
So, we'll actually be having the midterm in
several other lecture halls across campus.
14
00:01:20,062 --> 00:01:26,099
And we'll be sending out some more details on
exactly where to go in the next couple of days.
15
00:01:26,099 --> 00:01:28,179
So a bit of a, another
bit of announcement.
16
00:01:28,179 --> 00:01:34,950
We've been working on this sort of fun bit of extra credit thing for you to play with
that we're calling the training game. This is this cool browser based experience,
17
00:01:34,950 --> 00:01:39,927
where you can go in and interactively train neural
networks and tweak the hyper parameters during training.
18
00:01:39,927 --> 00:01:47,646
And this should be a really cool interactive way for you to practice some of these hyper
parameter tuning skills that we've been talking about the last couple of lectures.
19
00:01:47,646 --> 00:01:53,190
So this is not required, but this, I think, will be a really
useful experience to gain a little bit more intuition
20
00:01:53,190 --> 00:01:57,481
into how some of these hyper parameters work
for different types of data sets in practice.
21
00:01:57,481 --> 00:02:05,790
So we're still working on getting all the bugs worked out of this setup, and we'll probably
send out some more instructions on exactly how this will work in the next couple of days.
22
00:02:05,790 --> 00:02:11,008
But again, not required. But please do check it out. I think it'll
be really fun and a really cool thing for you to play with.
23
00:02:11,008 --> 00:02:17,204
And will give you a bit of extra credit if you do some, if you
end up working with this and doing a couple of runs with it.
24
00:02:18,208 --> 00:02:23,458
So, we'll again send out some more details about
this soon once we get all the bugs worked out.
25
00:02:24,720 --> 00:02:28,139
As a reminder, last time we were
talking about CNN Architectures.
26
00:02:28,139 --> 00:02:35,006
We kind of walked through the time line of some of the various winners of
the image net classification challenge, kind of the breakthrough result.
27
00:02:35,006 --> 00:02:41,331
As we saw was the AlexNet architecture in 2012, which was a
nine layer convolutional network. It did amazingly well,
28
00:02:41,331 --> 00:02:48,081
and it sort of kick started this whole deep learning revolution in computer
vision, and kind of brought a lot of these models into the mainstream.
29
00:02:48,081 --> 00:02:56,699
Then we skipped ahead a couple years, and saw that in 2014 image net challenge, we
had these two really interesting models, VGG and GoogLeNet, which were much deeper.
30
00:02:56,699 --> 00:03:02,930
So VGG was, they had a 16 and a 19 layer model,
and GoogLeNet was, I believe, a 22 layer model.
31
00:03:02,930 --> 00:03:11,230
Although one thing that is kind of interesting about these models is that the
2014 image net challenge was right before batch normalization was invented.
32
00:03:11,230 --> 00:03:18,761
So at this time, before the invention of batch normalization, training these
relatively deep models of roughly twenty layers was very challenging.
33
00:03:18,761 --> 00:03:24,869
So, in fact, both of these two models had to resort to a little
bit of hackery in order to get their deep models to converge.
34
00:03:24,869 --> 00:03:28,579
So for VGG, they had the
16 and 19 layer models,
35
00:03:28,579 --> 00:03:34,107
but actually they first trained an 11 layer model,
because that was what they could get to converge.
36
00:03:34,107 --> 00:03:40,059
And then added some extra random layers in the middle and then
continued training, actually training the 16 and 19 layer models.
37
00:03:40,059 --> 00:03:46,539
So, managing this training process was very challenging
in 2014 before the invention of batch normalization.
38
00:03:46,539 --> 00:03:52,539
Similarly, for GoogLeNet, we saw that GoogLeNet has these auxiliary
classifiers that were stuck into lower layers of the network.
39
00:03:52,539 --> 00:03:56,539
And these were not really needed for the class
to, to get good classification performance.
40
00:03:56,539 --> 00:04:03,430
This was just sort of a way to cause extra gradient to be
injected directly into the lower layers of the network.
41
00:04:03,430 --> 00:04:10,411
And this sort of, this again was before the invention of batch normalization
and now once you have these networks with batch normalization,
42
00:04:10,411 --> 00:04:17,321
then you no longer need these slightly ugly hacks
in order to get these deeper models to converge.
43
00:04:17,321 --> 00:04:24,350
Then we also saw in the 2015 image net challenge was this
really cool model called ResNet, these residual networks
44
00:04:24,350 --> 00:04:28,310
that now have these shortcut connections that
actually have these little residual blocks
45
00:04:28,310 --> 00:04:39,110
where we're going to take our input, pass it through the residual blocks, and then add that
output of the, then add our input to the block, to the output from these convolutional layers.
46
00:04:39,110 --> 00:04:43,308
This is kind of a funny architecture, but
it actually has two really nice properties.
47
00:04:43,308 --> 00:04:49,531
One is that if we just set all the weights in this residual
block to zero, then this block is competing the identity.
48
00:04:49,531 --> 00:04:55,681
So in some way, it's relatively easy for this model
to learn not to use the layers that it doesn't need.
49
00:04:55,681 --> 00:05:02,171
In addition, it kind of adds this interpretation to L2
regularization in the context of these neural networks,
50
00:05:02,171 --> 00:05:08,321
cause once you put L2 regularization, remember, on your, on the weights
of your network, that's going to drive all the parameters towards zero.
51
00:05:08,321 --> 00:05:12,739
And maybe your standard convolutional architecture is
driving towards zero. Maybe it doesn't make sense.
52
00:05:12,739 --> 00:05:20,510
But in the context of a residual network, if you drive all the parameters towards
zero, that's kind of encouraging the model to not use layers that it doesn't need,
53
00:05:20,510 --> 00:05:26,310
because it will just drive those, the residual blocks towards
the identity, whether or not needed for classification.
54
00:05:26,310 --> 00:05:31,371
The other really useful property of these residual networks
has to do with the gradient flow in the backward paths.
55
00:05:31,371 --> 00:05:34,361
If you remember what happens at these
addition gates in the backward pass,
56
00:05:34,361 --> 00:05:39,881
when upstream gradient is coming in through an addition gate,
then it will split and fork along these two different paths.
57
00:05:39,881 --> 00:05:46,361
So then, when upstream gradient comes in, it'll
take one path through these convolutional blocks,
58
00:05:46,361 --> 00:05:50,811
but it will also have a direct connection of
the gradient through this residual connection.
59
00:05:50,811 --> 00:05:59,150
So then when you look at, when you imagine stacking many of these residual blocks on top
of each other, and our network ends up with hundreds of, potentially hundreds of layers.
60
00:05:59,150 --> 00:06:05,561
Then, these residual connections give a sort of gradient super
highway for gradients to flow backward through the entire network.
61
00:06:05,561 --> 00:06:09,630
And this allows it to train much easier
and much faster.
62
00:06:09,630 --> 00:06:15,738
And actually allows these things to converge reasonably well,
even when the model is potentially hundreds of layers deep.
63
00:06:15,738 --> 00:06:21,550
And this idea of managing gradient flow in your models is
actually super important everywhere in machine learning.
64
00:06:21,550 --> 00:06:28,564
And super prevalent in recurrent networks as well. So we'll definitely
revisit this idea of gradient flow later in today's lecture.
65
00:06:31,148 --> 00:06:38,068
So then, we kind of also saw a couple other more exotic, more recent
CNN architectures last time, including DenseNet and FractalNet,
66
00:06:38,068 --> 00:06:43,070
and once you think about these architectures in terms
of gradient flow, they make a little bit more sense.
67
00:06:43,070 --> 00:06:48,619
These things like DenseNet and FractalNet are adding these
additional shortcut or identity connections inside the model.
68
00:06:48,619 --> 00:07:00,571
And if you think about what happens in the backwards pass for these models, these additional funny topologies are basically providing
direct paths for gradients to flow from the loss at the end of the network more easily into all the different layers of the network.
69
00:07:00,571 --> 00:07:09,760
So I think that, again, this idea of managing gradient flow properly in your CNN
Architectures is something that we've really seen a lot more in the last couple of years.
70
00:07:09,760 --> 00:07:15,221
And will probably see more moving forward
as more exotic architectures are invented.
71
00:07:16,257 --> 00:07:24,331
We also saw this kind of nice plot, plotting performance of the number of flops
versus the number of parameters versus the run time of these various models.
72
00:07:24,331 --> 00:07:27,971
And there's some interesting characteristics
that you can dive in and see from this plot.
73
00:07:27,971 --> 00:07:32,801
One idea is that VGG and AlexNet
have a huge number of parameters,
74
00:07:32,801 --> 00:07:37,119
and these parameters actually come almost entirely
from the fully connected layers of the models.
75
00:07:37,119 --> 00:07:39,959
So AlexNet has something like
roughly 62 million parameters,
76
00:07:39,959 --> 00:07:47,771
and if you look at that last fully connected layer, the final fully connected
layer in AlexNet is going from an activation volume of six by six by 256
77
00:07:47,771 --> 00:07:51,190
into this fully connected vector of 496.
78
00:07:51,190 --> 00:07:56,851
So if you imagine what the weight matrix needs to look
like at that layer, the weight matrix is gigantic.
79
00:07:56,851 --> 00:08:01,921
It's number of entries is six by six,
six times six times 256 times 496.
80
00:08:01,921 --> 00:08:06,370
And if you multiply that out, you see that
that single layer has 38 million parameters.
81
00:08:06,370 --> 00:08:11,859
So more than half of the parameters of the entire AlexNet
model are just sitting in that last fully connected layer.
82
00:08:11,859 --> 00:08:24,241
And if you add up all the parameters in just the fully connected layers of AlexNet, including these other fully connected
layers, you see something like 59 of the 62 million parameters in AlexNet are sitting in these fully connected layers.
83
00:08:24,241 --> 00:08:31,110
So then when we move other architectures, like GoogLeNet and ResNet,
they do away with a lot of these large fully connected layers
84
00:08:31,110 --> 00:08:33,698
in favor of global average pooling
at the end of the network.
85
00:08:33,698 --> 00:08:40,935
And this allows these networks to really cut, these nicer architectures,
to really cut down the parameter count in these architectures.
86
00:08:44,463 --> 00:08:49,604
So that was kind of our brief recap of the CNN
architectures that we saw last lecture, and then today,
87
00:08:49,604 --> 00:08:56,321
we're going to move to one of my favorite topics
to talk about, which is recurrent neural networks.
88
00:08:56,321 --> 00:09:03,222
So, so far in this class, we've seen, what I like to think of as kind of a
vanilla feed forward network, all of our network architectures have this flavor,
89
00:09:03,222 --> 00:09:08,593
where we receive some input and that input is
a fixed size object, like an image or vector.
90
00:09:08,593 --> 00:09:13,850
That input is fed through some set of
hidden layers and produces a single output,
91
00:09:13,850 --> 00:09:18,876
like a classifications, like a set of
classifications scores over a set of categories.
92
00:09:20,071 --> 00:09:25,942
But in some context in machine learning, we want to have more
flexibility in the types of data that our models can process.
93
00:09:25,942 --> 00:09:35,313
So once we move to this idea of recurrent neural networks, we have a lot more opportunities
to play around with the types of input and output data that our networks can handle.
94
00:09:35,313 --> 00:09:41,009
So once we have recurrent neural networks, we
can do what we call these one to many models.
95
00:09:41,009 --> 00:09:48,721
Or where maybe our input is some object of fixed size, like an image,
but now our output is a sequence of variable length, such as a caption.
96
00:09:48,721 --> 00:09:54,081
Where different captions might have different numbers
of words, so our output needs to be variable in length.
97
00:09:54,081 --> 00:09:56,491
We also might have many to one models,
98
00:09:56,491 --> 00:10:01,001
where our input could be variably sized. This
might be something like a piece of text,
99
00:10:01,001 --> 00:10:06,161
and we want to say what is the sentiment of that text,
whether it's positive or negative in sentiment.
100
00:10:06,161 --> 00:10:12,512
Or in a computer vision context, you might imagine taking as input
a video, and that video might have a variable number of frames.
101
00:10:12,512 --> 00:10:16,401
And now we want to read this entire video
of potentially variable length.
102
00:10:16,401 --> 00:10:22,721
And then at the end, make a classification decision about maybe
what kind of activity or action is going on in that video.
103
00:10:22,721 --> 00:10:29,931
We also have a, we might also have problems where we want
both the inputs and the output to be variable in length.
104
00:10:29,931 --> 00:10:37,302
We might see something like this in machine translation, where our input
is some, maybe, sentence in English, which could have a variable length,
105
00:10:37,302 --> 00:10:41,633
and our output is maybe some sentence in French,
which also could have a variable length.
106
00:10:41,633 --> 00:10:46,801
And crucially, the length of the English sentence might
be different from the length of the French sentence.
107
00:10:46,801 --> 00:10:53,931
So we need some models that have the capacity to accept both
variable length sequences on the input and on the output.
108
00:10:53,931 --> 00:11:04,771
Finally, we might also consider problems where our input is variably length, like something like a video sequence
with a variable number of frames. And now we want to make a decision for each element of that input sequence.
109
00:11:04,771 --> 00:11:11,891
So in the context of videos, that might be making some
classification decision along every frame of the video.
110
00:11:11,891 --> 00:11:17,401
And recurrent neural networks are this kind of general
paradigm for handling variable sized sequence data
111
00:11:17,401 --> 00:11:23,469
that allow us to pretty naturally capture all of
these different types of setups in our models.
112
00:11:24,349 --> 00:11:33,752
So recurring neural networks are actually important, even for some problems that have a fixed
size input and a fixed size output. Recurrent neural networks can still be pretty useful.
113
00:11:33,752 --> 00:11:38,793
So in this example, we might want to do, for
example, sequential processing of our input.
114
00:11:38,793 --> 00:11:46,227
So here, we're receiving a fixed size input like an image, and we want to make a
classification decision about, like, what number is being shown in this image?
115
00:11:46,227 --> 00:11:50,393
But now, rather than just doing a single feed
forward pass and making the decision all at once,
116
00:11:50,393 --> 00:11:55,553
this network is actually looking around the image and
taking various glimpses of different parts of the image.
117
00:11:55,553 --> 00:12:01,742
And then after making some series of glimpses, then it makes
its final decision as to what kind of number is present.
118
00:12:01,742 --> 00:12:17,473
So here, we had one, So here, even though our input and outputs, our input was an image, and our output was a classification decision, even this context,
this idea of being able to handle variably length processing with recurrent neural networks can lead to some really interesting types of models.
119
00:12:17,473 --> 00:12:23,923
There's a really cool paper that I like that applied
this same type of idea to generating new images.
120
00:12:23,923 --> 00:12:29,723
Where now, we want the model to synthesize brand new images
that look kind of like the images it saw in training,
121
00:12:29,723 --> 00:12:36,254
and we can use a recurrent neural network architecture to actually
paint these output images sort of one piece at a time in the output.
122
00:12:36,254 --> 00:12:46,380
You can see that, even though our output is this fixed size image, we can have these models
that are working over time to compute parts of the output one at a time sequentially.
123
00:12:46,380 --> 00:12:51,662
And we can use recurrent neural networds
for that type of setup as well.
124
00:12:51,662 --> 00:12:58,785
So after this sort of cool pitch about all these cool things that
RNNs can do, you might wonder, like what exactly are these things?
125
00:12:58,785 --> 00:13:04,163
So in general, a recurrent neural network is
this little, has this little recurrent core cell
126
00:13:04,163 --> 00:13:11,382
and it will take some input x, feed that input into
the RNN, and that RNN has some internal hidden state,
127
00:13:11,382 --> 00:13:17,641
and that internal hidden state will be updated
every time that the RNN reads a new input.
128
00:13:17,641 --> 00:13:23,980
And that internal hidden state will be then fed
back to the model the next time it reads an input.
129
00:13:23,980 --> 00:13:28,822
And frequently, we will want our RNN"s to
also produce some output at every time step,
130
00:13:28,822 --> 00:13:31,043
so we'll have this pattern
where it will read an input,
131
00:13:31,043 --> 00:13:34,469
update its hidden state,
and then produce an output.
132
00:13:35,814 --> 00:13:40,961
So then the question is what is the functional form
of this recurrence relation that we're computing?
133
00:13:40,961 --> 00:13:46,443
So inside this little green RNN block, we're computing
some recurrence relation, with a function f.
134
00:13:46,443 --> 00:13:49,094
So this function f will
depend on some weights, w.
135
00:13:49,094 --> 00:13:55,374
It will accept the previous hidden state, h t - 1,
as well as the input at the current state, x t,
136
00:13:55,374 --> 00:14:01,420
and this will output the next hidden state, or
the updated hidden state, that we call h t.
137
00:14:01,420 --> 00:14:11,552
And now, then as we read the next input, this hidden state, this new hidden state, h t,
will then just be passed into the same function as we read the next input, x t plus one.
138
00:14:11,552 --> 00:14:21,797
And now, if we wanted to produce some output at every time step of this network, we might
attach some additional fully connected layers that read in this h t at every time step.
139
00:14:21,797 --> 00:14:27,327
And make that decision based
on the hidden state at every time step.
140
00:14:27,327 --> 00:14:35,662
And one thing to note is that we use the same function, f w, and
the same weights, w, at every time step of the computation.
141
00:14:36,921 --> 00:14:43,434
So then kind of the simplest function form that you can imagine
is what we call this vanilla recurrent neural network.
142
00:14:43,434 --> 00:14:46,866
So here, we have this same functional form
from the previous slide,
143
00:14:46,866 --> 00:14:52,483
where we're taking in our previous hidden state and our
current input and we need to produce the next hidden state.
144
00:14:52,483 --> 00:15:00,124
And the kind of simplest thing you might imagine is that we have
some weight matrix, w x h, that we multiply against the input, x t,
145
00:15:00,124 --> 00:15:05,615
as well as another weight matrix, w h h, that
we multiply against the previous hidden state.
146
00:15:05,615 --> 00:15:09,327
So we make these two multiplications
against our two states, add them together,
147
00:15:09,327 --> 00:15:13,514
and squash them through a tanh, so we get
some kind of non linearity in the system.
148
00:15:13,514 --> 00:15:17,312
You might be wondering why we use a tanh here
and not some other type of non-linearity?
149
00:15:17,312 --> 00:15:20,594
After all that we've said negative
about tanh's in previous lectures,
150
00:15:20,594 --> 00:15:26,507
and I think we'll return a little bit to that later on
when we talk about more advanced architectures, like lstm.
151
00:15:27,346 --> 00:15:33,394
So then, this, So then, in addition in this architecture,
if we wanted to produce some y t at every time step,
152
00:15:33,394 --> 00:15:40,375
you might have another weight matrix, w, you might have another weight
matrix that accepts this hidden state and then transforms it to some y
153
00:15:40,375 --> 00:15:44,826
to produce maybe some class score
predictions at every time step.
154
00:15:44,826 --> 00:15:51,487
And when I think about recurrent neural networks, I kind of think about, you
can also, you can kind of think of recurrent neural networks in two ways.
155
00:15:51,487 --> 00:15:57,095
One is this concept of having a hidden state
that feeds back at itself, recurrently.
156
00:15:57,095 --> 00:16:05,914
But I find that picture a little bit confusing. And sometimes, I find it clearer
to think about unrolling this computational graph for multiple time steps.
157
00:16:05,914 --> 00:16:11,786
And this makes the data flow of the hidden states and the inputs
and the outputs and the weights maybe a little bit more clear.
158
00:16:11,786 --> 00:16:15,494
So then at the first time step, we'll
have some initial hidden state h zero.
159
00:16:15,494 --> 00:16:22,415
This is usually initialized to zeros for most context,
in most contexts, an then we'll have some input, x t.
160
00:16:22,415 --> 00:16:28,324
This initial hidden state, h zero, and our current
input, x t, will go into our f w function.
161
00:16:28,324 --> 00:16:36,154
This will produce our next hidden state, h one. And then,
we'll repeat this process when we receive the next input.
162
00:16:36,154 --> 00:16:42,847
So now our current h one and our x one, will go into
that same f w, to produce our next output, h two.
163
00:16:42,847 --> 00:16:50,866
And this process will repeat over and over again, as we
consume all of the input, x ts, in our sequence of inputs.
164
00:16:50,866 --> 00:16:58,036
And now, one thing to note, is that we can actually make this even
more explicit and write the w matrix in our computational graph.
165
00:16:58,036 --> 00:17:03,415
And here you can see that we're re-using the same
w matrix at every time step of the computation.
166
00:17:03,415 --> 00:17:11,006
So now every time that we have this little f w block, it's receiving a
unique h and a unique x, but all of these blocks are taking the same w.
167
00:17:11,007 --> 00:17:20,786
And if you remember, we talked about how gradient flows in back propagation, when you
re-use the same, when you re-use the same node multiple times in a computational graph,
168
00:17:20,786 --> 00:17:28,218
then remember during the backward pass, you end up summing the
gradients into the w matrix when you're computing a d los d w.
169
00:17:28,218 --> 00:17:32,526
So, if you kind of think about
the back propagation for this model,
170
00:17:32,526 --> 00:17:42,503
then you'll have a separate gradient for w flowing from each of those time steps, and then
the final gradient for w will be the sum of all of those individual per time step gradiants.
171
00:17:43,615 --> 00:17:47,727
We can also write to this y t explicitly
in this computational graph.
172
00:17:47,727 --> 00:17:54,858
So then, this output, h t, at every time step might feed into
some other little neural network that can produce a y t,
173
00:17:54,858 --> 00:17:59,087
which might be some class scores, or
something like that, at every time step.
174
00:17:59,087 --> 00:18:00,738
We can also make the loss more explicit.
175
00:18:00,738 --> 00:18:14,068
So in many cases, you might imagine producing, you might imagine that you have some ground truth label at every time step
of your sequence, and then you'll compute some loss, some individual loss, at every time step of these outputs, y t's.
176
00:18:14,068 --> 00:18:22,497
And this loss might, it will frequently be something like soft max loss, in the case
where you have, maybe, a ground truth label at every time step of the sequence.
177
00:18:22,497 --> 00:18:27,887
And now the final loss for the entire, for this entire
training stop, will be the sum of these individual losses.
178
00:18:27,887 --> 00:18:34,196
So now, we had a scaler loss at every time step? And we just summed
them up to get our final scaler loss at the top of the network.
179
00:18:34,196 --> 00:18:42,098
And now, if you think about, again, back propagation through this thing, we need, in
order to train the model, we need to compute the gradient of the loss with respect to w.
180
00:18:42,098 --> 00:18:46,178
So, we'll have loss flowing from that
final loss into each of these time steps.
181
00:18:46,178 --> 00:18:49,840
And then each of those time steps will
compute a local gradient on the weights, w,
182
00:18:49,840 --> 00:18:54,343
which will all then be summed to give us
our final gradient for the weights, w.
183
00:18:55,597 --> 00:19:01,188
Now if we have a, sort of, this many to one situation, where
maybe we want to do something like sentiment analysis,
184
00:19:01,188 --> 00:19:05,799
then we would typically make that decision based
on the final hidden state of this network.
185
00:19:05,799 --> 00:19:11,868
Because this final hidden state kind of summarizes
all of the context from the entire sequence.
186
00:19:11,868 --> 00:19:14,788
Also, if we have a kind of
a one to many situation,
187
00:19:14,788 --> 00:19:19,319
where we want to receive a fix sized input
and then produce a variably sized output.
188
00:19:19,319 --> 00:19:26,050
Then you'll commonly use that fixed size input to
initialize, somehow, the initial hidden state of the model,
189
00:19:26,050 --> 00:19:30,079
and now the recurrent network will tick
for each cell in the output.
190
00:19:30,079 --> 00:19:36,915
And now, as you produce your variably sized output,
you'll unroll the graph for each element in the output.
191
00:19:38,490 --> 00:19:44,308
So this, when we talk about the sequence to sequence models
where you might do something like machine translation,
192
00:19:44,308 --> 00:19:47,648
where you take a variably sized input
and a variably sized output.
193
00:19:47,648 --> 00:19:52,398
You can think of this as a combination
of the many to one, plus a one to many.
194
00:19:52,398 --> 00:19:56,900
So, we'll kind of proceed in two stages,
what we call an encoder and a decoder.
195
00:19:56,900 --> 00:20:02,159
So if you're the encoder, we'll receive the variably
sized input, which might be your sentence in English,
196
00:20:02,159 --> 00:20:08,110
and then summarize that entire sentence using
the final hidden state of the encoder network.
197
00:20:08,110 --> 00:20:15,769
And now we're in this many to one situation where we've summarized
this entire variably sized input in this single vector,
198
00:20:15,769 --> 00:20:23,111
and now, we have a second decoder network, which is a one to many situation,
which will input that single vector summarizing the input sentence
199
00:20:23,111 --> 00:20:28,969
and now produce this variably sized output, which
might be your sentence in another language.
200
00:20:28,969 --> 00:20:34,609
And now in this variably sized output, we might make some
predictions at every time step, maybe about what word to use.
201
00:20:34,609 --> 00:20:38,199
And you can imagine kind of training this entire