eng/Lecture 10 _ Recurrent Neural Networks.srt

﻿1
00:00:07,961 --> 00:00:12,153
- Okay. Can everyone hear me?
Okay. Sorry for the delay.

2
00:00:12,153 --> 00:00:24,081
I had a bit of technical difficulty. Today was the first time I was trying to use my new touch bar Mac book pro for presenting,
and none of the adapters are working. So, I had to switch laptops at the last minute. So, thanks. Sorry about that.

3
00:00:25,353 --> 00:00:30,420
So, today is lecture 10. We're talking
about recurrent neural networks.

4
00:00:30,420 --> 00:00:33,003
So, as of, as usual, a
couple administrative notes.

5
00:00:33,003 --> 00:00:37,353
So, We're working hard
on assignment one grading.

6
00:00:37,353 --> 00:00:46,251
Those grades will probably be out sometime later today. Hopefully,
they can get out before the A2 deadline. That's what I'm hoping for.

7
00:00:46,251 --> 00:00:50,361
On a related note, Assignment
two is due today at 11:59 p.m.

8
00:00:50,361 --> 00:00:56,633
so, who's done with that already?
About half you guys.

9
00:00:56,633 --> 00:01:03,811
So, you remember, I did warn you when the assignment went out that
it was quite long, to start early. So, you were warned about that.

10
00:01:03,811 --> 00:01:06,561
But, hopefully, you guys
have some late days left.

11
00:01:06,561 --> 00:01:10,531
Also, as another reminder,
the midterm will be in class on Tuesday.

12
00:01:10,531 --> 00:01:15,881
If you kind of look around the lecture hall, there are not enough
seats in this room to seat all the enrolled students in the class.

13
00:01:15,881 --> 00:01:20,062
So, we'll actually be having the midterm in
several other lecture halls across campus.

14
00:01:20,062 --> 00:01:26,099
And we'll be sending out some more details on
exactly where to go in the next couple of days.

15
00:01:26,099 --> 00:01:28,179
So a bit of a, another
bit of announcement.

16
00:01:28,179 --> 00:01:34,950
We've been working on this sort of fun bit of extra credit thing for you to play with
that we're calling the training game. This is this cool browser based experience,

17
00:01:34,950 --> 00:01:39,927
where you can go in and interactively train neural
networks and tweak the hyper parameters during training.

18
00:01:39,927 --> 00:01:47,646
And this should be a really cool interactive way for you to practice some of these hyper
parameter tuning skills that we've been talking about the last couple of lectures.

19
00:01:47,646 --> 00:01:53,190
So this is not required, but this, I think, will be a really
useful experience to gain a little bit more intuition

20
00:01:53,190 --> 00:01:57,481
into how some of these hyper parameters work
for different types of data sets in practice.

21
00:01:57,481 --> 00:02:05,790
So we're still working on getting all the bugs worked out of this setup, and we'll probably
send out some more instructions on exactly how this will work in the next couple of days.

22
00:02:05,790 --> 00:02:11,008
But again, not required. But please do check it out. I think it'll
be really fun and a really cool thing for you to play with.

23
00:02:11,008 --> 00:02:17,204
And will give you a bit of extra credit if you do some, if you
end up working with this and doing a couple of runs with it.

24
00:02:18,208 --> 00:02:23,458
So, we'll again send out some more details about
this soon once we get all the bugs worked out.

25
00:02:24,720 --> 00:02:28,139
As a reminder, last time we were
talking about CNN Architectures.

26
00:02:28,139 --> 00:02:35,006
We kind of walked through the time line of some of the various winners of
the image net classification challenge, kind of the breakthrough result.

27
00:02:35,006 --> 00:02:41,331
As we saw was the AlexNet architecture in 2012, which was a
nine layer convolutional network. It did amazingly well,

28
00:02:41,331 --> 00:02:48,081
and it sort of kick started this whole deep learning revolution in computer
vision, and kind of brought a lot of these models into the mainstream.

29
00:02:48,081 --> 00:02:56,699
Then we skipped ahead a couple years, and saw that in 2014 image net challenge, we
had these two really interesting models, VGG and GoogLeNet, which were much deeper.

30
00:02:56,699 --> 00:03:02,930
So VGG was, they had a 16 and a 19 layer model,
and GoogLeNet was, I believe, a 22 layer model.

31
00:03:02,930 --> 00:03:11,230
Although one thing that is kind of interesting about these models is that the
2014 image net challenge was right before batch normalization was invented.

32
00:03:11,230 --> 00:03:18,761
So at this time, before the invention of batch normalization, training these
relatively deep models of roughly twenty layers was very challenging.

33
00:03:18,761 --> 00:03:24,869
So, in fact, both of these two models had to resort to a little
bit of hackery in order to get their deep models to converge.

34
00:03:24,869 --> 00:03:28,579
So for VGG, they had the
16 and 19 layer models,

35
00:03:28,579 --> 00:03:34,107
but actually they first trained an 11 layer model,
because that was what they could get to converge.

36
00:03:34,107 --> 00:03:40,059
And then added some extra random layers in the middle and then
continued training, actually training the 16 and 19 layer models.

37
00:03:40,059 --> 00:03:46,539
So, managing this training process was very challenging
in 2014 before the invention of batch normalization.

38
00:03:46,539 --> 00:03:52,539
Similarly, for GoogLeNet, we saw that GoogLeNet has these auxiliary
classifiers that were stuck into lower layers of the network.

39
00:03:52,539 --> 00:03:56,539
And these were not really needed for the class
to, to get good classification performance.

40
00:03:56,539 --> 00:04:03,430
This was just sort of a way to cause extra gradient to be
injected directly into the lower layers of the network.

41
00:04:03,430 --> 00:04:10,411
And this sort of, this again was before the invention of batch normalization
and now once you have these networks with batch normalization,

42
00:04:10,411 --> 00:04:17,321
then you no longer need these slightly ugly hacks
in order to get these deeper models to converge.

43
00:04:17,321 --> 00:04:24,350
Then we also saw in the 2015 image net challenge was this
really cool model called ResNet, these residual networks

44
00:04:24,350 --> 00:04:28,310
that now have these shortcut connections that
actually have these little residual blocks

45
00:04:28,310 --> 00:04:39,110
where we're going to take our input, pass it through the residual blocks, and then add that
output of the, then add our input to the block, to the output from these convolutional layers.

46
00:04:39,110 --> 00:04:43,308
This is kind of a funny architecture, but
it actually has two really nice properties.

47
00:04:43,308 --> 00:04:49,531
One is that if we just set all the weights in this residual
block to zero, then this block is competing the identity.

48
00:04:49,531 --> 00:04:55,681
So in some way, it's relatively easy for this model
to learn not to use the layers that it doesn't need.

49
00:04:55,681 --> 00:05:02,171
In addition, it kind of adds this interpretation to L2
regularization in the context of these neural networks,

50
00:05:02,171 --> 00:05:08,321
cause once you put L2 regularization, remember, on your, on the weights
of your network, that's going to drive all the parameters towards zero.

51
00:05:08,321 --> 00:05:12,739
And maybe your standard convolutional architecture is
driving towards zero. Maybe it doesn't make sense.

52
00:05:12,739 --> 00:05:20,510
But in the context of a residual network, if you drive all the parameters towards
zero, that's kind of encouraging the model to not use layers that it doesn't need,

53
00:05:20,510 --> 00:05:26,310
because it will just drive those, the residual blocks towards
the identity, whether or not needed for classification.

54
00:05:26,310 --> 00:05:31,371
The other really useful property of these residual networks
has to do with the gradient flow in the backward paths.

55
00:05:31,371 --> 00:05:34,361
If you remember what happens at these
addition gates in the backward pass,

56
00:05:34,361 --> 00:05:39,881
when upstream gradient is coming in through an addition gate,
then it will split and fork along these two different paths.

57
00:05:39,881 --> 00:05:46,361
So then, when upstream gradient comes in, it'll
take one path through these convolutional blocks,

58
00:05:46,361 --> 00:05:50,811
but it will also have a direct connection of
the gradient through this residual connection.

59
00:05:50,811 --> 00:05:59,150
So then when you look at, when you imagine stacking many of these residual blocks on top
of each other, and our network ends up with hundreds of, potentially hundreds of layers.

60
00:05:59,150 --> 00:06:05,561
Then, these residual connections give a sort of gradient super
highway for gradients to flow backward through the entire network.

61
00:06:05,561 --> 00:06:09,630
And this allows it to train much easier
and much faster.

62
00:06:09,630 --> 00:06:15,738
And actually allows these things to converge reasonably well,
even when the model is potentially hundreds of layers deep.

63
00:06:15,738 --> 00:06:21,550
And this idea of managing gradient flow in your models is
actually super important everywhere in machine learning.

64
00:06:21,550 --> 00:06:28,564
And super prevalent in recurrent networks as well. So we'll definitely
revisit this idea of gradient flow later in today's lecture.

65
00:06:31,148 --> 00:06:38,068
So then, we kind of also saw a couple other more exotic, more recent
CNN architectures last time, including DenseNet and FractalNet,

66
00:06:38,068 --> 00:06:43,070
and once you think about these architectures in terms
of gradient flow, they make a little bit more sense.

67
00:06:43,070 --> 00:06:48,619
These things like DenseNet and FractalNet are adding these
additional shortcut or identity connections inside the model.

68
00:06:48,619 --> 00:07:00,571
And if you think about what happens in the backwards pass for these models, these additional funny topologies are basically providing
direct paths for gradients to flow from the loss at the end of the network more easily into all the different layers of the network.

69
00:07:00,571 --> 00:07:09,760
So I think that, again, this idea of managing gradient flow properly in your CNN
Architectures is something that we've really seen a lot more in the last couple of years.

70
00:07:09,760 --> 00:07:15,221
And will probably see more moving forward
as more exotic architectures are invented.

71
00:07:16,257 --> 00:07:24,331
We also saw this kind of nice plot, plotting performance of the number of flops
versus the number of parameters versus the run time of these various models.

72
00:07:24,331 --> 00:07:27,971
And there's some interesting characteristics
that you can dive in and see from this plot.

73
00:07:27,971 --> 00:07:32,801
One idea is that VGG and AlexNet
have a huge number of parameters,

74
00:07:32,801 --> 00:07:37,119
and these parameters actually come almost entirely
from the fully connected layers of the models.

75
00:07:37,119 --> 00:07:39,959
So AlexNet has something like
roughly 62 million parameters,

76
00:07:39,959 --> 00:07:47,771
and if you look at that last fully connected layer, the final fully connected
layer in AlexNet is going from an activation volume of six by six by 256

77
00:07:47,771 --> 00:07:51,190
into this fully connected vector of 496.

78
00:07:51,190 --> 00:07:56,851
So if you imagine what the weight matrix needs to look
like at that layer, the weight matrix is gigantic.

79
00:07:56,851 --> 00:08:01,921
It's number of entries is six by six,
six times six times 256 times 496.

80
00:08:01,921 --> 00:08:06,370
And if you multiply that out, you see that
that single layer has 38 million parameters.

81
00:08:06,370 --> 00:08:11,859
So more than half of the parameters of the entire AlexNet
model are just sitting in that last fully connected layer.

82
00:08:11,859 --> 00:08:24,241
And if you add up all the parameters in just the fully connected layers of AlexNet, including these other fully connected
layers, you see something like 59 of the 62 million parameters in AlexNet are sitting in these fully connected layers.

83
00:08:24,241 --> 00:08:31,110
So then when we move other architectures, like GoogLeNet and ResNet,
they do away with a lot of these large fully connected layers

84
00:08:31,110 --> 00:08:33,698
in favor of global average pooling
at the end of the network.

85
00:08:33,698 --> 00:08:40,935
And this allows these networks to really cut, these nicer architectures,
to really cut down the parameter count in these architectures.

86
00:08:44,463 --> 00:08:49,604
So that was kind of our brief recap of the CNN
architectures that we saw last lecture, and then today,

87
00:08:49,604 --> 00:08:56,321
we're going to move to one of my favorite topics
to talk about, which is recurrent neural networks.

88
00:08:56,321 --> 00:09:03,222
So, so far in this class, we've seen, what I like to think of as kind of a
vanilla feed forward network, all of our network architectures have this flavor,

89
00:09:03,222 --> 00:09:08,593
where we receive some input and that input is
a fixed size object, like an image or vector.

90
00:09:08,593 --> 00:09:13,850
That input is fed through some set of
hidden layers and produces a single output,

91
00:09:13,850 --> 00:09:18,876
like a classifications, like a set of
classifications scores over a set of categories.

92
00:09:20,071 --> 00:09:25,942
But in some context in machine learning, we want to have more
flexibility in the types of data that our models can process.

93
00:09:25,942 --> 00:09:35,313
So once we move to this idea of recurrent neural networks, we have a lot more opportunities
to play around with the types of input and output data that our networks can handle.

94
00:09:35,313 --> 00:09:41,009
So once we have recurrent neural networks, we
can do what we call these one to many models.

95
00:09:41,009 --> 00:09:48,721
Or where maybe our input is some object of fixed size, like an image,
but now our output is a sequence of variable length, such as a caption.

96
00:09:48,721 --> 00:09:54,081
Where different captions might have different numbers
of words, so our output needs to be variable in length.

97
00:09:54,081 --> 00:09:56,491
We also might have many to one models,

98
00:09:56,491 --> 00:10:01,001
where our input could be variably sized. This
might be something like a piece of text,

99
00:10:01,001 --> 00:10:06,161
and we want to say what is the sentiment of that text,
whether it's positive or negative in sentiment.

100
00:10:06,161 --> 00:10:12,512
Or in a computer vision context, you might imagine taking as input
a video, and that video might have a variable number of frames.

101
00:10:12,512 --> 00:10:16,401
And now we want to read this entire video
of potentially variable length.

102
00:10:16,401 --> 00:10:22,721
And then at the end, make a classification decision about maybe
what kind of activity or action is going on in that video.

103
00:10:22,721 --> 00:10:29,931
We also have a, we might also have problems where we want
both the inputs and the output to be variable in length.

104
00:10:29,931 --> 00:10:37,302
We might see something like this in machine translation, where our input
is some, maybe, sentence in English, which could have a variable length,

105
00:10:37,302 --> 00:10:41,633
and our output is maybe some sentence in French,
which also could have a variable length.

106
00:10:41,633 --> 00:10:46,801
And crucially, the length of the English sentence might
be different from the length of the French sentence.

107
00:10:46,801 --> 00:10:53,931
So we need some models that have the capacity to accept both
variable length sequences on the input and on the output.

108
00:10:53,931 --> 00:11:04,771
Finally, we might also consider problems where our input is variably length, like something like a video sequence
with a variable number of frames. And now we want to make a decision for each element of that input sequence.

109
00:11:04,771 --> 00:11:11,891
So in the context of videos, that might be making some
classification decision along every frame of the video.

110
00:11:11,891 --> 00:11:17,401
And recurrent neural networks are this kind of general
paradigm for handling variable sized sequence data

111
00:11:17,401 --> 00:11:23,469
that allow us to pretty naturally capture all of
these different types of setups in our models.

112
00:11:24,349 --> 00:11:33,752
So recurring neural networks are actually important, even for some problems that have a fixed
size input and a fixed size output. Recurrent neural networks can still be pretty useful.

113
00:11:33,752 --> 00:11:38,793
So in this example, we might want to do, for
example, sequential processing of our input.

114
00:11:38,793 --> 00:11:46,227
So here, we're receiving a fixed size input like an image, and we want to make a
classification decision about, like, what number is being shown in this image?

115
00:11:46,227 --> 00:11:50,393
But now, rather than just doing a single feed
forward pass and making the decision all at once,

116
00:11:50,393 --> 00:11:55,553
this network is actually looking around the image and
taking various glimpses of different parts of the image.

117
00:11:55,553 --> 00:12:01,742
And then after making some series of glimpses, then it makes
its final decision as to what kind of number is present.

118
00:12:01,742 --> 00:12:17,473
So here, we had one, So here, even though our input and outputs, our input was an image, and our output was a classification decision, even this context,
this idea of being able to handle variably length processing with recurrent neural networks can lead to some really interesting types of models.

119
00:12:17,473 --> 00:12:23,923
There's a really cool paper that I like that applied
this same type of idea to generating new images.

120
00:12:23,923 --> 00:12:29,723
Where now, we want the model to synthesize brand new images
that look kind of like the images it saw in training,

121
00:12:29,723 --> 00:12:36,254
and we can use a recurrent neural network architecture to actually
paint these output images sort of one piece at a time in the output.

122
00:12:36,254 --> 00:12:46,380
You can see that, even though our output is this fixed size image, we can have these models
that are working over time to compute parts of the output one at a time sequentially.

123
00:12:46,380 --> 00:12:51,662
And we can use recurrent neural networds
for that type of setup as well.

124
00:12:51,662 --> 00:12:58,785
So after this sort of cool pitch about all these cool things that
RNNs can do, you might wonder, like what exactly are these things?

125
00:12:58,785 --> 00:13:04,163
So in general, a recurrent neural network is
this little, has this little recurrent core cell

126
00:13:04,163 --> 00:13:11,382
and it will take some input x, feed that input into
the RNN, and that RNN has some internal hidden state,

127
00:13:11,382 --> 00:13:17,641
and that internal hidden state will be updated
every time that the RNN reads a new input.

128
00:13:17,641 --> 00:13:23,980
And that internal hidden state will be then fed
back to the model the next time it reads an input.

129
00:13:23,980 --> 00:13:28,822
And frequently, we will want our RNN"s to
also produce some output at every time step,

130
00:13:28,822 --> 00:13:31,043
so we'll have this pattern
where it will read an input,

131
00:13:31,043 --> 00:13:34,469
update its hidden state,
and then produce an output.

132
00:13:35,814 --> 00:13:40,961
So then the question is what is the functional form
of this recurrence relation that we're computing?

133
00:13:40,961 --> 00:13:46,443
So inside this little green RNN block, we're computing
some recurrence relation, with a function f.

134
00:13:46,443 --> 00:13:49,094
So this function f will
depend on some weights, w.

135
00:13:49,094 --> 00:13:55,374
It will accept the previous hidden state, h t - 1,
as well as the input at the current state, x t,

136
00:13:55,374 --> 00:14:01,420
and this will output the next hidden state, or
the updated hidden state, that we call h t.

137
00:14:01,420 --> 00:14:11,552
And now, then as we read the next input, this hidden state, this new hidden state, h t,
will then just be passed into the same function as we read the next input, x t plus one.

138
00:14:11,552 --> 00:14:21,797
And now, if we wanted to produce some output at every time step of this network, we might
attach some additional fully connected layers that read in this h t at every time step.

139
00:14:21,797 --> 00:14:27,327
And make that decision based
on the hidden state at every time step.

140
00:14:27,327 --> 00:14:35,662
And one thing to note is that we use the same function, f w, and
the same weights, w, at every time step of the computation.

141
00:14:36,921 --> 00:14:43,434
So then kind of the simplest function form that you can imagine
is what we call this vanilla recurrent neural network.

142
00:14:43,434 --> 00:14:46,866
So here, we have this same functional form
from the previous slide,

143
00:14:46,866 --> 00:14:52,483
where we're taking in our previous hidden state and our
current input and we need to produce the next hidden state.

144
00:14:52,483 --> 00:15:00,124
And the kind of simplest thing you might imagine is that we have
some weight matrix, w x h, that we multiply against the input, x t,

145
00:15:00,124 --> 00:15:05,615
as well as another weight matrix, w h h, that
we multiply against the previous hidden state.

146
00:15:05,615 --> 00:15:09,327
So we make these two multiplications
against our two states, add them together,

147
00:15:09,327 --> 00:15:13,514
and squash them through a tanh, so we get
some kind of non linearity in the system.

148
00:15:13,514 --> 00:15:17,312
You might be wondering why we use a tanh here
and not some other type of non-linearity?

149
00:15:17,312 --> 00:15:20,594
After all that we've said negative
about tanh's in previous lectures,

150
00:15:20,594 --> 00:15:26,507
and I think we'll return a little bit to that later on
when we talk about more advanced architectures, like lstm.

151
00:15:27,346 --> 00:15:33,394
So then, this, So then, in addition in this architecture,
if we wanted to produce some y t at every time step,

152
00:15:33,394 --> 00:15:40,375
you might have another weight matrix, w, you might have another weight
matrix that accepts this hidden state and then transforms it to some y

153
00:15:40,375 --> 00:15:44,826
to produce maybe some class score
predictions at every time step.

154
00:15:44,826 --> 00:15:51,487
And when I think about recurrent neural networks, I kind of think about, you
can also, you can kind of think of recurrent neural networks in two ways.

155
00:15:51,487 --> 00:15:57,095
One is this concept of having a hidden state
that feeds back at itself, recurrently.

156
00:15:57,095 --> 00:16:05,914
But I find that picture a little bit confusing. And sometimes, I find it clearer
to think about unrolling this computational graph for multiple time steps.

157
00:16:05,914 --> 00:16:11,786
And this makes the data flow of the hidden states and the inputs
and the outputs and the weights maybe a little bit more clear.

158
00:16:11,786 --> 00:16:15,494
So then at the first time step, we'll
have some initial hidden state h zero.

159
00:16:15,494 --> 00:16:22,415
This is usually initialized to zeros for most context,
in most contexts, an then we'll have some input, x t.

160
00:16:22,415 --> 00:16:28,324
This initial hidden state, h zero, and our current
input, x t, will go into our f w function.

161
00:16:28,324 --> 00:16:36,154
This will produce our next hidden state, h one. And then,
we'll repeat this process when we receive the next input.

162
00:16:36,154 --> 00:16:42,847
So now our current h one and our x one, will go into
that same f w, to produce our next output, h two.

163
00:16:42,847 --> 00:16:50,866
And this process will repeat over and over again, as we
consume all of the input, x ts, in our sequence of inputs.

164
00:16:50,866 --> 00:16:58,036
And now, one thing to note, is that we can actually make this even
more explicit and write the w matrix in our computational graph.

165
00:16:58,036 --> 00:17:03,415
And here you can see that we're re-using the same
w matrix at every time step of the computation.

166
00:17:03,415 --> 00:17:11,006
So now every time that we have this little f w block, it's receiving a
unique h and a unique x, but all of these blocks are taking the same w.

167
00:17:11,007 --> 00:17:20,786
And if you remember, we talked about how gradient flows in back propagation, when you
re-use the same, when you re-use the same node multiple times in a computational graph,

168
00:17:20,786 --> 00:17:28,218
then remember during the backward pass, you end up summing the
gradients into the w matrix when you're computing a d los d w.

169
00:17:28,218 --> 00:17:32,526
So, if you kind of think about
the back propagation for this model,

170
00:17:32,526 --> 00:17:42,503
then you'll have a separate gradient for w flowing from each of those time steps, and then
the final gradient for w will be the sum of all of those individual per time step gradiants.

171
00:17:43,615 --> 00:17:47,727
We can also write to this y t explicitly
in this computational graph.

172
00:17:47,727 --> 00:17:54,858
So then, this output, h t, at every time step might feed into
some other little neural network that can produce a y t,

173
00:17:54,858 --> 00:17:59,087
which might be some class scores, or
something like that, at every time step.

174
00:17:59,087 --> 00:18:00,738
We can also make the loss more explicit.

175
00:18:00,738 --> 00:18:14,068
So in many cases, you might imagine producing, you might imagine that you have some ground truth label at every time step
of your sequence, and then you'll compute some loss, some individual loss, at every time step of these outputs, y t's.

176
00:18:14,068 --> 00:18:22,497
And this loss might, it will frequently be something like soft max loss, in the case
where you have, maybe, a ground truth label at every time step of the sequence.

177
00:18:22,497 --> 00:18:27,887
And now the final loss for the entire, for this entire
training stop, will be the sum of these individual losses.

178
00:18:27,887 --> 00:18:34,196
So now, we had a scaler loss at every time step? And we just summed
them up to get our final scaler loss at the top of the network.

179
00:18:34,196 --> 00:18:42,098
And now, if you think about, again, back propagation through this thing, we need, in
order to train the model, we need to compute the gradient of the loss with respect to w.

180
00:18:42,098 --> 00:18:46,178
So, we'll have loss flowing from that
final loss into each of these time steps.

181
00:18:46,178 --> 00:18:49,840
And then each of those time steps will
compute a local gradient on the weights, w,

182
00:18:49,840 --> 00:18:54,343
which will all then be summed to give us
our final gradient for the weights, w.

183
00:18:55,597 --> 00:19:01,188
Now if we have a, sort of, this many to one situation, where
maybe we want to do something like sentiment analysis,

184
00:19:01,188 --> 00:19:05,799
then we would typically make that decision based
on the final hidden state of this network.

185
00:19:05,799 --> 00:19:11,868
Because this final hidden state kind of summarizes
all of the context from the entire sequence.

186
00:19:11,868 --> 00:19:14,788
Also, if we have a kind of
a one to many situation,

187
00:19:14,788 --> 00:19:19,319
where we want to receive a fix sized input
and then produce a variably sized output.

188
00:19:19,319 --> 00:19:26,050
Then you'll commonly use that fixed size input to
initialize, somehow, the initial hidden state of the model,

189
00:19:26,050 --> 00:19:30,079
and now the recurrent network will tick
for each cell in the output.

190
00:19:30,079 --> 00:19:36,915
And now, as you produce your variably sized output,
you'll unroll the graph for each element in the output.

191
00:19:38,490 --> 00:19:44,308
So this, when we talk about the sequence to sequence models
where you might do something like machine translation,

192
00:19:44,308 --> 00:19:47,648
where you take a variably sized input
and a variably sized output.

193
00:19:47,648 --> 00:19:52,398
You can think of this as a combination
of the many to one, plus a one to many.

194
00:19:52,398 --> 00:19:56,900
So, we'll kind of proceed in two stages,
what we call an encoder and a decoder.

195
00:19:56,900 --> 00:20:02,159
So if you're the encoder, we'll receive the variably
sized input, which might be your sentence in English,

196
00:20:02,159 --> 00:20:08,110
and then summarize that entire sentence using
the final hidden state of the encoder network.

197
00:20:08,110 --> 00:20:15,769
And now we're in this many to one situation where we've summarized
this entire variably sized input in this single vector,

198
00:20:15,769 --> 00:20:23,111
and now, we have a second decoder network, which is a one to many situation,
which will input that single vector summarizing the input sentence

199
00:20:23,111 --> 00:20:28,969
and now produce this variably sized output, which
might be your sentence in another language.

200
00:20:28,969 --> 00:20:34,609
And now in this variably sized output, we might make some
predictions at every time step, maybe about what word to use.

201
00:20:34,609 --> 00:20:38,199
And you can imagine kind of training this entire
thing by unrolling this computational graph

202
00:20:38,199 --> 00:20:44,692
summing the losses at the output sequence and
just performing back propagation, as usual.

203
00:20:44,692 --> 00:20:50,940
So as a bit of a concrete example, one thing that we frequently use
recurrent neural networks for, is this problem called language modeling.

204
00:20:50,940 --> 00:21:00,908
So in the language modeling problem, we want to read some sequence of, we want
to have our network, sort of, understand how to produce natural language.

205
00:21:00,908 --> 00:21:06,601
So in the, so this, this might happen at the character level
where our model will produce characters one at a time.

206
00:21:06,601 --> 00:21:10,769
This might also happen at the word level where
our model will produce words one at a time.

207
00:21:10,769 --> 00:21:14,740
But in a very simple example, you can
imagine this character level language model

208
00:21:14,740 --> 00:21:22,780
where we want, where the network will read some sequence of characters and then
it needs to predict, what will the next character be in this stream of text?

209
00:21:22,780 --> 00:21:33,884
So in this example, we have this very small vocabulary of four letters, h, e, l, and
o, and we have this example training sequence of the word hello, h, e, l, l, o.

210
00:21:33,884 --> 00:21:49,689
So during training, when we're training this language model, we will feed the characters of this training
sequence as inputs, as x ts, to out input of our, we'll feed the characters of our training sequence,

211
00:21:49,689 --> 00:21:53,980
these will be the x ts that we feed in as
the inputs to our recurrent neural network.

212
00:21:53,980 --> 00:22:01,039
And then, each of these inputs, it's a letter, and we need
to figure out a way to represent letters in our network.

213
00:22:01,039 --> 00:22:07,460
So what we'll typically do is figure out what is our total
vocabulary. In this case, our vocabulary has four elements.

214
00:22:07,460 --> 00:22:12,589
And each letter will be represented by a
vector that has zeros in every slot but one,

215
00:22:12,589 --> 00:22:16,628
and a one for the slot in the vocabulary
corresponding to that letter.

216
00:22:16,628 --> 00:22:22,324
In this little example, since our vocab has the
four letters, h, e, l, o, then our input sequence,

217
00:22:22,324 --> 00:22:28,684
the h is represented by a four element vector with a one
in the first slot and zero's in the other three slots.

218
00:22:28,684 --> 00:22:33,139
And we use the same sort of pattern to represent
all the different letters in the input sequence.

219
00:22:34,914 --> 00:22:41,874
Now, during this forward pass of what this network is doing,
at the first time step, it will receive the input letter h.

220
00:22:41,874 --> 00:22:48,594
That will go into the first RNN, to the RNN
cell, and then we'll produce this output, y t,

221
00:22:48,594 --> 00:22:56,024
which is the network making predictions about for each letter in the
vocabulary, which letter does it think is most likely going to come next.

222
00:22:56,024 --> 00:23:01,405
In this example, the correct output letter was
e because our training sequence was hello,

223
00:23:01,405 --> 00:23:06,861
but the model is actually predicting, I think it's
actually predicting o as the most likely letter.

224
00:23:07,850 --> 00:23:13,889
So in this case, this prediction was wrong and we would use
softmaxt loss to quantify our unhappiness with these predictions.

225
00:23:13,889 --> 00:23:19,741
The next time step, we would feed in the second letter in
the training sequence, e, and this process will repeat.

226
00:23:19,741 --> 00:23:27,271
We'll now represent e as a vector. Use that input vector together
with the previous hidden state to produce a new hidden state

227
00:23:27,271 --> 00:23:31,912
and now use the second hidden state to, again, make
predictions over every letter in the vocabulary.

228
00:23:31,912 --> 00:23:36,810
In this case, because our training sequence was hello,
after the letter e, we want our model to predict l.

229
00:23:36,810 --> 00:23:41,954
In this case, our model may have very low predictions
for the letter l, so we would incur high loss.

230
00:23:44,244 --> 00:23:50,343
And you kind of repeat this process over and over, and
if you train this model with many different sequences,

231
00:23:50,343 --> 00:23:58,596
then eventually it should learn how to predict the next character in a sequence
based on the context of all the previous characters that it's seen before.

232
00:23:59,893 --> 00:24:01,655
And now, if you think about
what happens at test time,

233
00:24:01,655 --> 00:24:07,594
after we train this model, one thing that we might
want to do with it is a sample from the model,

234
00:24:07,594 --> 00:24:15,103
and actually use this trained neural network model to synthesize new text
that kind of looks similar in spirit to the text that it was trained on.

235
00:24:15,103 --> 00:24:22,716
The way that this will work is we'll typically see the model with some input
prefix of text. In this case, the prefix is just the single letter h,

236
00:24:22,716 --> 00:24:27,295
and now we'll feed that letter h through the
first time step of our recurrent neural network.

237
00:24:27,295 --> 00:24:32,916
It will product this distribution of scores
over all the characters in the vocabulary.

238
00:24:32,916 --> 00:24:37,501
Now, at training time, we'll use these
scores to actually sample from it.

239
00:24:37,501 --> 00:24:41,421
So we'll use a softmaxt function to convert
those scores into a probability distribution

240
00:24:41,421 --> 00:24:47,362
and then we will sample from that probability distribution
to actually synthesize the second letter in the sequence.

241
00:24:47,362 --> 00:24:54,771
And in this case, even though the scores were pretty bad, maybe we got
lucky and sampled the letter e from this probability distribution.

242
00:24:54,771 --> 00:25:02,492
And now, we'll take this letter e that was sampled from this distribution
and feed it back as input into the network at the next time step.

243
00:25:02,492 --> 00:25:15,008
Now, we'll take this e, pull it down from the top, feed it back into the network as one of these, sort of, one hot
vectorial representations, and then repeat the process in order to synthesize the second letter in the output.

244
00:25:15,008 --> 00:25:20,722
And we can repeat this process over and over again to
synthesize a new sequence using this trained model,

245
00:25:20,722 --> 00:25:27,712
where we're synthesizing the sequence one character at a time using
these predicted probability distributions at each time step.

246
00:25:27,712 --> 00:25:28,545
Question?

247
00:25:34,792 --> 00:25:41,315
Yeah, that's a great question. So the question is why might we
sample instead of just taking the character with the largest score?

248
00:25:41,315 --> 00:25:46,555
In this case, because of the probability distribution that
we had, it was impossible to get the right character,

249
00:25:47,451 --> 00:25:51,512
so we had the sample so the example could
work out, and it would make sense.

250
00:25:51,512 --> 00:25:54,384
But in practice,
sometimes you'll see both.

251
00:25:54,384 --> 00:25:59,482
So sometimes you'll just take the argmax probability,
and that will sometimes be a little bit more stable,

252
00:25:59,482 --> 00:26:04,264
but one advantage of sampling, in general, is
that it lets you get diversity from your models.

253
00:26:04,264 --> 00:26:11,421
Sometimes you might have the same input, maybe the same prefix,
or in the case of image captioning, maybe the same image.

254
00:26:11,421 --> 00:26:20,032
But then if you sample rather than taking the argmax, then you'll see that sometimes these trained
models are actually able to produce multiple different types of reasonable output sequences,

255
00:26:20,032 --> 00:26:23,824
depending on the kind, depending on which
samples they take at the first time steps.

256
00:26:23,824 --> 00:26:29,213
It's actually kind of a benefit cause we
can get now more diversity in our outputs.

257
00:26:29,213 --> 00:26:30,630
Another question?

258
00:26:35,143 --> 00:26:40,435
Could we feed in the softmax vector instead of
the one element vector? You mean at test time?

259
00:26:46,162 --> 00:26:51,373
Yeah yeah, so the question is, at test time, could we feed
in this whole softmax vector rather than a one hot vector?

260
00:26:51,373 --> 00:26:56,782
There's kind of two problems with that. One is that that's
very different from the data that it saw at training time.

261
00:26:56,782 --> 00:27:05,413
In general, if you ask your model to do something at test time, which is different from training
time, then it'll usually blow up. It'll usually give you garbage and you'll usually be sad.

262
00:27:05,413 --> 00:27:09,112
The other problem is that in practice,
our vocabularies might be very large.

263
00:27:09,112 --> 00:27:13,202
So maybe, in this simple example, our vocabulary
is only four elements, so it's not a big problem.

264
00:27:13,202 --> 00:27:18,773
But if you're thinking about generating words one at a time,
now your vocabulary is every word in the English language,

265
00:27:18,773 --> 00:27:21,162
which could be something like
tens of thousands of elements.

266
00:27:21,162 --> 00:27:30,533
So in practice, this first element, this first operation that's taking in this one hot
vector, is often performed using sparse vector operations rather than dense factors.

267
00:27:30,533 --> 00:27:36,827
It would be, sort of, computationally really bad if you
wanted to have this load of 10,000 elements softmax vector.

268
00:27:36,827 --> 00:27:40,302
So that's usually why we use a one
hot instead, even at test time.

269
00:27:42,121 --> 00:27:47,104
This idea that we have a sequence and we produce
an output at every time step of the sequence

270
00:27:47,104 --> 00:27:51,543
and then finally compute some loss, this is
sometimes called backpropagation through time

271
00:27:51,543 --> 00:27:57,053
because you're imagining that in the forward pass, you're kind of
stepping forward through time and then during the backward pass,

272
00:27:57,053 --> 00:28:00,762
you're sort of going backwards through
time to compute all your gradients.

273
00:28:00,762 --> 00:28:06,162
This can actually be kind of problematic if you want
to train the sequences that are very, very long.

274
00:28:06,162 --> 00:28:15,453
So if you imagine that we were kind of trying to train a neural network language model on maybe
the entire text of Wikipedia, which is, by the way, something that people do pretty frequently,

275
00:28:15,453 --> 00:28:23,328
this would be super slow, and every time we made a gradient step, we would
have to make a forward pass through the entire text of all of wikipedia,

276
00:28:23,328 --> 00:28:27,813
and then make a backward pass through all of
wikipedia, and then make a single gradient update.

277
00:28:27,813 --> 00:28:34,172
And that would be super slow. Your model would never converge. It would
also take a ridiculous amount of memory so this would be just really bad.

278
00:28:34,172 --> 00:28:39,933
In practice, what people do is this, sort of, approximation
called truncated backpropagation through time.

279
00:28:39,933 --> 00:28:45,562
Here, the idea is that, even though our input sequence
is very, very long, and even potentially infinite,

280
00:28:45,562 --> 00:28:56,232
what we'll do is that during, when we're training the model, we'll step forward for some
number of steps, maybe like a hundred is kind of a ballpark number that people frequently use,

281
00:28:56,232 --> 00:29:06,261
and we'll step forward for maybe a hundred steps, compute a loss only over this sub sequence
of the data, and then back propagate through this sub sequence, and now make a gradient step.

282
00:29:06,261 --> 00:29:12,064
And now, when we repeat, well, we still have these
hidden states that we computed from the first batch,

283
00:29:12,064 --> 00:29:20,631
and now, when we compute this next batch of data, we will carry those hidden
states forward in time, so the forward pass will be exactly the same.

284
00:29:20,631 --> 00:29:28,124
But now when we compute a gradient step for this next batch of
data, we will only backpropagate again through this second batch.

285
00:29:28,124 --> 00:29:32,760
Now, we'll make a gradient step based on
this truncated backpropagation through time.

286
00:29:32,760 --> 00:29:38,250
This process will continue, where now when we make the
next batch, we'll again copy these hidden states forward,

287
00:29:38,250 --> 00:29:43,840
but then step forward and then step backward,
but only for some small number of time steps.

288
00:29:43,840 --> 00:29:49,872
So this is, you can kind of think of this as being an alegist
who's the cast at gradient descent in the case of sequences.

289
00:29:49,872 --> 00:29:53,690
Remember, when we talked about training
our models on large data sets,

290
00:29:53,690 --> 00:29:58,720
then these data sets, it would be super expensive to
compute the gradients over every element in the data set.

291
00:29:58,720 --> 00:30:02,520
So instead, we kind of take small samples,
small mini batches instead,

292
00:30:02,520 --> 00:30:08,053
and use mini batches of data to compute gradient
stops in any kind of image classification case.

293
00:30:08,053 --> 00:30:08,886
Question?

294
00:30:12,441 --> 00:30:15,989
Is this kind of, the question is, is this
kind of making the Mark Hobb assumption?

295
00:30:15,989 --> 00:30:20,581
No, not really. Because we're carrying
this hidden state forward in time forever.

296
00:30:21,442 --> 00:30:25,792
It's making a Marcovian assumption in the
sense that, conditioned on the hidden state,

297
00:30:25,792 --> 00:30:31,101
but the hidden state is all that we need to
predict the entire future of the sequence.

298
00:30:31,101 --> 00:30:35,941
But that assumption is kind of built into the
recurrent neural network formula from the start.

299
00:30:35,941 --> 00:30:39,032
And that's not really particular
to back propagation through time.

300
00:30:39,032 --> 00:30:51,239
Back propagation through time, or sorry, truncated back prop though time is just the way to approximate
these gradients without going making a backwards pass through your potentially very large sequence of data.

301
00:30:52,677 --> 00:30:59,649
This all sounds very complicated and confusing and it sounds like a lot
of code to write, but in fact, this can acutally be pretty concise.

302
00:30:59,649 --> 00:31:07,474
Andrea has this example of what he calls min-char-rnn, that
does all of this stuff in just like a 112 lines of Python.

303
00:31:07,474 --> 00:31:11,725
It handles building the vocabulary. It trains the
model with truncated back propagation through time.

304
00:31:11,725 --> 00:31:16,584
And then, it can actually sample from
that model in actually not too much code.

305
00:31:16,584 --> 00:31:20,862
So even though this sounds like kind of a big,
scary process, it's actually not too difficult.

306
00:31:20,862 --> 00:31:27,954
I'd encourage you, if you're confused, to maybe go check this out and step through the
code on your own time, and see, kind of, all of these concrete steps happening in code.

307
00:31:27,954 --> 00:31:34,473
So this is all in just a single file, all using numpy
with no dependencies. This was relatively easy to read.

308
00:31:35,584 --> 00:31:41,593
So then, once we have this idea of training a recurrent neural
network language model, we can actually have a lot of fun with this.

309
00:31:41,593 --> 00:31:52,304
And we can take in, sort of, any text that we want. Take in, like, whatever random text you can think of from
the internet, train our recurrent neural network language model on this text, and then generate new text.

310
00:31:52,304 --> 00:32:02,634
So in this example, we took this entire text of all of Shakespeare's works, and then
used that to train a recurrent neural network language model on all of Shakespeare.

311
00:32:02,634 --> 00:32:11,584
And you can see that the beginning of training, it's kind of producing maybe random gibberish garbage,
but throughout the course of training, it ends up producing things that seem relatively reasonable.

312
00:32:11,584 --> 00:32:18,224
And after you've, after this model has been trained pretty well,
then it produces text that seems, kind of, Shakespeare-esque to me.

313
00:32:18,224 --> 00:32:24,754
"Why do what that day," replied, whatever, right, you can
read this. Like, it kind of looks kind of like Shakespeare.

314
00:32:24,754 --> 00:32:31,264
And if you actually train this model even more, and let it converge
even further, and then sample these even longer sequences,

315
00:32:31,264 --> 00:32:35,864
you can see that it learns all kinds of crazy cool
stuff that really looks like a Shakespeare play.

316
00:32:35,864 --> 00:32:40,016
It knows that it uses, maybe, these
headings to say who's speaking.

317
00:32:40,016 --> 00:32:45,565
Then it produces these bits of text that have crazy
dialogue that sounds kind of Shakespeare-esque.

318
00:32:45,565 --> 00:32:47,744
It knows to put line breaks
in between these different things.

319
00:32:47,744 --> 00:32:52,958
And this is all, like, really cool, all just
sort of learned from the structure of the data.

320
00:32:52,958 --> 00:33:02,725
We can actually get even crazier than this. This was one of my favorite
examples. I found online, there's this. Is anyone a mathematician in this room?

321
00:33:02,725 --> 00:33:07,325
Has anyone taken an algebraic topology course by
any chance? Wow, a couple, that's impressive.

322
00:33:07,325 --> 00:33:15,114
So you probably know more algebraic topology than me, but I
found this open source algebraic topology textbook online.

323
00:33:15,114 --> 00:33:19,554
It's just a whole bunch of tech files that
are like this super dense mathematics.

324
00:33:19,554 --> 00:33:26,325
And LaTac, cause LaTac is sort of this, let's you write
equations and diagrams and everything just using plain text.

325
00:33:26,325 --> 00:33:33,146
We can actually train our recurrent neural network language model
on the raw Latac source code of this algebraic topology textbook.

326
00:33:33,146 --> 00:33:41,446
And if we do that, then after we sample from the model, then we
get something that seems like, kind of like algebraic topology.

327
00:33:41,446 --> 00:33:46,679
So it knows to like put equations.
It puts all kinds of crazy stuff.

328
00:33:46,679 --> 00:33:52,773
It's like, to prove study, we see that F sub U is a
covering of x prime, blah, blah, blah, blah, blah.

329
00:33:52,773 --> 00:33:56,576
It knows where to put unions. It knows
to put squares at the end of proofs.

330
00:33:56,576 --> 00:34:00,026
It makes lemmas.
It makes references to previous lemmas.

331
00:34:00,026 --> 00:34:08,417
Right, like we hear, like. It's namely a bi-lemma question. We see
that R is geometrically something. So it's actually pretty crazy.

332
00:34:09,496 --> 00:34:12,913
It also sometimes tries to make diagrams.

333
00:34:14,239 --> 00:34:19,313
For those of you that have taken algebraic topology, you know that these
commutative diagrams are kind of a thing that you work with a lot

334
00:34:19,313 --> 00:34:26,379
So it kind of got the general gist of how to make those
diagrams, but they actually don't make any sense.

335
00:34:26,380 --> 00:34:32,728
And actually, one of my favorite examples
here is that it sometimes omits proofs.

336
00:34:32,728 --> 00:34:39,694
So it'll sometimes say, it'll sometimes say something like
theorem, blah, blah, blah, blah, blah, proof omitted.

337
00:34:39,695 --> 00:34:45,392
This thing kind of has gotten the gist of
how some of these math textbooks look like.

338
00:34:47,831 --> 00:34:53,497
We can have a lot of fun with this. So we also tried training one
of these models on the entire source code of the Linux kernel.

339
00:34:53,498 --> 00:34:56,801
'Cause again, this character level stuff
that we can train on,

340
00:34:56,801 --> 00:35:01,483
And then, when we sample this, it
acutally again looks like C source code.

341
00:35:01,483 --> 00:35:06,192
- It knows how to write if statements.
- It has, like, pretty good code formatting skills.

342
00:35:06,192 --> 00:35:09,552
It knows to indent after these if
statements. It knows to put curly braces.

343
00:35:09,552 --> 00:35:15,052
It actually even makes comments about
some things that are usually nonsense.

344
00:35:15,052 --> 00:35:23,355
One problem with this model is that it knows how to declare
variables. But it doesn't always use the variables that it declares.

345
00:35:23,355 --> 00:35:27,554
And sometimes it tries to use variables that
haven't been declared. This wouldn't compile.

346
00:35:27,554 --> 00:35:30,555
I would not recommend sending this
as a pull request to Linux.

347
00:35:30,555 --> 00:35:37,867
This thing also figures out how to recite the
GNU, this GNU license character by character.

348
00:35:37,867 --> 00:35:44,962
It kind of knows that you need to recite the GNU license and after the
license comes some includes, then some other includes, then source code.

349
00:35:44,962 --> 00:35:48,836
This thing has actually learned quite a lot
about the general structure of the data.

350
00:35:48,836 --> 00:35:53,398
Where, again, during training, all we asked this model to
do was try to predict the next character in the sequence.

351
00:35:53,398 --> 00:35:55,406
We didn't tell it any of this structure,

352
00:35:55,406 --> 00:36:03,355
but somehow, just through the course of this training process, it
learned a lot about the latent structure in the sequential data.

353
00:36:05,808 --> 00:36:10,246
Yeah, so it knows how to write code.
It does a lot of cool stuff.

354
00:36:10,246 --> 00:36:20,856
I had this paper with Andre a couple years ago where we trained a bunch of these models and then we wanted to
try to poke into the brains of these models and figure out like what are they doing and why are they working.

355
00:36:20,856 --> 00:36:28,165
So we saw, in our, these recurring neural networks has this hidden
vector which is, maybe, some vector that's updated over every time step.

356
00:36:28,165 --> 00:36:34,176
And then what we wanted to try to figure out is could we find some
elements of this vector that have some Symantec interpretable meaning.

357
00:36:34,176 --> 00:36:39,917
So what we did is we trained a neural network language model,
one of these character level models on one of these data sets,

358
00:36:39,917 --> 00:36:47,923
and then we picked one of the elements in that hidden vector and now we look
at what is the value of that hidden vector over the course of a sequence

359
00:36:47,923 --> 00:36:52,206
to try to get some sense of maybe what these
different hidden states are looking for.

360
00:36:52,206 --> 00:36:56,406
When you do this, a lot of them end up looking
kind of like random gibberish garbage.

361
00:36:56,406 --> 00:37:02,686
So here again, what we've done, is we've picked one element of that
vector, and now we run the sequence forward through the trained model,

362
00:37:02,686 --> 00:37:10,876
and now the color of each character corresponds to the magnitude of that single scaler
element of the hidden vector at every time step when it's reading the sequence.

363
00:37:10,876 --> 00:37:15,883
So you can see that a lot of the vectors in these
hidden states are kind of not very interpretable.

364
00:37:15,883 --> 00:37:21,396
It seems like they're kind of doing some of this low level
language modeling to figure out what character should come next.

365
00:37:21,396 --> 00:37:23,438
But some of them end up quite nice.

366
00:37:23,438 --> 00:37:26,375
So here we found this vector
that is looking for quotes.

367
00:37:26,375 --> 00:37:32,486
You can see that there's this one hidden element, this one
element in the vector, that is off, off, off, off, off blue

368
00:37:32,486 --> 00:37:38,303
and then once it hits a quote, it turns on and
remains on for the duration of this quote.

369
00:37:39,187 --> 00:37:42,598
And now when we hit the second quotation
mark, then that cell turns off.

370
00:37:42,598 --> 00:37:48,856
So somehow, even though this model was only trained to predict the
next character in a sequence, it somehow learned that a useful thing,

371
00:37:48,856 --> 00:37:54,222
in order to do this, might be to have
some cell that's trying to detect quotes.

372
00:37:54,222 --> 00:38:00,104
We also found this other cell that is, looks like it's
counting the number of characters since a line break.

373
00:38:00,104 --> 00:38:07,055
So you can see that at the beginning of each line, this element starts
off at zero. Throughout the course of the line, it's gradually more red,

374
00:38:07,055 --> 00:38:11,976
so that value increases. And then after the
new line character, it resets to zero.

375
00:38:11,976 --> 00:38:19,545
So you can imagine that maybe this cell is letting the network keep
track of when it needs to write to produce these new line characters.

376
00:38:19,545 --> 00:38:22,987
We also found some that,
when we trained on the linux source code,

377
00:38:22,987 --> 00:38:33,838
we found some examples that are turning on inside the conditions of if statements. So this maybe
allows the network to differentiate whether it's outside an if statement or inside that condition,

378
00:38:33,838 --> 00:38:35,806
which might help it model
these sequences better.

379
00:38:35,806 --> 00:38:44,765
We also found some that turn on in comments, or some that
seem like they're counting the number of indentation levels.

380
00:38:44,765 --> 00:38:50,758
This is all just really cool stuff because it's saying that even though
we are only trying to train this model to predict next characters,

381
00:38:50,758 --> 00:38:55,646
it somehow ends up learning a lot
of useful structure about the input data.

382
00:38:57,161 --> 00:39:05,528
One kind of thing that we often use, so this is not really been computer vision so
far, and we need to pull this back to computer vision since this is a vision class.

383
00:39:05,528 --> 00:39:08,747
We've alluded many times to this
image captioning model

384
00:39:08,747 --> 00:39:14,376
- where we want to build models that can input
- an image and then output a caption in natural language.

385
00:39:14,376 --> 00:39:19,787
There were a bunch of papers a couple years ago
that all had relatively similar approaches.

386
00:39:19,787 --> 00:39:25,026
But I'm showing the figure from the paper
from our lab in a totally un-biased way.

387
00:39:26,876 --> 00:39:35,608
But, the idea here is that the caption is this variably length sequence that we
might, the sequence might have different numbers of words for different captions.

388
00:39:35,608 --> 00:39:39,947
So this is a totally natural fit for a
recurrent neural network language model.

389
00:39:39,947 --> 00:39:50,379
So then what this model looks like is we have some convolutional network which will input the, which
will take as input the image, and we've seen a lot about how convolution networks work at this point,

390
00:39:50,379 --> 00:39:58,928
and that convolutional network will produce a summary vector of the image which will then
feed into the first time step of one of these recurrent neural network language models

391
00:39:58,928 --> 00:40:02,888
which will then produce words
of the caption one at a time.

392
00:40:02,888 --> 00:40:06,425
So the way that this kind of works at
test time after the model is trained

393
00:40:06,425 --> 00:40:11,178
looks almost exactly the same as these character
level language models that we saw a little bit ago.

394
00:40:11,178 --> 00:40:18,638
We'll take our input image, feed it through our convolutional network.
But now instead of taking the softmax scores from an image net model,

395
00:40:18,638 --> 00:40:24,915
we'll instead take this 4,096 dimensional
vector from the end of the model,

396
00:40:24,915 --> 00:40:31,377
and we'll take that vector and use it to
summarize the whole content of the image.

397
00:40:31,377 --> 00:40:39,528
Now, remember when we talked about RNN language models, we said that we need to see
the language model with that first initial input to tell it to start generating text.

398
00:40:39,528 --> 00:40:49,006
So in this case, we'll give it some special start token, which is just saying, hey, this is the
start of a sentence. Please start generating some text conditioned on this image information.

399
00:40:49,006 --> 00:40:57,107
So now previously, we saw that in this RNN language model, we had these
matrices that were taking the previous, the input at the current time step

400
00:40:57,107 --> 00:41:01,240
and the hidden state of the previous time step and
combining those to get the next hidden state.

401
00:41:01,240 --> 00:41:04,678
Well now, we also need to add
in this image information.

402
00:41:04,678 --> 00:41:13,105
So one way, people play around with exactly different ways to incorporate this
image information, but one simple way is just to add a third weight matrix that is

403
00:41:14,561 --> 00:41:18,598
adding in this image information at every
time step to compute the next hidden state.

404
00:41:18,598 --> 00:41:23,635
So now, we'll compute this distribution
over all scores in our vocabulary

405
00:41:23,635 --> 00:41:27,307
and here, our vocabulary is something like all
English words, so it could be pretty large.

406
00:41:27,307 --> 00:41:32,099
We'll sample from that distribution and now pass
that word back as input at the next time step.

407
00:41:32,099 --> 00:41:39,546
And that will then feed that word in, again get a distribution over
all words in the vocab, and again sample to produce the next word.

408
00:41:39,546 --> 00:41:47,538
So then, after that thing is all done, we'll maybe
generate, we'll generate this complete sentence.

409
00:41:47,538 --> 00:41:54,347
We stop generation once we sample the special ends token, which
kind of corresponds to the period at the end of the sentence.

410
00:41:54,347 --> 00:42:00,328
Then once the network samples this ends token, we stop generation
and we're done and we've gotten our caption for this image.

411
00:42:00,328 --> 00:42:06,739
And now, during training, we trained this thing to generate, like
we put an end token at the end of every caption during training

412
00:42:06,739 --> 00:42:10,907
so that the network kind of learned during training
that end tokens come at the end of sequences.

413
00:42:10,907 --> 00:42:16,848
So then, during test time, it tends to sample
these end tokens once it's done generating.

414
00:42:16,848 --> 00:42:20,139
So we trained this model
in kind of a completely supervised way.

415
00:42:20,139 --> 00:42:24,763
You can find data sets that have images
together with natural language captions.

416
00:42:24,763 --> 00:42:28,666
Microsoft COCO is probably the biggest
and most widely used for this task.

417
00:42:28,666 --> 00:42:47,837
But you can just train this model in a purely supervised way. And then backpropagate through to jointly train both this recurrent neural network language model and then
also pass gradients back into this final layer of this the CNN and additionally update the weights of the CNN to jointly tune all parts of the model to perform this task.

418
00:42:47,837 --> 00:42:52,438
Once you train these models, they actually
do some pretty reasonable things.

419
00:42:52,438 --> 00:42:56,689
These are some real results from a model,
from one of these trained models,

420
00:42:56,689 --> 00:43:00,995
and it says things like a cat sitting
on a suitcase on the floor,

421
00:43:00,995 --> 00:43:02,583
which is pretty impressive.

422
00:43:02,583 --> 00:43:04,975
It knows about cats
sitting on a tree branch,

423
00:43:04,975 --> 00:43:06,499
which is also pretty cool.

424
00:43:06,499 --> 00:43:09,631
It knows about two people walking
on the beach with surfboards.

425
00:43:09,631 --> 00:43:16,635
So these models are actually pretty powerful and can
produce relatively complex captions to describe the image.

426
00:43:16,635 --> 00:43:19,808
But that being said,
these models are really not perfect.

427
00:43:19,808 --> 00:43:24,319
They're not magical.
Just like any machine learning model,

428
00:43:24,319 --> 00:43:28,830
if you try to run them on data that was very different
from the training data, they don't work very well.

429
00:43:28,830 --> 00:43:33,317
So for example, this example, it says
a woman is holding a cat in her hand.

430
00:43:33,317 --> 00:43:35,421
There's clearly no cat in the image.

431
00:43:35,421 --> 00:43:41,359
But she is wearing a fur coat, and maybe the texture
of that coat kind of looked like a cat to the model.

432
00:43:41,359 --> 00:43:44,379
Over here, we see a woman standing
on a beach holding a surfboard.

433
00:43:44,379 --> 00:43:47,892
Well, she's definitely not holding a
surfboard and she's doing a handstand,

434
00:43:47,892 --> 00:43:51,820
which is maybe the interesting part of that
image, and the model totally missed that.

435
00:43:51,820 --> 00:43:58,238
Also, over here, we see this example where there's this
picture of a spider web in the tree branch, and it totally,

436
00:43:58,238 --> 00:44:00,382
and it says something like
a bird sitting on a tree branch.

437
00:44:00,382 --> 00:44:05,139
So it totally missed the spider, but during
training, it never really saw examples of spiders.

438
00:44:05,139 --> 00:44:08,217
It just knows that birds sit
on tree branches during training.

439
00:44:08,217 --> 00:44:10,152
So it kind of makes these
reasonable mistakes.

440
00:44:10,152 --> 00:44:14,338
Or here at the bottom, it can't really tell the difference
between this guy throwing and catching the ball,

441
00:44:14,338 --> 00:44:17,709
but it does know that it's a baseball player
and there's balls and things involved.

442
00:44:17,709 --> 00:44:20,441
So again, just want to say that
these models are not perfect.

443
00:44:20,441 --> 00:44:25,109
They work pretty well when you ask them to caption
images that were similar to the training data,

444
00:44:25,109 --> 00:44:29,188
but they definitely have a hard time
generalizing far beyond that.

445
00:44:29,188 --> 00:44:34,152
So another thing you'll sometimes see is this
slightly more advanced model called Attention,

446
00:44:34,152 --> 00:44:41,036
where now when we're generating the words of this caption, we can allow
the model to steer it's attention to different parts of the image.

447
00:44:41,036 --> 00:44:44,670
And I don't want to spend
too much time on this.

448
00:44:44,670 --> 00:44:48,925
But the general way that this works is
that now our convolutional network,

449
00:44:48,925 --> 00:44:59,954
rather than producing a single vector summarizing the entire image, now it produces some grid of
vectors that summarize the, that give maybe one vector for each spatial location in the image.

450
00:44:59,954 --> 00:45:06,618
And now, when we, when this model runs forward, in
addition to sampling the vocabulary at every time step,

451
00:45:06,618 --> 00:45:11,263
it also produces a distribution over the
locations in the image where it wants to look.

452
00:45:11,263 --> 00:45:18,276
And now this distribution over image locations can be seen as a
kind of a tension of where the model should look during training.

453
00:45:18,276 --> 00:45:23,155
So now that first hidden state computes
this distribution over image locations,

454
00:45:23,155 --> 00:45:31,372
which then goes back to the set of vectors to give a single summary
vector that maybe focuses the attention on one part of that image.

455
00:45:31,372 --> 00:45:36,508
And now that summary vector gets fed, as an additional
input, at the next time step of the neural network.

456
00:45:36,508 --> 00:45:38,278
And now again, it will
produce two outputs.

457
00:45:38,278 --> 00:45:43,714
One is our distribution over vocabulary words. And
the other is a distribution over image locations.

458
00:45:43,714 --> 00:45:49,160
This whole process will continue, and it will sort
of do these two different things at every time step.

459
00:45:50,296 --> 00:45:58,298
And after you train the model, then you can see that it kind of will shift it's
attention around the image for every word that it generates in the caption.

460
00:45:58,298 --> 00:46:04,436
Here you can see that it produced the caption,
a bird is flying over, I can't see that far.

461
00:46:04,436 --> 00:46:12,473
But you can see that its attention is shifting around different
parts of the image for each word in the caption that it generates.

462
00:46:14,112 --> 00:46:19,147
There's this notion of hard attention versus soft
attention, which I don't really want to get into too much,

463
00:46:19,147 --> 00:46:26,614
but with this idea of soft attention, we're kind of taking a
weighted combination of all features from all image locations,

464
00:46:26,614 --> 00:46:34,259
whereas in the hard attention case, we're forcing the model to select
exactly one location to look at in the image at each time step.

465
00:46:34,259 --> 00:46:38,277
So the hard attention case where we're selecting
exactly one image location is a little bit tricky

466
00:46:38,277 --> 00:46:46,247
because that is not really a differentiable function, so you need to do something slightly
fancier than vanilla backpropagation in order to just train the model in that scenario.

467
00:46:46,247 --> 00:46:51,498
And I think we'll talk about that a little bit
later in the lecture on reinforcement learning.

468
00:46:51,498 --> 00:46:57,415
Now, when you look at after you train one of these
attention models and then run it on to generate captions,

469
00:46:57,415 --> 00:47:04,067
you can see that it tends to focus it's attention on maybe the salient
or semanticly meaningful part of the image when generating captions.

470
00:47:04,067 --> 00:47:08,413
You can see that the caption was a woman
is throwing a frisbee in a park

471
00:47:08,413 --> 00:47:14,516
and you can see that this attention mask, when it generated the
word, when the model generated the word frisbee, at the same time,

472
00:47:14,516 --> 00:47:18,976
it was focusing it's attention on this image
region that actually contains the frisbee.

473
00:47:18,976 --> 00:47:24,542
This is actually really cool. We did not tell the
model where it should be looking at every time step.

474
00:47:24,542 --> 00:47:27,955
It sort of figured all that out for itself
during the training process.

475
00:47:27,955 --> 00:47:32,721
Because somehow, it figured out that looking at that
image region was the right thing to do for this image.

476
00:47:32,721 --> 00:47:34,610
And because everything in
this model is differentiable,

477
00:47:34,610 --> 00:47:42,541
because we can backpropagate through all these soft attention steps, all
of this soft attention stuff just comes out through the training process.

478
00:47:42,541 --> 00:47:45,297
So that's really, really cool.

479
00:47:45,297 --> 00:47:51,565
By the way, this idea of recurrent neural networks and attention
actually gets used in other tasks beyond image captioning.

480
00:47:51,565 --> 00:47:54,691
One recent example is this idea
of visual question answering.

481
00:47:54,691 --> 00:48:04,010
So here, our model is going to take two things as input. It's going to take an image and
it will also take a natural language question that's asking some question about the image.

482
00:48:04,010 --> 00:48:10,163
Here, we might see this image on the left and we might ask the
question, what endangered animal is featured on the truck?

483
00:48:10,163 --> 00:48:19,027
And now the model needs to select from one of these four natural language answers about
which of these answers correctly answers that question in the context of the image.

484
00:48:19,027 --> 00:48:24,774
So you can imagine kind of stitching this model
together using CNNs and RNNs in kind of a natural way.

485
00:48:24,774 --> 00:48:29,701
Now, we're in this
many to one scenario,

486
00:48:29,701 --> 00:48:33,566
where now our model needs to take as input
this natural language sequence,

487
00:48:33,566 --> 00:48:40,111
so we can imagine running a recurrent neural network over each element of
that input question, to now summarize the input question in a single vector.

488
00:48:40,111 --> 00:48:43,928
And then we can have a CNN
to again summarize the image,

489
00:48:43,928 --> 00:48:51,645
and now combine both the vector from the CNN and the vector from the
question and coding RNN to then predict a distribution over answers.

490
00:48:51,645 --> 00:48:59,175
We also sometimes, you'll also sometimes see this idea of soft spacial
attention being incorporated into things like visual question answering.

491
00:48:59,175 --> 00:49:07,062
So you can see that here, this model is also having the spatial attention
over the image when it's trying to determine answers to the questions.

492
00:49:07,062 --> 00:49:09,062
Just to, yeah, question?

493
00:49:18,715 --> 00:49:20,878
So the question is
How are the different inputs combined?

494
00:49:20,878 --> 00:49:25,998
Do you mean like the encoded question
vector and the encoded image vector?

495
00:49:25,998 --> 00:49:30,395
Yeah, so the question is how are the encoded
image and the encoded question vector combined?

496
00:49:30,395 --> 00:49:34,847
Kind of the simplest thing to do is just to concatenate
them and stick them into fully connected layers.

497
00:49:34,847 --> 00:49:37,864
That's probably the most common and
that's probably the first thing to try.

498
00:49:37,864 --> 00:49:44,389
Sometimes people do slightly fancier things where they might try to have multiplicative
interactions between those two vectors to allow a more powerful function.

499
00:49:44,389 --> 00:49:48,050
But generally, concatenation is
kind of a good first thing to try.

500
00:49:49,426 --> 00:49:55,084
Okay, so now we've talked about a bunch of scenarios where RNNs are
used for different kinds of problems. And I think it's super cool

501
00:49:55,084 --> 00:50:04,072
because it allows you to start tackling really complicated problems
combining images and computer vision with natural language processing.

502
00:50:04,072 --> 00:50:08,870
And you can see that we can kind of stith together these
models like Lego blocks and attack really complicated things,

503
00:50:08,870 --> 00:50:16,736
Like image captioning or visual question answering just by stitching
together these relatively simple types of neural network modules.

504
00:50:18,708 --> 00:50:26,274
But I'd also like to mention that so far, we've talked about this idea of
a single recurrent network layer, where we have sort of one hidden state,

505
00:50:26,274 --> 00:50:32,581
and another thing that you'll see pretty commonly is
this idea of a multilayer recurrent neural network.

506
00:50:32,581 --> 00:50:44,554
Here, this is a three layer recurrent neural network, so now our input goes in, goes into, goes
in and produces a sequence of hidden states from the first recurrent neural network layer.

507
00:50:44,554 --> 00:50:50,468
And now, after we run kind of one recurrent neural network
layer, then we have this whole sequence of hidden states.

508
00:50:50,468 --> 00:50:56,474
And now, we can use the sequence of hidden states as an
input sequence to another recurrent neural network layer.

509
00:50:56,474 --> 00:51:01,577
And then you can just imagine, which will then produce
another sequence of hidden states from the second RNN layer.

510
00:51:01,577 --> 00:51:04,181
And then you can just imagine stacking
these things on top of each other,

511
00:51:04,181 --> 00:51:09,398
cause we know that we've seen in other contexts that
deeper models tend to perform better for various problems.

512
00:51:09,398 --> 00:51:18,215
And the same kind of holds in RNNs as well. For many problems, you'll see maybe
a two or three layer recurrent neural network model is pretty commonly used.

513
00:51:18,215 --> 00:51:22,086
You typically don't see
super deep models in RNNs.

514
00:51:22,086 --> 00:51:29,327
So generally, like two, three, four layer
RNNs is maybe as deep as you'll typically go.

515
00:51:29,327 --> 00:51:38,222
Then, I think it's also really interesting and important to think about,
now we've seen kind of what kinds of problems these RNNs can be used for,

516
00:51:38,222 --> 00:51:43,229
but then you need to think a little bit more carefully about
exactly what happens to these models when we try to train them.

517
00:51:43,229 --> 00:51:47,447
So here, I've drawn this little vanilla
RNN cell that we've talked about so far.

518
00:51:47,447 --> 00:51:55,307
So here, we're taking our current input, x t, and our previous hidden
state, h t minus one, and then we stack, those are two vectors.

519
00:51:55,307 --> 00:52:05,783
So we can just stack them together. And then perform this matrix multiplication with our weight matrix,
to give our, and then squash that output through a tanh, and that will give us our next hidden state.

520
00:52:05,783 --> 00:52:10,131
And that's kind of the basic functional form
of this vanilla recurrent neural network.

521
00:52:10,131 --> 00:52:17,738
But then, we need to think about what happens in this architecture
during the backward pass when we try to compute gradients?

522
00:52:17,738 --> 00:52:27,477
So then if we think about trying to compute, so then during the backwards pass, we'll
receive the derivative of our h t, we'll receive derivative of loss with respect to h t.

523
00:52:28,568 --> 00:52:34,632
And during the backward pass through the cell, we'll need to
compute derivative of loss to the respect of h t minus one.

524
00:52:34,632 --> 00:52:39,881
Then, when we compute this backward pass, we see that
the gradient flows backward through this red path.

525
00:52:39,881 --> 00:52:45,958
So first, that gradient will flow backwards through this tanh gate, and
then it will flow backwards through this matrix multiplication gate.

526
00:52:45,958 --> 00:52:54,711
And then, as we've seen in the homework and when implementing these matrix
multiplication layers, when you backpropagate through this matrix multiplication gate,

527
00:52:54,711 --> 00:52:58,457
you end up mulitplying by the transpose
of that weight matrix.

528
00:52:58,457 --> 00:53:08,024
So that means that every time we backpropagate through one of these vanilla
RNN cells, we end up multiplying by some part of the weight matrix.

529
00:53:09,110 --> 00:53:16,599
So now if you imagine that we are sticking many of these recurrent neural network
cells in sequence, because again this is an RNN. We want a model sequences.

530
00:53:16,599 --> 00:53:20,921
Now if you imagine what happens to the gradient
flow through a sequence of these layers,

531
00:53:20,921 --> 00:53:23,770
then something kind of
fishy starts to happen.

532
00:53:23,770 --> 00:53:30,810
Because now, when we want to compute the gradient of the loss with respect
to h zero, we need to backpropagate through every one of these RNN cells.

533
00:53:30,810 --> 00:53:35,641
And every time you backpropagate through one cell,
you'll pick up one of these w transpose factors.

534
00:53:35,641 --> 00:53:44,967
So that means that the final expression for the gradient on h zero will
involve many, many factors of this weight matrix, which could be kind of bad.

535
00:53:44,967 --> 00:53:50,138
Maybe don't think about the weight, the
matrix case, but imagine a scaler case.

536
00:53:50,138 --> 00:53:54,563
If we end up, if we have some scaler and we multiply
by that same number over and over and over again,

537
00:53:54,563 --> 00:54:00,269
maybe not for four examples, but for something
like a hundred or several hundred time steps,

538
00:54:00,269 --> 00:54:03,835
then multiplying by the same number
over and over again is really bad.

539
00:54:03,835 --> 00:54:14,069
In the scaler case, it's either going to explode in the case that that number is greater than one
or it's going to vanish towards zero in the case that number is less than one in absolute value.

540
00:54:14,069 --> 00:54:18,186
And the only way in which this will not
happen is if that number is exactly one,

541
00:54:18,186 --> 00:54:20,911
which is actually very
rare to happen in practice.

542
00:54:20,911 --> 00:54:25,229
That leaves us to, that same intuition
extends to the matrix case,

543
00:54:25,229 --> 00:54:32,071
but now, rather than the absolute value of a scaler number, you instead need
to look at the largest, the largest singular value of this weight matrix.

544
00:54:32,071 --> 00:54:46,079
Now if that largest singular value is greater than one, then during this backward pass, when we multiply by the weight
matrix over and over, that gradient on h w, on h zero, sorry, will become very, very large, when that matrix is too large.

545
00:54:46,079 --> 00:54:48,947
And that's something we call
the exploding gradient problem.

546
00:54:48,947 --> 00:54:55,061
Where now this gradient will explode exponentially in depth
with the number of time steps that we backpropagate through.

547
00:54:55,061 --> 00:55:05,349
And if the largest singular value is less than one, then we get the opposite problem, where now our gradients will
shrink and shrink and shrink exponentially, as we backpropagate and pick up more and more factors of this weight matrix.

548
00:55:05,349 --> 00:55:08,863
That's called the
vanishing gradient problem.

549
00:55:08,863 --> 00:55:14,208
THere's a bit of a hack that people sometimes do to fix
the exploding gradient problem called gradient clipping,

550
00:55:14,208 --> 00:55:19,713
which is just this simple heuristic saying that
after we compute our gradient, if that gradient,

551
00:55:19,713 --> 00:55:27,955
if it's L2 norm is above some threshold, then just clamp it down
and divide, just clamp it down so it has this maximum threshold.

552
00:55:27,955 --> 00:55:32,645
This is kind of a nasty hack, but it actually gets used in
practice quite a lot when training recurrent neural networks.

553
00:55:32,645 --> 00:55:39,034
And it's a relatively useful tool for
attacking this exploding gradient problem.

554
00:55:39,034 --> 00:55:46,207
But now for the vanishing gradient problem, what we typically do
is we might need to move to a more complicated RNN architecture.

555
00:55:46,207 --> 00:55:53,524
So that motivates this idea of an LSTM. An
LSTM, which stands for Long Short Term Memory,

556
00:55:53,524 --> 00:55:58,316
is this slightly fancier recurrence relation
for these recurrent neural networks.

557
00:55:58,316 --> 00:56:03,330
It's really designed to help alleviate this
problem of vanishing and exploding gradients.

558
00:56:03,330 --> 00:56:10,193
So that rather than kind of hacking on top of it, we just kind of
design the architecture to have better gradient flow properties.

559
00:56:10,193 --> 00:56:16,556
Kind of an analogy to those fancier CNN architectures
that we saw at the top of the lecture.

560
00:56:16,556 --> 00:56:20,807
Another thing to point out is that the
LSTM cell actually comes from 1997.

561
00:56:20,807 --> 00:56:24,073
So this idea of an LSTM
has been around for quite a while,

562
00:56:24,073 --> 00:56:28,852
and these folks were working on these ideas way back
in the 90s, were definitely ahead of the curve.

563
00:56:28,852 --> 00:56:32,475
Because these models are kind of
used everywhere now 20 years later.

564
00:56:33,864 --> 00:56:37,921
And LSTMs kind of have
this funny functional form.

565
00:56:37,921 --> 00:56:46,397
So remember when we had this vanilla recurrent neural network, it had this hidden state.
And we used this recurrence relation to update the hidden state at every time step.

566
00:56:46,397 --> 00:56:51,462
Well, now in an LSTM, we actually have two, we
maintain two hidden states at every time step.

567
00:56:51,462 --> 00:56:59,305
One is this h t, which is called the hidden state, which is kind
of an analogy to the hidden state that we had in the vanilla RNN.

568
00:56:59,305 --> 00:57:03,604
But an LSTM also maintains the second
vector, c t, called the cell state.

569
00:57:03,604 --> 00:57:12,240
And the cell state is this vector which is kind of internal, kept inside
the LSTM, and it does not really get exposed to the outside world.

570
00:57:12,240 --> 00:57:15,260
And we'll see, and you can kind of see
that through this update equation,

571
00:57:15,260 --> 00:57:20,803
where you can see that when we, first when
we compute these, we take our two inputs,

572
00:57:20,803 --> 00:57:25,485
we use them to compute these four gates
called i, f, o, n, g.

573
00:57:25,485 --> 00:57:33,802
We use those gates to update our cell states, c t, and then we expose
part of our cell state as the hidden state at the next time step.

574
00:57:36,704 --> 00:57:44,014
This is kind of a funny functional form, and I want to walk through for a couple
slides exactly why do we use this architecture and why does it make sense,

575
00:57:44,014 --> 00:57:47,731
especially in the context
of vanishing or exploding gradients.

576
00:57:47,731 --> 00:57:57,213
This first thing that we do in an LSTM is that we're given this previous
hidden state, h t, and we're given our current input vector, x t,

577
00:57:57,213 --> 00:57:58,611
and just like the vanilla RNN.

578
00:57:58,611 --> 00:58:08,663
In the vanilla RNN, remember, we took those two input vectors. We concatenated them.
Then we did a matrix multiply to directly compute the next hidden state in the RNN.

579
00:58:08,663 --> 00:58:15,315
Now, the LSTM does something a little bit different. We're going to
take our previous hidden state and our current input, stack them,

580
00:58:15,315 --> 00:58:21,926
and now multiply by a very big weight
matrix, w, to compute four different gates,

581
00:58:21,926 --> 00:58:24,391
Which all have the same
size as the hidden state.

582
00:58:24,391 --> 00:58:30,808
Sometimes, you'll see this written in different ways. Some
authors will write a different weight matrix for each gate.

583
00:58:30,808 --> 00:58:33,160
Some authors will combine them all
into one big weight matrix.

584
00:58:33,160 --> 00:58:34,677
But it's all really the same thing.

585
00:58:34,677 --> 00:58:40,288
The ideas is that we take our hidden state, our current
input, and then we use those to compute these four gates.

586
00:58:40,288 --> 00:58:47,768
These four gates are the, you often see this written as i, f, o,
g, ifog, which makes it pretty easy to remember what they are.

587
00:58:47,768 --> 00:58:53,655
I is the input gate. It says how much
do we want to input into our cell.

588
00:58:53,655 --> 00:59:00,194
F is the forget gate. How much do we want to forget the
cell memory at the previous, from the previous time step.

589
00:59:00,194 --> 00:59:04,653
O is the output gate, which is how much do we
want to reveal ourself to the outside world.

590
00:59:04,653 --> 00:59:10,130
And G really doesn't have a nice name,
so I usually call it the gate gate.

591
00:59:10,130 --> 00:59:14,626
G, it tells us how much do we want
to write into our input cell.

592
00:59:14,626 --> 00:59:19,665
And then you notice that each of these four
gates are using a different non linearity.

593
00:59:21,724 --> 00:59:25,047
The input, forget and output gate
are all using sigmoids,

594
00:59:25,047 --> 00:59:28,571
which means that their values
will be between zero and one.

595
00:59:28,571 --> 00:59:34,316
Whereas the gate gate uses a tanh, which means
it's output will be between minus one and one.

596
00:59:34,316 --> 00:59:41,725
So, these are kind of weird, but it makes a little bit
more sense if you imagine them all as binary values.

597
00:59:41,725 --> 00:59:44,744
Right, like what happens at the extremes
of these two values?

598
00:59:46,033 --> 00:59:53,926
It's kind of what happens, if you look after we compute these gates if you look at this next
equation, you can see that our cell state is being multiplied element wise by the forget gate.

599
00:59:53,926 --> 01:00:00,350
Sorry, our cell state from the previous time step is being multiplied
element wise by this forget gate. And now if this forget gate,

600
01:00:00,350 --> 01:00:03,861
you can think of it as being
a vector of zeros and ones,

601
01:00:03,861 --> 01:00:10,644
that's telling us for each element in the cell state, do we want to
forget that element of the cell in the case if the forget gate was zero?

602
01:00:10,644 --> 01:00:14,935
Or do we want to remember that element of the
cell in the case if the forget gate was one.

603
01:00:14,935 --> 01:00:19,767
Now, once we've used the forget gate
to gate off the part of the cell state,

604
01:00:19,767 --> 01:00:24,814
then we have the second term, which is
the element wise product of i and g.

605
01:00:24,814 --> 01:00:28,824
So now, i is this vector of zeros and ones,
cause it's coming through a sigmoid,

606
01:00:28,824 --> 01:00:35,728
telling us for each element of the cell state, do we want to write
to that element of the cell state in the case that i is one,

607
01:00:35,728 --> 01:00:41,719
or do we not want to write to that element of the cell
state at this time step in the case that i is zero.

608
01:00:41,719 --> 01:00:46,031
And now the gate gate, because it's coming
through a tanh, will be either one or minus one.

609
01:00:46,031 --> 01:00:54,022
So that is the value that we want, the candidate value that we might
consider writing to each element of the cell state at this time step.

610
01:00:54,022 --> 01:01:02,499
Then if you look at the cell state equation, you can see that at every time step,
the cell state has these kind of these different, independent scaler values,

611
01:01:02,499 --> 01:01:05,230
and they're all being incremented
or decremented by one.

612
01:01:05,230 --> 01:01:11,028
So there's kind of like, inside the cell state, we
can either remember or forget our previous state,

613
01:01:11,028 --> 01:01:16,535
and then we can either increment or decrement each element
of that cell state by up to one at each time step.

614
01:01:16,535 --> 01:01:25,708
So you can kind of think of these elements of the cell state as being little
scaler integer counters that can be incremented and decremented at each time step.

615
01:01:25,708 --> 01:01:36,513
And now, after we've computed our cell state, then we use our now updated cell
state to compute a hidden state, which we will reveal to the outside world.

616
01:01:36,513 --> 01:01:43,353
So because this cell state has this interpretation of being counters,
and sort of counting up by one or minus one at each time step,

617
01:01:43,353 --> 01:01:51,826
we want to squash that counter value into a nice zero to one range
using a tanh. And now, we multiply element wise, by this output gate.

618
01:01:51,826 --> 01:01:57,597
And the output gate is again coming through a sigmoid,
so you can think of it as being mostly zeros and ones,

619
01:01:57,597 --> 01:02:08,577
and the output gate tells us for each element of our cell state, do we want to reveal or not reveal
that element of our cell state when we're computing the external hidden state for this time step.

620
01:02:09,736 --> 01:02:17,132
And then, I think there's kind of a tradition in people trying to explain LSTMs,
that everyone needs to come up with their own potentially confusing LSTM diagram.

621
01:02:17,132 --> 01:02:18,882
So here's my attempt.

622
01:02:20,380 --> 01:02:23,866
Here, we can see what's going
on inside this LSTM cell,

623
01:02:23,866 --> 01:02:31,266
is that we take our, we're taking as input on the left our previous cell
state and the previous hidden state, as well as our current input, x t.

624
01:02:31,266 --> 01:02:37,346
Now we're going to take our current, our previous
hidden state, as well as our current input, stack them,

625
01:02:37,346 --> 01:02:41,166
and then multiply with this weight
matrix, w, to produce our four gates.

626
01:02:41,166 --> 01:02:44,836
And here, I've left out the non linearities
because we saw those on a previous slide.

627
01:02:44,836 --> 01:02:48,143
And now the forget gate multiplies
element wise with the cell state.

628
01:02:48,143 --> 01:02:54,524
The input and gate gate are multiplied element wise and
added to the cell state. And that gives us our next cell.

629
01:02:54,524 --> 01:03:01,366
The next cell gets squashed through a tanh, and multiplied element
wise with this output gate to produce our next hidden state.

630
01:03:02,417 --> 01:03:03,250
Question?

631
01:03:13,116 --> 01:03:17,878
No, So they're coming through this, they're coming
from different parts of this weight matrix.

632
01:03:17,878 --> 01:03:26,415
So if our hidden, if our x and our h all have this dimension
h, then after we stack them, they'll be a vector size two h,

633
01:03:26,415 --> 01:03:30,393
and now our weight matrix will be this
matrix of size four h times two h.

634
01:03:30,393 --> 01:03:41,511
So you can think of that as sort of having four chunks of this weight matrix. And each of
these four chunks of the weight matrix is going to compute a different one of these gates.

635
01:03:42,404 --> 01:03:51,109
You'll often see this written for clarity, kind of combining all four of those different
weight matrices into a single large matrix, w, just for notational convenience.

636
01:03:51,109 --> 01:03:56,067
But they're all computed using
different parts of the weight matrix.

637
01:03:57,080 --> 01:04:04,574
But you're correct in that they're all computed using the same functional
form of just stacking the two things and taking the matrix multiplication.

638
01:04:04,574 --> 01:04:11,196
Now that we have this picture, we can think about what
happens to an LSTM cell during the backwards pass?

639
01:04:11,196 --> 01:04:18,999
We saw, in the context of vanilla recurrent neural network, that some bad things happened
during the backwards pass, where we were continually multiplying by that weight matrix, w.

640
01:04:18,999 --> 01:04:23,238
But now, the situation looks much,
quite a bit different in the LSTM.

641
01:04:23,238 --> 01:04:29,952
If you imagine this path backwards of computing the
gradients of the cell state, we get quite a nice picture.

642
01:04:29,952 --> 01:04:33,320
Now, when we have our upstream gradient
from the cell coming in,

643
01:04:33,320 --> 01:04:37,737
then once we backpropagate backwards
through this addition operation,

644
01:04:37,737 --> 01:04:43,663
remember that this addition just copies that
upstream gradient into the two branches,

645
01:04:43,663 --> 01:04:50,435
so our upstream gradient gets copied directly and passed
directly to backpropagating through this element wise multiply.

646
01:04:50,435 --> 01:04:56,455
So then our upstream gradient ends up getting
multiplied element wise by the forget gate.

647
01:04:56,455 --> 01:05:07,171
As we backpropagate backwards through this cell state, the only thing that happens to our upstream
cell state gradient is that it ends up getting multiplied element wise by the forget gate.

648
01:05:07,171 --> 01:05:12,640
This is really a lot nicer
than the vanilla RNN for two reasons.

649
01:05:12,640 --> 01:05:18,498
One is that this forget gate is now an element wise
multiplication rather than a full matrix multiplication.

650
01:05:18,498 --> 01:05:24,964
So element wise multiplication is going to be a
little bit nicer than full matrix multiplication.

651
01:05:24,964 --> 01:05:31,354
Second is that element wise multiplication will potentially
be multiplying by a different forget gate at every time step.

652
01:05:31,354 --> 01:05:36,660
So remember, in the vanilla RNN, we were continually
multiplying by that same weight matrix over and over again,

653
01:05:36,660 --> 01:05:40,563
which led very explicitly
to these exploding or vanishing gradients.

654
01:05:40,563 --> 01:05:45,161
But now in the LSTM case, this forget
gate can vary from each time step.

655
01:05:45,161 --> 01:05:51,670
Now, it's much easier for the model to avoid these
problems of exploding and vanishing gradients.

656
01:05:51,670 --> 01:05:58,438
Finally, because this forget gate is coming out from a sigmoid, this
element wise multiply is guaranteed to be between zero and one,

657
01:05:58,438 --> 01:06:04,278
which again, leads to sort of nicer numerical properties if
you imagine multiplying by these things over and over again.

658
01:06:04,278 --> 01:06:14,273
Another thing to notice is that in the context of the vanilla recurrent neural network, we saw
that during the backward pass, our gradients were flowing through also a tanh at every time step.

659
01:06:14,273 --> 01:06:24,110
But now, in an LSTM, our outputs are, in an LSTM, our
hidden state is used to compute those outputs, y t,

660
01:06:24,110 --> 01:06:33,024
so now, each hidden state, if you imagine backpropagating from the final
hidden state back to the first cell state, then through that backward path,

661
01:06:33,024 --> 01:06:42,044
we only backpropagate through a single tanh non linearity
rather than through a separate tanh at every time step.

662
01:06:42,044 --> 01:06:50,483
So kind of when you put all these things together, you can see this backwards
pass backpropagating through the cell state is kind of a gradient super highway

663
01:06:50,483 --> 01:06:59,172
that lets gradients pass relatively unimpeded from the loss at the very end of the
model all the way back to the initial cell state at the beginning of the model.

664
01:06:59,172 --> 01:07:00,922
Was there a question?

665
01:07:02,901 --> 01:07:06,792
Yeah, what about the gradient in respect to w? 'Cause
that's ultimately the thing that we care about.

666
01:07:06,792 --> 01:07:19,848
So, the gradient with respect to w will come through, at every time step, will take our current cell state as well as our
current hidden state and that will give us an element, that will give us our local gradient on w for that time step.

667
01:07:19,848 --> 01:07:29,587
So because our cell state, and just in the vanilla RNN case, we'll end up
adding those first time step w gradients to compute our final gradient on w.

668
01:07:29,587 --> 01:07:37,280
But now, if you imagine the situation where we have a very long sequence,
and we're only getting gradients to the very end of the sequence.

669
01:07:37,280 --> 01:07:43,219
Now, as you backpropagate through, we'll get
a local gradient on w for each time step,

670
01:07:43,219 --> 01:07:48,506
and that local gradient on w will be
coming through these gradients on c and h.

671
01:07:48,506 --> 01:07:52,751
So because we're maintaining the gradients
on c much more nicely in the LSTM case,

672
01:07:52,751 --> 01:07:59,804
those local gradients on w at each time step will also be
carried forward and backward through time much more cleanly.

673
01:08:01,627 --> 01:08:03,044
Another question?

674
01:08:17,428 --> 01:08:22,088
Yeah, so the question is due to the non linearities,
could this still be susceptible to vanishing gradients?

675
01:08:22,089 --> 01:08:34,103
And that could be the case. Actually, so one problem you might imagine is that maybe if these forget gates are always less
than zero, or always less than one, you might get vanishing gradients as you continually go through these forget gates.

676
01:08:34,103 --> 01:08:42,746
Well, one sort of trick that people do in practice is that they will,
sometimes, initialize the biases of the forget gate to be somewhat positive.

677
01:08:42,746 --> 01:08:46,305
So that at the beginning of training, those
forget gates are always very close to one.

678
01:08:46,305 --> 01:08:56,631
So that at least at the beginning of training, then we have not so, relatively clean
gradient flow through these forget gates, since they're all initialized to be near one.

679
01:08:56,631 --> 01:09:02,741
And then throughout the course of training, then the model can
learn those biases and kind of learn to forget where it needs to.

680
01:09:02,742 --> 01:09:09,182
You're right that there still could be some potential for vanishing
gradients here. But it's much less extreme than the vanilla RNN case,

681
01:09:09,182 --> 01:09:12,126
both because those fs can
vary at each time step,

682
01:09:12,126 --> 01:09:19,126
and also because we're doing this element wise
multiplication rather than a full matrix multiplication.

683
01:09:19,126 --> 01:09:23,048
So you can see that this LSTM
actually looks quite similar to ResNet.

684
01:09:23,048 --> 01:09:28,069
In this residual network, we had this path of identity
connections going backward through the network

685
01:09:28,069 --> 01:09:32,303
and that gave, sort of a gradient super highway
for gradients to flow backward in ResNet.

686
01:09:32,303 --> 01:09:34,924
And now it's kind of the
same intuition in LSTM

687
01:09:34,924 --> 01:09:45,244
where these additive and element wise multiplicative interactions of the cell state can give a
similar gradient super highway for gradients to flow backwards through the cell state in an LSTM.

688
01:09:46,343 --> 01:09:55,558
And by the way, there's this other kind of nice paper called highway networks, which
is kind of in between this idea of this LSTM cell and these residual networks.

689
01:09:57,796 --> 01:10:01,165
So these highway networks
actually came before residual networks,

690
01:10:01,165 --> 01:10:08,804
and they had this idea where at every layer of the highway network, we're
going to compute sort of a candidate activation, as well as a gating function

691
01:10:08,804 --> 01:10:17,017
that tells us that interpolates between our previous input at that layer, and
that candidate activation that came through our convolutions or what not.

692
01:10:17,017 --> 01:10:20,472
So there's actually a lot of architectural
similarities between these things,

693
01:10:20,472 --> 01:10:27,688
and people take a lot of inspiration from training very deep
CNNs and very deep RNNs and there's a lot of crossover here.

694
01:10:27,688 --> 01:10:34,675
Very briefly, you'll see a lot of other types of variance of
recurrent neural network architectures out there in the wild.

695
01:10:34,675 --> 01:10:40,853
Probably the most common, apart from the LSTM,
is this GRU, called the gated recurrent unit.

696
01:10:40,853 --> 01:10:45,701
And you can see those update equations here, and
it kind of has this similar flavor of the LSTM,

697
01:10:45,701 --> 01:10:53,828
where it uses these multiplicative element wise gates together with
these additive interactions to avoid this vanishing gradient problem.

698
01:10:53,828 --> 01:10:59,734
There's also this cool paper called LSTM: a
search based oddysey, very inventive title,

699
01:10:59,734 --> 01:11:04,980
where they tried to play around with the LSTM equations
and swap out the non linearities at one point,

700
01:11:04,980 --> 01:11:07,343
like do we really need that tanh
for exposing the output gate,

701
01:11:07,343 --> 01:11:14,017
and they tried to answer a lot of these different questions about each of
those non linearities, each of those pieces of the LSTM update equations.

702
01:11:14,017 --> 01:11:18,068
What happens if we change the model and
tweak those LSTM equations a little bit.

703
01:11:18,068 --> 01:11:24,521
And kind of the conclusion is that they all work about the same Some of
them work a little bit better than others for one problem or another.

704
01:11:24,521 --> 01:11:32,148
But generally, none of the things, none of the tweaks of LSTM that they
tried were significantly better that the original LSTM for all problems.

705
01:11:32,148 --> 01:11:41,084
So that gives you a little bit more faith that the LSTM update equations seem kind of
magical but they're useful anyway. You should probably consider them for your problem.

706
01:11:41,084 --> 01:11:43,692
There's also this cool paper
from Google a couple years ago

707
01:11:43,692 --> 01:11:52,605
where they tried to use, where they did kind of an evolutionary search and did
a search over many, over a very large number of random RNN architectures,

708
01:11:52,605 --> 01:12:00,860
they kind of randomly premute these update equations and try putting the additions and the
multiplications and the gates and the non linearities in different kinds of combinations.

709
01:12:00,860 --> 01:12:08,570
They blasted this out over their huge Google cluster and just tried
a whole bunch of these different weigh updates in various flavors.

710
01:12:08,570 --> 01:12:15,446
And again, it was the same story that they didn't really find anything
that was significantly better than these existing GRU or LSTM styles.

711
01:12:15,446 --> 01:12:19,518
Although there were some variations that worked maybe
slightly better or worse for certain problems.

712
01:12:19,518 --> 01:12:27,080
But kind of the take away is that probably and using an
LSTM or GRU is not so much magic in those equations,

713
01:12:27,080 --> 01:12:33,356
but this idea of managing gradient flow properly through these
additive connections and these multiplicative gates is super useful.

714
01:12:34,888 --> 01:12:40,103
So yeah, the summary is that RNNs are super cool. They
can allow you to attack tons of new types of problems.

715
01:12:40,103 --> 01:12:43,431
They sometimes are susceptible to
vanishing or exploding gradients.

716
01:12:43,431 --> 01:12:47,412
But we can address that with weight
clipping and with fancier architectures.

717
01:12:47,412 --> 01:12:52,043
And there's a lot of cool overlap between
CNN architectures and RNN architectures.

718
01:12:52,043 --> 01:13:02,801
- So next time, you'll be taking the midterm. - But after that, we'll have a, sorry,
a question? Midterm is after this lecture so anything up to this point is fair game.