-
Notifications
You must be signed in to change notification settings - Fork 212
/
Copy pathLecture 13 _ Generative Models.srt
3299 lines (2630 loc) · 87.6 KB
/
Lecture 13 _ Generative Models.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1
00:00:11,077 --> 00:00:14,258
- Okay we have a lot to cover
today so let's get started.
2
00:00:14,258 --> 00:00:17,454
Today we'll be talking
about Generative Models.
3
00:00:17,454 --> 00:00:20,484
And before we start, a few
administrative details.
4
00:00:20,484 --> 00:00:23,522
So midterm grades will be
released on Gradescope this week
5
00:00:23,522 --> 00:00:27,730
A reminder that A3 is
due next Friday May 26th.
6
00:00:27,730 --> 00:00:32,709
The HyperQuest deadline for extra credit you
can do this still until Sunday May 21st.
7
00:00:33,632 --> 00:00:37,799
And our poster session is
June 6th from 12 to 3 P.M..
8
00:00:40,812 --> 00:00:47,759
Okay so an overview of what we're going to talk about today we're going to
switch gears a little bit and take a look at unsupervised learning today.
9
00:00:47,759 --> 00:00:54,103
And in particular we're going to talk about generative
models which is a type of unsupervised learning.
10
00:00:54,103 --> 00:00:57,112
And we'll look at three
types of generative models.
11
00:00:57,112 --> 00:01:01,174
So pixelRNNs and pixelCNNs
variational autoencoders
12
00:01:01,174 --> 00:01:04,174
and Generative Adversarial networks.
13
00:01:05,571 --> 00:01:11,168
So so far in this class we've talked a lot about supervised
learning and different kinds of supervised learning problems.
14
00:01:11,168 --> 00:01:16,078
So in the supervised learning set up we have
our data X and then we have some labels Y.
15
00:01:16,078 --> 00:01:21,417
And our goal is to learn a function that's
mapping from our data X to our labels Y.
16
00:01:21,417 --> 00:01:26,237
And these labels can take
many different types of forms.
17
00:01:26,237 --> 00:01:34,934
So for example, we've looked at classification where our input is
an image and we want to output Y, a class label for the category.
18
00:01:34,934 --> 00:01:44,093
We've talked about object detection where now our input is still an image but here
we want to output the bounding boxes of instances of up to multiple dogs or cats.
19
00:01:46,138 --> 00:01:51,986
We've talked about semantic segmentation where here we have a
label for every pixel the category that every pixel belongs to.
20
00:01:53,572 --> 00:01:58,961
And we've also talked about image captioning
where here our label is now a sentence
21
00:01:58,961 --> 00:02:02,961
and so it's now in the
form of natural language.
22
00:02:03,998 --> 00:02:15,661
So unsupervised learning in this set up, it's a type of learning where here we have unlabeled
training data and our goal now is to learn some underlying hidden structure of the data.
23
00:02:15,661 --> 00:02:20,370
Right, so an example of this can be something like
clustering which you guys might have seen before
24
00:02:20,370 --> 00:02:25,029
where here the goal is to find groups within the
data that are similar through some type of metric.
25
00:02:25,029 --> 00:02:27,187
For example, K means clustering.
26
00:02:27,187 --> 00:02:32,871
Another example of an unsupervised learning
task is a dimensionality reduction.
27
00:02:33,777 --> 00:02:38,939
So in this problem want to find axes along which
our training data has the most variation,
28
00:02:38,939 --> 00:02:43,537
and so these axes are part of the
underlying structure of the data.
29
00:02:43,537 --> 00:02:51,095
And then we can use this to reduce of dimensionality of the data such that
the data has significant variation among each of the remaining dimensions.
30
00:02:51,095 --> 00:02:57,842
Right, so this example here we start off with data in three
dimensions and we're going to find two axes of variation in this case
31
00:02:57,842 --> 00:03:01,259
and reduce our data projected down to 2D.
32
00:03:04,205 --> 00:03:09,964
Another example of unsupervised learning is
learning feature representations for data.
33
00:03:11,006 --> 00:03:17,209
We've seen how to do this in supervised ways before where
we used the supervised loss, for example classification.
34
00:03:17,209 --> 00:03:21,617
Where we have the classification label.
We have something like a Softmax loss
35
00:03:21,617 --> 00:03:29,869
And we can train a neural network where we can interpret activations for
example our FC7 layers as some kind of future representation for the data.
36
00:03:29,869 --> 00:03:35,742
And in an unsupervised setting, for example here
autoencoders which we'll talk more about later
37
00:03:35,742 --> 00:03:46,872
In this case our loss is now trying to reconstruct the input data to basically,
you have a good reconstruction of our input data and use this to learn features.
38
00:03:46,872 --> 00:03:52,245
So we're learning a feature representation
without using any additional external labels.
39
00:03:53,471 --> 00:03:59,585
And finally another example of unsupervised learning
is density estimation where in this case we want to
40
00:03:59,585 --> 00:04:02,884
estimate the underlying
distribution of our data.
41
00:04:02,884 --> 00:04:10,811
So for example in this top case over here, we have points
in 1-d and we can try and fit a Gaussian into this density
42
00:04:10,811 --> 00:04:16,605
and in this bottom example over here it's 2D data and
here again we're trying to estimate the density and
43
00:04:16,605 --> 00:04:24,239
we can model this density. We want to fit a model such that
the density is higher where there's more points concentrated.
44
00:04:26,100 --> 00:04:35,990
And so to summarize the differences in unsupervised learning which we've looked
a lot so far, we want to use label data to learn a function mapping from X to Y
45
00:04:35,990 --> 00:04:44,124
and an unsupervised learning we use no labels and instead we try to learn
some underlying hidden structure of the data, whether this is grouping,
46
00:04:44,124 --> 00:04:48,291
acts as a variation or
underlying density estimation.
47
00:04:49,662 --> 00:04:54,113
And unsupervised learning is a huge
and really exciting area of research and
48
00:04:54,113 --> 00:05:04,339
and some of the reasons are that training data is really cheap, it doesn't use labels
so we're able to learn from a lot of data at one time and basically utilize a lot
49
00:05:04,339 --> 00:05:09,977
more data than if we required annotating
or finding labels for data.
50
00:05:09,977 --> 00:05:17,823
And unsupervised learning is still relatively unsolved research
area by comparison. There's a lot of open problems in this,
51
00:05:17,823 --> 00:05:24,669
but it also, it holds the potential of if you're able to
successfully learn and represent a lot of the underlying structure
52
00:05:24,669 --> 00:05:32,729
in the data then this also takes you a long way towards the Holy
Grail of trying to understand the structure of the visual world.
53
00:05:35,026 --> 00:05:40,432
So that's a little bit of kind of a high-level
big picture view of unsupervised learning.
54
00:05:40,432 --> 00:05:44,155
And today will focus more
specifically on generative models
55
00:05:44,155 --> 00:05:52,933
which is a class of models for unsupervised learning where given training
data our goal is to try and generate new samples from the same distribution.
56
00:05:52,933 --> 00:05:57,686
Right, so we have training data over here
generated from some distribution P data
57
00:05:57,686 --> 00:06:04,955
and we want to learn a model, P model to
generate samples from the same distribution
58
00:06:04,955 --> 00:06:09,854
and so we want to learn P
model to be similar to P data.
59
00:06:09,854 --> 00:06:12,636
And generative models
address density estimations.
60
00:06:12,636 --> 00:06:22,180
So this problem that we saw earlier of trying to estimate the underlying
distribution of your training data which is a core problem in unsupervised learning.
61
00:06:22,180 --> 00:06:25,190
And we'll see that there's
several flavors of this.
62
00:06:25,190 --> 00:06:33,353
We can use generative models to do explicit density estimation
where we're going to explicitly define and solve for our P model
63
00:06:35,045 --> 00:06:37,610
or we can also do implicit
density estimation
64
00:06:37,610 --> 00:06:45,035
where in this case we'll learn a model that can produce
samples from P model without explicitly defining it.
65
00:06:47,700 --> 00:06:54,096
So, why do we care about generative models? Why is this a
really interesting core problem in unsupervised learning?
66
00:06:54,096 --> 00:06:57,451
Well there's a lot of things that
we can do with generative models.
67
00:06:57,451 --> 00:07:04,659
If we're able to create realistic samples from the data distributions
that we want we can do really cool things with this, right?
68
00:07:04,659 --> 00:07:14,568
We can generate just beautiful samples to start with so on the left you can
see a completely new samples of just generated by these generative models.
69
00:07:14,568 --> 00:07:21,042
Also in the center here generated samples of
images we can also do tasks like super resolution,
70
00:07:21,042 --> 00:07:32,145
colorization so hallucinating or filling in these edges with
generated ideas of colors and what the purse should look like.
71
00:07:32,145 --> 00:07:41,619
We can also use generative models of time series data for simulation and
planning and so this will be useful in for reinforcement learning applications
72
00:07:41,619 --> 00:07:45,089
which we'll talk a bit more about
reinforcement learning in a later lecture.
73
00:07:45,089 --> 00:07:50,261
And training generative models can also
enable inference of latent representations.
74
00:07:50,261 --> 00:07:57,435
Learning latent features that can be useful
as general features for downstream tasks.
75
00:07:59,059 --> 00:08:05,688
So if we look at types of generative models
these can be organized into the taxonomy here
76
00:08:05,688 --> 00:08:13,180
where we have these two major branches that we talked
about, explicit density models and implicit density models.
77
00:08:13,180 --> 00:08:19,062
And then we can also get down into many
of these other sub categories.
78
00:08:19,062 --> 00:08:27,814
And well we can refer to this figure is adapted
from a tutorial on GANs from Ian Goodfellow
79
00:08:27,814 --> 00:08:36,861
and so if you're interested in some of these different taxonomy and categorizations
of generative models this is a good resource that you can take a look at.
80
00:08:36,861 --> 00:08:45,645
But today we're going to discuss three of the most popular types
of generative models that are in use and in research today.
81
00:08:45,645 --> 00:08:49,475
And so we'll talk first briefly
about pixelRNNs and CNNs
82
00:08:49,475 --> 00:08:52,162
And then we'll talk about
variational autoencoders.
83
00:08:52,162 --> 00:08:55,661
These are both types of
explicit density models.
84
00:08:55,661 --> 00:08:57,494
One that's using a tractable density
85
00:08:57,494 --> 00:09:01,312
and another that's using
an approximate density
86
00:09:01,312 --> 00:09:05,614
And then we'll talk about
generative adversarial networks,
87
00:09:05,614 --> 00:09:09,781
GANs which are a type of
implicit density estimation.
88
00:09:12,152 --> 00:09:16,304
So let's first talk
about pixelRNNs and CNNs.
89
00:09:16,304 --> 00:09:20,015
So these are a type of fully
visible belief networks
90
00:09:20,015 --> 00:09:22,432
which are modeling a density explicitly
91
00:09:22,432 --> 00:09:34,941
so in this case what they do is we have this image data X that we have and we want to model the
probability or likelihood of this image P of X. Right and so in this case, for these kinds of models,
92
00:09:34,941 --> 00:09:40,384
we use the chain rule to decompose this likelihood
into a product of one dimensional distribution.
93
00:09:40,384 --> 00:09:43,493
So we have here the
probability of each pixel X I
94
00:09:43,493 --> 00:09:47,871
conditioned on all previous
pixels X1 through XI - 1.
95
00:09:47,871 --> 00:09:58,073
and your likelihood all right, your joint likelihood of all the pixels in your image is
going to be the product of all of these pixels together, all of these likelihoods together.
96
00:09:58,073 --> 00:10:08,938
And then once we define this likelihood, in order to train this model we can
just maximize the likelihood of our training data under this defined density.
97
00:10:10,980 --> 00:10:20,833
So if we look at this this distribution over pixel values right, we have this P of
XI given all the previous pixel values, well this is a really complex distribution.
98
00:10:20,833 --> 00:10:22,700
So how can we model this?
99
00:10:22,700 --> 00:10:29,042
Well we've seen before that if we want to have complex
transformations we can do these using neural networks.
100
00:10:29,042 --> 00:10:32,828
Neural networks are a good way to
express complex transformations.
101
00:10:32,828 --> 00:10:42,300
And so what we'll do is we'll use a neural network to express
this complex function that we have of the distribution.
102
00:10:43,235 --> 00:10:44,796
And one thing you'll see here is that,
103
00:10:44,796 --> 00:10:51,212
okay even if we're going to use a neural network for this another
thing we have to take care of is how do we order the pixels.
104
00:10:51,212 --> 00:10:58,886
Right, I said here that we have a distribution for P of XI given
all previous pixels but what does all previous the pixels mean?
105
00:10:58,886 --> 00:11:01,303
So we'll take a look at that.
106
00:11:03,336 --> 00:11:06,669
So PixelRNN was a model proposed in 2016
107
00:11:07,595 --> 00:11:17,657
that basically defines a way for setting up and
optimizing this problem and so how this model works is
108
00:11:17,657 --> 00:11:21,187
that we're going to generate pixels
starting in a corner of the image.
109
00:11:21,187 --> 00:11:31,050
So we can look at this grid as basically the pixels of your image and so
what we're going to do is start from the pixel in the upper left-hand corner
110
00:11:31,050 --> 00:11:37,195
and then we're going to sequentially generate pixels based
on these connections from the arrows that you can see here.
111
00:11:37,195 --> 00:11:44,332
And each of the dependencies on the previous pixels
in this ordering is going to be modeled using an RNN
112
00:11:44,332 --> 00:11:48,092
or more specifically an LSTM which
we've seen before in lecture.
113
00:11:48,092 --> 00:11:55,242
Right so using this we can basically continue to
move forward just moving down a long is diagonal
114
00:11:55,242 --> 00:12:01,244
and generating all of these pixel values dependent
on the pixels that they're connected to.
115
00:12:01,244 --> 00:12:08,736
And so this works really well but the drawback here is that this
sequential generation, right, so it's actually quite slow to do this.
116
00:12:08,736 --> 00:12:15,061
You can imagine you know if you're going to generate a new image instead
of all of these feed forward networks that we see, we've seen with CNNs.
117
00:12:15,061 --> 00:12:20,952
Here we're going to have to iteratively go through
and generate all these images, all these pixels.
118
00:12:24,044 --> 00:12:30,575
So a little bit later, after a pixelRNN,
another model called pixelCNN was introduced.
119
00:12:30,575 --> 00:12:34,570
And this has very
similar setup as pixelCNN
120
00:12:34,570 --> 00:12:43,074
and we're still going to do this image generation starting from the corner of the of
the image and expanding outwards but the difference now is that now instead of using
121
00:12:43,074 --> 00:12:47,752
an RNN to model all these dependencies
we're going to use the CNN instead.
122
00:12:47,752 --> 00:12:52,179
And we're now going to use a
CNN over a a context region
123
00:12:52,179 --> 00:12:56,384
that you can see here around in the particular
pixel that we're going to generate now.
124
00:12:56,384 --> 00:13:09,313
Right so we take the pixels around it, this gray area within the region that's already been
generated and then we can pass this through a CNN and use that to generate our next pixel value.
125
00:13:11,041 --> 00:13:18,055
And so what this is going to give is this is going to give
This is a CNN, a neural network at each pixel location
126
00:13:18,055 --> 00:13:22,967
right and so the output of this is going to be
a soft max loss over the pixel values here.
127
00:13:22,967 --> 00:13:31,193
In this case we have a 0 to 255 and then we can train
this by maximizing the likelihood of the training images.
128
00:13:31,193 --> 00:13:43,482
Right so we say that basically we want to take a training image we're going to
do this generation process and at each pixel location we have the ground truth
129
00:13:43,482 --> 00:13:53,976
training data image value that we have here and this is a quick basically the label or
the the the classification label that we want our pixel to be which of these 255 values
130
00:13:53,976 --> 00:13:56,723
and we can train this
using a Softmax loss.
131
00:13:56,723 --> 00:14:05,597
Right and so basically the effect of doing this is that we're going to
maximize the likelihood of our training data pixels being generated.
132
00:14:05,597 --> 00:14:08,413
Okay any questions about this?
Yes.
133
00:14:08,413 --> 00:14:12,159
[student's words obscured
due to lack of microphone]
134
00:14:12,159 --> 00:14:18,675
Yeah, so the question is, I thought we were talking about unsupervised
learning, why do we have basically a classification label here?
135
00:14:18,675 --> 00:14:24,970
The reason is that this loss, this output that
we have is the value of the input training data.
136
00:14:24,970 --> 00:14:26,983
So we have no external labels, right?
137
00:14:26,983 --> 00:14:38,533
We didn't go and have to manually collect any labels for this, we're just taking
our input data and saying that this is what we used for the last function.
138
00:14:41,199 --> 00:14:45,366
[student's words obscured
due to lack of microphone]
139
00:14:47,998 --> 00:14:50,746
The question is, is
this like bag of words?
140
00:14:50,746 --> 00:14:53,109
I would say it's not really bag of words,
141
00:14:53,109 --> 00:15:01,466
it's more saying that we want where we're outputting a distribution over
pixel values at each location of our image right, and what we want to do
142
00:15:01,466 --> 00:15:10,442
is we want to maximize the likelihood of our input,
our training data being produced, being generated.
143
00:15:10,442 --> 00:15:15,761
Right so, in that sense, this is why it's
using our input data to create our loss.
144
00:15:21,006 --> 00:15:24,904
So using pixelCNN training
is faster than pixelRNN
145
00:15:24,904 --> 00:15:34,301
because here now right at every pixel location we want to maximize the
value of our, we want to maximize the likelihood of our training data
146
00:15:34,301 --> 00:15:40,739
showing up and so we have all of these values already right,
just from our training data and so we can do this much
147
00:15:40,739 --> 00:15:47,296
faster but a generation time for a test time we want to
generate a completely new image right, just starting from
148
00:15:47,296 --> 00:15:59,197
the corner and we're not, we're not trying to do any type of learning so in that generation time
we still have to generate each of these pixel locations before we can generate the next location.
149
00:15:59,197 --> 00:16:03,025
And so generation time here it still slow
even though training time is faster.
150
00:16:03,025 --> 00:16:04,204
Question.
151
00:16:04,204 --> 00:16:08,365
[student's words obscured
due to lack of microphone]
152
00:16:08,365 --> 00:16:14,077
So the question is, is this training a sensitive
distribution to what you pick for the first pixel?
153
00:16:14,077 --> 00:16:21,208
Yeah, so it is dependent on what you have as the initial pixel
distribution and then everything is conditioned based on that.
154
00:16:23,203 --> 00:16:32,171
So again, how do you pick this distribution? So at training time you have
these distributions from your training data and then at generation time
155
00:16:32,171 --> 00:16:38,368
you can just initialize this with either uniform
or from your training data, however you want.
156
00:16:38,368 --> 00:16:42,553
And then once you have that everything
else is conditioned based on that.
157
00:16:42,553 --> 00:16:43,912
Question.
158
00:16:43,912 --> 00:16:48,079
[student's words obscured
due to lack of microphone]
159
00:17:07,415 --> 00:17:14,146
Yeah so the question is is there a way that we define this in this
chain rule fashion instead of predicting all the pixels at one time?
160
00:17:14,146 --> 00:17:17,884
And so we'll see, we'll see
models later that do do this,
161
00:17:17,884 --> 00:17:27,868
but what the chain rule allows us to do is it allows us to find this very tractable
density that we can then basically optimize and do, directly optimizes likelihood
162
00:17:31,864 --> 00:17:39,606
Okay so these are some examples of generations from
this model and so here on the left you can see
163
00:17:39,606 --> 00:17:48,846
generations where the training data is CIFAR-10, CIFAR-10 dataset. And so you can
see that in general they are starting to capture statistics of natural images.
164
00:17:48,846 --> 00:17:56,848
You can see general types of blobs and kind of things
that look like parts of natural images coming out.
165
00:17:56,848 --> 00:18:02,768
On the right here it's ImageNet, we can again see samples
from here and these are starting to look like natural images
166
00:18:05,060 --> 00:18:09,966
but they're still not, there's
still room for improvement.
167
00:18:09,966 --> 00:18:17,059
You can still see that there are differences obviously with regional
training images and some of the semantics are not clear in here.
168
00:18:19,371 --> 00:18:27,020
So, to summarize this, pixelRNNs and CNNs allow
you to explicitly compute likelihood P of X.
169
00:18:27,020 --> 00:18:29,297
It's an explicit density
that we can optimize.
170
00:18:29,297 --> 00:18:34,043
And being able to do this also has another
benefit of giving a good evaluation metric.
171
00:18:34,043 --> 00:18:40,958
You know you can kind of measure how good your samples
are by this likelihood of the data that you can compute.
172
00:18:40,958 --> 00:18:47,043
And it's able to produce pretty good samples
but it's still an active area of research
173
00:18:47,043 --> 00:18:53,760
and the main disadvantage of these methods is that the
generation is sequential and so it can be pretty slow.
174
00:18:53,760 --> 00:18:59,324
And these kinds of methods have also been
used for generating audio for example.
175
00:18:59,324 --> 00:19:08,170
And you can look online for some pretty interesting examples of this, but
again the drawback is that it takes a long time to generate these samples.
176
00:19:08,170 --> 00:19:14,565
And so there's a lot of work, has been work since
then on still on improving pixelCNN performance
177
00:19:14,565 --> 00:19:22,346
And so all kinds of different you know architecture changes add the loss
function formulating this differently on different types of training tricks
178
00:19:22,346 --> 00:19:29,495
And so if you're interested in learning more about
this you can look at some of these papers on PixelCNN
179
00:19:29,495 --> 00:19:35,115
and then other pixelCNN plus plus better
improved version that came out this year.
180
00:19:37,455 --> 00:19:44,321
Okay so now we're going to talk about another type
of generative models call variational autoencoders.
181
00:19:44,321 --> 00:19:52,204
And so far we saw that pixelCNNs defined a tractable
density function, right, using this this definition
182
00:19:52,204 --> 00:19:58,365
and based on that we can optimize directly
optimize the likelihood of the training data.
183
00:19:59,419 --> 00:20:04,195
So with variational autoencoders now we're going
to define an intractable density function.
184
00:20:04,195 --> 00:20:10,769
We're now going to model this with an additional latent
variable Z and we'll talk in more detail about how this looks.
185
00:20:10,769 --> 00:20:17,886
And so our data likelihood P of X is now
basically has to be this integral right,
186
00:20:17,886 --> 00:20:21,422
taking the expectation over
all possible values of Z.
187
00:20:21,422 --> 00:20:26,909
And so this now is going to be a problem. We'll
see that we cannot optimize this directly.
188
00:20:26,909 --> 00:20:33,706
And so instead what we have to do is we have to derive
and optimize a lower bound on the likelihood instead.
189
00:20:33,706 --> 00:20:34,956
Yeah, question.
190
00:20:35,864 --> 00:20:37,592
So the question is is what is Z?
191
00:20:37,592 --> 00:20:42,862
Z is a latent variable and I'll go
through this in much more detail.
192
00:20:44,479 --> 00:20:48,538
So let's talk about some background first.
193
00:20:48,538 --> 00:20:54,733
Variational autoencoders are related to a type of
unsupervised learning model called autoencoders.
194
00:20:54,733 --> 00:21:00,965
And so we'll talk little bit more first about autoencoders
and what they are and then I'll explain how variational
195
00:21:00,965 --> 00:21:05,851
autoencoders are related and build off
of this and allow you to generate data.
196
00:21:05,851 --> 00:21:09,168
So with autoencoders we don't
use this to generate data,
197
00:21:09,168 --> 00:21:15,719
but it's an unsupervised approach for learning a lower
dimensional feature representation from unlabeled training data.
198
00:21:15,719 --> 00:21:21,550
All right so in this case we have our input data X and then
we're going to want to learn some features that we call Z.
199
00:21:22,541 --> 00:21:29,605
And then we'll have an encoder that's going to be a mapping,
a function mapping from this input data to our feature Z.
200
00:21:30,911 --> 00:21:33,905
And this encoder can take
many different forms right,
201
00:21:33,905 --> 00:21:41,239
they would generally use neural networks so originally these models
have been around, autoencoders have been around for a long time.
202
00:21:41,239 --> 00:21:45,803
So in the 2000s we used linear
layers of non-linearities,
203
00:21:45,803 --> 00:21:54,389
then later on we had fully connected deeper networks and then
after that we moved on to using CNNs for these encoders.
204