-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathSpotify.tex
1402 lines (949 loc) · 89.9 KB
/
Spotify.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% Options for packages loaded elsewhere
\PassOptionsToPackage{unicode}{hyperref}
\PassOptionsToPackage{hyphens}{url}
\PassOptionsToPackage{dvipsnames,svgnames*,x11names*}{xcolor}
%
\documentclass[
11pt,
]{book}
\usepackage{amsmath,amssymb}
\usepackage[]{mathpazo}
\usepackage{ifxetex,ifluatex}
\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{textcomp} % provide euro and other symbols
\else % if luatex or xetex
\usepackage{unicode-math}
\defaultfontfeatures{Scale=MatchLowercase}
\defaultfontfeatures[\rmfamily]{Ligatures=TeX,Scale=1}
\fi
% Use upquote if available, for straight quotes in verbatim environments
\IfFileExists{upquote.sty}{\usepackage{upquote}}{}
\IfFileExists{microtype.sty}{% use microtype if available
\usepackage[]{microtype}
\UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts
}{}
\makeatletter
\@ifundefined{KOMAClassName}{% if non-KOMA class
\IfFileExists{parskip.sty}{%
\usepackage{parskip}
}{% else
\setlength{\parindent}{0pt}
\setlength{\parskip}{6pt plus 2pt minus 1pt}}
}{% if KOMA class
\KOMAoptions{parskip=half}}
\makeatother
\usepackage{xcolor}
\IfFileExists{xurl.sty}{\usepackage{xurl}}{} % add URL line breaks if available
\IfFileExists{bookmark.sty}{\usepackage{bookmark}}{\usepackage{hyperref}}
\hypersetup{
pdftitle={Data Science Project Guide: Spotify},
pdfauthor={TechAcademy e.V.},
colorlinks=true,
linkcolor=Maroon,
filecolor=Maroon,
citecolor=Blue,
urlcolor=blue,
pdfcreator={LaTeX via pandoc}}
\urlstyle{same} % disable monospaced font for URLs
\usepackage[top=0.5in, bottom=1.5in, left=1in, right=1in, a4paper]{geometry}
\usepackage{longtable,booktabs,array}
\usepackage{calc} % for calculating minipage widths
% Correct order of tables after \paragraph or \subparagraph
\usepackage{etoolbox}
\makeatletter
\patchcmd\longtable{\par}{\if@noskipsec\mbox{}\fi\par}{}{}
\makeatother
% Allow footnotes in longtable head/foot
\IfFileExists{footnotehyper.sty}{\usepackage{footnotehyper}}{\usepackage{footnote}}
\makesavenoteenv{longtable}
\usepackage{graphicx}
\makeatletter
\def\maxwidth{\ifdim\Gin@nat@width>\linewidth\linewidth\else\Gin@nat@width\fi}
\def\maxheight{\ifdim\Gin@nat@height>\textheight\textheight\else\Gin@nat@height\fi}
\makeatother
% Scale images if necessary, so that they will not overflow the page
% margins by default, and it is still possible to overwrite the defaults
% using explicit options in \includegraphics[width, height, ...]{}
\setkeys{Gin}{width=\maxwidth,height=\maxheight,keepaspectratio}
% Set default figure placement to htbp
\makeatletter
\def\fps@figure{htbp}
\makeatother
\setlength{\emergencystretch}{3em} % prevent overfull lines
\providecommand{\tightlist}{%
\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}
\setcounter{secnumdepth}{5}
\newcommand{\sectionbreak}{\clearpage}
\usepackage{booktabs}
\RequirePackage{fix-cm}
\usepackage[many]{tcolorbox}
\usepackage{xcolor}
\definecolor{r}{HTML}{2369bd}
\definecolor{p}{HTML}{ffde4d}
\definecolor{boxgrey}{HTML}{fefefe}
\newtcolorbox{rbox}{
colback=boxgrey,
colframe=r,
coltext=black,
boxsep=5pt,
arc=4pt}
\newtcolorbox{pbox}{
colback=boxgrey,
colframe=p,
coltext=black,
boxsep=5pt,
arc=4pt}
\newenvironment{tips}[1]
{
\begin{itemize}
\footnotesize
\renewcommand{\labelitemi}{
\raisebox{-.7\height}[0pt][0pt]{
{\setkeys{Gin}{width=3em,keepaspectratio}
\includegraphics{images/#1.png}}
}
}
\setlength{\fboxsep}{1em}
\begin{rbox}
\item
}
{
\end{rbox}
\end{itemize}
}
\newenvironment{tipsp}[1]
{
\begin{itemize}
\footnotesize
\renewcommand{\labelitemi}{
\raisebox{-.7\height}[0pt][0pt]{
{\setkeys{Gin}{width=3em,keepaspectratio}
\includegraphics{images/#1.png}}
}
}
\setlength{\fboxsep}{1em}
\begin{pbox}
\item
}
{
\end{pbox}
\end{itemize}
}
\usepackage{fancyhdr}
\pagestyle{fancyplain}
\usepackage{setspace}
\usepackage{chngcntr}
\onehalfspacing
\usepackage{titling}
\pretitle{\begin{center}\LARGE\includegraphics[width=6cm]{plots/TA_Logo.png}\\[\bigskipamount]}
\posttitle{\end{center}}
\ifluatex
\usepackage{selnolig} % disable illegal ligatures
\fi
\usepackage[]{natbib}
\bibliographystyle{apalike}
\title{Data Science Project Guide: Spotify}
\author{TechAcademy e.V.}
\date{Summer Term 2021}
\begin{document}
\maketitle
\clearpage
\addtolength{\headheight}{17.82275pt}
\rhead{\includegraphics[height=0.5cm]{plots/TA_logo.png}}
\fancyfoot{}
\fancyfoot[R]{\thepage}
\addtolength{\headheight}{17.82275pt}
\fancyfoot[L]{Data Science Project Guide | Spotify | \copyright\ 2021, TechAcademy e.V.}
\renewcommand{\headrulewidth}{0.25pt}
\renewcommand{\footrulewidth}{0.25pt}
\tableofcontents
\clearpage
\hypertarget{welcome}{%
\chapter{Welcome!}\label{welcome}}
In the first few chapters you will be introduced to the basics of the \texttt{R} and \texttt{Python} tracks respectively and you will find helpful explanations to questions you might have in the beginning of your coding journey. There will be a quick introduction to the Data Science track so that you can get started with the project quickly. So let's get started with the basics!
In all tracks you will work on your project in small groups of fellow students. This not only helps you get the project done faster, it also helps make your results even better. Our experience shows: The different backgrounds of the members and discussing different opinions and ideas will produce the best results. Besides, it is of course more fun to work on a project together than to code alone!\\
The groups can consist of a maximum of four members. You can choose your teammates independently, we won't interfere with your arrangements. It is important that all group members complete the same level of difficulty (beginner or advanced), since the tasks are different in scope for each level. We explicitly encourage you to collaborate with students from different departments. This not only allows you to get to know people from other departments, but may even give you a whole new perspective on the project and tasks.\\
When submitting it is important to note: for a certificate, each person must submit the project individually. However, this can be identical within your group. You can get more information at our first Coding Meetup on \textbf{May 19, 2021}.
This Spotify case study and the associated project guide was developed and written entirely from scratch by TechAcademy's Data Science team. Karlo Lukic and Lara Zaremba developed the project in \texttt{R}, while Felix Schneider and Manuel Mair am Tinkhof developed it in \texttt{Python}.
\hypertarget{whats-data-science-and-how-do-i-do-it}{%
\chapter{What's Data Science and How Do I Do It?}\label{whats-data-science-and-how-do-i-do-it}}
Data science is a multi-layered field in which the use of the latest machine learning methods is only a sub-area. To get there, you'll need many steps before -- from collecting to manipulating to exploring the data. And eventually, you will need to somehow communicate your findings.
But first things first. To analyze data, it must first be obtained. You need to know where to obtain it and how to integrate it in your respective tools. The data is rarely available as it would be needed for further processing. Familiarizing yourself with the information available, cleaning it up and processing it into the desired formats that can be read by humans and machines are important steps that often make up a large part of the work.
Before the obtained data can be analyzed, the right tool must be selected and mastered: the programming language. The most often used languages for Data Science are \texttt{R}, which was explicitly developed for statistics, and \texttt{Python}, which is characterized by its additional versatility. The data scientist does not have to be a perfect software developer who masters every detail and paradigm, but the competent handling of syntax and idiosyncrasies is essential.
There are some well-developed method collections, so-called packages or libraries, which provide a lot of functionality. The use of these collections also has to be learned and mastered.
Once all of this is achieved, the data can finally be analyzed. Here too, it is important to know and understand the multitude of statistical approaches in order to be able to choose the right method for the problem at hand. The newest, best, most beautiful neural network is not always the solution for everything.\\
One step is still missing in the data science process: understanding and communicating the results. The results are often not spontaneously intuitive or sometimes even surprising. Here, the specific expertise and creativity can be played out, especially in the visualization.
\hypertarget{whats-r}{%
\section{What's R?}\label{whats-r}}
\texttt{R} is a programming language that was developed by statisticians in the early 1990s for use in the calculation and visualization of statistical applications. A lot has happened since then and by now, \texttt{R} is one of the most widely used programming languages in the field of data science. Code in \texttt{R} does not have to be compiled, but can be used interactively and dynamically. This makes it possible to quickly gain basic knowledge about existing data and to display it graphically.\\
\texttt{R} offers much more than just programming, but also a complete system for solving statistical problems. A large number of packages and interfaces are available, with which the functionality can be expanded and integration into other applications is made possible.
\hypertarget{rstudio-cloud}{%
\subsection{RStudio Cloud}\label{rstudio-cloud}}
Before you can use \texttt{R}, you usually have to install some separate programs locally on your computer. Typically, you first install a ``raw'' version of \texttt{R}. In theory, you can then already start programming. However, it is very difficult to carry out an entire project with it. That's why there is RStudio, an Integrated Development Environment (IDE) for \texttt{R}. This includes many essential features that simplify programming with \texttt{R}. Among other things, an auto-completion of your code, a nicely structured user interface and many expansion options.
Experience has shown that installing \texttt{R} and RStudio locally takes some effort. Fortunately, RStudio also has a cloud solution that eliminates these steps: RStudio Cloud. There it is possible to edit your project in exactly the same IDE in the browser without any prior installations. You can also easily switch your project from private to public and give your team an insight into your code via a link or by giving them access to the workspace directly. In this way you are able to easily exchange ideas within your team.
We will introduce RStudio Cloud and unlock access to our workspace on our first Coding Meetup. Until then, focus on learning the ``hard skills'' of programming with your courses on DataCamp. This brings us to your curriculum in the next section.
\hypertarget{curriculum}{%
\subsection{Curriculum}\label{curriculum}}
The following list shows the required DataCamp courses for the Data Science with \texttt{R} Track. As a beginner, please stick to the courses of the ``beginner'' program, ambitious beginners can of course also take the advanced courses afterwards. However, the courses should be worked through in the order in which they are listed.\\
The same applies to the advanced courses. Here, too, the specified courses should be processed in the given order. Since it can of course happen that you have already mastered the topics of an advanced course, individual courses can be replaced. The topics of the advanced courses are given in key points. If these key points seem familiar to you, then take a look at the table of contents of the corresponding DataCamp course. If you are convinced that this course does not provide any added value for you, it can be replaced by one of the courses in the ``Exchange Pool'' (see list). However, this exchange course should not be processed until all other courses in the advanced course have been completed.
Both beginners and advanced learners must have completed at least two thirds of the curriculum in order to receive the certificate. For the beginners this means at least up to the course ``Data Visualization with ggplot2 (Part 1)'' and for the advanced at least up to ``Supervised Learning in \texttt{R}: Classification''. In addition, at least two thirds of the project tasks must have been completed.
\begin{tips}r
\textbf{R Fundamentals (Beginner)}
\begin{itemize}
\tightlist
\item
\href{https://www.datacamp.com/courses/free-introduction-to-r}{Introduction to R (4h)}
\item
\href{https://www.datacamp.com/courses/intermediate-r}{Intermediate R (6h)}
\item
\href{https://www.datacamp.com/courses/importing-data-in-r-part-1}{Introduction to Importing Data in R (3h)}
\item
\href{https://www.datacamp.com/courses/cleaning-data-in-r}{Cleaning Data in R (4h)}
\item
\href{https://www.datacamp.com/courses/data-manipulation-with-dplyr-in-r}{Data Manipulation with dplyr (4h)}
\item
\href{https://www.datacamp.com/courses/data-visualization-with-ggplot2-1}{Data Visualization with ggplot2 (Part1) (5h)}
\item
\href{https://www.datacamp.com/courses/exploratory-data-analysis}{Exploratory Data Analysis in R (4h)}
\item
\href{https://www.datacamp.com/courses/correlation-and-regression}{Correlation and Regression in R (4h)}
\item
\href{https://www.datacamp.com/courses/multiple-and-logistic-regression}{Multiple and Logistic Regression in R (4h)}
\end{itemize}
\textbf{Machine Learning Fundamentals in R (Advanced)}
\begin{itemize}
\tightlist
\item
\href{https://www.datacamp.com/courses/intermediate-r}{Intermediate R (6h)}: conditionals, loops, functions, apply
\item
\href{https://www.datacamp.com/courses/importing-data-in-r-part-1}{Introduction to Importing Data in R (3h)}: utils, readr, data.table, XLConnect
\item
\href{https://www.datacamp.com/courses/cleaning-data-in-r}{Cleaning Data in R (4h)}: raw data, tidying \& preparing data
\item
\href{https://www.datacamp.com/courses/importing-cleaning-data-in-r-case-studies}{Importing \& Cleaning Data in R: Case Studies (4h)}: case studies
\item
\href{https://www.datacamp.com/courses/data-visualization-with-ggplot2-1}{Data Visualization with ggplot2 (Part1) (5h)}: aesthetics, geometries, qplot
\item
\href{https://www.datacamp.com/courses/supervised-learning-in-r-classification}{Supervised Learning in R: Classification (4h)}: kNN, naive bayes, logistic regression, classification trees
\item
\href{https://www.datacamp.com/courses/supervised-learning-in-r-regression}{Supervised learning in R: Regression (4h)}: linear \& non-linear regression, tree-based methods
\item
\href{https://www.datacamp.com/courses/unsupervised-learning-in-r}{Unsupervised Learning in R (4h)}: k-means, clustering, dimensionality reduction
\item
\href{https://www.datacamp.com/courses/machine-learning-toolbox}{Machine Learning with caret in R (4h)}: train()-function, cross-validation, auc
\end{itemize}
\textbf{Data Science R (Advanced) -- Exchange Pool}
\begin{itemize}
\tightlist
\item
\href{https://www.datacamp.com/courses/data-visualization-with-ggplot2-2}{Data Visualization with ggplot2 (Part 2) (5h)}
\item
\href{https://www.datacamp.com/courses/interactive-maps-with-leaflet-in-r}{Interactive Maps with leaflet in R (4h)}
\item
\href{https://www.datacamp.com/courses/machine-learning-in-the-tidyverse}{Machine Learning in Tidyverse (5h)}
\item
\href{https://www.datacamp.com/courses/writing-efficient-r-code}{Writing Efficient R Code (4h)}
\item
\href{https://www.datacamp.com/courses/support-vector-machines-in-r}{Support Vector Machines in R (4h)}
\item
\href{https://www.datacamp.com/courses/supervised-learning-in-r-case-studies}{Supervised Learning in R: Case Studies (4h)}
\item
\href{https://www.datacamp.com/courses/optimizing-r-code-with-rcpp}{Optimizing R Code with Rcpp (4h)}
\end{itemize}
\end{tips}
\hypertarget{links}{%
\subsection{Links}\label{links}}
\begin{itemize}
\tightlist
\item
RStudio Cheat Sheets: \url{https://rstudio.cloud/learn/cheat-sheets}
\item
RMarkdown Explanation (to document your analyses): \url{https://rmarkdown.rstudio.com/lesson-1.html}
\item
StackOverflow (forum for all kinds of coding questions): \url{https://stackoverflow.com/}
\item
CrossValidated (Statistics and Data Science forum): \url{https://stats.stackexchange.com/}
\end{itemize}
\hypertarget{whats-python}{%
\section{What's Python?}\label{whats-python}}
\texttt{Python} is a dynamic programming language. The code is executed in the interpreter, which means that the code does not first have to be compiled. This makes \texttt{Python} very easy and quick to use. The good usability, easy readability and simple structuring were and still are core ideas in the development of this programming language.
Basically, you can use \texttt{Python} to program according to any paradigm, whereby structured and object-oriented programming is easiest due to the structure of the language, but functional or aspect-oriented programming is also possible. These options give users great freedom to design projects the way they want, but also great freedom to write code that is difficult to understand and confusing. For this reason, certain standards that are based on the so-called \texttt{Python} Enhancement Proposals (PEP) have developed over the decades.
\hypertarget{anaconda-and-jupyter}{%
\subsection{Anaconda and Jupyter}\label{anaconda-and-jupyter}}
Before you can use \texttt{Python}, it must be installed on the computer. \texttt{Python} is already installed on Linux and Unix systems (such as macOS), but often it is an older version. Since there are differences in the handling of \texttt{Python} version 2 -- which is not longer supported anymore -- and version 3, we decided to work with version 3.6 or higher.
One of the easiest ways to get both \texttt{Python} and most of the best known programming libraries is to install Anaconda. There are detailed explanations for the installation on all operating systems on the \href{https://docs.anaconda.com/anaconda/install/}{website} of the provider.
With Anaconda installed, all you have to do is open the Anaconda Navigator and you're ready to go. There are two ways to get started: Spyder or Jupyter. Spyder is the integrated development environment (IDE) for \texttt{Python} and offers all possibilities from syntax highlighting to debugging (links to tutorials below). The other option is to use Jupyter or Jupyter notebooks. It is an internet technology based interface for executing commands. The big advantage of this is that you can quickly write short code pieces and try them out interactively without writing an entire executable program.
Now you can get started! If you have not worked with Jupyter before, we recommend that you complete the course on DataCamp (\url{https://www.datacamp.com/projects/33}) first. There you will get to know many tips and tricks that will make your workflow with Jupyter much easier.
In order to make your work and, above all, the collaboration easier, we are working with the Google Colab platform that contains a Jupyter environment with the necessary libraries. You can then import all the data necessary for the project with Google Drive. We will introduce this environment during our first Coding Meetup. Until then, focus on learning the ``hard skills'' of programming with your courses on DataCamp. This brings us to your curriculum in the next section.
\hypertarget{curriculum-1}{%
\subsection{Curriculum}\label{curriculum-1}}
The following list shows the DataCamp courses for the \texttt{Python} data science track. As a beginner, please follow the courses for the beginner level. These should be processed in the order in which they are listed.\\
The same applies to the advanced courses. Here, too, the specified courses should be processed in the given order. Since it can of course happen that you have already mastered the topics of an advanced course, individual courses can be replaced. The topics of the advanced courses are given in brief. If these key points seem familiar to you, then take a look at the table of contents of the corresponding DataCamp course.
If you are convinced that this course does not provide any added value for you, it can be replaced by one of the courses in the ``Exchange Pool'' (see list). However, this course should not be processed until all other courses in the intermediate \texttt{Python} course have been completed.
Both beginners and advanced learners must have completed at least two thirds of the curriculum in order to receive the certificate. For beginners this means at least up to the course \href{https://learn.datacamp.com/courses/merging-dataframes-with-pandas}{Merging DataFrames with pandas (4h)} (including this course) and for advanced learners at least up to the \href{https://learn.datacamp.com/courses/exploratory-data-analysis-in-python}{Exploratory Data Analysis in Phyton (4h)} (including this course). In addition, at least two thirds of the project tasks have to be completed.
\begin{tipsp}p
\textbf{Python Fundamentals (Beginner) }
\begin{itemize}
\tightlist
\item
\href{https://www.datacamp.com/courses/introduction-to-data-science-in-python}{Introduction to Data Science in Python (4h)}
\item
\href{https://www.datacamp.com/courses/intermediate-python-for-data-science}{Intermediate Python (4h)}
\item
\href{https://www.datacamp.com/courses/python-data-science-toolbox-part-1}{Python for Data Science Toolbox (Part 1) (3h)}
\item
\href{https://www.datacamp.com/courses/introduction-to-matplotlib}{Introduction to Data Visualization with Matplotlib (4h)}
\item
\href{https://www.datacamp.com/courses/manipulating-dataframes-with-pandas}{Manipulating DataFrames with pandas (4h)}
\item
\href{https://www.datacamp.com/courses/merging-dataframes-with-pandas}{Merging DataFrames with pandas (4h)}
\item
\href{https://www.datacamp.com/courses/exploratory-data-analysis-in-python}{Exploratory Data Analysis in Phyton (4h)}
\item
\href{https://www.datacamp.com/projects/33}{Introduction to DataCamp Projects (2h)}
\item
\href{https://www.datacamp.com/courses/introduction-to-linear-modeling-in-python}{Introduction to Linear Modeling in Python (4h)}
\end{itemize}
\textbf{Data Science with Python (Advanced) }
\begin{itemize}
\tightlist
\item
\href{https://www.datacamp.com/courses/intermediate-python-for-data-science}{Intermediate Python (4h)}: Matplotlib, Dict, Pandas, Loops
\item
\href{https://www.datacamp.com/courses/python-data-science-toolbox-part-1}{Python Data Science Toolbox (Part 1) (3h)}: Default arguments, Lambdas, Error handling
\item
\href{https://www.datacamp.com/courses/python-data-science-toolbox-part-2}{Python Data Science Toolbox (Part 2) (4h)}: Iterators, generators, List comprehension
\item
\href{https://www.datacamp.com/courses/cleaning-data-in-python}{Cleaning Data in Python (4h)}: Using pandas for Data cleaning
\item
\href{https://www.datacamp.com/projects/82}{Exploring the Bitcoin Cryptocurrency Market (3h)}: Small project
\item
\href{https://www.datacamp.com/courses/exploratory-data-analysis-in-python}{Exploratory Data Analysis in Phyton (4h)}: How to start a data analysis
\item
\href{https://www.datacamp.com/courses/introduction-to-linear-modeling-in-python}{Introduction to Linear Modeling in Python (4h)}: Linear Regression, sklearn
\item
\href{https://www.datacamp.com/courses/supervised-learning-with-scikit-learn}{Supervised Learning with Scikit-Learn (4h)}: Classification, Regression, Tuning
\item
\href{https://www.datacamp.com/courses/linear-classifiers-in-python}{Linear Classifiers in Python (4h)}: Logistic regression, SVM, Loss functions
\end{itemize}
\textbf{Data Science with Python (Advanced) - Exchange Pool}
\begin{itemize}
\tightlist
\item
\href{https://www.datacamp.com/projects/684}{TV, Halftime Shows and the Big Game (4h)}
\item
\href{https://www.datacamp.com/courses/interactive-data-visualization-with-bokeh}{Interactive Data Visualization with Bokeh (4h)}
\item
\href{https://www.datacamp.com/courses/introduction-to-time-series-analysis-in-python}{Time Series Analysis (4h)}
\item
\href{https://www.datacamp.com/courses/machine-learning-for-time-series-data-in-python}{Machine Learning for Time Series Data in Python (4h)}
\item
\href{https://www.datacamp.com/courses/advanced-deep-learning-with-keras-in-python}{Advanced Deep Learning with Keras (4h)}
\item
\href{https://www.datacamp.com/courses/data-visualization-with-seaborn}{Data Visualization with Seaborn (4h)}
\item
\href{https://www.datacamp.com/courses/web-scraping-with-python}{Web Scraping in Python (4h)}
\item
\href{https://www.datacamp.com/courses/writing-efficient-python-code}{Writing Efficient Python Code (4h)}
\item
\href{https://www.datacamp.com/courses/unsupervised-learning-in-python}{Unsupervised Learning in Python (4h)}
\item
\href{https://www.datacamp.com/courses/optimizing-python-code-with-pandas}{Writing Efficient Code with pandas (4h)}
\item
\href{https://www.datacamp.com/courses/deep-learning-in-python}{Introduction to Deep Learning in Python (4h)}
\item
\href{https://www.datacamp.com/courses/forecasting-using-arima-models-in-python}{ARIMA Models in Python (4h)}
\end{itemize}
\end{tipsp}
\hypertarget{links-1}{%
\subsection{Links}\label{links-1}}
Official Tutorials/Documentation:
\begin{itemize}
\tightlist
\item
\url{https://docs.python.org/3/tutorial/index.html}
\item
\url{https://jupyter.org/documentation}
\end{itemize}
Further Explanations:
\begin{itemize}
\tightlist
\item
\url{https://pythonprogramming.net/}
\item
\url{https://automatetheboringstuff.com/}
\item
\url{https://www.reddit.com/r/learnpython}
\item
\url{https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook}
\end{itemize}
\hypertarget{your-data-science-project}{%
\section{Your Data Science Project}\label{your-data-science-project}}
\hypertarget{coding-meetups-and-requirements}{%
\subsection{Coding Meetups and Requirements}\label{coding-meetups-and-requirements}}
Now that you have learned the theoretical foundation in the DataCamp courses, you can put your skills into practice. We have put together a project for you based on real data sets. You can read about the details in the following chapters of this project guide.\\
Of course, we will also go into more detail about the project and the tools that go with it. We will discuss everything you need to know during the first Coding Meetup, which will take place on \textbf{May 19, 2021}. After that, the work on the project will officially begin.
You can find the exact project tasks together with further explanations and hints in the following chapters.
To receive the certificate, it is essential that you have solved at least two thirds of the ``Exploratory Data Analysis'' part of the project. For the advanced participants, the entire ``Price Prediction -- The Application of Statistical Models'' part is added. In addition, two thirds of the respective curriculum on DataCamp must be completed. You can find more detailed information on this in the ``Curriculum'' section of the respective programming language above.
\newpage
\hypertarget{introduction-to-your-project}{%
\chapter{Introduction to Your Project}\label{introduction-to-your-project}}
\hypertarget{purpose-of-the-project-guide}{%
\section{Purpose of the Project Guide}\label{purpose-of-the-project-guide}}
Welcome to the project guide for your TechAcademy Data Science project! This document will guide you through the different steps of your project and will provide you with valuable hints along the way. However, it is not a detailed step-by-step manual because we felt like you needed to develop the skills of coming up with your way of solving different tasks. This method is a great way to apply the knowledge and tools you have acquired through DataCamp.
Questions might come up, or you might not know how to solve a task right away---but don't worry---this is just part of coding. In those cases, you can also find helpful links in the introductory chapters, where your questions might already have been answered. If not, and in the unlikely case that even Google can't help you, the TechAcademy mentors will help you via Slack or directly during the coding meetups.
\textbf{At the end of the project guide, you will find an overview of all tasks that have to be completed}, depending on your track (beginner/advanced). You can use this list to check which tasks you still need to complete or which assignments are relevant for your track.
\hypertarget{what-is-this-project-about}{%
\section{What is this Project About?}\label{what-is-this-project-about}}
Last semester, we dealt with the business model of Airbnb. This semester, we will have a look at Spotify data! More precisely, we are first analyzing a very detailed Spotify data set, after which you will get the chance to look at your own Spotify data. You will find all kinds of information in the data sets - valuable and useless ones.
Are you already curious to see for yourself? In analogy to the typical Data Science workflow, we split this project into two parts. First, you are going to learn how to perform an Exploratory Data Analysis (EDA). You will have a closer look at the data, transform it, and get to know the different variables and what they look like in various visualizations. Beginners will have completed the project after this task. Still, it will be beneficial to try and work on the next part: In the second part of the project, you will develop a model that predicts a song's popularity as accurately as possible. You will start with a linear regression model, which you can modify as you please, and then you can explore all the other possibilities of a modeling and predicting data.
But first things first: What exactly is EDA, and what can you achieve with it?
\hypertarget{exploratory-data-analysis-getting-to-know-the-data-set}{%
\section{Exploratory Data Analysis -- getting to know the data set}\label{exploratory-data-analysis-getting-to-know-the-data-set}}
As a first step, you will get to know the data set. That means you will describe the data and answer questions like ``Which variables are contained in the data set? \ldots and how are they related?''. For this, you can often use graphical tools like box plots or histograms.
This first part of the project is structured in a way that lets you get to know the data thoroughly by completing the given tasks one after the other. As a beginner, you can stop after this part because you will have fulfilled the necessary coding requirements for the certificate. However, if this first part inspires you to learn more, we encourage you to also work on the second part.
This project guide is structured in the following format. Since the concept of Data Science is independent of specific programming languages, we will describe the general approach in this part of the text. After you understand the overall picture and the tasks we ask you to do, you will find language-specific tips and tricks in visually separated boxes. If you participate in our \texttt{R}-program, you'll only need to look at the boxes with the blue border. Conversely, you only need to look at the yellow-bordered boxes if you are coding in \texttt{Python}. From time to time, it might be interesting to check out the other language -- though you can do the same in both, they sometimes have a different approach to the identical problem. It makes sense that you complete the first few beginner chapters mentioned in the introductory chapter. We recommend finishing the courses at least until---and including---\emph{Exploratory Data Analysis} for both tracks.
\hypertarget{prediction-apply-statistical-methods}{%
\section{Prediction -- Apply Statistical Methods}\label{prediction-apply-statistical-methods}}
This part is mainly for the advanced TechAcademy participants. If you are a beginner and you were able to complete the first part without too many difficulties, we highly recommend trying to do the second part as well. Statistical models are a significant part of data science, and this is your chance of developing skills in this area.
You got to know the data in the first part, and you should be familiar with it so that it is now possible to use it to make predictions about a song's popularity based on the information you have about the songs in the data set.
For this part of the project, we recommend the advanced courses mentioned in the introductory chapter. Please note that more classes are available, so if you want to extend your skills even further, feel free and complete more courses on the topics that interest you. We recommend that you finish the lessons until---and including---\emph{Unsupervised Learning in Python} for the \texttt{Python} track and \emph{Machine Learning Toolbox} for the \texttt{R} track.
Ready? After getting a first impression of what this project is all about, let's get started!
\newpage
\hypertarget{exploratory-data-analysis}{%
\chapter{Exploratory Data Analysis}\label{exploratory-data-analysis}}
At the end of this chapter, you will look into your personal Spotify streaming history. You need to request this data from Spotify, and this can take a few days. Therefore, you should ask for your data as early as possible. Then, start with the first exercises of this chapter which will use some of the data we provide.
\textbf{Download your personal Spotify data}
If you do not use Spotify personally, you can skip this step and start with chapter 4.1.
In chapter 4.5, we will use the requested personal Spotify data, and you can use the data from one of your team members. If nobody on your team uses Spotify, we will provide you with some dummy data.
We will follow eight steps based on \href{https://www.makeuseof.com/tag/download-privacy-data-spotify/}{this} blog post by Dan Price to download your personal Spotify data:
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
Go to \url{https://accounts.spotify.com/en/status} and enter your login credentials.
\item
In the menu on the screen's left-hand side, click on Privacy settings.
\item
Scroll down to the Download your data section. You will see the following:
\end{enumerate}
\begin{center}\includegraphics[width=1\linewidth]{splots/spotify} \end{center}
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\setcounter{enumi}{3}
\tightlist
\item
Click on \emph{Request}. You will probably have to complete a captcha and will receive a confirmation email.
\item
Click ``Confirm'' in the email. This confirmation will take you back to Spotify, where you will receive an on-screen notification saying: ``We are preparing your data file. This {[}process{]} can take up to 30 days to complete. You'll receive an email when it is ready to download.'' - Spotify
\item
When the data preparation is complete, return to Privacy settings; you will now have access to Step 3.
\item
Click on Download.
\end{enumerate}
In the next couple of days, you will receive an email from Spotify to download your personal Spotify data as a \texttt{my\_spotify\_data.zip} file. Unzip the file and look for the \texttt{StreamingHistory0.json} file.
\textbf{Hint}: Check what Spotify knows about you by inspecting \texttt{Userdata.json}, \texttt{SearchQueries.json}, or \texttt{Inferences.json} files using Notepad (Windows), TextEdit (MacOS), or VIM (Linux). Surprised? :)
\hypertarget{getting-started}{%
\section{Getting Started}\label{getting-started}}
In this chapter, you will apply the knowledge that you gained throughout the semester to real-world data. The exercises will cover typical problems that you usually encounter when analyzing data.
Before you can dive into the data, set up your programming environment. This environment will be the place where the magic happens -- all your coding will take place there.
\begin{tips}r
In your workspace on \href{https://rstudio.cloud/projects}{rstudio.cloud}, we have already uploaded an ``assignment'' for you (Template Spotify). When you create a new project within the workspace \emph{Class of '21 \textbar{} TechAcademy \textbar{} Data Science with R}, your workspace will open up. We've already made some arrangements for you: The data sets you will be working with throughout the project are already available in your working directory. We also created an `RMarkdown' file, with which you will be able to create a detailed report of your project. You can convert this file into an HTML document when you have completed the documentation. Open the file ``Markdown\_Spotify.Rmd'' and see for yourself!
\end{tips}
\begin{tipsp}p
We recommend using \href{https://colab.research.google.com}{Google Colab} for this project since it requires no particular setup, stores its notebooks to your Google Drive, and makes it easy for you to share them with your team members.
As an alternative to Google Colab, you might want to install Jupyter Notebook locally using the Anaconda distribution.
Either way, when importing Spotify data, you can use the links to the separate data files provided in the ``Data-Links'' document, which you will find in the TechAcademy drive.
We will give you a more detailed step-by-step demo during the first coding meetup.
\end{tipsp}
\hypertarget{importing-data}{%
\subsection{Importing Data}\label{importing-data}}
First things first, you will need to load the following two CSV files into your coding environment:
\begin{itemize}
\tightlist
\item
\href{https://drive.google.com/uc?export=download\&id=1OmFrdUaPFatpPLRikLJHS-VoF1F27G73}{spotify\_songs.csv}
\item
\href{https://drive.google.com/uc?export=download\&id=1gHO566_qxBbz7t8fcH10uVQDAmxK33IF}{song\_features.csv}
\end{itemize}
How is the data structured, and which variables does it contain? To get a peek at the data you'll be working with, output the first/last few rows. Also, compute some basic descriptive statistics (count, mean, and standard deviation) and determine the column's data types.
\begin{tips}r
First, the data set has to be loaded from your folder structure into the workspace. Import both the \texttt{song\_features.csv} and the \texttt{spotify\_songs.csv} data sets into your workspace (e.g., using \{base\} R's \texttt{read.csv()}, \href{https://readr.tidyverse.org}{\{readr\}}'s \texttt{read\_csv()} or \href{https://rdatatable.gitlab.io/data.table/}{\{data.table\}}'s \texttt{fread()} function) and name the object(s) accordingly.
Though you can name the object however you wish, we recommend using \texttt{original\_file\_name} + e.g., \texttt{\_df} (for data.frame) / \texttt{\_tbl}(for tibble) / \texttt{\_DT} (for data.table) naming convention, so that you can easily distinguish your R objects (data frames, plots, vectors, etc.) later on.
Now, get an overview of the data set. Are there any variables you could drop while reading the data set? You can use the following functions to print the data frame, for example \texttt{head()} or \texttt{glimpse()}.
Lastly, get a quick summary of the data using \{base\} R's \texttt{summary()} or \href{https://personality-project.org/r/psych/}{\{psych\}}'s \texttt{describe()} function.
To find more ways to generate quick data summaries in R, check \href{https://dabblingwithdata.wordpress.com/2018/01/02/my-favourite-r-package-for-summarising-data/}{this} blog post from Adam Medcalf.
\end{tips}
\begin{tipsp}p
Start by importing the pandas' library: \texttt{import\ pandas\ as\ pd}.
You can use the same panda's method you used to import CSV data from your local machine to import CSV data from a web server: Just replace the file path with the URL of the CSV file.
The URLs are:
\begin{itemize}
\tightlist
\item
spotify\_songs.csv: \url{https://tinyurl.com/TechAcademy-songs}
\item
song\_features.csv: \url{https://tinyurl.com/TechAcadey-song-features}
\end{itemize}
Finally, get familiar with the data by outputting some of the data frame (e.g.~\texttt{df.head()}), as well as some basic information such as column data types and some descriptive statistics (\texttt{df.info()} \texttt{df.describe()}).
\end{tipsp}
\hypertarget{convert-milliseconds-to-minutes}{%
\subsection{Convert Milliseconds to Minutes}\label{convert-milliseconds-to-minutes}}
\textbf{We will start with the data set \texttt{spotify\_songs.csv}.}
You might have noticed a column with song durations given in milliseconds from the task before. Since (presumably) no one measures song durations in milliseconds, convert them into minutes.
\begin{tips}r
You could use \href{https://dplyr.tidyverse.org/index.html}{\{dplyr\}}'s \texttt{mutate} function to compute and add a new column \texttt{duration\_min} using the existing column \texttt{duration\_ms}. If you haven't heard of the \{dplyr\} package yet, take the respective \href{https://www.datacamp.com/courses/data-manipulation-with-dplyr}{DataCamp course} and the DataCamp's course on \href{https://www.datacamp.com/courses/exploratory-data-analysis}{Exploratory Data Analysis}. Use the respective function and add the new variable to your existing data set.
You can use \href{https://www.timecalculator.net/milliseconds-to-minutes}{this} online calculator as a point of reference for the needed \texttt{mutate()} calculation.
\end{tips}
\begin{tipsp}p
Compute the length of songs in minutes based on the column ``duration\_ms'' and assign it to a new column with
\texttt{df{[}“duration\_minutes”{]}=\ ...}
\end{tipsp}
\hypertarget{song-durations}{%
\subsection{Song Durations}\label{song-durations}}
Next, we want to focus on the concept of distribution:
In this task, you should focus on computing the mean and standard deviation of the duration in minutes and finding the song with the longest duration.
Ideally, the output should be well formatted to something like ``The song with the longest duration is \texttt{{[}song{]}\ by\ {[}artist{]}}.''
\begin{tips}r
Simply use \{base\} R's \texttt{mean()} function to compute a mean of any column in R. Similarly, R comes with \texttt{sd()} function to calculate the standard deviation of a column. You can combine these two functions, e.g., in a \{dplyr\}'s \texttt{summarise()} verb.
Next, find the song with the longest duration (in minutes) using the newly created \texttt{duration\_min} column. The \texttt{filter()} verb from \{dplyr\} in combination with \texttt{max()} function could be helpful in this case. Moreover, you can combine \{dplyr\}'s \texttt{select()} verb with the mentioned \texttt{filter()} verb to show: song's duration, its artist(s) and its name.
For example:
\begin{longtable}[]{@{}rrr@{}}
\toprule
duration\_min & artist(s) & song name \\
\midrule
\endhead
X.X & YY & ZZ \\
\bottomrule
\end{longtable}
\end{tips}
\begin{tipsp}p
Calculating mean and standard deviation should be reasonably easy by applying the mean() and std() function to the corresponding column.
To find the song with the longest duration, you can use pandas \texttt{nlargest} function. You can apply it to a data frame df with \texttt{df.nlargest(n,\ column)}, where n is the number of rows to keep and \texttt{column} defines in which column you want to find the largest values.
\end{tipsp}
\hypertarget{visualizing-song-distributions}{%
\subsection{Visualizing Song Distributions}\label{visualizing-song-distributions}}
Visualizing data is essential to facilitate perception and understanding of information: Create a graph to visualize the distribution of song lengths.
We focus on creating a histogram in the tips section since it's perhaps the most common approach for plotting distributions. You can, however, choose a different chart type; just make sure that the information you want to display is clear and correct.
However, as in school, always add axis labels when possible.
\begin{tips}r
You can use \{base\} R's \texttt{hist()} function to generate a simple histogram of song duration (in minutes). However, we encourage you to use the more extensive \href{https://ggplot2.tidyverse.org}{\{ggplot2\}} package. \{ggplot2\}'s syntax is perhaps a bit harder to learn at the start, but it gives you more plotting benefits over the long run. If you haven't heard of the \{ggplot2\} package yet, take the respective \href{https://www.datacamp.com/courses/data-visualization-with-ggplot2-1}{DataCamp course}.
You could, e.g.~install \href{https://github.com/dreamRs/esquisse\#esquisse-}{\{esquisse\}} package to learn the \{ggplot2\}'s syntax easier. \{esquisse\} will provide you a graphical user interface (GUI) inside RStudio to generate \{ggplot2\} figures. You can then copy/paste the syntax that generated the \{ggplot2\} figure to your R Markdown file. Awesome, eh?
\end{tips}
\begin{tipsp}p
With matplotlibs' \texttt{pyplot.hist(...)}, you can plot a histogram in one line of code.
With some more lines of code, you could (and should) add axis labels.
Try adding a title to the histogram, change the color to your preferences, etc.
Maybe you could even delete outliers with methods such as \texttt{col1.between()} and \texttt{col1.quantile()} - but don't bother with details early on and leave them to the end in case you have enough time.
\end{tipsp}
\begin{center}\includegraphics[width=1\linewidth]{splots/durations (3)} \end{center}
\hypertarget{time-series-data}{%
\section{Time Series Data}\label{time-series-data}}
\hypertarget{visualizing-song-durations-over-time}{%
\subsection{Visualizing Song Durations over Time}\label{visualizing-song-durations-over-time}}
While at the topic of song durations, it might also be interesting to see how the average song length evolved: Choose and plot the chart you think is most appropriate for displaying this type of information.
After plotting, please comment/interpret the graph: Were there any significant increases/decreases in song length over time? If so, what could be the reason?
\begin{tips}r
You first need to prepare the data before visualizing it. Generate the average duration of a song (in minutes) by each year. You already used \texttt{summarise()} and \texttt{mean()} functions before. Now, you need to generate that same statistic for each year. You can, e.g., use \{dplyr\}'s \texttt{group\_by()} verb for this purpose.
Afterward, use \{ggplot2\} or \{base\} R's \texttt{plot()} to create a chart.
Please answer open questions in RMarkdown outside the code chunk.
\end{tips}
\begin{tipsp}p
There is a pandas method to group your data frame by year. In combination with the \texttt{.mean()} method, you can compute the average song length by year in one line of code.
Plot the result with matplotlib.
Please answer open questions in text cells.
\end{tipsp}
\begin{center}\includegraphics[width=1\linewidth]{splots/durations_over_time} \end{center}
\hypertarget{merge}{%
\subsection{Merge}\label{merge}}
As a data scientist, you'll often find yourself working with data sets from different data sources referencing the same object. For example, you might have the song names in one file and the respective song features in a separate file. It would make more sense to just merge the two data sets into one.
Indeed, this is what happened with our data. So, to make life easier for the upcoming tasks, you'll now need to merge both data sets.
As you might have noticed, there is one column that both data sets share. Find out which column that is and combine the data sets on that column.
\begin{tips}r
You could, e.g., use \{dplyr\}'s \texttt{left\_join()} to combine the two data sets (\texttt{song\_features.csv} and \texttt{songs.csv}) side-by-side.
Take a look at the \href{https://r4ds.had.co.nz/relational-data.html\#understanding-joins}{``R for Data Science''} book for an illustration of join types.
\end{tips}
\begin{tipsp}p
We want to combine the two data sets \texttt{spotify\_songs.csv} and \texttt{song\_features.csv}.
The official documentation of the Pandas library provides a great guide on how to deal with this stuff:
\url{https://pandas.pydata.org/docs/user_guide/merging.html}
\end{tipsp}
\hypertarget{line-plot}{%
\subsection{Line Plot}\label{line-plot}}
Your goal for this task is to plot how \texttt{acousticness}, \texttt{danceability}, \texttt{energy}, \texttt{valence}, \texttt{speechiness}, and \texttt{instrumentalness} changed over time.
\begin{tips}r
This task involves three steps: creating a summarized data set, reshaping that data set from wide to long format if you'd like to use \{ggplot2\} and finally -- plotting it.
Use the familiar \texttt{group\_by()} and \texttt{summarise()} \{dplyr\} verbs for the first part of the task. Unlike before, include multiple columns in your data summary. Use the familiar \texttt{mean()} function on each column to summarize the data. Save the resulting data summary as a new object.
Use the \href{https://tidyr.tidyverse.org}{\{tidyr\}}'s \texttt{pivot\_longer()} for the second part of the task. \texttt{pivot\_longer()} allows you to transform derived summarized data set from wide to long format by decreasing the number of columns and increasing the number of rows (a.k.a., ``lengthening'' the data). Such transformation is useful for plotting, e.g., in \{ggplot2\}.
Use the \{ggplot2\} to create a line chart of six variables for this task's final step. Alternatively, use the \{base\} R's \texttt{plot()}.
\end{tips}
\begin{tipsp}p
Once again, group the entries by year, get the mean values for the respective column for all years, and use matplotlibs' pyplot for plotting.
Don't forget to add a legend and axis labels to the plot.
\end{tipsp}
\begin{center}\includegraphics[width=1\linewidth]{splots/features_over_time} \end{center}
\hypertarget{radar-plot-advanced}{%
\subsection{Radar Plot (Advanced)}\label{radar-plot-advanced}}
\textbf{If you are on an advanced track, do this exercise; otherwise, you can skip it.}
Let's get a little fancy by displaying the yearly average song feature values through an interactive radar chart (aka spider plot or polar plot). This chart is very favorable in some cases when you try to display more than two dimensions. Here, each column (or dimension) gets its axis, and the dots on each axis are connected, resulting in a polygon shape.
The features you'll need to plot are: \texttt{acousticness}, \texttt{danceability}, \texttt{energy}, \texttt{valence}, \texttt{speechiness}, and \texttt{instrumentalness}.
\begin{tips}r
Use the \href{https://plotly.com/r/getting-started/}{\{plotly\}} package to create an interactive Radar/Spider chart of six columns over time. Similar to task 2.3 before, this task will involve summarizing, transforming, and plotting the data.
You are now already experienced in summarizing the data with, e.g., \texttt{group\_by()} and \texttt{summarise()} \{dplyr\} verbs. Focus on summarizing the six columns.
Afterward, transform the data from wide to long format. You could, e.g., use \{tidyr\}'s \texttt{pivot\_longer()} again or \{data.table\}'s \texttt{melt()}.
Lastly, plot the figure using \{plotly\}. Use \href{https://plotly.com/r/radar-chart/}{this} resource as a point of reference for a static plot. Then, create an interactive animation using \href{https://plotly.com/r/animations/}{this} resource.
\end{tips}
\begin{tipsp}p
Here we work with the visualization library plotly, which is a more sophisticated plotting tool than matplotlib:
\texttt{import\ plotly.express\ as\ px}.
Use the already computed data frame with average feature values for each year. Note that after you grouped the data frame, the \texttt{year} column automatically has been set to be the index column -- reverse that (i.e., ``reset'' the index).
Then apply the \texttt{.melt("year")} method on the data frame and have a look at the result. Before continuing, briefly try to understand what happened.
Done that, you can create the radial plot with the \texttt{px.line\_polar(...)} method. The documentation (google something like ``plotly express radar chart'' to find the documentation) tells you which arguments you need (and can) pass to this method:
- Of course, you need to pass some value for the \texttt{data\_frame} argument. For our purpose, this will be the melted data frame.
- You'll also need to pass values for the keyword arguments \texttt{r}, \texttt{theta}, \texttt{animation\_frame}, \texttt{line\_close}, and \texttt{range\_r}.
The values for these arguments are the strings \texttt{“value”}, \texttt{“variable”}, \texttt{“year”}, the boolean value \texttt{True}, and the tuple \texttt{(0,1)} respectively.
Apply \texttt{.show()} directly after the \texttt{line\_polar(...)} method to display the plot.
As always, try updating font size, layout, etc., as you like.
\end{tipsp}
\begin{center}\includegraphics[width=1\linewidth]{splots/features_over_time_advanced} \end{center}
\hypertarget{artists}{%
\section{Artists}\label{artists}}
\hypertarget{most-productive-artists}{%
\subsection{Most Productive Artists}\label{most-productive-artists}}
Our subsequent interest lies in analyzing music artists: Start by showing which artists wrote the most songs using a horizontal bar plot (or any other plot you think works best).
\begin{tips}r
You could use the familiar \texttt{group\_by()} \{dplyr\} verb for this task. However, it would help if you counted the songs of each artist. Does that ring a bell for another \{dplyr\} verb?
After grouping and counting, sort the result by its count. Then, subset the 15 artists with most songs. You can save the derived data summary to a new object.
Lastly, plot the result using \{base\} R or \{ggplot2\} by creating a bar chart.
\end{tips}
\begin{tipsp}p
Pandas' \texttt{.value\_counts(...)} method is all you need to prepare the data.
For plotting use matplotlib's \texttt{plt.barh(...)}.
\end{tipsp}
\begin{center}\includegraphics[width=1\linewidth]{splots/artists_bar_chart} \end{center}
\hypertarget{charts}{%
\subsection{Charts}\label{charts}}
Care to discover new songs?
In this section, you'll work with entirely new data (links in the respective tips section).
The goal should be to create a map with a hover effect displaying each country's top song (and its composer) when hovering your mouse over that country.
\begin{tips}r
First, load the data set \texttt{weekly\_charts.csv} and find each country's top song. You could use the familiar \{dplyr\} verbs for this part. You could also save the resulting data set for plotting.
Then, use \href{https://jkunst.com/highcharter/}{\{highcharter\}} to create a simple interactive map. Check \href{https://jkunst.com/highcharter/articles/maps.html}{this documentation} if you'd like to do that. In the case of \{highcharter\}, you do not need any additional data as the package relies on \texttt{ISO\ Alpha\ 2} (two-letter) country codes supplied in the \texttt{country} column of the \texttt{weekly\_charts.csv} data set. Make use of \{highcharter\}'s \texttt{hc\_tooltip()} function and its \texttt{pointFormat} argument for a mouse hover effect.
Alternatively, use \{plotly\} to create an interactive map. In this case, you have more customization options, but you need to add \texttt{ISO\ Alpha\ 3} (three-letter) country codes to the \texttt{weekly\_charts.csv} data set. You can load \href{https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes/blob/master/all/all.csv}{this CSV file} via URL to merge \texttt{ISO\ Alpha\ 3} country codes with countries in the \texttt{weekly\_charts.csv} data set. Then, create a plot using \{plotly\} as described \href{https://plotly.com/r/choropleth-maps/}{here}.
\end{tips}
\begin{tipsp}p
There exists a library for plotting maps: \texttt{import\ folium}.
With folium, you can easily plot a map by running the following two lines:
\begin{verbatim}
m = folium.Map(location=[20,10], zoom_start=2)
m
\end{verbatim}
Folium requires you to specify each country's country border coordinates you like to be drawn (i.e., colored, like in our example map). For this, we prepared a JSON file with all that data. Note that this is not a CSV file; hence you can't use the pandas \texttt{.read\_csv()} method. You'll need to work with two new libraries for importing the data - one for requesting the data from a web server (i.e., Google Drive) and one to parse (i.e., read) the data. Here's the code:
\begin{verbatim}
import requests
import JSON
# request data from a url
response = requests.get(“https://tinyurl.com/TechAcademy-GeoData”)
# parse json data
geo_data = json.loads(response.content)
\end{verbatim}
If you print out the data, you will notice each data subset belongs to a country, and that also contains the top song for the respective country: \texttt{print(geo\_data{[}"features"{]}{[}0{]})\ \ \#\ prints\ data\ of\ the\ first\ country\ in\ JSON\ file}
Now, ``draw'' the country borders by passing the \texttt{geo\_data} as an argument to \texttt{folium.GeoJson(...).add\_to(m)}, and assign it to the variable \texttt{geo}. Note: there is no other variable that needs to be passed as arguments here.
Having done that, add the hover effect displaying each country's top song with \texttt{geo.add\_child(folium.GeoJsonTooltip(...)} and specify fields and aliases through the method's parameters. There is no need to assign this to a variable. You can look up folium's \texttt{GeoJsonTooltip} class online to understand how are fields and aliases passed as arguments.
Display the map by solely typing the letter \texttt{m}.
\end{tipsp}
\begin{center}\includegraphics[width=1\linewidth]{splots/map} \end{center}
\hypertarget{compare-artists-based-on-their-songs}{%
\section{Compare Artists Based on their Songs}\label{compare-artists-based-on-their-songs}}
\hypertarget{pairplot-of-different-artists}{%
\subsection{Pairplot of different artists}\label{pairplot-of-different-artists}}
Until now, we analyzed songs individually or compared how many songs individual artists produced. Now, we want to compare the features of songs from artists with each other. For example, how do songs from ``Bob Marley'' differ from ``Nirvana''.
Search in the data set for three different artists or bands which you like. We would recommend that these artists produce music from various genres to see more significant differences between them.
\begin{tips}r
Pick the three artists of your choice (e.g., ``Bob Dylan,'' ``Nirvana,'' and ``Eminem''). You could save your selection to a character vector, e.g., \texttt{my\_artists\_vec}.
You then need to filter/subset the data set for these three artists. That means finding the rows that contain any of those three artists' names. While doing that, make sure to include the four columns of interest for these artists too. You could use \{dplyr\}'s \texttt{filter()} and \texttt{select()} verbs for this purpose.
Afterward, create a pairplot between three artists of your choice. Remember, you are comparing the four features between the three artists. You could use \href{http://ggobi.github.io/ggally/}{\{GGally\}}'s \texttt{ggpairs()} for this purpose. Feel free to play around with the \texttt{ggpairs()} function's arguments as described in \href{http://ggobi.github.io/ggally/articles/ggpairs.html}{its documentation}
\end{tips}
\begin{tipsp}p
Pick three artists of your choice (e.g., ``Bob Dylan'', ``Nirvana'', and ``Eminem''). Filter the data set so that it contains only songs of these three artists.
Now, you can use the \texttt{seaborn} library for visualization. Import this library and use the \texttt{pairplot()} function where you pass the filtered data frame as a parameter, and as the \texttt{hue} value, you choose ``artists''.
\end{tipsp}
\begin{center}\includegraphics[width=1\linewidth]{splots/feature_comparison} \end{center}
\hypertarget{discover-new-visualization-methods}{%
\subsection{Discover New Visualization Methods}\label{discover-new-visualization-methods}}
Python and R can produce insightful visualizations, and in this exercise, we want to give you an impression of what wide range of visualization types are available. Therefore you can have a look at a gallery that shows a wide range of example plots. Choose one of these possible plots and generate such a plot with the Spotify data.
Write in a comment what your visualization shows and what message it delivers. What insights about the data do you get from this plot?
\begin{tips}r
Visit \href{https://www.r-graph-gallery.com/all-graphs}{``The R Graph Gallery''} to find inspiration for your figure. You could then use either the \{base\} R or the familiar \{ggplot2\} package to create the plot of your choice.
\end{tips}
\begin{tipsp}p
Visit the {[}``Seaborn Gallery''{]} (\url{https://seaborn.pydata.org/examples/index.html\#}), look for a plot that you could use to visualize an aspect of Spotify data and generate a visualization.
\end{tipsp}
\hypertarget{analyse-personal-spotify-data}{%
\section{Analyse Personal Spotify Data}\label{analyse-personal-spotify-data}}
We dived into the provided Spotify data set, and now is the time to look into your personal streaming history. Therefore we use the data set which we requested from Spotify.
If you do not use Spotify, you can also use data from a member of your team. If nobody on your team uses Spotify, contact your mentor, and we will provide you with some dummy data.
Each year Spotify generates a personal wrap-up of your streaming history. Now, we want to produce something similar. First, we want to look at our favorite artists and afterward into our streaming behavior. On which days do we listen to more music or at which times of the day?
This section only scratches the surface of what is possible after you receive your Spotify data. We will illustrate some ideas on how you can analyze your data. In the end, we will mention how you could expand your analysis.
\hypertarget{generate-bar-plot-with-top-10-artists}{%
\subsection{Generate Bar Plot with Top 10 artists}\label{generate-bar-plot-with-top-10-artists}}
We want to start with a Bar plot that shows your favorite artists.
First, we have to load the data into our work environment. If you look at the downloaded data, you can see the \texttt{StreamingHistory0.json} JSON file. This file, however, is not a CSV file. But this is not a big problem as you will see.
\begin{tips}r
First of all, you have to upload your data to access it on RStudio Cloud. Check \href{https://community.rstudio.com/t/importing-data-into-r-studio-cloud/14332/5}{this} series of posts on how to do it. Click on the \texttt{Files} tab in the right pane of your RStudio dashboard. You will then find the \texttt{Upload} button right below the \texttt{Packages} button. After uploading your file, you will see it in your RStudio dashboard under the \texttt{Files} tab -- ready to be loaded.
You can use the \texttt{fromJSON()} function available in \href{https://cran.r-project.org/package=jsonlite}{\{jsonlite\}}, \href{https://cran.r-project.org/package=rjson}{\{rjson\}} or \href{https://cran.r-project.org/package=RJSONIO}{\{RJSONIO\}} package to load the \texttt{StreamingHistory0.json} file into your session. This works similar like, e.g., \texttt{read\_csv()} function that you already used to load a CSV file.
Pick the package whose naming convention comes off most intuitively to you, load the JSON file via \texttt{fromJSON()} and save it as a familiar data frame object (e.g., \texttt{streaming\_history\_df}).
\end{tips}
\begin{tipsp}p
First of all, you have to upload your data to access it with Google Colab. On the left side, you can find a ``Folder'' symbol. Click on it, and you can upload your data with drag and drop. The path to your file is then ``/content/your\_filename.json''.
Until now, we loaded CSV-Files with the pandas method:
\texttt{pd.read\_csv(file)}
So, what pandas method could you use to load JSON files ;) ?
\end{tipsp}
After you loaded the file into your environment, take some time to first look at the data. Print out a couple of rows and look at what columns this data set contains. It could be helpful again to compute a column \texttt{minutesPlayed} from the column \texttt{msPlayed}.
Afterward, you can create a graph with the artists you listened to the most.
\begin{tips}r
You could add a variable to your personal Spotify data set which counts the number of times you listened to a particular song. Take advantage, e.g., of \{dplyr\}'s \texttt{mutate()} verb. Name the new variable, for example \texttt{n\_times\_song\_played}.
Afterward, you could arrange the data set by the newly created \texttt{n\_times\_song\_played} variable.
Finally, create a chart of the ten most played artists.
\end{tips}
\begin{tipsp}p