The goal of this project is to analyze a couple of the most popular songs of each year for the last 60 years using topological data analysis (TDA) and a natural language processer (NLP). Depending on time restraints, more analysis might be done using ML or analysis of word frequencies in the lyrics.
In more detail, the goal will be to analyze and compare the most popular songs of the past 60 years against each other. Using the the top 100 most popular songs of a year to be the sample for the respective year. Hopefully this project will show some interesting trends in the progression of music.
This project is for the BGSU Hackathon
Hackathon Results
This project won the award for best design at the 2020 BGSU Hackathon. I will upload our presentation (named Hackathon Presentation) and freeze this respitory. I'd like to thank my teammates, David Ash and Jackson Conrad, for their contributions to this project. Since this project had an unsuccessful application of topological data analysis in this project, I will clone this respiratory and continue working on it as my schedule allows it. If I get good results I will write a paper for publication, as this field still needs some development.
I will be referencing 2 main papers for this project and hopefully wont need to redesign their described algorithms in order for it to work in my project. That would be problematic as my understanding of abstract algebraic group theory can be described as limited at the least.
This project will be devised into multiple parts:
1. Creating the data set
- Accomplished by text mining websites (lyrics on demand probably) using python.
- This stage also involves all preprocessing of the data such as; scrubbing, normalization, ect.
2. NLP set up
- Setting up and using the natural language processer. Will show comparisions for different NLP methods and techniques.
3. TDA of the NLP processed data.
- The fun part! Using TDA to model our the data generated by the NLP.
- Generate a bunch of graphs and statistics to be used for analysis.
4. Analysis
- Lots of visuals showing the results.
- Analysing what the results actually mean or indicate.
- Find trends and patterns in the difference of popular songs
- Compare how the output after changing different methods and parameters for NLP and TDA
5. Optimization
- This is really just part of analysis, as you need to analyze your output before knowing what to optimize
- This stage will involve lots of changing around methods and parameters used in the implementation of the NLP and TDA
6. Presentation
- Preping the code for an easy to follow presentation
- Also probably write a paper on the project
Note stage 4, Analysis, is very tricky as there isn't a correct answer. Typically in supervised and even unsupervised learning, have targets which let you get a number on the performance of your model. However, in this case it will be mostly just analysis, so there isn't really a correct answer, so comparing the outputs with different methods will show different interpretations of the data. One isn't going to be more correct than the other (probably).
If there is time I will try to extend this to attempt to guess what year a song came from given the lyrics. If I get to this part it wouldn't be much work to train a ML algorithm using the data for supervised learning. Or possibly set up a GAN and have it try to guess what year a randomly generated song (lyrics generated by the generator) would be from.