This project uses unsupervised learning to group reddit text and identify major conspiracy theories using NLP, LDA, spacy, SVD, SBert embedding and HDSCAN.
Shape - (17155, 2)
Columns – date (in type datetime[ns]), text (in type object)
Date Range - 01-01-2015 to 17-02-2023
Spacy model "en_core_web_sm" has been used for term extract along with the Matcher feature it provides. We are trying to detect a pattern that begins with an adjective or a noun followed by singular/ plural common nouns or proper nouns along with hypen. From the extracted terms it is noticed that there are terms in text than do not contribute to our analysis like
![]()
So, we invoke a second round of Data Cleaning that removes words like![]()
This is followed by term extraction using C values with theta = 100. The list of ten most common terms and 20 least common terms has been provided below
We create tokens from text data using Spacy pipeline incorporating the terms as created above
We have used the LDA model from tomotopy with the following features
The twenty topics are plotted on a 2D space as below:
Looking at top 50 words from each topic we label them as shown in the table below
- The “text” data has been cleaned to remove emojis and unnecessary text and punctuation expect “.” as required for sentence tokenization.
- We have removed all data that has less than 5 words.
- We have removed data that begins with "your post" or “please contact” to remove reddit submission messages.
We have used sentence splitter from Spacy to take a batch of 5 sentences at a time from each post
We have used TruncatedSVD to transform the data into a 20 dimensional vector.
We have used the "en_core_web_lg" from Spacy to transform the cleaned text into a 300-dimension vector. Further dimension reduction has been done using TruncatedSVD to bring it down to 20 dimension.
Dynamic embedding was done using SBERT sentence transformer from "all-mpnet-base-v2" to obtain a 768 dimension vector.Further dimension reduction has been done using TruncatedSVD to bring it down to 20 dimensions.
As seen above, with Perplexity=50, the SBERT model clearly produces better results as compared to LSA and Spacy. Thus, we tried to plot the vector projections with perplexity 100, 150 and 200, to compare results and select best model for clustering
Perplexity of 200 seems to provide optimal results for the analysis
We have considered a cluster size of 20 and fitted the SBERT vectors.
Highlighted terms in each cluster after incorporating c-values gives us the following topics as focus
For interactive feature and further details please refer to notebook and project reoprt. A few examples have been highlighted below.