Topic Modelling

Indentifying major conspiracy theories from reddit text

This project uses unsupervised learning to group reddit text and identify major conspiracy theories using NLP, LDA, spacy, SVD, SBert embedding and HDSCAN.

Data:

Shape - (17155, 2)
Columns – date (in type datetime[ns]), text (in type object)
Date Range - 01-01-2015 to 17-02-2023

Process Flow:

Data Cleaning:

Removing leading spaces
Removing emojis and any other component that is not a word or a number

Term Extraction:

Spacy model "en_core_web_sm" has been used for term extract along with the Matcher feature it provides. We are trying to detect a pattern that begins with an adjective or a noun followed by singular/ plural common nouns or proper nouns along with hypen. From the extracted terms it is noticed that there are terms in text than do not contribute to our analysis like

So, we invoke a second round of Data Cleaning that removes words like
This is followed by term extraction using C values with theta = 100. The list of ten most common terms and 20 least common terms has been provided below

Tokenization:

We create tokens from text data using Spacy pipeline incorporating the terms as created above

Topic Modelling:

We have used the LDA model from tomotopy with the following features

The twenty topics are plotted on a 2D space as below:

Looking at top 50 words from each topic we label them as shown in the table below

Clustering:

The “text” data has been cleaned to remove emojis and unnecessary text and punctuation expect “.” as required for sentence tokenization.
We have removed all data that has less than 5 words.
We have removed data that begins with "your post" or “please contact” to remove reddit submission messages.

Sentence Splitter:

We have used sentence splitter from Spacy to take a batch of 5 sentences at a time from each post

Embedding:

SVD:

We have used TruncatedSVD to transform the data into a 20 dimensional vector.

Spacy:

We have used the "en_core_web_lg" from Spacy to transform the cleaned text into a 300-dimension vector. Further dimension reduction has been done using TruncatedSVD to bring it down to 20 dimension.

SBERT:

Dynamic embedding was done using SBERT sentence transformer from "all-mpnet-base-v2" to obtain a 768 dimension vector.Further dimension reduction has been done using TruncatedSVD to bring it down to 20 dimensions.

Perplexity:

As seen above, with Perplexity=50, the SBERT model clearly produces better results as compared to LSA and Spacy. Thus, we tried to plot the vector projections with perplexity 100, 150 and 200, to compare results and select best model for clustering
Perplexity of 200 seems to provide optimal results for the analysis

HDBSCAN

We have considered a cluster size of 20 and fitted the SBERT vectors. Highlighted terms in each cluster after incorporating c-values gives us the following topics as focus

Results:

Dynamic Bokeh Plot

For interactive feature and further details please refer to notebook and project reoprt. A few examples have been highlighted below.

Presidential Election Fraud
Evolution Theory vs Religious Beliefs
Zionists – Israel and Palestine Crisis
Flat Earth
Ukraine War
Anti – Maskers (COVID 19)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Project Report.pdf		Project Report.pdf
README.md		README.md
TOPIC MODELLING.html		TOPIC MODELLING.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Topic Modelling

Indentifying major conspiracy theories from reddit text

Data:

Process Flow:

Data Cleaning:

Term Extraction:

Tokenization:

Topic Modelling:

Clustering:

Sentence Splitter:

Embedding:

SVD:

Spacy:

SBERT:

Perplexity:

HDBSCAN

Results:

Dynamic Bokeh Plot

About

Releases

Packages

Languages

anurima-saha/Topic_Modelling_LDA_HDBSCAN

Folders and files

Latest commit

History

Repository files navigation

Topic Modelling

Indentifying major conspiracy theories from reddit text

Data:

Process Flow:

Data Cleaning:

Term Extraction:

Tokenization:

Topic Modelling:

Clustering:

Sentence Splitter:

Embedding:

SVD:

Spacy:

SBERT:

Perplexity:

HDBSCAN

Results:

Dynamic Bokeh Plot

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages