Skip to content

Using unsupervised learning to group reddit text and identify major conspiracy theories using NLP, LDA, spacy, SVD, SBert embedding and HDBSCAN.

Notifications You must be signed in to change notification settings

anurima-saha/Topic_Modelling_LDA_HDBSCAN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Topic Modelling

Indentifying major conspiracy theories from reddit text

This project uses unsupervised learning to group reddit text and identify major conspiracy theories using NLP, LDA, spacy, SVD, SBert embedding and HDSCAN.

Data:

Shape - (17155, 2)
Columns – date (in type datetime[ns]), text (in type object)
Date Range - 01-01-2015 to 17-02-2023

Process Flow:

Data Cleaning:

  • Removing leading spaces

    image

  • Removing emojis and any other component that is not a word or a number

    image

Term Extraction:

Spacy model "en_core_web_sm" has been used for term extract along with the Matcher feature it provides. We are trying to detect a pattern that begins with an adjective or a noun followed by singular/ plural common nouns or proper nouns along with hypen. From the extracted terms it is noticed that there are terms in text than do not contribute to our analysis like

image
So, we invoke a second round of Data Cleaning that removes words like image
This is followed by term extraction using C values with theta = 100. The list of ten most common terms and 20 least common terms has been provided below
image

Tokenization:

We create tokens from text data using Spacy pipeline incorporating the terms as created above
image

Topic Modelling:

We have used the LDA model from tomotopy with the following features
image
The twenty topics are plotted on a 2D space as below:
image
Looking at top 50 words from each topic we label them as shown in the table below

image image

Clustering:

  • The “text” data has been cleaned to remove emojis and unnecessary text and punctuation expect “.” as required for sentence tokenization.
  • We have removed all data that has less than 5 words.
  • We have removed data that begins with "your post" or “please contact” to remove reddit submission messages.

Sentence Splitter:

We have used sentence splitter from Spacy to take a batch of 5 sentences at a time from each post image

Embedding:

SVD:

We have used TruncatedSVD to transform the data into a 20 dimensional vector. image

Spacy:

We have used the "en_core_web_lg" from Spacy to transform the cleaned text into a 300-dimension vector. Further dimension reduction has been done using TruncatedSVD to bring it down to 20 dimension. image

SBERT:

Dynamic embedding was done using SBERT sentence transformer from "all-mpnet-base-v2" to obtain a 768 dimension vector.Further dimension reduction has been done using TruncatedSVD to bring it down to 20 dimensions. image

Perplexity:

As seen above, with Perplexity=50, the SBERT model clearly produces better results as compared to LSA and Spacy. Thus, we tried to plot the vector projections with perplexity 100, 150 and 200, to compare results and select best model for clustering image
Perplexity of 200 seems to provide optimal results for the analysis

HDBSCAN

We have considered a cluster size of 20 and fitted the SBERT vectors. image Highlighted terms in each cluster after incorporating c-values gives us the following topics as focus
image

Results:

Dynamic Bokeh Plot

image

For interactive feature and further details please refer to notebook and project reoprt. A few examples have been highlighted below.

  1. Presidential Election Fraud
    image

  2. Evolution Theory vs Religious Beliefs
    image

  3. Zionists – Israel and Palestine Crisis
    image

  4. Flat Earth
    image

  5. Ukraine War
    image

  6. Anti – Maskers (COVID 19)
    image

About

Using unsupervised learning to group reddit text and identify major conspiracy theories using NLP, LDA, spacy, SVD, SBert embedding and HDBSCAN.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages