GitHub - DavidOgalo/Twitter-Sentiment-Analysis: Developed a sentiment analysis model to measure tweet positivity across regions using advanced NLP techniques. This project involved data preprocessing, feature engineering with TF-IDF and Doc2Vec, and training supervised machine learning models. Performance was validated using cross-validation and metrics like accuracy and precision

Sentiment Analysis on Social Media Data (Twitter)

Description

Conceptualized and developed a sentiment analysis model to quantify the positivity of tweets across diverse geographic regions. Leveraged advanced Natural Language Processing (NLP) techniques, including count vectorization, TF-IDF, and Doc2Vec, to extract meaningful insights from unstructured text data. This project involved extensive data handling and pre-processing, sophisticated machine learning algorithms, and rigorous model evaluation and validation to ensure robust and reliable performance.

Key Concepts

Data Handling and Pre-processing

Data Cleaning: Processed unstructured text data to handle missing values and duplicates, ensuring high-quality input for model training.

Feature Engineering: Utilized count vectorization, TF-IDF, and Doc2Vec to create meaningful features from raw text data, enhancing the model's ability to understand sentiment.

Data Visualization: Used libraries like Seaborn and Matplotlib to visualize sentiment distribution across regions, helping to identify patterns and trends in the data.

Machine Learning Algorithms

Supervised Learning: Trained the sentiment analysis model using supervised learning techniques on labeled tweet data, focusing on accurately classifying sentiment.

Supervised Learning: Applied clustering methods to explore patterns in sentiment data, providing additional insights into the data's structure.

Natural Language Processing (NLP)

Text Pre-processing: Implemented tokenization, stemming, and lemmatization using NLTK to standardize and clean the text data, making it suitable for analysis.

NLP Models: Leveraged advanced models like Doc2Vec for feature extraction, capturing semantic meaning from the text data.

Libraries: Utilized NLTK and Gensim for various NLP tasks, ensuring robust and efficient text processing.

Model Evaluation and Validation

Metrics: Assessed model performance using metrics such as accuracy, precision, recall, and F1 score to ensure a comprehensive evaluation.

Cross-Validation: Conducted k-fold cross-validation to validate model stability and robustness, ensuring the model generalizes well to unseen data.

A/B Testing: Performed A/B testing to evaluate model changes and improvements, ensuring continuous enhancement of model performance.

Technologies (Tools and Libraries)

Python==3.6: Primary programming language used for the project.
NLTK==3.4.5: Used for text preprocessing tasks such as tokenization, stemming, and lemmatization.
Gensim==3.8.3: Employed for advanced NLP tasks including the implementation of Doc2Vec.
Matplotlib==3.2.1: Utilized for data visualization to explore and understand sentiment distributions.
Matplotlib==3.2.1: Utilized for data visualization to explore and understand sentiment distributions.
Seaborn==0.10.1: Enhanced data visualization capabilities for better presentation of sentiment analysis results.
scikit-learn==0.21.3: scikit-learn: Used for machine learning model training and evaluation.

Project Breakdown

Part 1: Data Collection and Pre-processing

Data Collection: Gathered tweets using the Twitter API, ensuring a diverse dataset across various geographic regions. Also used a sample set from kaggle containing tweets extracted using the twitter API.

Data Cleaning: Processed the raw tweet data to handle missing values, duplicates, and irrelevant content.

Part 2: Feature Engineering

Count Vectorization: Transformed text data into numerical vectors using count vectorization.

TF-IDF: Applied Term Frequency-Inverse Document Frequency to weigh the importance of words in the dataset.

Doc2Vec: Used Doc2Vec to capture the semantic meaning of tweets, enhancing feature representation.

Part 3: Model Training and Tuning

Supervised Learning: Trained a sentiment analysis model using labeled data, employing algorithms like logistic regression and support vector machines.

Hyperparameter Tuning: Optimized model parameters to improve performance using techniques like grid search.

Part 4: Model Evaluation and Validation

Metrics: Evaluated model performance using accuracy, precision, recall, and F1 score.

Cross-Validation: Conducted k-fold cross-validation to ensure model robustness and generalizability.

A/B Testing: Implemented A/B testing to compare different model versions and select the best-performing model.

Getting Started

Clone the Repository
Install Dependencies: Manually install the required tools and libraries highlighted in the technologies section, versions are specified.
Dataset: Download the dataset using the Twitter API or a sample dataset from Kaggle (https://www.kaggle.com/datasets/kazanova/sentiment140) and place it in the designated directory.
Run the Preprocessing Script: Preprocess the tweets using the provided scripts to clean and standardize the data.
Feature Engineering: Execute the feature engineering scripts to transform the text data into numerical features.
Train the Model: Use the training scripts to build and optimize the sentiment analysis model.
Evaluate the Model: Run the evaluation scripts to assess the model performance using various metrics and validation techniques.

Maintainers and Contributors

Maintainer: David Ogalo
Contributors: Contributions are welcome. Please reach out for more information on contribution guidelines on this project.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
Sentiment_Analysis_on_Twitter_Data.ipynb		Sentiment_Analysis_on_Twitter_Data.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentiment Analysis on Social Media Data (Twitter)

Description

Key Concepts

Technologies (Tools and Libraries)

Project Breakdown

Getting Started

Maintainers and Contributors

About

Releases

Packages

Languages

DavidOgalo/Twitter-Sentiment-Analysis

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis on Social Media Data (Twitter)

Description

Key Concepts

Technologies (Tools and Libraries)

Project Breakdown

Getting Started

Maintainers and Contributors

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages