Twitter Bot Detector is a machine learning-based project designed to identify Twitter/X bots. Using the TwiBot-22 dataset we analyzed and engineered unique user features to create a custom dataset for training a Random Forest Classifier model. This model predicts whether a Twitter account is a bot or human.
Through our intuitive web application, users can search for any Twitter/X account and leverage our model to evaluate its authenticity.
For data about user's tweets, we analyzed 125 tweets due to rate-limiting constraints. Features were divided into the following categories:
Feature | Description |
---|---|
user_id | User's unique ID (not used for model training). |
screen_name | Screen name of the user (not used for model training). |
is_bot | Target variable indicating if the account is a bot (from TwiBot-22). |
account_age | Account age in days. |
is_blue_verified | Whether the account has a blue checkmark. |
is_verified | Whether the account is organizationally verified. |
profile_description_sentiment | Sentiment score (-1 to 1) of the user's profile description. |
following_count | Number of users the account is following. |
followers_count | Number of followers the account has. |
following_to_followers | Ratio of following count to follower count. |
is_possibly_sensitive | Whether the account contains sensitive content. |
is_default_profile_image | Whether the account has the default profile image. |
is_profile_banner | Whether the account has a banner. |
is_profile_image_valid | Whether the account has a valid profile image (via OpenCV). |
Feature | Description |
---|---|
tweet_freq | Total number of tweets divided by account age. |
likes_freq | Number of tweets liked divided by account age. |
media_freq | Media posts divided by account age. |
followers_freq | Followers count divided by account age. |
following_freq | Following count divided by account age. |
Feature | Description |
---|---|
replies_to_owned | Ratio of replies to total owned tweets. |
quotes_to_owned | Ratio of quotes to total owned tweets. |
retweets_to_owned | Ratio of retweets to total owned tweets. |
avg_urls | Average number of URLs per tweet. |
avg_hashtags | Average number of hashtags per tweet. |
avg_tweet_sentiment | Average sentiment score for all tweets. |
Feature | Description |
---|---|
avg_replies_per_follower | Average number of replies per 1000 followers. |
avg_likes_per_follower | Average number of likes per 1000 followers. |
avg_retweets_per_follower | Average number of retweets per 1000 followers. |
- Data Constraints: No access to the Twitter/X API led us to use the Twikit web scraping library, limiting the amount of data processed due to rate limiting.
- Dataset Freshness: The TwiBot-22 dataset, while a reputable benchmark, is nearly three years old may not reflect the evolving Twitter/X user base.
- 🐍 Backend: Python & Flask
- 💻 Frontend: TypeScript, Next.js, TailwindCSS
- Python: Extensive libraries for ML development.
- Twikit: Web scraping Twitter/X data.
- Jupyter Notebook: Interactive environment for model visualization.
- Pandas: Data manipulation and processing.
- scikit-learn: Implementation of Random Forest Classifier.
- OpenCV: Image analysis for profile validation.
- Transformers: Sentiment analysis for text data.
- Shap: Model interpretation and fine-tuning.
- Joblib: Saving and deploying the trained model.
Demo.mov
Clone the repository and install dependencies to run the project locally.
# Clone the repository
git clone https://github.com/aross2010/twitter-bot-detector.git
# Navigate to the project directory
cd twitter-bot-detector
# Navigate to the backend directory
cd backend
# Install dependencies (backend)
pip install -r requirements.txt
# Start the backend server
python3 app.py
# Navigate to the frontend directory
cd frontend/client-bot-detector
# Install dependencies (frontend)
npm install
# Start the frontend server
npm run dev