This project is to extract data from Twitter, store the data in a csv format in a particular file name. The second part is to create a Mongodb Databases called Tweets_db and store the extracted tweets into a collection named: raw_tweets.
Write a script that downloads tweets data on a specific search topic using the standard search API. The script should contain the following functions:
- scrape_tweets() that has the following parameters:
- Search topic
- The number of tweets to download per request
- The number of requests And returns a dataframe.
- Save_results_as_csv() that has the following parameters:
- the dataframe from the above function And returns a csv file with the following naming format: tweets_downloaded_yymmdd_hhmmss.csv (where ‘yymmdd_hhmmss’ is the current timestamp)
The following attributes of the tweets should be extracted:
- Tweet text
- Tweet id
- Source
- Coordinates
- Retweet count
- Likes count
- User info:
- Username
- Screenname
- Location
- Friends count
- Verification status
- Description
- Followers count
Create a MongoDB database called Tweets_db and store the extracted tweets into a collection named: raw_tweets.
- Python version : 3.6
- Packages : json, pandas, tweepy, mongo db
- OS : macOS Catalina
- Exteral client packages mongo Atlas, mongo Compass
- Web Framework: virtualenv, requirements.txt
To apply for Twitter developer key, you first of all need a Twitter account, if you do not have one, follow these steps to create one
The first thing I did was to apply for the Twitter developer key. This is not as straight forward as applying for the YouTube developer key. Your best chance is to apply as a hobbyist. Applying as hobbyist gives a higher probability of your application getting approved faster. In some cases, it is instant, especially if your reason for applying is convincing. Use this link to apply for access.
When you finally get your credentials, its best to copy them and keep them safe.You will be presented with API_key, API_secret_key, Bearer_token, Access_token, and Access_token_secret. In writing the script, all will be needed but Bearer_token.
The kind of environment to use is also key. There're a number of environments data engineers can deploy for this exercise. E.g, google colab, jupyter notebook, etc. I used jupyter notebook because I am conversant with it and also becuase all I do is saved on my local computer. For google colab, you'll need internet to create your workspace and to access your files. In situations where you have unstable internet, it delays execution of your project.
By defualt, jyputer notebook doesn't come with pre-installed packages for interaction with Twitter backend. I had to install the Tweepy (pip install tweepy). Before installing these libraries, be sure to check your python version to aid installation of the appropriate client version. You can follow these steps to help create an authentication object.
The next step I took was to hide my credentials by writing a script in my local container. I then imported the module in my script and called the credentials. There are others ways around but you have to find the one that best works for you.
supplementary files that helped me getaround the task
There are a number of ways to interact with mongo, using atlas,compass and also on out local machine. I had to work with my local machine so I installed mongodb using the terminal.
I already had homebrew installed on my machine so it was pretty straight forward installing mongodb but if you do not have homebrew intalled, follow these steps to install.
Follow these steps to install mongodb locally
Since I will be using python, I had to learn how to connect to mongo database. Here are the steps I followed to perform basic Create, Retreive, Update, and Delete (CRUD) operations using PyMongo. This will take you through steps to create clusters, and admin settings
I run into a problem when I was connecting to atlas which I want to point out. When you experience ConnectionError, it can be an indication that PyMongo is not getting access to your database. When this happens check the Database Access settings under Security and modify the authentcation method to SCRAM and MongoDB Roles to readWriteAnyDatabase@admin as shown in the image below
The next thing to check is the IP address under Network Access. You have to change the IP address to 0.0.0.0/0 which will accept connection from any endpoint with your credenitals as shown below.
After successfully connecting to my database with Pymongo, I ingested my csv file (shown in my script) into my collection. This can be seen in collections in my cluster but Compass is another way to view and interact with data. I installed compass and it enabled me see graph representation of my data. You have to be fammiliar with json if you want any perform CRUD operation
I used Vs Code to access my .py file to create a virtual environment and the requirements.txt. If you're new to using virtual environment, this material could be of help if you want to use Vs code.