The main task of this data challenge was to handel the class imbalance using technique like randomover sampling or doing SMOTE etc. Please see the EDA folder to see different visualization of the data provided
I have tried to map the sentiments on the Maps And have plotted different analysis of data using profiling data.
If you think this repo is been useful in someway then please Star it.
I still need to improve the code, I will keep working on it as I get more time. Many Thanks :)
Usefull URL to understand the solution:
Rightway to oversample in predective modelling https://beckernick.github.io/oversampling-modeling/
Crossvalidation pipeline with randomforest https://www.kaggle.com/alexisbcook/cross-validation https://towardsdatascience.com/yet-another-twitter-sentiment-analysis-part-1-tackling-class-imbalance-4d7a7f717d44
Preventing Dataleakage https://www.kaggle.com/alexisbcook/data-leakage
This question will test some basic skills in cleaning data and building a machine learning pipeline.
The focus of this test is to evaluate:
- Ability to quickly learn a new framework (luigi)
- Ability to manipulate and process data (cleaning, processing, feature engineering)
- Competency in software development
This test does not focus on modelling accuracy, ability to use a fancy model, or efficiency. It is mainly about the mechanics of building a proper machine learning pipeline.
There are two files: airline_tweets.csv
and cities.csv
.
airline_tweets.csv
has twitter data regarding airline sentiment augmented
with some extra columns. The relevant columns are:
airline_sentiment
: a string indicating if the tweet had positive, neutral or negative sentiment.tweet_coord
: is a string with form "[, ]" if a geo-coordinate exists for that tweet, or an empty string otherwise.
The cities.csv
contains information about latitude and longitude for large cities.
The relevant columns are:
name
: The name of the city.latitude
: The latitude of the city.longitude
: The longitude of the city.
Build a basic ML pipeline using the luigi
Python framework. The pipeline
should clean the tweet data, prepare features for building a model, train a
classifier and score using the model. The pipeline should have these steps:
CleanDataTask
: Cleans the input tweet CSV file by removing any rows without valid geo-coordinates.- An invalid coordinate has either an empty
tweet_coord
column or is coordinate (0.0, 0.0).
- An invalid coordinate has either an empty
TrainingDataTask
: Extracts features/outcome variable in preparation for training a model.- This prepares the cleaned data into the exact form that is able to be fit by the model.
- The "y" variable will be the multi-class sentiment (0, 1, 2 for negative, neutral and positive respectively).
- The "X" variables will be the closest city to the "tweet_coord" using Euclidean distance.
- You should use the
cities.csv
file to find the closest city. - You probably will need to one-hot encode the city names.
TrainModelTask
: Trains a classifier to predict negative, neutral, positive based only on the input city.- Train a classifier that uses closest cities as features.
- Dump the fitted model to the output file.
ScoreTask
: Uses the scored model to compute the sentiment for each city.- Use the trained model to predict the probability/score for each city the negative, neutral and positive sentiment.
- Output a sorted list of cities by the predicted positive sentiment score to the output file.
- We have provided a skeleton file to get you started named
pipeline.py
, and a scriptrun.sh
that will execute this luigi pipeline. - You must use the
luigi
package. - You must use Python (any version is fine).
- Feel free to use any Python packages. We used
pandas
,scikit-learn
,numpy
(as seen in the included requirements.txt). - Do not worry too much about run-time/memory efficiency. So long as it runs within 15 minutes, it should be fine.
* Luigi package: `http://luigi.readthedocs.io/en/stable/`