This is a repo for different machine learning tasks, most of them come from Kaggle competitions.
I wrote everything in Python3 using Jupyter Notebooks in an Anaconda3 environment.
Examples contain Code for both structured and unstructured data and are mostly for showcasing the code because the data I used was stored locally.
Whale Identification (one-shot learning, siamese network)
Protein Classification (Multilabel classification, resnet50)
structured data example and feature engineering
Kaggle competition: https://www.kaggle.com/c/champs-scalar-coupling
In this competition, you will develop an algorithm that can predict the magnetic interaction between two atoms in a molecule (i.e., the scalar coupling constant).
Dataformat | Metric | Prediction |
---|---|---|
structured data (graph based) | log mean average error | regression |
Started working on this competition using lightgbm and then used a modified implementation of a message passing neural network.
A lot of additional data that is not usable directly because it's not contained in the test set.
Domain knowledge about atom interaction in molecules was really important (to a certain degree).
Most of the features were calculated using rdkit and openbabel.
- 1JHC: -1.371
- 2JHC: -2.229
- 3JHC: -1.975
- 1JHN: -1.538
- 2JHN: -2.504
- 3JHN: -2.517
- 2JHH: -2.501
- 3JHH: -2.383
average local log mae: -2.12
top 2%
leaderboard | score | placement |
---|---|---|
public | -2.37190 | 43/2757 |
private | -2.36477 | 42/2757 |
In this competition, you will develop models capable of classifying mixed patterns of proteins in microscope images. The Human Protein Atlas will use these models to build a tool integrated with their smart-microscopy system to identify a protein's location(s) from a high-throughput image.
Dataformat | Metric | Prediction |
---|---|---|
4 channel image | macro F1 Score | multi label classification |
118/2172: top 5%
Focalloss worked way better than binary cross entropy
Started with resnet34 using fastai for multilabel-classification resnet50 worked even better (by about 0.05 macro F1-score)
In this competition, you’ll detect and delineate distinct objects of interest in biological images depicting neuronal cell types commonly used in the study of neurological disorders. More specifically, you'll use phase contrast microscopy images to train and test your model for instance segmentation of neuronal cells. Successful models will do this with a high level of accuracy.
224/1505: top 15%
In this competition, you’re challenged to build an algorithm to identify individual whales in images. You’ll analyze Happywhale’s database of over 25,000 images, gathered from research institutions and public contributors. By contributing, you’ll help to open rich fields of understanding for marine mammal population dynamics around the globe.
Dataformat | Metric | Prediction |
---|---|---|
3 channel image | Mean Average Precision @ 5 | single label classification (@ 5) |
555/2131: top 26%
The greatest challenge for this competition was the lack of images for each label of humpback whale (1-20 different images) So I tried different kinds of one-shot learning algorithms like siamese networks with LAP matching of positive and negative examples.
In the end it turned out metric learning and siamese networks were indeed good approaches to the problem but time was running short.
In this competition, you're challenged to build a model that recognizes toxicity and minimizes this type of unintended bias with respect to mentions of identities. You'll be using a dataset labeled for identity mentions and optimizing a metric designed to measure unintended bias.
Dataformat | Metric | Prediction |
---|---|---|
text | generalized mean of bias AUCs | classification |
First NLP competition I have joined and I still feel like I have to learn a lot in this space.
Used GloVe combined with a lstm + word embedding neural network.
0.93568, 718/2646 placement
In this competition, you are required to locate ships in images, and put an aligned bounding box segment around the ships you locate. Many images do not contain ships, and those that do may contain multiple ships. Ships within and across images may differ in size (sometimes significantly) and be located in open sea, at docks, marinas, etc.
Dataformat | Metric | Prediction |
---|---|---|
3 channel image | F2 Score | binary segmentation |
public 0.70823, 208/884 placement
private 0.82704, 524/884 placement
Used fastai with resnet34 for image segmentation.
Big dropoff on private test set because I tried to select a part of the train set to reduce computing time but my selection method was lacking. Definitely will keep this mistake in mind for the future
You are given 5 years of store-item sales data, and asked to predict 3 months of sales for 50 different items at 10 different stores.
Dataformat | Metric | Prediction |
---|---|---|
time series data | symmetric mean absolute percentage error | regression |
Because of a leak the competition was reset in the last few weeks and I did not have the time to submit again.
In this competition, you will develop an algorithm that can predict the magnetic interaction between two atoms in a molecule (i.e., the scalar coupling constant).
Dataformat | Metric | Prediction |
---|---|---|
structured data (graph based) | log mean average error | regression |
A lot of additional data that is not usable directly because it's not contained in the test set. Also domain knowledge about atom interaction in molecules seems really important.
Solved using lightgbm and a message passing neural network
top 2%
leaderboard | score | placement |
---|---|---|
public | -2.37190 | 43/2757 |
private | -2.36477 | 42/2757 |