Skip to content

Commit

Permalink
remove completed projects
Browse files Browse the repository at this point in the history
  • Loading branch information
ablaom committed Dec 27, 2024
1 parent 50106a6 commit f1daada
Showing 1 changed file with 116 additions and 117 deletions.
233 changes: 116 additions & 117 deletions jsoc/gsoc/MLJ.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,62 +8,10 @@ ML/Stats ecosystem.

### List of projects

MLJ is released under the MIT license and sponsored by the Alan Turing Institute.
MLJ is released under the MIT license.

\toc

## Categorical variable encoding

Extend the categorical variable encoding of MLJ.

**Difficulty.** Moderate. **Duration.** 350 hours

### Description

MLJ provides basic one-hot encoding of categorical variables but no sophisticated encoding
techniques. One-hot encoding is rather limited, in particular when a categorical has a
very large number of classes. Many other techniques exists, and this project aims to make
some of these available to the MLJ user.


**Mentors.** [Anthony Blaom](https://ablaom.github.io/) (best contact: direct message on Julia slack)

### Prerequisites

- Julia language fluency is essential.

- Git-workflow familiarity is strongly preferred.

- Experience with machine learning and data science workflows.

- Familiarity with MLJ's API a plus.

### Your contribution

In this project you will survey popular existing methods for one-hot encoding categorical
variables. In collaboration with the mentor, you will make a plan for integrating some of
these techniques into MLJ. You will begin work on the plan, initially focusing on simple
methods, providing MLJ interfaces to existing julia packages, or new implementations where
needed. If the project advances well, you will implement more advanced techniques, such as
[entity embedding](https://arxiv.org/abs/1604.06737) via MLJFlux.jl (MLJ's neural network
interface).

### References

- Existing encoding in MLJ:
[OneHotEncoder](https://alan-turing-institute.github.io/MLJ.jl/dev/models/OneHotEncoder_MLJModels/#OneHotEncoder_MLJModels);
[ContinuousEncoder](https://alan-turing-institute.github.io/MLJ.jl/dev/models/ContinuousEncoder_MLJModels/#ContinuousEncoder_MLJModels);
[UnivariateContinuousTimeEncoder](https://alan-turing-institute.github.io/MLJ.jl/dev/models/UnivariateTimeTypeToContinuous_MLJModels/#UnivariateTimeTypeToContinuous_MLJModels)

- StatsModels.jl [encoders](https://juliastats.org/StatsModels.jl/stable/contrasts/)

- MLJ [feature request](https://github.com/JuliaAI/MLJModels.jl/issues/534)

- Guo and Berkhahn [(2016]](https://arxiv.org/abs/1604.06737) "Entity Embeddings of Categorical Variables"

- [MLJFlux.jl](https://github.com/FluxML/MLJFlux.jl)


## Machine Learning in Predictive Survival Analysis

Implement survival analysis models for use in the MLJ machine learning
Expand All @@ -83,11 +31,6 @@ default), commercial ventures (customer churn), engineering (component
lifetime), and many more. This project aims to implement models for
performing survivor analysis with the MLJ machine learning framework.

**[mlr3proba](https://mlr3proba.mlr-org.com) is currently the most complete survival
analysis interface, let's get SurvivalAnalysisA.jl to the same standard - but learning from mistakes
along the way.**


**Mentors.** [Sebastian Vollmer](https://sebastian.vollmer.ms), [Anthony Blaom](https://ablaom.github.io/),

### Prerequisites
Expand All @@ -105,101 +48,155 @@ along the way.**

### Your contribution

You will work towards creating a survival analysis package with a range of metrics,
capable of making distribution predictions for classical and ML models. You will bake in
competing risks in early, as well as prediction transformations, and include both left and
interval censoring. You will code up basic models (Cox PH and AFT), as well as one ML
model as a proof of concept (probably decision tree is simplest or Coxnet).

Specifically, you will:

- Familiarize yourself with the training and evaluation machine
learning models in MLJ.

- For SurvivalAnalysis.jl, implement the [MLJ model interface](https://alan-turing-institute.github.io/MLJ.jl/dev/adding_models_for_general_use/).
- Survey existing survival models in Julia.

- Consider Explainability of SurvivalAnalysis through SurvSHAP(t)
- Integrate some existing classical survival models into MLJ.

- Develop a proof of concept for newer advanced survival analysis
models not currently implemented in Julia.

### References

- Mateusz Krzyziński et al., [SurvSHAP(t): Time-Dependent Explanations of Machine Learning Survival Models](https://doi.org/10.1016/j.knosys.2022.110234), Knowledge-Based Systems 262 (February 2023): 110234
- [Kvamme, H., Borgan, Ø., & Scheel, I. (2019). Time-to-event
prediction with neural networks and Cox regression. Journal of
Machine Learning Research, 20(129),
1--30.](https://arxiv.org/abs/1907.00825)

- Kvamme, H., Borgan, Ø., & Scheel, I. (2019). [Time-to-event prediction with neural networks and Cox regression](https://arxiv.org/abs/1907.00825). Journal of Machine Learning Research, 20(129), 1--30.
- [Lee, C., Zame, W. R., Yoon, J., & van der Schaar, M. (2018).
Deephit: A deep learning approach to survival analysis with
competing risks. In Thirty-Second AAAI Conference on Artificial
Intelligence.](https://ojs.aaai.org/index.php/AAAI/article/view/11842/11701)

- Lee, C., Zame, W. R., Yoon, J., & van der Schaar, M. (2018). [Deephit: A deep learning approach to survival analysis with competing risks.](https://ojs.aaai.org/index.php/AAAI/article/view/11842/11701) In Thirty-Second AAAI Conference on Artificial Intelligence.
- [Katzman, J. L., Shaham, U., Cloninger, A., Bates, J., Jiang, T., &
Kluger, Y. (2018). DeepSurv: personalized treatment recommender
system using a Cox proportional hazards deep neural network. BMC
Medical Research Methodology, 18(1),
24.](https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-018-0482-1)
<https://doi.org/10.1186/s12874-018-0482-1>

- Katzman, J. L., Shaham, U., Cloninger, A., Bates, J., Jiang, T., & Kluger, Y. (2018). [DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network](https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-018-0482-1). BMC Medical Research Methodology, 18(1), 24.
- [Gensheimer, M. F., & Narasimhan, B. (2019). A scalable
discrete-time survival model for neural networks. PeerJ, 7,
e6257.](https://peerj.com/articles/6257/)

- Gensheimer, M. F., & Narasimhan, B. (2019). A scalable discrete-time survival model for neural networks.](https://peerj.com/articles/6257/) PeerJ, 7, e6257.
- [Survival.jl
Documentation](https://juliastats.org/Survival.jl/latest/)

- [SurvivalAnalysis.jl](https://github.com/RaphaelS1/SurvivalAnalysis.jl)

## Time series forecasting at scale - speed up via Julia

## Deeper Bayesian Integration
Bayesian methods and probabilistic supervised learning provide uncertainty quantification. This project aims increasing integration to combine Bayesian and non-Bayesian methods using Turing.
Time series are ubiquitous - stocks, sensor reading, vital signs. This projects aims at adding time series forecasting to MLJ and perform benchmark comparisons to [sktime](https://github.com/alan-turing-institute/sktime), [tslearn](https://github.com/rtavenar/tslearn), [tsml](https://github.com/uea-machine-learning/tsml/)).

**Difficulty.** Difficult. **Duration.** 350 hours.
**Difficulty.** Moderate - hard. **Duration.** 350 hours.

### Description
As an initial step reproduce [SOSSMLJ](https://github.com/cscherrer/SossMLJ.jl) in Turing. The bulk of the project is to implement methods that combine multiple predictive distributions.
### Prerequisites
- Julia language fluency essential.
- Git-workflow essential
- Some prior contact with time series forecasting
- HPC in julia is a desirable

### Your contribution
MLJ is so far focused on tabular data and time series classification. This project is to add support for time series data in a modular, composable way.

### Your contributions
- Interface between Turing and MLJ
- Comparisons of ensembling, stacking of predictive distribution
- reproducible benchmarks across various settings.
Time series are everywhere in real-world applications and there has been an increase in interest in time series frameworks recently (see e.g. [sktime](https://github.com/alan-turing-institute/sktime), [tslearn](https://github.com/rtavenar/tslearn), [tsml](https://github.com/uea-machine-learning/tsml/)).

### References
[Bayesian Stacking](http://www.stat.columbia.edu/~gelman/research/published/stacking_paper_discussion_rejoinder.pdf)
[SKpro](https://github.com/alan-turing-institute/skpro/blob/master/README.md)
### Difficulty: Medium to Hard
But there are still very few principled time-series libraries out there, so you would be working on something that could be very useful for a large number of people. To find out more, check out this [paper](http://learningsys.org/neurips19/assets/papers/sktime_ml_systems_neurips2019.pdf) on sktime.

**Mentors**: [Hong Ge](https://github.com/yebai) [Sebastian Vollmer](https://sebastian.vollmer.ms)
**Mentors**: [Sebastian Vollmer](https://sebastian.vollmer.ms), [Markus Löning](https://github.com/mloning) (sktime developer).

### References
* [sktime](https://github.com/alan-turing-institute/sktime)
* [tslearn](https://github.com/rtavenar/tslearn)
* [tsml](https://github.com/uea-machine-learning/tsml/)
* [sktime paper](http://learningsys.org/neurips19/assets/papers/sktime_ml_systems_neurips2019.pdf)

## Tracking and sharing MLJ workflows using MLflow

Help data scientists using MLJ track and share their machine learning experiments using
[MLflow](https://mlflow.org). The emphasis iin this phase of the project is:
## Interpretable Machine Learning in Julia

- support *asynchronous* workflows, as appear in parallelized model tuning
- support live logging while training *iterative* models, such as neural networks
Interpreting and explaining black box interpretation crucial to establish trust and improve performance

**Difficulty.** Moderate. **Duration.** 350 hours.
**Difficulty.** Easy - moderate. **Duration.** 350 hours.

### Description
It is important to have mechanisms in place to interpret the results of machine learning models. Identify the relevant factors of a decision or scoring of a model.

This project will implement methods for model and feature interpretability.

MLflow is an open source platform for the machine learning life cycle. It allows the data
scientist to upload experiment metadata and outputs to the platform for reproducing and
sharing purposes. MLJ [already allows](https://github.com/JuliaAI/MLJFlow.jl) users to
report basic model performance evaluation to an MLflow service and this project seeks to
greatly enhance this integration.
**Mentors.** [Diego Arenas](https://github.com/darenasc), [Sebastian Vollmer](https://sebastian.vollmer.ms).

### Prerequisites
- Julia language fluency essential
- Understanding of asynchronous programming principles
- Julia language fluency essential.
- Git-workflow familiarity strongly preferred.
- General familiarity with data science workflows
- Some prior contact with explainable AI/ML methods is desirable.
- A passing familiarity with machine learning goals and workflow preferred

### Your contribution

The aim of this project is to implement multiple variants implementation algorithms such as:

- Implement methods to show feature importance
- Partial dependence plots
- Tree surrogate
- LocalModel: Local Interpretable Model-agnostic Explanations
- Add Dataset loaders for standard interpretability datasets.
- Add performance metrics for interpretability
- Add interpretability algorithms
- Glue code to SHAP package

Specifically you will
- Familiarize yourself with MLJ
- Survey of some of the literature and existing implementations in Julia and other languages, and preparing a short summary
- Implement visualizations of explanations
- Implement use cases
- You will learn about the benefits and short comings of model interpretation and how to use them.

### References
- [Interpretable Machine Learning - A Guide for Making Black Box Models Explainable by Christoph Molnalr](https://christophm.github.io/interpretable-ml-book/)
- [iml R package](https://github.com/christophM/iml/)
- [AI Explainability 360 by IBM]: A survey on particle swarm optimization with emphasis on engineering and network applications](https://github.com/Trusted-AI/AIX360)

Tutorials
- [AI explainability 360: hands-on tutorial](https://dl.acm.org/doi/abs/10.1145/3351095.3375667)
- [IML tutorial](https://mlr3book.mlr-org.com/iml.html)


## Model visualization in MLJ

Design and implement a data visualization module for MLJ.

**Difficulty**. Easy. **Duration.** 350 hours.

### Description

Design and implement a data visualization module for MLJ to visualize numeric and categorical features (histograms, boxplots, correlations, frequencies), intermediate results, and metrics generated by MLJ machines.

Using a suitable Julia package for data visualization.

The idea is to implement a similar resource to what [mlr3viz](https://github.com/mlr-org/mlr3viz) does for [mlr3](https://mlr3.mlr-org.com).

### Prerequisites
* Julia language fluency essential.
* Git-workflow essential.
* Some prior work on data visualization is desirable

### Your contribution
* You will familiarize yourself with MLJ, MLflow and MLflowClient.jl client APIs.
* You will familiarize yourself with the MLJFlow.jl package providing MLJ <--> MLflow integration
* Implement changes needed to allow correct *asynchronous* logging of model performance evaluations
* Extend logging to (parallelized) model tuning (MLJ's `TunedModel` wrapper)
* Extend logging to controlled training of iterative models (MLJ's `IteratedModel` wrapper)
So far visualizing data or features in MLJ is an ad-hoc task. Defined by the user case by case. You will be implementing a standard way to visualize model performance, residuals, benchmarks and predictions for MLJ users.

The structures and metrics will be given from the results of models or data sets used; your task will be to implement the right visualizations depending on the data type of the features.

A relevant part of this project is to visualize the target variable against the rest of the features.

You will enhance your visualisation skills as well as your ability to "debug" and understand models and their prediction visually.

### References
* [MLflow](https://mlflow.org) website.
* [MLflow REST API](https://mlflow.org/docs/latest/rest-api.html).
* [MLJFlow.jl](https://github.com/JuliaAI/MLJFlow.jl)
* [MLflowClient.jl](https://github.com/JuliaAI/MLFlowClient.jl)
* [MLJIteration.jl](https://github.com/JuliaAI/MLJIteration.jl)
* [Issue on asynchronous reporting](https://github.com/JuliaAI/MLJFlow.jl/issues/26)
* [mlr3viz](https://github.com/mlr-org/mlr3viz)
* [StatsPlots](https://github.com/JuliaPlots/StatsPlots.jl)

**Mentors.** [Anthony Blaom](https://ablaom.github.io/)
**Mentors**: [Sebastian Vollmer](https://sebastian.vollmer.ms), [Diego Arenas](https://github.com/darenasc).


## Speed demons only need apply
Expand All @@ -222,15 +219,17 @@ In addition to investigating a number of known performance bottlenecks, you will
### Your contribution
In this project you will:
- familiarize yourself with the training, evaluation and tuning of machine learning models in MLJ
- benchmark and profile common workflows to identify opportunities for further code optimizations, with a focus on the most popular models
- work to address problems identified
- roll out new data front-end for iterative models, to avoid unnecessary copying of data
- experiment with adding multi-processor parallelism to the current learning networks scheduler
- work towards addressing a number of known performance issues, including:
- limitations of the generic Tables.jl interface for interacting with tabular data which, in common cases (DataFrames), has extra functionality that can be exploited
- rolling out new data front-end for models to avoid unnecessary copying of data
- in conjunction with your mentor, identify best design for introducing better sparse data support to MLJ models (e.g., naive Bayes)
- implement multi-processor parallelism in the current learning networks scheduler
- benchmark and profile common workflows to identify opportunities for further code optimizations
- implement some of these optimizations

### References
- [MLJ Roadmap](https://github.com/alan-turing-institute/MLJ.jl/blob/dev/ROADMAP.md#scalability). See, in particular "Scalability" section.
- [Taking performance more seriously GitHub issue](https://github.com/alan-turing-institute/MLJBase.jl/issues/309)
- [Data front end](https://alan-turing-institute.github.io/MLJ.jl/dev/adding_models_for_general_use/#Implementing-a-data-front-end-1) for MLJ models.

**Mentors.** [Anthony Blaom](https://ablaom.github.io), Okon Samuel.
**Mentors.** [Anthony Blaom](https://ablaom.github.io)

0 comments on commit f1daada

Please sign in to comment.