Transitive prediction of small molecule function through alignment of high-content screening resources

CLIPⁿ is a Python package for integrating phenotypic screening data generated from diverse experimental setups. It provides tools for data integration for reference compound profiles and predictions for uncharacterized compounds.

Figure 1: (a) Examples of three distinct experiments profiled using different cell assays, hardware/software setups, and resulting feature maps. (b) Cartoon illustration depicting CLIPⁿ integration and its applications. CLIPⁿ transforms diverse phenotypic feature spaces into a unified integrated space, enabling accurate grouping of perturbation categories from different experiments while separating distinct perturbations. The integrated space facilitates classification, annotation, and profile transfer. (c) Model architecture of CLIPⁿ. In each iteration, datasets are partitioned into one pivot dataset and multiple auxiliary datasets. Each dataset is then mapped to integrated embeddings through a dataset-specific encoder. Embedding contrast matrices between pivot and auxiliary datasets are calculated and compared with their perturbation similarities. (d) Construction of the cross-dataset contrastive graph by iteratively selecting each dataset as the pivot dataset. The entire contrastive graph is optimized together to reflect perturbation categorical similarities between any two integrated datasets.

Key Features

Profile: integration can be performed directly from HCS datasets of different profiles (rather than raw images);
Drug category: HCS profile datasets can be aligned even when the individual datasets utilize overlapping, but not identical reference compound categories with potentially different choices of compounds in these categories;
Datasets: integration can be performed simultaneously on two or more datasets.

Prerequisites

Before you start using CLIPⁿ, ensure that you have the following Python packages installed:

NumPy
Torch
tqdm

CLIPⁿ was developed using Python 3.7.4 with NumPy (1.21.6), Torch (1.10.0), tqdm (4.66.2) packages and tested with Ubuntu OS (20.04.6 LTS), Nvidia Titan X GPU and Xeon E5 CPU.

Installation

You can install CLIPⁿ via pip:

pip install clipn

Getting Started

To start using CLIPⁿ, import the package in your Python environment:

from clipn import CLIPn

Running the CLIPⁿ Model

This guide will walk you through the process of running the CLIPⁿ model on simulated data.

Prerequisites

Ensure that you have the following Python packages installed:

functions.simulation
matplotlib
clipn
umap

Steps

Generate Simulation Data

Use the assay_simulator function from the functions.simulation module to generate the simulation data. The function takes several parameters such as the number of samples, clusters, assays, and others.

n_datasets = 8
data = simulation.assay_simulator(n_sample=10000, n_cluster=10, n_assay=n_datasets,
                                  sigma_max=1, sigma_min=0.1,
                                  rho_max=0.8, rho_min=0.1,
                                  cluster_observe_ratio=0.5, random_seed=2023)
X = dict()
y = dict()
for i in range(n_datasets):
    X[i] = data["dataset_" + str(i) + "_feature"]
    y[i] = data["dataset_" + str(i) + "_label"]

Visualize the Simulation Data

You can visualize the simulation data using UMAP and matplotlib.

Figure 2: Visualization of simulation data

Run CLIPⁿ on the Generated Data Instantiate the CLIPⁿ model with the simulation data and the latent dimension. Then, fit the model with the data and predict the latent representations.

Input: 
    X: python dictionary of features matrices from all datasets
    y: python dictionary of labels from all datasets
    latent_dim: number of latent dimensions

latent_dim = 10
clipn = CLIPn(X, y, latent_dim=latent_dim)
loss = clipn.fit(X, y)
Z = clipn.predict(X)

Output:
  Z: python dictionary of latent representations from all datasets

Visualize the Latent Representations

The output of Z is also in dictionary format correponding to the input X. You can visualize the latent representations using UMAP.
```
from functions.helper import *

umap_scatter(Z, y)
```

Figure 3: Latent Representations Visualization

Reproducing Results

Scripts to obtain main results from the paper be found in scripts directory. The scripts include:

Data Availability

The processed experimental data used in the study are available at Zenodo.

License

CLIPⁿ is released under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
clipn		clipn
demo		demo
functions		functions
scripts		scripts
.gitignore		.gitignore
Figure1.png		Figure1.png
Figure2.png		Figure2.png
Figure3.png		Figure3.png
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transitive prediction of small molecule function through alignment of high-content screening resources

Key Features

Prerequisites

Installation

Getting Started

Running the CLIPⁿ Model

Prerequisites

Steps

Reproducing Results

Data Availability

License

About

Releases

Packages

Languages

License

AltschulerWu-Lab/CLIPn

Folders and files

Latest commit

History

Repository files navigation

Transitive prediction of small molecule function through alignment of high-content screening resources

Key Features

Prerequisites

Installation

Getting Started

Running the CLIPn Model

Prerequisites

Steps

Reproducing Results

Data Availability

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Running the CLIPⁿ Model

Packages