Transitive prediction of small molecule function through alignment of high-content screening resources
CLIPn is a Python package for integrating phenotypic screening data generated from diverse experimental setups. It provides tools for data integration for reference compound profiles and predictions for uncharacterized compounds.
Figure 1: (a) Examples of three distinct experiments profiled using different cell assays,
hardware/software
setups, and
resulting
feature maps. (b) Cartoon illustration depicting CLIPn integration and its applications. CLIP
n transforms diverse
phenotypic feature spaces into a unified integrated space, enabling accurate grouping of perturbation categories from
different experiments while separating distinct perturbations. The integrated space facilitates classification,
annotation, and profile transfer. (c) Model architecture of CLIPn. In each iteration, datasets are
partitioned into one
pivot dataset and multiple auxiliary datasets. Each dataset is then mapped to integrated embeddings through a
dataset-specific encoder. Embedding contrast matrices between pivot and auxiliary datasets are calculated and compared
with their perturbation similarities. (d) Construction of the cross-dataset contrastive graph by iteratively
selecting
each dataset as the pivot dataset. The entire contrastive graph is optimized together to reflect perturbation
categorical similarities between any two integrated datasets.
- Profile: integration can be performed directly from HCS datasets of different profiles (rather than raw images);
- Drug category: HCS profile datasets can be aligned even when the individual datasets utilize overlapping, but not identical reference compound categories with potentially different choices of compounds in these categories;
- Datasets: integration can be performed simultaneously on two or more datasets.
Before you start using CLIPn, ensure that you have the following Python packages installed:
NumPy
Torch
tqdm
CLIPn was developed using Python 3.7.4
with NumPy (1.21.6)
, Torch (1.10.0)
, tqdm (4.66.2)
packages
and tested
with Ubuntu OS (20.04.6 LTS)
, Nvidia Titan X GPU
and Xeon E5 CPU
.
You can install CLIPn via pip:
pip install clipn
To start using CLIPn, import the package in your Python environment:
from clipn import CLIPn
This guide will walk you through the process of running the CLIPn model on simulated data.
Ensure that you have the following Python packages installed:
functions.simulation
matplotlib
clipn
umap
-
Generate Simulation Data
Use the
assay_simulator
function from thefunctions.simulation
module to generate the simulation data. The function takes several parameters such as the number of samples, clusters, assays, and others.n_datasets = 8 data = simulation.assay_simulator(n_sample=10000, n_cluster=10, n_assay=n_datasets, sigma_max=1, sigma_min=0.1, rho_max=0.8, rho_min=0.1, cluster_observe_ratio=0.5, random_seed=2023) X = dict() y = dict() for i in range(n_datasets): X[i] = data["dataset_" + str(i) + "_feature"] y[i] = data["dataset_" + str(i) + "_label"]
-
Visualize the Simulation Data
You can visualize the simulation data using UMAP and matplotlib.
-
Run CLIPn on the Generated Data Instantiate the CLIPn model with the simulation data and the latent dimension. Then, fit the model with the data and predict the latent representations.
Input: X: python dictionary of features matrices from all datasets y: python dictionary of labels from all datasets latent_dim: number of latent dimensions
latent_dim = 10 clipn = CLIPn(X, y, latent_dim=latent_dim) loss = clipn.fit(X, y) Z = clipn.predict(X)
Output: Z: python dictionary of latent representations from all datasets
-
Visualize the Latent Representations
The output of
Z
is also in dictionary format correponding to the inputX
. You can visualize the latent representations using UMAP.from functions.helper import * umap_scatter(Z, y)
Figure 3: Latent Representations Visualization
Scripts to obtain main results from the paper be found in scripts
directory. The scripts include:
- Integration from simulated data.
- Integration from multiple hypoxia screens.
- Integration from curated compound screens across 20 years.
- Integration from transcript + image profiles.
The processed experimental data used in the study are available at Zenodo.
CLIPn is released under the MIT License. See the LICENSE file for details.