If you make use of the brca_tcga code, please cite the Zenodo DOI: https://zenodo.org/record/8251328
If you make use of the acc_tcga code, please cite the Zenodo DOI: https://zenodo.org/record/8286179
This is a project that was carried out during Google Summer of Code(GSoC'23).
The aim of the project is to develop an example dataset that will be compatible with Pytorch Geometric. This dataset will be created by integrating datasets gotten from cBIo datahub and Pathway Commons. The data integration process will require combining the two datasets to create a single, comprehensive dataset. The integrated dataset will then be used to train GNN models for downstream tasks.
The proposed solution involves a three-stage process: retrieving and preprocessing data from the datasets, integrating the data, and developing and training Graph Neural Network(GNN) models on the integrated dataset.
WorkFlow Image;
During the phase of this project, two main cancer types were considered which includes:
- Adrenocortical Carcinoma
- Breast Invasive Carcinoma
This is a set of notebooks that shows how data was collected, preprocessed and integrated, then converted to PyTorch Geometric Dataset
-
Data Collection: The Datasets that were collected from cbioportal and Pathway Commons are:
-
Reactome subset from Pathway Commons: https://www.pathwaycommons.org/archives/PC2/v12/PathwayCommons12.reactome.hgnc.sif.gz
-
brca_tcga_pan_can_atlas_2018(data_clinical_patient.txt, data_clinical_patient.txt and data_mrna_seq_v2_rsem.txt): https://github.com/cBioPortal/datahub/tree/master/public/brca_tcga_pan_can_atlas_2018
-
acc_tcga_pan_can_atlas_2018(data_clinical_patient.txt, data_clinical_patient.txt and data_mrna_seq_v2_rsem.txt): https://github.com/cBioPortal/datahub/tree/master/public/acc_tcga_pan_can_atlas_2018
To have access to these datasets without having to go through the process needed to download datasets from cBioportal, you can access them on Zenodo: https://zenodo.org/record/8251328
NB: If you want to do these steps for a different dataset, then you have to download the required data from the sites. You can see how to download from cBioportal here: https://github.com/cBioPortal/datahub/tree/master
-
-
Data Preprocessing: For the creation of this sample dataset, the data_clinical_patient.txt and data_clinical_patient.txt were merged based on the patient identifier. After which, the only columns kept were the sample identifier, patient identifier and overall survival(months) of each patient. Then, the Gene expression features (N=9288) that overlapped with biological network data from Pathway Commonswere extracted from the data_mrna_seq_v2_rsem.txt dataset, this was then merged with the first merged data based on the sample identifiers. Additionally, overall survival time in months was extracted as the value to be predicted. Then this new dataset was splitted into X and y which represents features and labels. In this stage, work was also done on creating training, test and validation splits for modelling using the 60:20:20 rule for the brca_tcga and the 70:30 rule for acc_tcga
The final results gotten from these steps include: For brca_tcga, https://zenodo.org/record/8251328:
-
X_train, y_train, X_val, y_val, X_test, y_test.
-
Gene features: graph_idx
-
Labels: graph_labels
-
edges: edge_index
For acc_tcga, https://zenodo.org/record/8286179:
-
X_train, y_train, X_test, y_test.
-
Gene features: graph_idx
-
Labels: graph_labels
-
edges: edge_index
The steps that were followed after downloading the required datasets are shown in the notebooks below.
-
-
Data Integration: In this stage, an edge index (N=271771 edges) was generated using Pathway Commons v12 data in a tabular format. To convert to a PyG Dataset, a list of graphs were created from the preprocessed dataset first. Then, these two data types were integrated which resulted in patient-specific graphs which were then converted into PyG data objects. These steps are shown in the provided notebooks. Finally, these graphs are wrapped using the InMemoryDataset class for use with PyG.
You can view the datatsets final statistics here: https://github.com/cannin/gsoc_2023_pytorch_pathway_commons/blob/main/docs/dataset_statistics.csv
Notebook for brca_tcga Integration: https://github.com/cannin/gsoc_2023_pytorch_pathway_commons/blob/main/Notebooks/inmemorydataset_class_with_brca_tcga.ipynb
Notebook for acc_tcga Integration: https://github.com/cannin/gsoc_2023_pytorch_pathway_commons/blob/main/Notebooks/pyg_sample_data_with_inmemorydataset_class.ipynb
-
Modelling: The two datasets were then used for modelling. First a baseline model was created using FLAML, then a Graph Neural Network(GNN) model was built using GCNConv. For the brca_tcga dataset, another GNN model technique known as Graph Attention Network(GAT) was used for modeliing too. The notebooks below show the modelling steps that was carried out.
For brca_tcga: Baseline model: https://github.com/cannin/gsoc_2023_pytorch_pathway_commons/blob/main/Notebooks/baseline_model_with_brca_data.ipynb
For acc_tcga: Baseline model: https://github.com/cannin/gsoc_2023_pytorch_pathway_commons/blob/main/Notebooks/baseline_model_with_acc_data.ipynb
View the modelling statistics here: https://github.com/cannin/gsoc_2023_pytorch_pathway_commons/blob/main/docs/modelling_statistics.csv