Official Python implementation of NSGCCA, from the following paper:
Nonlinear Sparse Generalized Canonical Correlation Analysis for Multi-view High-dimensional Data.
Rong Wu, Ziqi Chen, Gen Li and Hai Shu.
New York University
[arXiv
]
We propose three nonlinear, sparse, generalized CCA methods, HSIC-SGCCA, SA-KGCCA, and TS-KGCCA, for variable selection in multi-view high-dimensional data. These methods extend existing SCCA-HSIC, SA-KCCA, and TS-KCCA from two-view to multi-view settings. While SA-KGCCA and TS-KGCCA yield multi-convex optimization problems solved via block coordinate descent, HSIC-SGCCA introduces a necessary unit-variance constraint previously ignored in SCCA-HSIC, resulting in a nonconvex, non-multiconvex problem. We efficiently address this challenge by integrating the block prox-linear method with the linearized alternating direction method of multipliers. Simulations and TCGA-BRCA data analysis demonstrate that HSIC-SGCCA outperforms competing methods in variable selection.
Clone this repository and install other required packages:
git clone [email protected]:Rows21/NSGCCA
- Synthetic Datasets synth_data.py
- TCGA Breast Cancer Database in Realdata from (https://tcga-data.nci.nih.gov/docs/publications)
(Feel free to post suggestions in issues of recommending latest proposed CCA network for comparison. Currently, the baselines folder is to put comparable models.)
If you find this repository helpful, please consider citing:
@article{wu2025nonlinear,
title={Nonlinear Sparse Generalized Canonical Correlation Analysis for Multi-view High-dimensional Data},
author={Wu, Rong and Chen, Ziqi and Li, Gen and Shu, Hai},
journal={arXiv preprint arXiv:2502.18756},
year={2025}
}
Figure 2: The simulation performance for Synthetic Datasets.
Data_download_preprocess: TCGA-BRCA preprocessing through R script.
Venn Diagram: The clustering results for TCGA-BRCA.
This repository is built using the timm library.
This project is released under the MIT license. Please see the LICENSE file for more information.