Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add expression data, sample metadata, and recount2 #2

Merged
merged 8 commits into from
Apr 3, 2018
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.pcl filter=lfs diff=lfs merge=lfs -text
29 changes: 29 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Ignore hidden metadata files
._*

# History files
.Rhistory
.Rapp.history

# Session Data files
.RData

# Output files from R CMD check
/*.Rcheck/

# RStudio files
.Rproj.user/

# knitr and R markdown default cache directories
/*_cache/
/cache/

# Temporary files created by R markdown
*.utf8.md
*.knit.md

# Rplot default output
Rplots.pdf

# recount2 data from figshare
data/recount2_PLIER_data
39 changes: 39 additions & 0 deletions 00-data_download.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
#!/bin/bash

# set up directories
mkdir data && mkdir plots && mkdir results && mkdir util

# data directory subdirectories
cd data && mkdir expression_data

# get recount2 data & model from figshare, source code in
# greenelab/rheum-data-plier
wget https://ndownloader.figshare.com/files/10881866 \
-O recount2.zip
unzip recount2.zip && rm recount2.zip

## microarray data from greenelab/rheum-plier-data
cd expression_data

# sle-wb data
wget https://github.com/greenelab/rheum-plier-data/raw/4be547553f24fecac9e2f5c2b469a17f9df253f0/sle-wb/processed/aggregated_data/SLE_WB_all_microarray_QN_zto_before.pcl
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering why you're storing these files in this repo also. These files can be read in directly from the url (in both R and python)

url <- 'https://github.com/greenelab/rheum-plier-data/raw/4be547553f24fecac9e2f5c2b469a17f9df253f0/sle-wb/processed/aggregated_data/SLE_WB_all_microarray_QN_zto_before.pcl'
data <- readr::read_tsv(url)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case that I have to load this into R in multiple scripts/notebooks/what have you, I think it would be preferable to not have to read this from the URL (somewhere between 400-500MB in this particular case) each time. I can add these files to .gitignore, though, if you think that's better.

Copy link

@gwaybio gwaybio Apr 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it faster to read from the file than the url? Are there other concerns too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be a few orders of magnitude faster unless this does some serious caching.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, in python (did not test in R)

%%time
import pandas as pd
url = 'https://github.com/greenelab/rheum-plier-data/raw/4be547553f24fecac9e2f5c2b469a17f9df253f0/sle-wb/processed/aggregated_data/SLE_WB_all_microarray_QN_zto_before.pcl'
data = pd.read_table(url)

CPU times: user 6.61 s, sys: 1.51 s, total: 8.12 s
Wall time: 55.1 s

%%time
file_loc = 'SLE_WB_all_microarray_QN_zto_before.pcl'
data = pd.read_table(url)

CPU times: user 6.61 s, sys: 1.2 s, total: 7.81 s
Wall time: 12.4 s

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fast internet connection there 👍

Copy link

@gwaybio gwaybio Apr 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this use case is suitable as a git submodule then 🤔 . Up to you @jaclyn-taroni - I will approve this PR. (just wondering about pros vs. cons for alternative solutions)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think I will keep downloading the data this way rather than using a submodule or reading directly from the url, but I will ignore the data from that GitHub repo (same as what I'm doing with the recount2 data from figshare). Will update in next commit.


# NARES
wget https://github.com/greenelab/rheum-plier-data/raw/4be547553f24fecac9e2f5c2b469a17f9df253f0/NARES/processed/NARES_SCANfast_ComBat.pcl

# GPA blood dataset (GSE18885)
wget https://github.com/greenelab/rheum-plier-data/raw/4be547553f24fecac9e2f5c2b469a17f9df253f0/gpa-blood/GSE18885_series_matrix.txt

# isolated blood cell populations from autoimmune conditions
wget https://github.com/greenelab/rheum-plier-data/raw/4be547553f24fecac9e2f5c2b469a17f9df253f0/isolated-cell-pop/processed/E-MTAB-2452_hugene11st_SCANfast.pcl

# get sample (e.g., phenotype) data
cd .. && mkdir sample_info && cd sample_info
# sle-wb sample to dataset of origin data
wget https://github.com/jaclyn-taroni/rheum-plier-data/raw/4be547553f24fecac9e2f5c2b469a17f9df253f0/sle-wb/processed/sle-wb_sample_dataset_mapping.tsv
# other/single dataset sample information
wget https://github.com/greenelab/rheum-plier-data/raw/4be547553f24fecac9e2f5c2b469a17f9df253f0/sle-wb/arrayexpress/E-GEOD-65391/E-GEOD-65391.sdrf.txt
wget https://github.com/greenelab/rheum-plier-data/raw/4be547553f24fecac9e2f5c2b469a17f9df253f0/isolated-cell-pop/E-MTAB-2452.sdrf.txt
wget https://github.com/greenelab/rheum-plier-data/raw/4be547553f24fecac9e2f5c2b469a17f9df253f0/sle-wb/arrayexpress/E-GEOD-39088/E-GEOD-39088.sdrf.txt
wget https://github.com/greenelab/rheum-plier-data/raw/4be547553f24fecac9e2f5c2b469a17f9df253f0/sle-wb/arrayexpress/E-GEOD-78193/E-GEOD-78193.sdrf.txt
wget https://github.com/greenelab/rheum-plier-data/raw/4be547553f24fecac9e2f5c2b469a17f9df253f0/NARES/NARES_demographic_data.tsv
3 changes: 3 additions & 0 deletions data/expression_data/E-MTAB-2452_hugene11st_SCANfast.pcl
Git LFS file not shown
22,264 changes: 22,264 additions & 0 deletions data/expression_data/GSE18885_series_matrix.txt

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions data/expression_data/NARES_SCANfast_ComBat.pcl
Git LFS file not shown
3 changes: 3 additions & 0 deletions data/expression_data/SLE_WB_all_microarray_QN_zto_before.pcl
Git LFS file not shown
143 changes: 143 additions & 0 deletions data/sample_info/E-GEOD-39088.sdrf.txt

Large diffs are not rendered by default.

997 changes: 997 additions & 0 deletions data/sample_info/E-GEOD-65391.sdrf.txt

Large diffs are not rendered by default.

126 changes: 126 additions & 0 deletions data/sample_info/E-GEOD-78193.sdrf.txt

Large diffs are not rendered by default.

48 changes: 48 additions & 0 deletions data/sample_info/E-MTAB-2452.sdrf.txt

Large diffs are not rendered by default.

77 changes: 77 additions & 0 deletions data/sample_info/NARES_demographic_data.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
Sample Disease Disease_Activity Batch Classification Severity ANCA Age Race Gender Ethnicity Disease_Duration (yrs) Flares PGA BVAS VDI Smoking_pkyrs Smoking_Status Steroids_daily_pred_mg Steroids_Cat Immune_Meds GC_or_Immune Nasal_Steroids Immune_or_Nasal Any_Immune
N1004 GPA Active 1 V3 Severe MPO 62 White F non-Hisp 6 8 4 7 10 5 Former 30 1 None 1 Y 1 1
N1007 GPA Inactive 1 V2 Severe PR3 43 White M non-Hisp 5 0 0 0 1 19 Former 0 0 None 0 N 0 0
N1017 GPA Never 1 V1 Severe PR3 64 White F non-Hisp 0.75 0 0 0 0 20 Former 5 1 Azathioprine 1 N 1 1
N1025 Control Control 1 C1 N/A Neg 67 White F non-Hisp . . . . . 25 Former 0 0 None 0 N 0 0
N1030 Control Control 1 C2 N/A Neg 63 White F non-Hisp 2 0 . . . 40 Former 15 1 None 1 Y 1 1
N1033 Control Control 1 C2 N/A Neg 67 White M non-Hisp 1 2 . . . 30 Former 15 1 None 1 N 0 1
N1006 Control Control 1 C3 N/A Neg 52 White M non-Hisp . . . . . 0 Never 0 0 None 0 N 0 0
N1009 Control Control 1 C1 N/A Neg 61 White F non-Hisp . . . . . 0 Never 0 0 None 0 N 0 0
N1011 GPA Inactive 1 V2 Severe PR3 54 White M non-Hisp 3 1 0 0 0 0 Never 0 0 None 0 N 0 0
N1012 Control Control 1 C3 N/A Neg 52 White F non-Hisp . . . . . 0 Never 15 1 Rituximab 1 N 1 1
N1014 Control Control 1 C1 N/A Neg 59 White F non-Hisp . . . . . 0 Never 0 0 None 0 N 0 0
N1015 GPA Active 1 V3 Limited PR3 19 White F non-Hisp 5 6 3 3 2 0 Never 7 1 Rituximab 1 N 1 1
N1018 GPA Inactive 1 V2 Limited MPO 72 White F non-Hisp 10 3 0 0 2 0 Never 0 0 Azathioprine 1 N 1 1
N1021 GPA Never 1 V1 Severe MPO 54 White M non-Hisp 4 2 0 0 3 15 Former 40 1 CYC 1 N 1 1
N1026 GPA Never 1 V1 Severe PR3 78 White F non-Hisp 6 0 0 0 2 0 Never 0 0 None 0 N 0 0
N1029 Control Control 1 C2 N/A Neg 51 White M non-Hisp 2 0 . . . 0 Never 0 0 MTX 1 N 1 1
N1001 GPA Inactive 2 V2 Limited PR3 31 White F non-Hisp 3 0 0 0 0 0 Never 2.5 1 MTX 1 N 1 1
N1002 Control Control 2 C1 N/A Neg 27 White M Hispanic . . . . . 0 Never 0 0 None 0 N 0 0
N1013 GPA Never 2 V1 Limited PR3 61 White M non-Hisp 17 0 0 0 0 94 Current 0 0 MTX 1 N 1 1
N1016 GPA Inactive 2 V2 Limited PR3 51 White M non-Hisp 7 0 0 0 1 0 Never 0 0 None 0 N 0 0
N1019 Control Control 2 C1 N/A Neg 76 White M non-Hisp . . . . . 40 Former 0 0 None 0 N 0 0
N1024 GPA Inactive 2 V2 Severe PR3 67 White M non-Hisp 4 0 0 0 2 20 Former 0 0 MTX 1 N 1 1
N1027 Control Control 2 C2 N/A Neg 69 White M non-Hisp 6 2 . . . 15 Former 0 0 None 0 N 0 0
N1028 Control Control 2 C2 N/A Neg 40 White M non-Hisp 1 0 . . . 0 Never 10 1 None 1 N 0 1
N1031 GPA Never 2 V1 Severe PR3 44 White F non-Hisp 2 0 0 0 2 5 Former 0 0 MTX 1 N 1 1
N1032 Control Control 2 C1 N/A Neg 47 White M non-Hisp . . . . . 0 Never 0 0 None 0 N 0 0
N1034 GPA Inactive 2 V2 Severe PR3 48 White M non-Hisp 2 0 0 0 0 30 Current 0 0 None 0 N 0 0
N1036 Control Control 2 C3 N/A Neg 39 Asian M non-Hisp . . . . . 0 Never 0 0 None 0 Y 1 1
N1037 Control Control 2 C2 N/A Neg 44 White F non-Hisp 8 . . . . 0 Never 0 0 CYC 1 N 1 1
N1038 Control Control 2 C3 N/A Neg 18 Black F Hispanic . . . . . 0 Never 0 0 None 0 Y 1 1
N1040 GPA Active 2 V3 Limited PR3 32 White F non-Hisp 3 1 2 2 0 0 Never 0 0 MTX 1 N 0 0
N1041 GPA Active 2 V3 Severe MPO 56 White F non-Hisp 1.5 1 2 3 3 0 Never 7.5 1 Rituximab 1 Y 1 1
N1093 Control Control 3 C2 N/A Neg 28 White M non-Hisp 1.5 0 . . 0 2 Former 0 0 None 0 N 0 0
N1003 GPA Inactive 3 V2 Limited MPO 63 White F non-Hisp 5 0 0 0 5 100 Former 0 0 None 0 Y 1 1
N1042A GPA Active 3 V3 Limited PR3 42 White M non-Hisp 0.3 0 3 4 0 0 Never 50 1 None 1 N 0 1
N1043B Control Control 3 C1 N/A Neg 38 White F non-Hisp . . . . . 0 Never 0 0 None 0 N 0 0
N1044B GPA Active 3 V3 Limited PR3 40 White M Hispanic 0.8 0 4 5 2 0 Never 15 1 Azathioprine 1 N 1 1
N1048B GPA Active 3 V3 Severe PR3 63 White M non-Hisp 8 1 5 10 5 50 Former 30 1 Azathioprine 1 N 1 1
N1050A GPA Active 3 V3 Limited PR3 44 Arabic F non-Hisp 0.8 0 1 1 0 0 Never 7.5 1 MTX 1 N 1 1
N1052A GPA Never 3 V1 Severe PR3 69 White F non-Hisp 13 1 0 0 3 7 Former 5 1 MMF 1 N 1 1
N1053B GPA Never 3 V1 Severe PR3 48 White M non-Hisp 5 2 0 0 0 7 Former 0 0 Rituximab 1 N 1 1
N1055B GPA Inactive 3 V2 Limited PR3 47 White F non-Hisp 9 2 0 0 2 0 Never 0 0 None 0 N 0 0
N1057A GPA Never 3 V1 Severe MPO 75 White F non-Hisp 6 3 0 0 3 45 Former 0 0 None 0 N 0 0
N1058 GPA Inactive 3 V2 Severe PR3 29 White M non-Hisp 3 2 1 1 0 0 Never 0 0 MTX 1 N 1 1
N1060 GPA Never 3 V1 Severe PR3 58 White M non-Hisp 5 1 0 0 0 25 Former 10 1 RTX 1 N 1 1
N1061 GPA Active 3 V3 Severe PR3 55 White F non-Hisp 0.5 0 2 9 1 30 Former 30 1 None 1 N 0 1
N1064 Control Control 3 C1 N/A Neg 57 White M non-Hisp . . . . . 0 Never 0 0 None 0 N 0 0
N1066 Control Control 3 C1 N/A Neg 71 White M non-Hisp . . . . . 30 Former 0 0 None 0 N 0 0
N1068 Control Control 3 C2 N/A Neg 51 White M non-Hisp . . . . . 0 Never 0 0 None 0 Y 1 1
N1069 Control Control 3 C2 N/A Neg 51 Black F non-Hisp . . . . . 0 Never 0 0 MTXINFLIX 1 N 1 1
N1070 Control Control 3 C2 N/A Neg 43 White M non-Hisp . . . . . 10 Former 5 1 None 1 N 0 1
N1073 EGPA "NA" 3 C4 N/A MPO 56 White F non-Hisp 2 0 0 . 0 0 Never 5 1 MTX 1 N 1 1
N1074 GPA Inactive 3 V2 Limited PR3 71 White M non-Hisp 11 2 0 0 3 0 Never 0 0 MTX 1 N 1 1
N1081 Control Control 3 C1 N/A Neg 62 White F non-Hisp . . . . . 0 Never 0 0 None 0 N 0 0
N1082 Control Control 3 C2 N/A Neg 45 White F non-Hisp 1 0 . . . 0 Never 0 0 None 0 Y 1 1
N1086 Control Control 3 C3 N/A Neg 41 White M non-Hisp . . . . . 0 Never 0 0 None 0 N 0 0
N1087 Control Control 3 C2 N/A Neg 51 White M non-Hisp 0.5 0 . . . 0 Never 20 1 MTX 1 N 1 1
N1088 GPA Inactive 3 V2 Severe PR3 23 White M non-Hisp 5 1 0 0 2 0 Never 5 1 None 1 N 0 1
N1091 GPA Inactive 3 V2 Severe PR3 64 White M non-Hisp 10 7 0 0 5 0 Never 7 1 MMF 1 N 1 1
N1092 Control Control 3 C1 N/A Neg 59 White F non-Hisp . . . . . 0 Never 0 0 None 0 N 0 0
N1094 EGPA "NA" 3 C4 N/A Neg 63 White F non-Hisp 14 4 0 0 5 0 Never 0 0 Azathioprine 1 N 1 1
N1095 EGPA "NA" 3 C4 N/A Neg 52 White F non-Hisp NA 2 0 0 4 0 Never 10 1 Azathioprine 1 N 1 1
N1096 EGPA "NA" 3 C4 N/A MPO 64 White F non-Hisp . 6 0 1 4 0 Never 2 1 MTX 1 N 1 1
N1097 Control Control 3 C1 N/A Neg 66 White M non-Hisp . . . . . 0 Never 0 0 None 0 N 0 0
N1098 Control Control 3 C2 N/A Neg 59 White F non-Hisp 8 . . . . 0 Never 0 0 MTX 1 N 1 1
N1099 Control Control 3 C2 N/A Neg 53 White M non-Hisp 3 . . . . 18 Former 20 1 None 1 N 0 1
N1100 GPA Active 3 V3 Severe PR3 43 White F non-Hisp 0.1 0 6 8 0 0 Never 60 1 None 1 N 0 1
N1101 EGPA "NA" 3 C4 N/A MPO 68 White M non-Hisp 6 4 2 3 1 0 Never 15 1 None 1 N 0 1
N1102 EGPA "NA" 3 C4 N/A MPO 68 White M non-Hisp 3 1 1 1 3 30 Former 0 0 MTX 1 N 1 1
N1103 EGPA "NA" 3 C4 N/A Neg 42 White M non-Hisp 10 3 0 0 5 0 Never 5 1 MTX 1 N 1 1
N1104 EGPA "NA" 3 C4 Severe MPO 50 White F non-Hisp 7 4 4 NA 0 0 Never 50 1 None 1 N 0 1
N1105 EGPA "NA" 3 C4 N/A Neg 28 White F non-Hisp 10 9 5 5 5 0 Never 20 1 RTX 1 N 1 1
N1106 EGPA "NA" 3 C4 N/A Neg 73 White F non-Hisp 2 1 0 0 5 40 Former 10 1 Azathioprine 1 N 1 1
N1107 EGPA "NA" 3 C4 N/A Neg 70 White F non-Hisp 4 1 0 0 1 0 Never 2 1 MMF 1 N 1 1
N1108 EGPA "NA" 3 C4 N/A Neg 56 White F non-Hisp 25 4 0 0 4 0 Never 5 1 Azathioprine 1 N 1 1
N1110 EGPA "NA" 3 C4 N/A MPO 38 Asian M non-Hisp 5 3 1 1 4 0 Never 8 1 None 1 N 0 1
Loading