Add expression data, sample metadata, and recount2 #2

jaclyn-taroni · 2018-04-03T17:20:20Z

This PR is the initial data download and set up, including git LFS, for this repository. This project utilizes a data repository (greenelab/rheum-plier-data). I obtain expression data and sample metadata from greenelab/rheum-plier-data. Due to size constraints, the recount2 data processed using source code from that repository is stored on figshare and that's what is used to wget the recount2 expression data and PLIER model.

huqiwen0313

Looks good to me

gwaybio · 2018-04-03T17:47:06Z

00-data_download.sh

+cd expression_data
+
+# sle-wb data
+wget https://github.com/greenelab/rheum-plier-data/raw/4be547553f24fecac9e2f5c2b469a17f9df253f0/sle-wb/processed/aggregated_data/SLE_WB_all_microarray_QN_zto_before.pcl


I am wondering why you're storing these files in this repo also. These files can be read in directly from the url (in both R and python)

url <- 'https://github.com/greenelab/rheum-plier-data/raw/4be547553f24fecac9e2f5c2b469a17f9df253f0/sle-wb/processed/aggregated_data/SLE_WB_all_microarray_QN_zto_before.pcl' data <- readr::read_tsv(url)

In the case that I have to load this into R in multiple scripts/notebooks/what have you, I think it would be preferable to not have to read this from the URL (somewhere between 400-500MB in this particular case) each time. I can add these files to .gitignore, though, if you think that's better.

Is it faster to read from the file than the url? Are there other concerns too?

It should be a few orders of magnitude faster unless this does some serious caching.

yeah, in python (did not test in R)

%%time import pandas as pd url = 'https://github.com/greenelab/rheum-plier-data/raw/4be547553f24fecac9e2f5c2b469a17f9df253f0/sle-wb/processed/aggregated_data/SLE_WB_all_microarray_QN_zto_before.pcl' data = pd.read_table(url)

CPU times: user 6.61 s, sys: 1.51 s, total: 8.12 s
Wall time: 55.1 s

%%time file_loc = 'SLE_WB_all_microarray_QN_zto_before.pcl' data = pd.read_table(url)

CPU times: user 6.61 s, sys: 1.2 s, total: 7.81 s
Wall time: 12.4 s

Fast internet connection there 👍

I wonder if this use case is suitable as a git submodule then 🤔 . Up to you @jaclyn-taroni - I will approve this PR. (just wondering about pros vs. cons for alternative solutions)

So I think I will keep downloading the data this way rather than using a submodule or reading directly from the url, but I will ignore the data from that GitHub repo (same as what I'm doing with the recount2 data from figshare). Will update in next commit.

jaclyn-taroni added 7 commits April 3, 2018 10:44

Add .gitignore

bf69541

Add shell script for data download

f29b2ae

Ignore recount2 data from figshare

5a2e754

Add git LFS tracking pcl

8c763ad

Add microarray PCL (lfs)

bd1bb44

Add GSE18885 series matrix

e2de1a7

Add sample/phenotype data

f9125ea

jaclyn-taroni requested review from gwaybio and huqiwen0313 April 3, 2018 17:20

huqiwen0313 approved these changes Apr 3, 2018

View reviewed changes

gwaybio reviewed Apr 3, 2018

View reviewed changes

gwaybio approved these changes Apr 3, 2018

View reviewed changes

Ignore microarray expression and sample metadata

ef60a5b

jaclyn-taroni merged commit 11d92ce into greenelab:master Apr 3, 2018

jaclyn-taroni deleted the data-download branch April 3, 2018 20:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add expression data, sample metadata, and recount2 #2

Add expression data, sample metadata, and recount2 #2

jaclyn-taroni commented Apr 3, 2018

huqiwen0313 left a comment

gwaybio Apr 3, 2018

jaclyn-taroni Apr 3, 2018

gwaybio Apr 3, 2018 •

edited

Loading

cgreene Apr 3, 2018

gwaybio Apr 3, 2018

cgreene Apr 3, 2018

gwaybio Apr 3, 2018 •

edited

Loading

jaclyn-taroni Apr 3, 2018

Add expression data, sample metadata, and recount2 #2

Add expression data, sample metadata, and recount2 #2

Conversation

jaclyn-taroni commented Apr 3, 2018

huqiwen0313 left a comment

Choose a reason for hiding this comment

gwaybio Apr 3, 2018

Choose a reason for hiding this comment

jaclyn-taroni Apr 3, 2018

Choose a reason for hiding this comment

gwaybio Apr 3, 2018 • edited Loading

Choose a reason for hiding this comment

cgreene Apr 3, 2018

Choose a reason for hiding this comment

gwaybio Apr 3, 2018

Choose a reason for hiding this comment

cgreene Apr 3, 2018

Choose a reason for hiding this comment

gwaybio Apr 3, 2018 • edited Loading

Choose a reason for hiding this comment

jaclyn-taroni Apr 3, 2018

Choose a reason for hiding this comment

gwaybio Apr 3, 2018 •

edited

Loading

gwaybio Apr 3, 2018 •

edited

Loading