Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add expression data, sample metadata, and recount2 #2

Merged
merged 8 commits into from
Apr 3, 2018

Conversation

jaclyn-taroni
Copy link
Collaborator

This PR is the initial data download and set up, including git LFS, for this repository. This project utilizes a data repository (greenelab/rheum-plier-data). I obtain expression data and sample metadata from greenelab/rheum-plier-data. Due to size constraints, the recount2 data processed using source code from that repository is stored on figshare and that's what is used to wget the recount2 expression data and PLIER model.

Copy link

@huqiwen0313 huqiwen0313 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

cd expression_data

# sle-wb data
wget https://github.com/greenelab/rheum-plier-data/raw/4be547553f24fecac9e2f5c2b469a17f9df253f0/sle-wb/processed/aggregated_data/SLE_WB_all_microarray_QN_zto_before.pcl
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering why you're storing these files in this repo also. These files can be read in directly from the url (in both R and python)

url <- 'https://github.com/greenelab/rheum-plier-data/raw/4be547553f24fecac9e2f5c2b469a17f9df253f0/sle-wb/processed/aggregated_data/SLE_WB_all_microarray_QN_zto_before.pcl'
data <- readr::read_tsv(url)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case that I have to load this into R in multiple scripts/notebooks/what have you, I think it would be preferable to not have to read this from the URL (somewhere between 400-500MB in this particular case) each time. I can add these files to .gitignore, though, if you think that's better.

Copy link

@gwaybio gwaybio Apr 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it faster to read from the file than the url? Are there other concerns too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be a few orders of magnitude faster unless this does some serious caching.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, in python (did not test in R)

%%time
import pandas as pd
url = 'https://github.com/greenelab/rheum-plier-data/raw/4be547553f24fecac9e2f5c2b469a17f9df253f0/sle-wb/processed/aggregated_data/SLE_WB_all_microarray_QN_zto_before.pcl'
data = pd.read_table(url)

CPU times: user 6.61 s, sys: 1.51 s, total: 8.12 s
Wall time: 55.1 s

%%time
file_loc = 'SLE_WB_all_microarray_QN_zto_before.pcl'
data = pd.read_table(url)

CPU times: user 6.61 s, sys: 1.2 s, total: 7.81 s
Wall time: 12.4 s

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fast internet connection there 👍

Copy link

@gwaybio gwaybio Apr 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this use case is suitable as a git submodule then 🤔 . Up to you @jaclyn-taroni - I will approve this PR. (just wondering about pros vs. cons for alternative solutions)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think I will keep downloading the data this way rather than using a submodule or reading directly from the url, but I will ignore the data from that GitHub repo (same as what I'm doing with the recount2 data from figshare). Will update in next commit.

@jaclyn-taroni jaclyn-taroni merged commit 11d92ce into greenelab:master Apr 3, 2018
@jaclyn-taroni jaclyn-taroni deleted the data-download branch April 3, 2018 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants