-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add expression data, sample metadata, and recount2 #2
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
cd expression_data | ||
|
||
# sle-wb data | ||
wget https://github.com/greenelab/rheum-plier-data/raw/4be547553f24fecac9e2f5c2b469a17f9df253f0/sle-wb/processed/aggregated_data/SLE_WB_all_microarray_QN_zto_before.pcl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering why you're storing these files in this repo also. These files can be read in directly from the url (in both R and python)
url <- 'https://github.com/greenelab/rheum-plier-data/raw/4be547553f24fecac9e2f5c2b469a17f9df253f0/sle-wb/processed/aggregated_data/SLE_WB_all_microarray_QN_zto_before.pcl'
data <- readr::read_tsv(url)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case that I have to load this into R in multiple scripts/notebooks/what have you, I think it would be preferable to not have to read this from the URL (somewhere between 400-500MB in this particular case) each time. I can add these files to .gitignore
, though, if you think that's better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it faster to read from the file than the url? Are there other concerns too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be a few orders of magnitude faster unless this does some serious caching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, in python (did not test in R)
%%time
import pandas as pd
url = 'https://github.com/greenelab/rheum-plier-data/raw/4be547553f24fecac9e2f5c2b469a17f9df253f0/sle-wb/processed/aggregated_data/SLE_WB_all_microarray_QN_zto_before.pcl'
data = pd.read_table(url)
CPU times: user 6.61 s, sys: 1.51 s, total: 8.12 s
Wall time: 55.1 s
%%time
file_loc = 'SLE_WB_all_microarray_QN_zto_before.pcl'
data = pd.read_table(url)
CPU times: user 6.61 s, sys: 1.2 s, total: 7.81 s
Wall time: 12.4 s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fast internet connection there 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this use case is suitable as a git submodule then 🤔 . Up to you @jaclyn-taroni - I will approve this PR. (just wondering about pros vs. cons for alternative solutions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think I will keep downloading the data this way rather than using a submodule or reading directly from the url, but I will ignore the data from that GitHub repo (same as what I'm doing with the recount2 data from figshare). Will update in next commit.
This PR is the initial data download and set up, including git LFS, for this repository. This project utilizes a data repository (
greenelab/rheum-plier-data
). I obtain expression data and sample metadata fromgreenelab/rheum-plier-data
. Due to size constraints, the recount2 data processed using source code from that repository is stored on figshare and that's what is used towget
the recount2 expression data and PLIER model.