-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CGHub download #38
base: master
Are you sure you want to change the base?
CGHub download #38
Conversation
... and a little background: https://cghub.ucsc.edu/datasets/ccle.html |
+1 |
{ | ||
"name": "ccle-wgs", | ||
"description": "Cancer Cell Line Encyclopedia whole genome sequencing", | ||
"target": "ccle/wgs", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: We've updated the format recently, so that target
is no longer used (see https://github.com/bigdatagenomics/eggo/blob/master/docs/spec.md), so this line can be omitted. (I still need to update the other files in the registry).
This looks good to me. It might be helpful to add a file in test/registry to make it easy to try out on a single file from CGHub. |
I updated the registries according to your suggestions, and added the other public dataset available on CGHub, a benchmarking dataset published for the purpose of testing mutation callers. Unfortunately the smallest file from both studies is a ~5Gb RNA-seq BAM, so I used that one as the test registry. Still testing the ETL code, not ready for merge yet |
Ready for merge. GeneTorrent installation and CGHub download functions are tested. I couldn't get the full luigi DAG working, but it seems like that's a work in progress. |
cghub_key = os.environ.get('CGHUB_KEY') or CGHUB_PUBLIC_KEY | ||
|
||
# 2. Parse url for analysis ID and filename | ||
analysis_id, filename = url.lstrip('cghub://').split('/') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the split ever produce more than two objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
analysis_id
is a CGHub concept?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the CGHub metadata store centers around the analysis_id
. It refers to a single downloadable object (which may contain multiple files). I made up a cghub url to fit the existing registry structure, so it is easy to change. Right now I am creating them to only contain the analysis ID and a single filename of interest (BAM file, generally), so this would always split into 2 objects.
However, the CGHub REST API returns a JSON with lots of metadata for a given analysis_id. One option would be to store this full JSON in the registry, though it would be long and cluttered, not as easily human-readable. Or we could just use the analysis ID and have the code call the API to get the filename, filesize, and other metadata if necessary.
Standardize behavior of cghub, ftp, and http download functions - flatten the cghub download directory and omit return value - rename http_download to curl download to capture ftp use case Also add “editions” field to test-cghub.json
Ok, fixed it according to your suggestions (although the awscli globbing issue is now moot after recent updates to master). Let me know if you prefer a rebase |
Hi Uri
Here is a bit of code to follow-up to our emails. This is untested and not ready for merge, but I wanted to get your feedback before continuing.
CGHub is hosting > 1200 public BAM files from the Cancer Cell Line Encyclopedia that are available only through their GeneTorrent download client. TCGA BAM files, while not available publicly, can be downloaded using the same framework for anyone with an authorized key file.
The dependencies for installing GeneTorrent are tricky on CentOS so that part would require the most attention. The other changes are straightforward.
I invented a url of the form
cghub://<analysis_id>/<filename>
to distinguish these downloads but we should think about whether there is a better way.Jake