CGHub download #38

jfeala · 2015-04-20T14:35:11Z

Hi Uri

Here is a bit of code to follow-up to our emails. This is untested and not ready for merge, but I wanted to get your feedback before continuing.

CGHub is hosting > 1200 public BAM files from the Cancer Cell Line Encyclopedia that are available only through their GeneTorrent download client. TCGA BAM files, while not available publicly, can be downloaded using the same framework for anyone with an authorized key file.

The dependencies for installing GeneTorrent are tricky on CentOS so that part would require the most attention. The other changes are straightforward.

I invented a url of the form cghub://<analysis_id>/<filename> to distinguish these downloads but we should think about whether there is a better way.

Jake

jfeala · 2015-04-20T14:37:36Z

... and a little background: https://cghub.ucsc.edu/datasets/ccle.html

fnothaft · 2015-04-20T15:14:16Z

+1

tomwhite · 2015-04-21T13:47:36Z

registry/ccle-wgs.json

+{
+    "name": "ccle-wgs",
+    "description": "Cancer Cell Line Encyclopedia whole genome sequencing",
+    "target": "ccle/wgs",


Nit: We've updated the format recently, so that target is no longer used (see https://github.com/bigdatagenomics/eggo/blob/master/docs/spec.md), so this line can be omitted. (I still need to update the other files in the registry).

tomwhite · 2015-04-21T13:49:14Z

This looks good to me. It might be helpful to add a file in test/registry to make it easy to try out on a single file from CGHub.

jfeala · 2015-04-23T14:50:02Z

I updated the registries according to your suggestions, and added the other public dataset available on CGHub, a benchmarking dataset published for the purpose of testing mutation callers. Unfortunately the smallest file from both studies is a ~5Gb RNA-seq BAM, so I used that one as the test registry.

Still testing the ETL code, not ready for merge yet

jfeala · 2015-04-24T01:53:34Z

Ready for merge. GeneTorrent installation and CGHub download functions are tested. I couldn't get the full luigi DAG working, but it seems like that's a work in progress.

laserson · 2015-04-28T17:23:28Z

eggo/dag.py

+        cghub_key = os.environ.get('CGHUB_KEY') or CGHUB_PUBLIC_KEY
+
+    # 2. Parse url for analysis ID and filename
+    analysis_id, filename = url.lstrip('cghub://').split('/')


Does the split ever produce more than two objects?

analysis_id is a CGHub concept?

yes, the CGHub metadata store centers around the analysis_id. It refers to a single downloadable object (which may contain multiple files). I made up a cghub url to fit the existing registry structure, so it is easy to change. Right now I am creating them to only contain the analysis ID and a single filename of interest (BAM file, generally), so this would always split into 2 objects.

However, the CGHub REST API returns a JSON with lots of metadata for a given analysis_id. One option would be to store this full JSON in the registry, though it would be long and cluttered, not as easily human-readable. Or we could just use the analysis ID and have the code call the API to get the filename, filesize, and other metadata if necessary.

Standardize behavior of cghub, ftp, and http download functions - flatten the cghub download directory and omit return value - rename http_download to curl download to capture ftp use case Also add “editions” field to test-cghub.json

jfeala · 2015-05-02T03:20:30Z

Ok, fixed it according to your suggestions (although the awscli globbing issue is now moot after recent updates to master). Let me know if you prefer a rebase

Jake Feala added 2 commits April 20, 2015 10:11

cghub download via GeneTorrent

1cc8f7c

CCLE BAM files registry

e79af51

tomwhite reviewed Apr 21, 2015
View reviewed changes

Jake Feala added 3 commits April 23, 2015 10:33

added test registry for cghub files

81f513b

removed "target" field from registries

4b5d3de

added TCGA WGS benchmark registry

c7f0ea8

cleanup and debug

ecf47fe

laserson reviewed Apr 28, 2015
View reviewed changes

Jake Feala added 3 commits May 1, 2015 22:57

Standardize download functions

f52643e

Standardize behavior of cghub, ftp, and http download functions - flatten the cghub download directory and omit return value - rename http_download to curl download to capture ftp use case Also add “editions” field to test-cghub.json

Merge upstream master

447b24d

Bugfix merge conflict

39003e9

tomwhite mentioned this pull request May 8, 2015

[EGGO-18] WIP: Setup config files #53

Merged

laserson force-pushed the master branch from 38b890a to fbffdb6 Compare August 18, 2015 00:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CGHub download #38

CGHub download #38

jfeala commented Apr 20, 2015

jfeala commented Apr 20, 2015

fnothaft commented Apr 20, 2015

tomwhite Apr 21, 2015

tomwhite commented Apr 21, 2015

jfeala commented Apr 23, 2015

jfeala commented Apr 24, 2015

laserson Apr 28, 2015

laserson Apr 28, 2015

jfeala Apr 28, 2015

jfeala commented May 2, 2015

CGHub download #38

Are you sure you want to change the base?

CGHub download #38

Conversation

jfeala commented Apr 20, 2015

jfeala commented Apr 20, 2015

fnothaft commented Apr 20, 2015

tomwhite Apr 21, 2015

Choose a reason for hiding this comment

tomwhite commented Apr 21, 2015

jfeala commented Apr 23, 2015

jfeala commented Apr 24, 2015

laserson Apr 28, 2015

Choose a reason for hiding this comment

laserson Apr 28, 2015

Choose a reason for hiding this comment

jfeala Apr 28, 2015

Choose a reason for hiding this comment

jfeala commented May 2, 2015