Apply compression to common data formats #10

brainstorm · 2011-12-15T12:56:15Z

.sam files should not be present for more than X months on the filesystem. Automatic conversion to .bam can be performed.

Likewise, unused .fastq files (for a reasonable amount of time), should be compressed to .gz, which many bioinformatic tools support natively.

samuell · 2011-12-15T13:05:13Z

Very interesting ideas! I wonder though how that fits in with our current plan, that iRODS mostly would handle data when stored on SweStore, while when running analyses, one would check out data as normal files (basically since handling data directly, via iRODS is quite cumbersome).

I think these are the kind of things for which an IRL meeting could help ... to decide on what we should aim for in these regards.

jhagberg · 2011-12-15T13:16:22Z

Yes I agree. Much what I have thought iRODS can do !

I feel that we really need a design.
A concept how to work. How to use iRODS as a help and a tool in the day to day work. Perhaps we need help from other experts in that discussion.

Direct access vault can bee a good way to come around the iget/iput problem. Then just iput the results from analysis and by metadata associate the result with different input files in iRODS and so on.

jhagberg · 2012-04-10T10:08:17Z

Can you help sketch up the outline of the rules?

I can then write a periodic rule to check for files, bundle them and apply compression or just apply compression and then archive.

brainstorm · 2012-04-18T14:21:00Z

We will irsync files that follow those globs to uppmax:

https://github.com/SciLifeLab/bcbb/blob/master/nextgen/scripts/illumina_finished_msg.py#L239

Then, a first approach would be to look for uncompressed fastq files within irods (there will be under fastq/ dir) and compress them using gzip, md5summing the resulting file.

We want to have easy access to metadata, so the fastq folder (biggest) should be bundled independently from the lightweight metadata files (*.xml, etc..).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply compression to common data formats #10

Apply compression to common data formats #10

brainstorm commented Dec 15, 2011

samuell commented Dec 15, 2011

jhagberg commented Dec 15, 2011

jhagberg commented Apr 10, 2012

brainstorm commented Apr 18, 2012

Apply compression to common data formats #10

Apply compression to common data formats #10

Comments

brainstorm commented Dec 15, 2011

samuell commented Dec 15, 2011

jhagberg commented Dec 15, 2011

jhagberg commented Apr 10, 2012

brainstorm commented Apr 18, 2012