Skip to content

Apply compression to common data formats #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
brainstorm opened this issue Dec 15, 2011 · 4 comments
Open

Apply compression to common data formats #10

brainstorm opened this issue Dec 15, 2011 · 4 comments

Comments

@brainstorm
Copy link
Contributor

.sam files should not be present for more than X months on the filesystem. Automatic conversion to .bam can be performed.

Likewise, unused .fastq files (for a reasonable amount of time), should be compressed to .gz, which many bioinformatic tools support natively.

@samuell
Copy link
Contributor

samuell commented Dec 15, 2011

Very interesting ideas! I wonder though how that fits in with our current plan, that iRODS mostly would handle data when stored on SweStore, while when running analyses, one would check out data as normal files (basically since handling data directly, via iRODS is quite cumbersome).

I think these are the kind of things for which an IRL meeting could help ... to decide on what we should aim for in these regards.

@jhagberg
Copy link
Contributor

Yes I agree. Much what I have thought iRODS can do !

I feel that we really need a design.
A concept how to work. How to use iRODS as a help and a tool in the day to day work. Perhaps we need help from other experts in that discussion.

Direct access vault can bee a good way to come around the iget/iput problem. Then just iput the results from analysis and by metadata associate the result with different input files in iRODS and so on.

@jhagberg
Copy link
Contributor

Can you help sketch up the outline of the rules?

I can then write a periodic rule to check for files, bundle them and apply compression or just apply compression and then archive.

@brainstorm
Copy link
Contributor Author

We will irsync files that follow those globs to uppmax:

https://github.com/SciLifeLab/bcbb/blob/master/nextgen/scripts/illumina_finished_msg.py#L239

Then, a first approach would be to look for uncompressed fastq files within irods (there will be under fastq/ dir) and compress them using gzip, md5summing the resulting file.

We want to have easy access to metadata, so the fastq folder (biggest) should be bundled independently from the lightweight metadata files (*.xml, etc..).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants