Using s3 bucket for resources #2

kopardev · 2020-11-05T19:38:34Z

Can the resources be hosted in s3 buckets? For eg. for hg38 and gencode version 38 can we use:

s3://nciccbr/Resources/hg38/gencode_release30/hg38.rRNA_interval_list.gz
s3://nciccbr/Resources/hg38/gencode_release30/qualimap_info.txt.gz
s3://nciccbr/Resources/hg38/gencode_release30/annotate.genes.txt.gz
s3://nciccbr/Resources/hg38/gencode_release30/karyoplot_gene_coordinates.txt.gz
s3://nciccbr/Resources/hg38/gencode_release30/genes.ref.bed.gz
s3://nciccbr/Resources/hg38/gencode_release30/karyobeds.tar.gz
s3://nciccbr/Resources/hg38/gencode_release30/geneinfo.bed.gz
s3://nciccbr/Resources/hg38/gencode_release30/annotate.isoforms.txt.gz
s3://nciccbr/Resources/hg38/gencode_release30/refFlat.txt.gz
s3://nciccbr/Resources/hg38/gencode_release30/gencode.v30.annotation.gtf.gz
s3://nciccbr/Resources/hg38/gencode_release30/rsemref.tar.gz
s3://nciccbr/Resources/hg38/hg38.fa.gz
s3://nciccbr/Resources/common/TruSeq_and_nextera_adapters.ngsqc.dat.gz
s3://nciccbr/Resources/common/fastq_screen.conf.gz
s3://nciccbr/Resources/common/fastqc.adapters.gz
s3://nciccbr/Resources/common/TruSeq_and_nextera_adapters_new.fa.gz
s3://nciccbr/Resources/common/adapters2.fa.gz

I have already uploaded all resources for hg38 (gencode release 30), except the STAR indices. We only need the noGTF version of the STAR index if we are providing GTF on the fly and will be independent of the release version. All files are gzipped on the s3 bucket and folders are tar.gz (eg. rsemref.tar.gz)

The text was updated successfully, but these errors were encountered:

skchronicles · 2020-11-05T20:05:15Z

@kopardev
Yes, I will look into it.

I was also thinking we should create a custom set of references for the ci workflow.

Here is what I am thinking:

1. Find a dataset with a differentially expressed gene
- DE gene should be comprised of uniquely mapped reads (reads only mapping to one location). This is so we can spike-in this gene later on into a pre-computed counts matrix.
- Optional: Differential expression is validated through a secondary method
2. Extract these uniquely mapped reads for said DE gene to create the following:
1. Sub-sampled fastq files for testing purposes
2. Custom reference files (with a custom ref.fa and genes.gtf)

The ref.fa should only contain the sequence for the gene of intereset (you can pad it with +/- 10KB), and the GTF files will have to be modified to accommodate the new ref.fa, and it should only contain our gene of interest.

Do you have some time to do look into this more?

kopardev · 2020-11-07T13:26:59Z

This is a great idea for creating a small dataset for workflow CI. As this a completely different issue, I am moving it as such.

skchronicles · 2020-11-10T00:07:03Z

@kopardev
Okay, sounds good. Yes, I just saw the new issue: #3

I just finished creating a new docker image for kraken2+krona and re-writing/testing the new rule. I was also able to get it to integrate with the latest version of MultiQC:

I will look into integrating s3 resources tomorrow. I was reading through the snakemake's documentation, and it looks pretty straight-forward.

kopardev · 2020-11-10T03:29:15Z

I was reading the snakemake documentation on s3 and it appears to me that you need to login to the s3 bucket with aws credentials. I dont know what the best way to authenticate via a pipeline (may be a service account?). But another option is to convert the s3 bucket into a static website. For eg. the fastqc adapters are now available on the following URL
https://nciccbr.s3.amazonaws.com/Resources/common/fastqc.adapters.gz. In order to make this happen, I had to

edit the permissions of the s3 bucket
add the following bucket policy

{
    "Version": "2012-10-17",
    "Id": "Policy1604978077308",
    "Statement": [
        {
            "Sid": "Stmt1604978061256",
            "Effect": "Allow",
            "Principal": "*",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::nciccbr/Resources/*"
        }
    ]
}

then edit bucket properties to enable static website hosting. More details about my steps are here

Now, we can possibly use this to access the files/objects in the s3 bucket in read-only mode via HTTP.

What do you think?

kopardev · 2020-11-10T03:30:50Z

Regarding kraken2.... since the visualization is now completely handled by the newer version of multiqc, do you still need krona?

skchronicles added the enhancement New feature or request label Nov 18, 2020

skchronicles self-assigned this Nov 18, 2020

skchronicles added the rna run rna-seek run related task label Mar 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using s3 bucket for resources #2

Using s3 bucket for resources #2

kopardev commented Nov 5, 2020

skchronicles commented Nov 5, 2020

kopardev commented Nov 7, 2020

skchronicles commented Nov 10, 2020

kopardev commented Nov 10, 2020

kopardev commented Nov 10, 2020

Using s3 bucket for resources #2

Using s3 bucket for resources #2

Comments

kopardev commented Nov 5, 2020

skchronicles commented Nov 5, 2020

kopardev commented Nov 7, 2020

skchronicles commented Nov 10, 2020

kopardev commented Nov 10, 2020

kopardev commented Nov 10, 2020