Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using s3 bucket for resources #2

Open
kopardev opened this issue Nov 5, 2020 · 5 comments
Open

Using s3 bucket for resources #2

kopardev opened this issue Nov 5, 2020 · 5 comments
Assignees
Labels
enhancement New feature or request rna run rna-seek run related task

Comments

@kopardev
Copy link
Collaborator

kopardev commented Nov 5, 2020

Can the resources be hosted in s3 buckets? For eg. for hg38 and gencode version 38 can we use:

s3://nciccbr/Resources/hg38/gencode_release30/hg38.rRNA_interval_list.gz
s3://nciccbr/Resources/hg38/gencode_release30/qualimap_info.txt.gz
s3://nciccbr/Resources/hg38/gencode_release30/annotate.genes.txt.gz
s3://nciccbr/Resources/hg38/gencode_release30/karyoplot_gene_coordinates.txt.gz
s3://nciccbr/Resources/hg38/gencode_release30/genes.ref.bed.gz
s3://nciccbr/Resources/hg38/gencode_release30/karyobeds.tar.gz
s3://nciccbr/Resources/hg38/gencode_release30/geneinfo.bed.gz
s3://nciccbr/Resources/hg38/gencode_release30/annotate.isoforms.txt.gz
s3://nciccbr/Resources/hg38/gencode_release30/refFlat.txt.gz
s3://nciccbr/Resources/hg38/gencode_release30/gencode.v30.annotation.gtf.gz
s3://nciccbr/Resources/hg38/gencode_release30/rsemref.tar.gz
s3://nciccbr/Resources/hg38/hg38.fa.gz
s3://nciccbr/Resources/common/TruSeq_and_nextera_adapters.ngsqc.dat.gz
s3://nciccbr/Resources/common/fastq_screen.conf.gz
s3://nciccbr/Resources/common/fastqc.adapters.gz
s3://nciccbr/Resources/common/TruSeq_and_nextera_adapters_new.fa.gz
s3://nciccbr/Resources/common/adapters2.fa.gz

I have already uploaded all resources for hg38 (gencode release 30), except the STAR indices. We only need the noGTF version of the STAR index if we are providing GTF on the fly and will be independent of the release version. All files are gzipped on the s3 bucket and folders are tar.gz (eg. rsemref.tar.gz)

@skchronicles
Copy link
Owner

@kopardev
Yes, I will look into it.

I was also thinking we should create a custom set of references for the ci workflow.

Here is what I am thinking:

  • 1. Find a dataset with a differentially expressed gene
    • DE gene should be comprised of uniquely mapped reads (reads only mapping to one location). This is so we can spike-in this gene later on into a pre-computed counts matrix.
    • Optional: Differential expression is validated through a secondary method
  • 2. Extract these uniquely mapped reads for said DE gene to create the following:
    1. Sub-sampled fastq files for testing purposes
    2. Custom reference files (with a custom ref.fa and genes.gtf)

The ref.fa should only contain the sequence for the gene of intereset (you can pad it with +/- 10KB), and the GTF files will have to be modified to accommodate the new ref.fa, and it should only contain our gene of interest.

Do you have some time to do look into this more?

@kopardev
Copy link
Collaborator Author

kopardev commented Nov 7, 2020

This is a great idea for creating a small dataset for workflow CI. As this a completely different issue, I am moving it as such.

@skchronicles
Copy link
Owner

@kopardev
Okay, sounds good. Yes, I just saw the new issue: #3

I just finished creating a new docker image for kraken2+krona and re-writing/testing the new rule. I was also able to get it to integrate with the latest version of MultiQC:
image

I will look into integrating s3 resources tomorrow. I was reading through the snakemake's documentation, and it looks pretty straight-forward.

@kopardev
Copy link
Collaborator Author

I was reading the snakemake documentation on s3 and it appears to me that you need to login to the s3 bucket with aws credentials. I dont know what the best way to authenticate via a pipeline (may be a service account?). But another option is to convert the s3 bucket into a static website. For eg. the fastqc adapters are now available on the following URL
https://nciccbr.s3.amazonaws.com/Resources/common/fastqc.adapters.gz. In order to make this happen, I had to

  • edit the permissions of the s3 bucket
  • add the following bucket policy
{
    "Version": "2012-10-17",
    "Id": "Policy1604978077308",
    "Statement": [
        {
            "Sid": "Stmt1604978061256",
            "Effect": "Allow",
            "Principal": "*",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::nciccbr/Resources/*"
        }
    ]
}
  • then edit bucket properties to enable static website hosting. More details about my steps are here

Now, we can possibly use this to access the files/objects in the s3 bucket in read-only mode via HTTP.

What do you think?

@kopardev
Copy link
Collaborator Author

Regarding kraken2.... since the visualization is now completely handled by the newer version of multiqc, do you still need krona?

@skchronicles skchronicles added the enhancement New feature or request label Nov 18, 2020
@skchronicles skchronicles self-assigned this Nov 18, 2020
@skchronicles skchronicles added the rna run rna-seek run related task label Mar 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request rna run rna-seek run related task
Projects
None yet
Development

No branches or pull requests

2 participants