preparing submission

ialbert · Apr 10, 2024 · 0403878 · 0403878
1 parent deb6897
commit 0403878
Show file tree

Hide file tree

Showing 2 changed files with 169 additions and 0 deletions.
diff --git a/docs/paper.bib b/docs/paper.bib
@@ -0,0 +1,121 @@
+@misc{edirect,
+  author = {Kans, J.},
+  title = {Entrez Direct: E-utilities on the Unix Command Line},
+  year = {2013},
+  note = {Updated 2024 Apr 4},
+  url = {https://www.ncbi.nlm.nih.gov/books/NBK179288/},
+  month = {04},
+  day = {23},
+  booktitle = {Entrez Programming Utilities Help},
+  publisher = {National Center for Biotechnology Information (US)},
+  address = {Bethesda (MD)},
+  edition = {2010-}
+}
+
+@article{ffq,
+    author = {Gálvez-Merchán, Ángel and Min, Kyung Hoi (Joseph) and Pachter, Lior and Booeshaghi, A Sina},
+    title = "{Metadata retrieval from sequence databases with ffq}",
+    journal = {Bioinformatics},
+    volume = {39},
+    number = {1},
+    pages = {btac667},
+    year = {2023},
+    month = {01},
+    abstract = "{Several genomic databases host data and metadata for an ever-growing collection of sequence datasets. While these databases have a shared hierarchical structure, there are no tools specifically designed to leverage it for metadata extraction.We present a command-line tool, called ffq, for querying user-generated data and metadata from sequence databases. Given an accession or a paper’s DOI, ffq efficiently fetches metadata and links to raw data in JSON format. ffq’s modularity and simplicity make it extensible to any genomic database exposing its data for programmatic access.ffq is free and open source, and the code can be found here: https://github.com/pachterlab/ffq.}",
+    issn = {1367-4811},
+    doi = {10.1093/bioinformatics/btac667},
+    url = {https://doi.org/10.1093/bioinformatics/btac667},
+    eprint = {https://academic.oup.com/bioinformatics/article-pdf/39/1/btac667/48942763/btac667.pdf},
+}
+
+@article{gget,
+    author = {Luebbert, Laura and Pachter, Lior},
+    title = "{Efficient querying of genomic reference databases with gget}",
+    journal = {Bioinformatics},
+    volume = {39},
+    number = {1},
+    pages = {btac836},
+    year = {2023},
+    month = {01},
+    abstract = "{A recurring challenge in interpreting genomic data is the assessment of results in the context of existing reference databases. With the increasing number of command line and Python users, there is a need for tools implementing automated, easy programmatic access to curated reference information stored in a diverse collection of large, public genomic databases.gget is a free and open-source command line tool and Python package that enables efficient querying of genomic reference databases, such as Ensembl. gget consists of a collection of separate but interoperable modules, each designed to facilitate one type of database querying required for genomic data analysis in a single line of code.The manual and source code are available at https://github.com/pachterlab/gget.Supplementary data are available at Bioinformatics online.}",
+    issn = {1367-4811},
+    doi = {10.1093/bioinformatics/btac836},
+    url = {https://doi.org/10.1093/bioinformatics/btac836},
+    eprint = {https://academic.oup.com/bioinformatics/article-pdf/39/1/btac836/48646674/btac836.pdf},
+}
+
+@article{pysradb,
+doi = {10.12688/f1000research.18676.1},
+url = {https://doi.org/10.12688/f1000research.18676.1},
+year = {2019},
+month = apr,
+publisher = {F1000 (Faculty of 1000 Ltd)},
+volume = {8},
+pages = {532},
+author = {Saket Choudhary},
+title = {pysradb: A {P}ython package to query next-generation sequencing metadata and data from {NCBI} {S}equence {R}ead {A}rchive},
+journal = {F1000Research}
+}
+
+@article{parasail,
+  author = {Daily, Jeff},
+  title = {Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments},
+  journal = {BMC Bioinformatics},
+  year = {2016},
+  volume = {17},
+  number = {1},
+  pages = {81},
+  date = {2016-02-10},
+  abstract = {Sequence alignment algorithms are a key component of many bioinformatics applications.},
+  issn = {1471-2105},
+  doi = {10.1186/s12859-016-0930-z},
+  url = {https://doi.org/10.1186/s12859-016-0930-z}
+}
+
+@article{entrezpy,
+    author = {Buchmann, Jan P and Holmes, Edward C},
+    title = "{Entrezpy: a Python library to dynamically interact with the NCBI Entrez databases}",
+    journal = {Bioinformatics},
+    volume = {35},
+    number = {21},
+    pages = {4511-4514},
+    year = {2019},
+    month = {05},
+    abstract = "{Entrezpy is a Python library that automates the querying and downloading of data from the Entrez databases at National Center for Biotechnology Information by interacting with E-Utilities. Entrezpy implements complex queries by automatically creating E-Utility parameters from the results obtained that can then be used directly in subsequent queries. Entrezpy also allows the user to cache and retrieve results locally, implements interactions with all Entrez databases as part of an analysis pipeline and adjusts parameters within an ongoing query or using prior results. Entrezpy’s modular design enables it to easily extend and adjust existing E-Utility functions.Entrezpy is implemented in Python 3 (≥3.6) and depends only on the Python Standard Library. It is available via PyPi (https://pypi.org/project/entrezpy/) and at https://gitlab.com/ncbipy/entrezpy.git. Entrezpy is licensed under the LGPLv3 and also at http://entrezpy.readthedocs.io/.}",
+    issn = {1367-4803},
+    doi = {10.1093/bioinformatics/btz385},
+    url = {https://doi.org/10.1093/bioinformatics/btz385},
+    eprint = {https://academic.oup.com/bioinformatics/article-pdf/35/21/4511/50722030/bioinformatics\_35\_21\_4511.pdf},
+}
+
+@article{taxonkit,
+title = {TaxonKit: A practical and efficient NCBI taxonomy toolkit},
+journal = {Journal of Genetics and Genomics},
+volume = {48},
+number = {9},
+pages = {844-850},
+year = {2021},
+note = {Special issue on Microbiome},
+issn = {1673-8527},
+doi = {https://doi.org/10.1016/j.jgg.2021.03.006},
+url = {https://www.sciencedirect.com/science/article/pii/S1673852721000837},
+author = {Wei Shen and Hong Ren},
+keywords = {NCBI Taxonomy, TaxonKit, TaxId, Lineage, TaxId changelog},
+abstract = {The National Center for Biotechnology Information (NCBI) Taxonomy is widely applied in biomedical and ecological studies. Typical demands include querying taxonomy identifier (TaxIds) by taxonomy names, querying complete taxonomic lineages by TaxIds, listing descendants of given TaxIds, and others. However, existed tools are either limited in functionalities or inefficient in terms of runtime. In this work, we present TaxonKit, a command-line toolkit for comprehensive and efficient manipulation of NCBI Taxonomy data. TaxonKit comprises seven core subcommands providing functions, including TaxIds querying, listing, filtering, lineage retrieving and reformatting, lowest common ancestor computation, and TaxIds change tracking. The practical functions, competitive processing performance, scalability with different scales of datasets and good accessibility can facilitate taxonomy data manipulations. TaxonKit provides free access under the permissive MIT license on GitHub, Brewsci, and Bioconda. The documents are also available at https://bioinf.shenwei.me/taxonkit/.}
+}
+
+@article{biopython,
+    author = {Cock, Peter J. A. and Antao, Tiago and Chang, Jeffrey T. and Chapman, Brad A. and Cox, Cymon J. and Dalke, Andrew and Friedberg, Iddo and Hamelryck, Thomas and Kauff, Frank and Wilczynski, Bartek and de Hoon, Michiel J. L.},
+    title = "{Biopython: freely available Python tools for computational molecular biology and bioinformatics}",
+    journal = {Bioinformatics},
+    volume = {25},
+    number = {11},
+    pages = {1422-1423},
+    year = {2009},
+    month = {03},
+    abstract = "{Summary: The Biopython project is a mature open source international collaboration of volunteer developers, providing Python libraries for a wide range of bioinformatics problems. Biopython includes modules for reading and writing different sequence file formats and multiple sequence alignments, dealing with 3D macro molecular structures, interacting with common tools such as BLAST, ClustalW and EMBOSS, accessing key online databases, as well as providing numerical methods for statistical learning.Availability: Biopython is freely available, with documentation and source code at www.biopython.org under the Biopython license.Contact: All queries should be directed to the Biopython mailing lists, see www.biopython.org/wiki/\_Mailing\[email protected].}",
+    issn = {1367-4803},
+    doi = {10.1093/bioinformatics/btp163},
+    url = {https://doi.org/10.1093/bioinformatics/btp163},
+    eprint = {https://academic.oup.com/bioinformatics/article-pdf/25/11/1422/48989335/bioinformatics\_25\_11\_1422.pdf},
+}
diff --git a/docs/paper.md b/docs/paper.md
@@ -0,0 +1,48 @@
+---
+title: 'bio: making bioinformatics fun again'
+tags:
+  - Python
+  - biology
+  - bioinformatics
+authors:
+  - name: Istvan Albert
+    orcid: 0000-0001-8366-984X
+    affiliation: "1, 2"
+
+affiliations:
+ - name: Bioinformatics Consulting Center, Pennsylvania State University, United States of America
+   index: 1
+ - name: Department of Biochemistry and Molecular Biology, Pennsylvania State University, United States of America
+   index: 2
+
+date: 30 March 2024
+bibliography: paper.bib
+---
+
+# Summary
+
+Biological data is cdistributed from a variety of sources and in a variety of formats. Life sciences in general are characterized distirbuted oversight where various databases and resources are maintained by different organizations. The same data may be represented under disparate names and may be stored in different formats.
+
+As an example gene with a common name of ??? may be represented as ...
+
+
+# Statement of need
+
+Anyone that has ever done bioinformatics knows well how even seemingly straightforward tasks typically require multiple convoluted steps, scouring the various corners of the internet, reading documentation, clicking around various websites that all together can slow down progress immensely.  What could be a five minute task turns into hours or days of work.
+
+The cause of all the difficulties is the disconnected nature of data and information, the lack of a common interface, the lack of a unified approach to data access and manipulation.
+
+
+The `bio` package is meant to solve that tedium. It is a bioinformatics toy to play with. Like LEGO pieces that match one another `bio` aims to provide users with commands that naturally fit together and let users express their intent with short, explicit and simple commands. 
+
+The `bio` package is a Python package that aims to simplify the access and manipulation of biological data. The package provides a set of commands that can be used to fetch, manipulate, and analyze biological data. The package is designed to be easy to use and flexible, allowing users to perform a wide range of tasks with minimal effort. The package is designed to be used by students, educators, and researchers in the field of bioinformatics. The package is open-source and freely available for download from the [GitHub repository][bio-src]. Detailed documentation is available at [bioinfo.help][bio-docs].
+
+
+[bio-src]: https://github.com/ialbert/bio
+[bio-docs]: https://www.bioinfo.help/
+
+# Acknowledgments
+
+We acknowledge support from the Huck Institutes for the Life Sciences at the Pennsylvania State University.
+
+# References