Skip to content

Commit

Permalink
Add IGDWriter, IGDFile->IGDReader, and add IGDTransformer (#5)
Browse files Browse the repository at this point in the history
* Add IGD writing capability, IGDTransformer
* IGDFile is now IGDReader, and takes an IO buffer instead of a filename. This is more flexible, and removes the need to have a context manager. IGDFile remains, but will be removed soon and should be avoided.
* IGDWriter is new: create IGD files in Python.
* IGDTransformer lets you take an input IGD buffer and modify the sample information, producing a new output IGD buffer.
* Documentation and examples
* Make flake8 happy, fix test dependencies
* Bump version
  • Loading branch information
dcdehaas authored Sep 19, 2024
1 parent 640e7ec commit 0d68f19
Show file tree
Hide file tree
Showing 15 changed files with 777 additions and 111 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install flake8 pytest
python -m pip install flake8 pytest BitVector
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Lint with flake8
run: |
Expand Down
31 changes: 5 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,35 +23,14 @@ pip install --force-reinstall dist/*.whl

## Usage

The `pyigd.IGDFile` class is a context manager, so it is recommended that you use it via the `with` statement. Below is an example script that loads an IGD file, prints out some meta-data, and then iterates the genotype data for all variants.

The `pyigd.IGDReader` class reads IGD data from a buffer. See [the example script](https://github.com/aprilweilab/pyigd/blob/main/examples/igdread.py) that loads an IGD file, prints out some meta-data, and then iterates the genotype data for all variants. Generally the usage pattern is:
```
import pyigd
import sys
if len(sys.argv) < 2:
print("Pass in IGD filename")
exit(1)
with pyigd.IGDFile(sys.argv[1]) as igd_file:
print(f"Version: {igd_file.version}")
print(f"Ploidy: {igd_file.ploidy}")
print(f"Variants: {igd_file.num_variants}")
print(f"Individuals: {igd_file.num_individuals}")
print(f"Source: {igd_file.source}")
print(f"Description: {igd_file.description}")
for variant_index in range(igd_file.num_variants):
# Approach 1: Get the samples as a list
print(f"REF: {igd_file.get_ref_allele(variant_index)}, ALT: {igd_file.get_alt_allele(variant_index)}")
position, is_missing, sample_list = igd_file.get_samples(variant_index)
print( (position, is_missing, len(sample_list)) )
# Approach 2: Get the samples as a BitVector object
# See https://engineering.purdue.edu/kak/dist/BitVector-3.5.0.html
position, is_missing, bitvect = igd_file.get_samples_bv(variant_index)
print( (position, is_missing, bitvect.count_bits()) )
with open(filename, "rb") as f:
igd_reader = pyigd.IGDReader(f)
```

There is also the `pyigd.IGDWriter` class to construct IGD files. Related is `pyigd.IGDTransformer`, which is a way to create a copy of an IGD while modifying its contents. See the IGDTransformer [sample list example](https://github.com/aprilweilab/pyigd/blob/main/examples/xform.py) and [bitvector example](https://github.com/aprilweilab/pyigd/blob/main/examples/xform_bv.py).

IGD can be highly performant for a few reasons:
1. It stores sparse data sparsely. Low-frequency variants are stored as sample lists. Medium/high frequency variants are stored as bit vectors.
2. It is indexable (you can jump directly to data for the `ith` variant). Since the index is stored in its own section of the file, scanning the index is extremely fast. So only looking at variants for a particular range of the genome is very fast (in this case you would use `pyigd.IGDFile.get_position_and_flags()` to find the first variant index within the range, and then use `pyigd.IGDFile.get_samples()` after that).
Expand Down
20 changes: 20 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
28 changes: 28 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Configuration file for the Sphinx documentation builder.
#
# For the full list of built-in configuration values, see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html

# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information

project = 'pyigd'
copyright = '2024, Drew DeHaas'
author = 'Drew DeHaas'
release = '0.2'

# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration

extensions = ["sphinx.ext.autodoc"]

templates_path = ['_templates']
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']



# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output

html_theme = 'sphinx_rtd_theme'
html_static_path = ['_static']
74 changes: 74 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
.. pyigd documentation master file, created by
sphinx-quickstart on Wed Aug 21 14:20:22 2024.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
pyigd Documentation
===================

.. toctree::
:maxdepth: 2
:caption: Contents:

pyigd is a Python library for reading and writing Indexable Genotype Data
(IGD) files. `IGD is a binary format <https://github.com/aprilweilab/picovcf/blob/main/IGD.FORMAT.md>`_ that is very simple and uses no
compression, but tends to be quite small and fast.

IGDReader example
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Example of how to read an IGD file. See :py:class:`pyigd.IGDReader` for the relevant API reference.

::

import pyigd
import sys

if len(sys.argv) < 2:
print("Pass in IGD filename")
exit(1)

with open(sys.argv[1], "rb") as f:
igd_file = pyigd.IGDReader(f)
print(f"Version: {igd_file.version}")
print(f"Ploidy: {igd_file.ploidy}")
print(f"Variants: {igd_file.num_variants}")
print(f"Individuals: {igd_file.num_individuals}")
print(f"Source: {igd_file.source}")
print(f"Description: {igd_file.description}")
for variant_index in range(igd_file.num_variants):
print(f"REF: {igd_file.get_ref_allele(variant_index)}, ALT: {igd_file.get_alt_allele(variant_index)}")
position, is_missing, sample_list = igd_file.get_samples(variant_index)
print( (position, is_missing, len(sample_list)) )

IGDWriter example
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Example of how to write an IGD file. See :py:class:`pyigd.IGDWriter` for the relevant API reference. Some of the methods invoked have a required order, and in general all the variants (sample info) needs to be written immediately after the header. The IGDWriter maintains some state, and for large datasets can use a non-trivial amount of RAM to maintain indexes while writing variant data.

::

num_individuals = 100
with open("output.igd", "wb") as f:
w = IGDWriter(f, num_individuals)
w.write_header()
w.write_variant(100, "A", "G", [1, 22, 99, 101])
w.write_variant(101, "A", "C", list(range(200)))
w.write_index()
w.write_variant_info()
w.write_individual_ids([]) # optional
w.write_variant_ids([]) # optional
# We write the header twice because the IGDWriter is updating fields
# on the header as we call the other methods above. For example, it is
# counting the number of variants being written and that gets updated
# in the header the second time we write it.
f.seek(0)
w.write_header()


Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
7 changes: 7 additions & 0 deletions docs/modules.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
pyigd
=====

.. toctree::
:maxdepth: 4

pyigd
10 changes: 10 additions & 0 deletions docs/pyigd.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
pyigd package
=============

Module contents
---------------

.. automodule:: pyigd
:members:
:undoc-members:
:show-inheritance:
3 changes: 2 additions & 1 deletion examples/igdread.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@
print("Pass in IGD filename")
exit(1)

with pyigd.IGDFile(sys.argv[1]) as igd_file:
with open(sys.argv[1], "rb") as f:
igd_file = pyigd.IGDReader(f)
print(f"Version: {igd_file.version}")
print(f"Ploidy: {igd_file.ploidy}")
print(f"Variants: {igd_file.num_variants}")
Expand Down
18 changes: 18 additions & 0 deletions examples/xform.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import pyigd
import sys

if len(sys.argv) < 3:
print("Pass in IGD input/output filenames")
exit(1)

# Our own IGDTransformer. This just deletes the first sample from every variant, but
# in general you can change the sample list any way you want. Return None to delete
# the variant entirely.
class MyXformer(pyigd.IGDTransformer):
def modify_samples(self, position, is_missing, samples):
return samples[1:]

with open(sys.argv[1], "rb") as fin, open(sys.argv[2], "wb") as fout:
xformer = MyXformer(fin, fout, use_bitvectors=False)
xformer.transform()
print(f"Copied {sys.argv[1]} to {sys.argv[2]} and deleted the first sample from every variant.")
22 changes: 22 additions & 0 deletions examples/xform_bv.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import pyigd
import sys

if len(sys.argv) < 3:
print("Pass in IGD input/output filenames")
exit(1)

# Our own IGDTransformer. This just deletes the first sample from every variant, but
# in general you can change the sample list any way you want. Return None to delete
# the variant entirely.
class MyXformer(pyigd.IGDTransformer):
def modify_samples(self, position, is_missing, samples):
# BitVector version. The ith element being 1 means the ith sample has the variant.
for i in range(len(samples)):
if samples[i]:
samples[i] = 0
return samples

with open(sys.argv[1], "rb") as fin, open(sys.argv[2], "wb") as fout:
xformer = MyXformer(fin, fout, use_bitvectors=True)
xformer.transform()
print(f"Copied {sys.argv[1]} to {sys.argv[2]} and deleted the first sample from every variant.")
Loading

0 comments on commit 0d68f19

Please sign in to comment.