Add IGDWriter, IGDFile->IGDReader, and add IGDTransformer (#5)

* Add IGD writing capability, IGDTransformer * IGDFile is now IGDReader, and takes an IO buffer instead of a filename. This is more flexible, and removes the need to have a context manager. IGDFile remains, but will be removed soon and should be avoided. * IGDWriter is new: create IGD files in Python. * IGDTransformer lets you take an input IGD buffer and modify the sample information, producing a new output IGD buffer. * Documentation and examples * Make flake8 happy, fix test dependencies * Bump version
aprilweilab · Sep 19, 2024 · 0d68f19 · 0d68f19
1 parent 640e7ec
commit 0d68f19
Show file tree

Hide file tree

Showing 15 changed files with 777 additions and 111 deletions.
diff --git a/.github/workflows/python-package.yml b/.github/workflows/python-package.yml
@@ -27,7 +27,7 @@ jobs:
     - name: Install dependencies
       run: |
         python -m pip install --upgrade pip
-        python -m pip install flake8 pytest
+        python -m pip install flake8 pytest BitVector
         if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
     - name: Lint with flake8
       run: |

diff --git a/README.md b/README.md
@@ -23,35 +23,14 @@ pip install --force-reinstall dist/*.whl
 
 ## Usage
 
-The `pyigd.IGDFile` class is a context manager, so it is recommended that you use it via the `with` statement. Below is an example script that loads an IGD file, prints out some meta-data, and then iterates the genotype data for all variants.
-
+The `pyigd.IGDReader` class reads IGD data from a buffer. See [the example script](https://github.com/aprilweilab/pyigd/blob/main/examples/igdread.py) that loads an IGD file, prints out some meta-data, and then iterates the genotype data for all variants. Generally the usage pattern is:
 ```
-import pyigd
-import sys
-
-if len(sys.argv) < 2:
-    print("Pass in IGD filename")
-    exit(1)
-
-with pyigd.IGDFile(sys.argv[1]) as igd_file:
-    print(f"Version: {igd_file.version}")
-    print(f"Ploidy: {igd_file.ploidy}")
-    print(f"Variants: {igd_file.num_variants}")
-    print(f"Individuals: {igd_file.num_individuals}")
-    print(f"Source: {igd_file.source}")
-    print(f"Description: {igd_file.description}")
-    for variant_index in range(igd_file.num_variants):
-        # Approach 1: Get the samples as a list
-        print(f"REF: {igd_file.get_ref_allele(variant_index)}, ALT: {igd_file.get_alt_allele(variant_index)}")
-        position, is_missing, sample_list = igd_file.get_samples(variant_index)
-        print( (position, is_missing, len(sample_list))  )
-
-        # Approach 2: Get the samples as a BitVector object
-        # See https://engineering.purdue.edu/kak/dist/BitVector-3.5.0.html
-        position, is_missing, bitvect = igd_file.get_samples_bv(variant_index)
-        print( (position, is_missing, bitvect.count_bits())  )
+with open(filename, "rb") as f:
+  igd_reader = pyigd.IGDReader(f)
 ```
 
+There is also the `pyigd.IGDWriter` class to construct IGD files. Related is `pyigd.IGDTransformer`, which is a way to create a copy of an IGD while modifying its contents. See the IGDTransformer [sample list example](https://github.com/aprilweilab/pyigd/blob/main/examples/xform.py) and [bitvector example](https://github.com/aprilweilab/pyigd/blob/main/examples/xform_bv.py).
+
 IGD can be highly performant for a few reasons:
 1. It stores sparse data sparsely. Low-frequency variants are stored as sample lists. Medium/high frequency variants are stored as bit vectors.
 2. It is indexable (you can jump directly to data for the `ith` variant). Since the index is stored in its own section of the file, scanning the index is extremely fast. So only looking at variants for a particular range of the genome is very fast (in this case you would use `pyigd.IGDFile.get_position_and_flags()` to find the first variant index within the range, and then use `pyigd.IGDFile.get_samples()` after that).

diff --git a/docs/Makefile b/docs/Makefile
@@ -0,0 +1,20 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = .
+BUILDDIR      = _build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/docs/conf.py b/docs/conf.py
@@ -0,0 +1,28 @@
+# Configuration file for the Sphinx documentation builder.
+#
+# For the full list of built-in configuration values, see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+# -- Project information -----------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
+
+project = 'pyigd'
+copyright = '2024, Drew DeHaas'
+author = 'Drew DeHaas'
+release = '0.2'
+
+# -- General configuration ---------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
+
+extensions = ["sphinx.ext.autodoc"]
+
+templates_path = ['_templates']
+exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+
+
+
+# -- Options for HTML output -------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
+
+html_theme = 'sphinx_rtd_theme'
+html_static_path = ['_static']
diff --git a/docs/index.rst b/docs/index.rst
@@ -0,0 +1,74 @@
+.. pyigd documentation master file, created by
+   sphinx-quickstart on Wed Aug 21 14:20:22 2024.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+pyigd Documentation
+===================
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Contents:
+
+pyigd is a Python library for reading and writing Indexable Genotype Data
+(IGD) files. `IGD is a binary format <https://github.com/aprilweilab/picovcf/blob/main/IGD.FORMAT.md>`_ that is very simple and uses no
+compression, but tends to be quite small and fast.
+
+IGDReader example
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Example of how to read an IGD file. See :py:class:`pyigd.IGDReader` for the relevant API reference.
+
+::
+
+  import pyigd
+  import sys
+
+  if len(sys.argv) < 2:
+      print("Pass in IGD filename")
+      exit(1)
+
+  with open(sys.argv[1], "rb") as f:
+      igd_file = pyigd.IGDReader(f)
+      print(f"Version: {igd_file.version}")
+      print(f"Ploidy: {igd_file.ploidy}")
+      print(f"Variants: {igd_file.num_variants}")
+      print(f"Individuals: {igd_file.num_individuals}")
+      print(f"Source: {igd_file.source}")
+      print(f"Description: {igd_file.description}")
+      for variant_index in range(igd_file.num_variants):
+          print(f"REF: {igd_file.get_ref_allele(variant_index)}, ALT: {igd_file.get_alt_allele(variant_index)}")
+          position, is_missing, sample_list = igd_file.get_samples(variant_index)
+          print( (position, is_missing, len(sample_list))  )
+
+IGDWriter example
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Example of how to write an IGD file. See :py:class:`pyigd.IGDWriter` for the relevant API reference. Some of the methods invoked have a required order, and in general all the variants (sample info) needs to be written immediately after the header. The IGDWriter maintains some state, and for large datasets can use a non-trivial amount of RAM to maintain indexes while writing variant data.
+
+::
+
+  num_individuals = 100
+  with open("output.igd", "wb") as f:
+      w = IGDWriter(f, num_individuals)
+      w.write_header()
+      w.write_variant(100, "A", "G", [1, 22, 99, 101])
+      w.write_variant(101, "A", "C", list(range(200)))
+      w.write_index()
+      w.write_variant_info()
+      w.write_individual_ids([]) # optional
+      w.write_variant_ids([])    # optional
+      # We write the header twice because the IGDWriter is updating fields
+      # on the header as we call the other methods above. For example, it is
+      # counting the number of variants being written and that gets updated
+      # in the header the second time we write it.
+      f.seek(0)
+      w.write_header()
+
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`modindex`
+* :ref:`search`
diff --git a/docs/modules.rst b/docs/modules.rst
@@ -0,0 +1,7 @@
+pyigd
+=====
+
+.. toctree::
+   :maxdepth: 4
+
+   pyigd
diff --git a/docs/pyigd.rst b/docs/pyigd.rst
@@ -0,0 +1,10 @@
+pyigd package
+=============
+
+Module contents
+---------------
+
+.. automodule:: pyigd
+   :members:
+   :undoc-members:
+   :show-inheritance:
diff --git a/examples/igdread.py b/examples/igdread.py
@@ -5,7 +5,8 @@
     print("Pass in IGD filename")
     exit(1)
 
-with pyigd.IGDFile(sys.argv[1]) as igd_file:
+with open(sys.argv[1], "rb") as f:
+    igd_file = pyigd.IGDReader(f)
     print(f"Version: {igd_file.version}")
     print(f"Ploidy: {igd_file.ploidy}")
     print(f"Variants: {igd_file.num_variants}")

diff --git a/examples/xform.py b/examples/xform.py
@@ -0,0 +1,18 @@
+import pyigd
+import sys
+
+if len(sys.argv) < 3:
+    print("Pass in IGD input/output filenames")
+    exit(1)
+
+# Our own IGDTransformer. This just deletes the first sample from every variant, but
+# in general you can change the sample list any way you want. Return None to delete
+# the variant entirely.
+class MyXformer(pyigd.IGDTransformer):
+    def modify_samples(self, position, is_missing, samples):
+        return samples[1:]
+
+with open(sys.argv[1], "rb") as fin, open(sys.argv[2], "wb") as fout:
+    xformer = MyXformer(fin, fout, use_bitvectors=False)
+    xformer.transform()
+print(f"Copied {sys.argv[1]} to {sys.argv[2]} and deleted the first sample from every variant.")
diff --git a/examples/xform_bv.py b/examples/xform_bv.py
@@ -0,0 +1,22 @@
+import pyigd
+import sys
+
+if len(sys.argv) < 3:
+    print("Pass in IGD input/output filenames")
+    exit(1)
+
+# Our own IGDTransformer. This just deletes the first sample from every variant, but
+# in general you can change the sample list any way you want. Return None to delete
+# the variant entirely.
+class MyXformer(pyigd.IGDTransformer):
+    def modify_samples(self, position, is_missing, samples):
+        # BitVector version. The ith element being 1 means the ith sample has the variant.
+        for i in range(len(samples)):
+            if samples[i]:
+                samples[i] = 0
+        return samples
+
+with open(sys.argv[1], "rb") as fin, open(sys.argv[2], "wb") as fout:
+    xformer = MyXformer(fin, fout, use_bitvectors=True)
+    xformer.transform()
+print(f"Copied {sys.argv[1]} to {sys.argv[2]} and deleted the first sample from every variant.")