Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix typos in docs #160

Merged
merged 1 commit into from
Oct 25, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/configuration.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Configuration Reference
=======================

SqlSynthGen is configured using a YAML file, which is passed to several commands with the ``--config`` option.
SqlSynthGen is configured using a YAML file, which is passed to several commands with the ``--config-file`` option.
Throughout the docs, we will refer to this file as ``config.yaml`` but it can be called anything (the exception being that there will be a naming conflict if you have a vocabulary table called ``config``).

Below, we see the schema for the configuration file.
Expand Down
4 changes: 2 additions & 2 deletions docs/source/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,14 @@ Can SqlSynthGen work with two different schemas?
************************************************

SqlSynthGen can only work with a single source schema and a single destination schema at a time.
However, you can choose for the destination schema to have a different name to the source schema by setting the DST_SCHEMA environment variable.
However, you can choose for the destination schema to have a different name to the source schema by setting the ``DST_SCHEMA`` environment variable.

Which DBMSs does SqlSynthGen support?
*************************************

* SqlSynthGen most fully supports **PostgresSQL**, which it uses for its end-to-end functional tests.
* SqlSynthGen also supports **MariaDB**, as long as you don't set ``use-asyncio: true`` in your config.
* SqlSynthGen *might*, work with **SQLite** but this is largely untested.
* SqlSynthGen *might* work with **SQLite** but this is largely untested.
* SqlSynthGen may also work with SQL Server.
To connect to SQL Server, you will need to install `pyodbc <https://pypi.org/project/pyodbc/>`_ and an `ODBC driver <https://learn.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server?view=sql-server-ver16>`_, after which you should be able to use a DSN setting similar to ``SRC_DSN="mssql+pyodbc://username:password@hostname/dbname?driver=ODBC Driver 18 for SQL Server"``.

Expand Down
16 changes: 8 additions & 8 deletions docs/source/health_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,10 @@ The full configuration we wrote for the CCHIC data set is available `here <https

Before getting into the config itself, we need to discuss a few peculiarities of the OMOP CDM that need to be taken into account:

1. Some versions of OMOP contain a circular foreign key, for instance between the `vocabulary`, `concept`, and `domain` tables.
2. There are several standardized vocabulary tables (`concept`, `concept_relationship`, etc).
1. Some versions of OMOP contain a circular foreign key, for instance between the ``vocabulary``, ``concept``, and ``domain`` tables.
2. There are several standardized vocabulary tables (``concept``, ``concept_relationship``, etc).
These should be marked as such in the sqlsynthgen config file.
The tables will be exported to ``.yaml`` files during the ``make-tables`` step.
The tables will be exported to ``.yaml`` files during the ``make-generators`` step.
However, some of these vocabulary tables may be too large to practically be writable to ``.yaml`` files, and will need to be dealt with manually.
You should also check the license agreement of each standardized vocabulary before sharing any of the ``.yaml`` files.

Expand Down Expand Up @@ -195,7 +195,7 @@ Here is our config for the person table:
columns_assigned: care_site_id

``num_rows_per_pass`` is set to 0, because all rows are generated by the story generator.
Let's use the gender columns as an emxample.
Let's use the gender columns as an example.
Here is the relevant function from ``row_generators.py``.

.. code-block:: python
Expand Down Expand Up @@ -355,8 +355,8 @@ You can find examples of this in the `full configuration <https://github.com/ala
After creating a person, ``patient_story`` creates possibly an entry in the ``death`` table, and then one for ``visit_occurrence``.
The configurations and generators for these aren't very interesting, their main point is to make the chronology and time scales make sense, so that people born a long time ago are more likely to have died, and the order of birth, visit start, visit end, and possible death is correct.

After that the story generates a set of rows for tables like `observation`, `measurement`, `condition_occurrence`, etc., the ones that involve procedures and events that took place during the hospital stay.
The procedure is very similar for each one of these, we'll discuss `measurement` as an example.
After that the story generates a set of rows for tables like ``observation``, ``measurement``, ``condition_occurrence``, etc., the ones that involve procedures and events that took place during the hospital stay.
The procedure is very similar for each one of these, we'll discuss ``measurement`` as an example.

The first stop is the ``avg_measurements_per_hour`` src-stats query, which looks like this

Expand Down Expand Up @@ -394,11 +394,11 @@ The first stop is the ``avg_measurements_per_hour`` src-stats query, which looks
upper: 100

Note how the ``query`` part, which is executed on the database server, tries to do as much of the work as possible:
It extracts the number of `measurement` entries, divided by the length of the hospital stay, for each person.
It extracts the number of ``measurement`` entries, divided by the length of the hospital stay, for each person.
The ``dp-query`` then only computes the average.
This is both to circumvent the limitations of SNSQL, which can't for instance do subqueries or differences between columns, and also to minimise the data transferred to and work done on the local machine running SSG.

Based on that information, we generate a set of times, roughly at the right frequency, at which a `measurement` entry should generated for our synthetic patient.
Based on that information, we generate a set of times, roughly at the right frequency, at which a ``measurement`` entry should generated for our synthetic patient.
The relevant `src-stats queries <https://github.com/alan-turing-institute/sqlsynthgen/blob/main/examples/cchic_omop/>`_ for this are

* ``count_measurements``, which counts the relative frequencies of various types of measurements, like blood pressure, pulse taking, different lab results, etc.
Expand Down
10 changes: 5 additions & 5 deletions docs/source/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ Now when we run ``create-data`` we get valid, if not very sensible, values in ea
- 485
- 534

SSG’s default generators have minimal fidelity: All data is generated based purely on the datatype of the its column, e.g. random strings in string columns.
SSG’s default generators have minimal fidelity: All data is generated based purely on the datatype of the column, e.g. random strings in string columns.
Foreign key relations are respected by picking random rows from the table referenced.
Even this synthetic data, nearly the crudest imaginable, can be useful for instance for testing software pipelines.
Note that this data has no privacy implications, since it is only based on the schema.
Expand All @@ -121,7 +121,7 @@ This should of course only be done for tables that hold no privacy-sensitive dat
For instance, in the AirBnB dataset, the ``users`` table has a foreign key reference to a table of world countries: ``users.country_destination`` references the ``countries.country_destination`` primary key column.
Since the ``countries`` table doesn’t contain personal data, we can make it a vocabulary table.

Besides manual edition, on SSG we can also customise the generation of ``ssg.py`` via a YAML file,
Besides manually editing it, we can also customise the generation of ``ssg.py`` via a YAML file,
typically named ``config.yaml``.
We identify ``countries`` as a vocabulary table in our ``config.yaml`` file:

Expand Down Expand Up @@ -164,7 +164,7 @@ We need to truncate any tables in our destination database before importing the
$ sqlsynthgen remove-data --config-file config.yaml
$ sqlsynthgen create-vocab

Since ``make-generators`` rewrote ``ssg.py``, we must now re-edit it to add the primary key ``VARCHAR`` workaroundsfor the ``users`` and ``age_gender_bkts`` tables, as we did in section above.
Since ``make-generators`` rewrote ``ssg.py``, we must now re-edit it to add the primary key ``VARCHAR`` workarounds for the ``users`` and ``age_gender_bkts`` tables, as we did in section above.
Once this is done, we can generate random data for the other three tables with::

$ sqlsynthgen create-data
Expand Down Expand Up @@ -293,7 +293,7 @@ Then, we tell SSG to import our custom ``airbnb_generators.py`` and assign the r
columns_assigned: ["date_account_created", "date_first_booking"]

Note how we pass the ``generic`` object as a keyword argument to ``user_dates_provider``.
Row generators can have positional arguments specified as a list under the ``args`` list and keyword arguments as a dictionary under the ``kwargs`` entry.
Row generators can have positional arguments specified as a list under the ``args`` entry and keyword arguments as a dictionary under the ``kwargs`` entry.

Limitations to this approach to increasing fidelity are that rows can not be correlated with other rows in the same table, nor with any rows in other tables, except for trivially fulfilling foreign key constraints as in the default configuration.
We will see how to address this later when we talk about :ref:`story generators <story-generators>`.
Expand Down Expand Up @@ -537,7 +537,7 @@ For instance, it may first yield a row specifying a person in the ``users`` tabl
Three features make story generators more practical than simply manually writing code that creates the synthetic data bit-by-bit:

1. When a story generator yields a row, it can choose to only specify values for some of the columns. The values for the other columns will be filled by custom row generators (as explained in a previous section) or, if none are specified, by SSG's default generators. Above, we have chosen to specify the value for ``first_device_type`` but the date columns will still be handled by our ``user_dates_provider`` and the age column will still be populated by the ``user_age_provider``.
2. Any default values that are set when the rows yielded by the story generator are written into the database are available to the story generator when it resumes. In our example, the user's ``id`` is available so that we can respect the foreign key relationship between ``users`` and ``sessions``, even though we did not explicitly set the user's ``id`` when creating the user.
2. Any default values that are set when the rows yielded by the story generator are written into the database are available to the story generator when it resumes. In our example, the user's ``id`` is available so that we can respect the foreign key relationship between ``users`` and ``sessions``, even though we did not explicitly set the user's ``id`` when creating the user on line 8.

To use and get the most from story generators, we will need to make some changes to our configuration:

Expand Down
2 changes: 1 addition & 1 deletion docs/source/loan_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ We notice that the ``districts`` table doesn't contain any sensitive data so we
.. literalinclude:: ../../examples/loans/config2.yaml
:language: yaml

We can export the vocabularies to `.yaml` files, delete the old synthetic data, import the vocabularies and create new synthetic data with:
We can export the vocabularies to ``.yaml`` files, delete the old synthetic data, import the vocabularies and create new synthetic data with:

.. code-block:: console

Expand Down