Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MariaDB #110

Merged
merged 10 commits into from
Jul 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 13 additions & 3 deletions docs/source/faq.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,17 @@
FAQ
===

Can sqlsynthgen work with two different schemas
***********************************************
Can SqlSynthGen work with two different schemas?
************************************************

sqlsynthgen can only work with a single source schema and a single destination schema at a time. However, you can choose for the destination schema to have a different name to the source schema by setting the DST_SCHEMA environment variable.
SqlSynthGen can only work with a single source schema and a single destination schema at a time.
However, you can choose for the destination schema to have a different name to the source schema by setting the DST_SCHEMA environment variable.

Which DBMSs does SqlSynthGen support?
*************************************

* SqlSynthGen most fully supports **PostgresSQL**, which it uses for its end-to-end functional tests.
* SqlSynthGen also supports **MariaDB** with one exception: you cannot use source statistics (i.e. the ``make-stats`` command).
* SqlSynthGen *might*, work with **SQLite** but this is largely untested.

Please open a GitHub issue if you would like to see support for another DBMS.
2 changes: 1 addition & 1 deletion docs/source/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ To use sqlsynthgen, first install it:

.. code-block:: console

$ pip install git+https://github.com/alan-turing-institute/sqlsynthgen.git
$ pip install sqlsynthgen

and check that you can view the help message with:

Expand Down
13 changes: 4 additions & 9 deletions docs/source/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,21 +27,16 @@ For the simplest case, we will need `make-tables`, `make-generators`, `create-ta
we need to set environment variables to tell sqlsynthgen how to access our source database (where the real data resides now) and destination database (where the synthetic data will go).
We can do that in the terminal with the `export` keyword, as shown below, or in a file called `.env`.
The source and destination may be on the same database server, as long as the database or schema names differ.
If the source and destination schemas are the default schema for the user on that database, you should not set those variables.
If you are using a DBMS that does not support schemas (e.g. MariaDB), you must not set those variables.

.. code-block:: console

$ export SRC_HOST_NAME='[email protected]'
$ export SRC_USER_NAME='someuser'
$ export SRC_PASSWORD='secretpassword'
$ export SRC_DSN="postgresql://someuser:[email protected]"
$ export SRC_SCHEMA='myschema'
$ export SRC_DB_NAME='source_db'

$ export DST_HOST_NAME='[email protected]'
$ export DST_USER_NAME='someuser'
$ export DST_PASSWORD='secretpassword'
$ export DST_DSN="postgresql://someuser:[email protected]/dst_db"
$ export DST_SCHEMA='myschema'
$ export DST_DB_NAME='destination_db'


Next, we make a SQLAlchemy file that defines the structure of your database using the `make-tables` command:

Expand Down
12 changes: 4 additions & 8 deletions docs/source/tutorials/airbnb.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
An Introduction to SqlSynthGen
==============================

.. _introduction:

`SqlSynthGen <https://github.com/alan-turing-institute/sqlsynthgen/>`_, or SSG for short, is a software package that we have written for synthetic data generation, focussed on relational data.
When pointed to an existing relational database, SSG creates another database with the same database schema, and populates it with synthetic data.
By default the synthetic data is crudely low fidelity, but the user is given various ways to configure the behavior of SSG to increase fidelity, while maintaining transparency and control over how the original data is used to inform the synthetic data, to control privacy risks.
Expand All @@ -26,14 +28,8 @@ First, we need to provide SSG with the connection parameters, using a ``.env`` f

.. code-block:: console

SRC_HOST_NAME=localhost
SRC_USER_NAME=postgres
SRC_PASSWORD=password
SRC_DB_NAME=airbnb
DST_HOST_NAME=localhost
DST_USER_NAME=postgres
DST_PASSWORD=password
DST_DB_NAME=dst
SRC_DSN='postgresql://postgres:password@localhost/airbnb'
DST_DSN='postgresql://postgres:password@localhost/dst'

We can start the schema migration process by running the following command::

Expand Down
65 changes: 65 additions & 0 deletions docs/source/tutorials/loan_applications.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
Tutorial: Loan Data
===================

There are many potential applications of synthetic data in banking and finance where the nature of the data, being both personally and commercially sensitive, may rule out sharing real, identifiable data.

Here, we show how to use SqlSynthGen to generate a simple (uniformly random) synthetic version of the freely-available `PKDD'99 <https://relational.fit.cvut.cz/dataset/Financial>`_ dataset.
This dataset contains 606 successful and 76 not successful loan applications.

The PKDD'99 dataset is stored on a MariaDB database, which means that we need a local MariaDB database to store the synthetic data.
MariaDB installation instructions can be found `here <https://mariadb.org/download/?t=mariadb&p=mariadb&r=11.2.0#entry-header>`_.
We presume that you have a local server running on port 3306, with a user called ``myuser``, a password ``mypassword`` and a database called ``financial``.

.. code-block:: console

$ mysql
MariaDB > create user 'myuser'@'localhost' identified by 'mypassword';
MariaDB > create database financial;
MariaDB > grant all privileges on financial.* to 'myuser'@'localhost';
MariaDB > \q

After :ref:`installing SqlSynthGen <enduser>`, we create a `.env` file to set some environment variables to define the source database as the one linked at the bottom of the PKDD'99 page, and the destination database as the local one:

**.env**

.. code-block:: console

SRC_DSN="mariadb+pymysql://guest:[email protected]:3306/Financial_ijs"
DST_DSN="mariadb+pymysql://myuser:mypassword@localhost:3306/financial"

We run SqlSynthGen's ``make-tables`` command to create a file called ``orm.py`` that contains the schema of the source database.

.. code-block:: console

$ sqlsynthgen make-tables

Inspecting the ``orm.py`` file, we see that the ``tkeys`` table has column called ``goodClient``, which is a ``TINYINT``.
SqlSynthGen doesn't know what to do with ``TINYINT`` columns, so we need to create a config file to tell it how to handle them. This isn't necessary for normal ``Integer`` columns.

**config.yaml**

.. literalinclude:: ../../../tests/examples/loans/config.yaml
:language: yaml

We run SqlSynthGen's ``make-generators`` command to create ``ssg.py``, which contains a generator class for each table in the source database:

.. code-block:: console

$ sqlsynthgen make-generators --config config.yaml

We then run SqlSynthGen's ``create-tables`` command to create the tables in the destination database:

.. code-block:: console

$ sqlsynthgen create-tables

Note that, alternatively, you could use another tool, such as ``mysqldump`` to create the tables in the destination database.

Finally, we run SqlSynthGen's ``create-data`` command to populate the tables with synthetic data:

.. code-block:: console

$ sqlsynthgen create-data --num-passes 100

This will make 100 rows in each of the nine tables.
The data will be entirely random so you may wish to fine tune it using the source-statistics, custom generators or "story generators" explained in the longer :ref:`introduction <introduction>`.
Loading