diff --git a/CHANGELOG.md b/CHANGELOG.md
deleted file mode 100644
index 32d63f7..0000000
--- a/CHANGELOG.md
+++ /dev/null
@@ -1,51 +0,0 @@
-# Change Log
-
-## v1.3.2 - 2022-05-03
-
-### Added
-* html and public, commandline parameters to `sdnist.challenge.submission` module to allow using this
-module for scoring with public data sets.
-
-### Changed
-
-
-### Fixed
-* `sdnist.challenge.submission` to read epsilon from dataset's parameters json file
-instead of using hard-coded values.
-* Generating collective html visualization of all synthetic datasets scored using `sdnist.challenge.submission`
-module.
-* Jinja template to render drop-down selection of year and puma for census challenge visualization.
-
-## v1.3.1 - 2022-04-05
-
-### Added
-* Graph Map Edge scoring for taxi challenge.
-* Apparent Match Distribution privacy metric.
-* Support for running `sdnist.challenge.submission` module
-for commandline. `sdnist.challenge.submission` module generates
-final score which can be compared to leaderboard scores
-
-### Changed
-
-* Score function computes aggregate score over all scoring metrics
-available for a challenge.
- * k-marginal for census challenge.
- * k-marginal, hoc and graph-edge-map for taxi challenge
-
-### Fixed
-* Higher Order Conjunction Metric
-
-## v1.3.0 - 2022-03-23
-
-### Added
-* Support for visualizing k-marginal scores of each puma with openstreet map for all available census datasets in SDNist. Support available for datasets:
- * IL_OH_10Y_PUMS.[csv, json, parquet] tabular supported with IL_OH_10Y_PUMS.geojson.
- * GA_NC_SC_10Y_PUMS.[csv, json, parquet] tabular (csv, json, parquet) supported with GA_NC_SC_10Y_PUMS.geojson.
- * NY_PA_10Y_PUMS.[csv, json, parquet] tabular supported with NY_PA_10Y_PUMS.geojson.
-
-### Changed
-* No more support for downloading a specific missing dataset instead SDNist downloads all the available datasets from the [sdnist github releases](https://github.com/usnistgov/SDNist/releases).
-
-### Fixed
-* problem with importing `importlib_resources` in python >= 3.7
-* broken download links for fetching data.
\ No newline at end of file
diff --git a/CITATION.cff b/CITATION.cff
new file mode 100644
index 0000000..1ebac0a
--- /dev/null
+++ b/CITATION.cff
@@ -0,0 +1,27 @@
+cff-version: 1.2.0
+title: "SDNist: Deidentified Data Report Tool"
+abstract: "SDNist provides benchmark data and a suite of both machine- and human-readable outputs with more than ten metrics including univariate and multivariate statistics, database distance metrics, principal component analysis, propensity, basic privacy evaluation, and other information-rich tools. "
+message: >-
+ If you use this repository or present information about it publicly, please cite us.
+type: software
+version: 2.0.0
+doi: 10.18434/mds2-2943
+date-released: 2021-12-16
+contact:
+ - affiliation: "National Institute of Standards and Technology"
+ email: gary.howarth@nist.gov
+ family-names: Gary
+ given-names: Howarth
+authors:
+- family-names: Task
+ given-names: Christine
+ affiliation: Knexus Research Corporation
+ email: christine.task@knexusresearch.com
+- family-names: Bhagat
+ given-names: Karan
+ affiliation: Knexus Research Corporation
+- family-names: Howarth
+ given-names: Gary
+ affiliation: National Institute of Standards and Technology
+ email: gary.howarth@nist.gov
+ ORCID: 0000-0002-3587-0546
diff --git a/README.md b/README.md
index db9d282..d631fe6 100644
--- a/README.md
+++ b/README.md
@@ -1,10 +1,8 @@
-# SDNist v1.4 beta: Deidentified Data Report Tool
-
-## We anticipate releasing SDNist v2 February 21 2023!
+# SDNist v2.0: Deidentified Data Report Tool
## [SDNist is the offical software package for engaging in the NIST Collaborative Research Cycle](https://pages.nist.gov/privacy_collaborative_research_cycle)
-Welcome! SDNist v1.4b is a python package that provides benchmark data and evaluation metrics for deidentified data generators. This version of SDNist supports using the [NIST Diverse Community Excerpts](https://github.com/usnistgov/SDNist/tree/main/nist%20diverse%20communities%20data%20excerpts), a geographically partioned, limited feature data set.
+Welcome! SDNist v2.0 is a python package that provides benchmark data and evaluation metrics for deidentified data generators. This version of SDNist supports using the [NIST Diverse Community Excerpts](https://github.com/usnistgov/SDNist/tree/main/nist%20diverse%20communities%20data%20excerpts), a geographically partioned, limited feature data set.
The deidentified data report evaluates utility and privacy of a given deidentified dataset and generates a summary quality report with performance of a deidentified dataset enumerated and illustrated for each utility and privacy metric.
@@ -25,16 +23,24 @@ Help us improve the package and this guide by reporting issues [here](https://gi
### Temporal Map Challenge Environment
-SDNist v1.4b does not support the Temporal Map Challenge environment.
+SDNist v2.0 does not support the Temporal Map Challenge environment.
To run the testing environment from the [*NIST PSCR Differential Privacy Temporal Map Challenge*](https://www.nist.gov/ctl/pscr/open-innovation-prize-challenges/past-prize-challenges/2020-differential-privacy-temporal) for the Chicago Taxi data sprint or the American Community Survey sprint, please go to the the [Temporal Map Challenge assets repository](https://github.com/usnistgov/Differential-Privacy-Temporal-Map-Challenge-assets).
+
+### Citing SDNist Deidentified Data Report Tool
+If you publish work that utilizes the SDNist Deidentified Data Tool, please cite the software. Citation recommendation:
+> Task C., Bhagat K., and Howarth G.S. (2023), SDNist v2: Deidentified Data Report Tool,
+> National Institute of Standards and Technology,
+> https://doi.org/10.18434/mds2-2943
+(NOTE: DOI is not yet active, but should be by 1 APR 2023).
+
Setting Up the SDNIST Report Tool
------------------------
### Brief Setup Instructions
-SDNist v1.4 requires Python version 3.7 or greater. If you have installed a previous version of the SDNist library, we recommend uninstalling or installing v1.4 in a virtual environment. v1.4 can be installed via [Release 1.4.0b](https://github.com/usnistgov/SDNist/releases/tag/v1.4.1-b.1). The NIST Diverse Community Exceprt data will download on the fly.
+SDNist v2.0 requires Python version 3.7 or greater. If you have installed a previous version of the SDNist library, we recommend uninstalling or installing v2.0 in a virtual environment. v2.0 can be installed via [Release 2.0](https://github.com/usnistgov/SDNist/releases/tag/v2.0.0). The NIST Diverse Community Exceprt data will download on the fly.
### Detailed Setup Instructions
@@ -54,10 +60,10 @@ SDNist v1.4 requires Python version 3.7 or greater. If you have installed a prev
c:\\sdnist-project>
```
-4. Download the sdnist installable wheel (sdnist-1.4.1b-py3-none-any.whl) from the [Github:SDNist beta release](https://github.com/usnistgov/SDNist/releases/download/v1.4.1-b.1/sdnist-1.4.1b1-py3-none-any.whl).
+4. Download the sdnist installable wheel (sdnist-2.0.0-py3-none-any.whl) from the [Github:SDNist beta release](https://github.com/usnistgov/SDNist/releases/download/v2.0.0/sdnist-2.0.0-py3-none-any.whl).
-5. Move the downloaded sdnist-1.4.1b1-py3-none-any.whl file to the sdnist-project directory.
+5. Move the downloaded sdnist-2.0.0-py3-none-any.whl file to the sdnist-project directory.
6. Using the terminal on Mac/Linux or powershell on Windows, navigate to the sdnist-project directory.
@@ -110,7 +116,7 @@ SDNist v1.4 requires Python version 3.7 or greater. If you have installed a prev
```
-10. Per step 5 above, the sdnist-1.4.1b1-py3-none-any.whl file should already be present in the sdnist-project directory. Check whether that is true by listing the files in the sdnist-project directory.
+10. Per step 5 above, the sdnist-2.0.0-py3-none-any.whl file should already be present in the sdnist-project directory. Check whether that is true by listing the files in the sdnist-project directory.
**MAC OS/Linux:**
```
@@ -120,12 +126,12 @@ SDNist v1.4 requires Python version 3.7 or greater. If you have installed a prev
```
(venv) c:\\sdnist-project> dir
```
- The sdnist-1.4.0b2-py3-none-any.whl file should be in the list printed by the above command; otherwise, follow steps 4 and 5 again to download the .whl file.
+ The sdnist-2.0.0-py3-none-any.whl file should be in the list printed by the above command; otherwise, follow steps 4 and 5 again to download the .whl file.
11. Install sdnist Python library:
```
- (venv) c:\\sdnist-project> pip install sdnist-1.4.1b1-py3-none-any.whl
+ (venv) c:\\sdnist-project> pip install sdnist-2.0.0-py3-none-any.whl
```
@@ -143,8 +149,9 @@ SDNist v1.4 requires Python version 3.7 or greater. If you have installed a prev
TARGET_DATASET_NAME Select name of the target dataset that was used to generated given deidentified dataset
optional arguments:
- \-h, \--help show this help message and exit
- \--data-root DATA_ROOT Path of the directory to be used as the root for the target datasets\--download DOWNLOAD Download toy datasets if not present locallyChoices for Target Dataset Name::
+ \-h, \--help Show this help message and exit
+ \--data-root DATA_ROOT Path of the directory to be used as the root for the target datasets
+ \--download DOWNLOAD Download toy datasets if not present locallyChoices for Target Dataset Name::
(dataname) (filename)
MA ma2019
@@ -204,7 +211,7 @@ Generate Data Quality Report
- TX
- NATIONAL
- - **--data-root**: The absolute or relative path to the directory containing the bundled dataset, or the directory where the bundled dataset should be downloaded to if it is not available locally. The default directory is set to sdnist_toy_data.
+ - **--data-root**: The absolute or relative path to the directory containing the bundled dataset, or the directory where the bundled dataset should be downloaded to if it is not available locally. The default directory is set to **diverse_community_excerpts_data**.
## Setup Data for SDNIST Report Tool
@@ -215,7 +222,7 @@ Generate Data Quality Report
(venv) c:\\sdnist-project> python -m sdnist.report syn_tx.csv TX
Downloading all SDNist datasets from:
- https://github.com/usnistgov/SDNist/releases/download/v1.4.0-b.1/SDNist-toy-data-1.4.0-b.1.zip ...
+ https://github.com/usnistgov/SDNist/releases/download/v2.0.0/diverse_community_excerpts_data.zip ...
...5%, 47352 KB, 8265 KB/s, 5 seconds elapsed
```
@@ -227,30 +234,30 @@ Generate Data Quality Report
3. The sdnist.report package also needs a deidentified dataset that it can evaluate against its original counterpart. Since the sdnist.report package comes bundled with the datasets, the deidentified dataset should be generated using the bundled datasets.
- You can download a copy of the datasets from [Github Sdnist Toy Dataset](https://github.com/usnistgov/SDNist/tree/main/nist%20diverse%20communities%20data%20excerpts). This copy is similar to the one bundled with the sdnist.report package, but it contains more documentation and a description of the datasets.
+ You can download a copy of the datasets from Github [Diverse Community Excerpts Data](https://github.com/usnistgov/SDNist/tree/main/nist%20diverse%20communities%20data%20excerpts). This copy is similar to the one bundled with the sdnist.report package, but it contains more documentation and a description of the datasets.
-4. You can download the toy deidentified datasets from [Github Sdnist Toy Synthetic Dataset](https://github.com/usnistgov/SDNist/releases/download/v1.4.0-b.1/toy_synthetic_data.zip). Unzip the downloaded file, and move the unzipped toy_synthetic_dataset directory to the sdnist-project directory.
+4. You can download the toy deidentified datasets from Github [Sdnist Toy Synthetic Dataset](https://github.com/usnistgov/SDNist/releases/download/v2.0.0/toy_deidentified_data.zip). Unzip the downloaded file, and move the unzipped toy_synthetic_dataset directory to the sdnist-project directory.
-5. Each toy deidentified dataset file is generated using the [Sdnist Toy Dataset](https://github.com/usnistgov/SDNist/releases/download/v1.4.0-b.1/SDNist-toy-data-1.4.0-b.1.zip). The syn_ma.csv, syn_tx.csv, and syn_national.csv deidentified dataset files are created from target datasets MA (ma2019.csv), TX (tx2019.csv), and NATIONAL(national2019.csv), respectively. You can use one of the toy synthetic dataset files for testing whether the sdnist.report package is installed correctly on your system.
+5. Each toy deidentified dataset file is generated using the [Diverse Community Excerpts Data](https://github.com/usnistgov/SDNist/releases/download/v2.0.0/diverse_community_excerpts_data.zip). The syn_ma.csv, syn_tx.csv, and syn_national.csv deidentified dataset files are created from target datasets MA (ma2019.csv), TX (tx2019.csv), and NATIONAL(national2019.csv), respectively. You can use one of the toy synthetic dataset files for testing whether the sdnist.report package is installed correctly on your system.
6. Use the following commands for generating reports if you are using a toy deidentified dataset file:
For evaluating the Massachusetts dataset:
```
- (venv) c:\\sdnist-project> python -m sdnist.report toy_synthetic_data/syn_ma.csv MA
+ (venv) c:\\sdnist-project> python -m sdnist.report toy_deidentified_data/syn_ma.csv MA
```
For evaluating the Texas dataset:
```
- (venv) c:\\sdnist-project> python -m sdnist.report toy_synthetic_data/syn_tx.csv TX
+ (venv) c:\\sdnist-project> python -m sdnist.report toy_deidentified_data/syn_tx.csv TX
```
For evaluating the national dataset:
```
- (venv) c:\\sdnist-project> python -m sdnist.report toy_synthetic_data/syn_national.csv NATIONAL
+ (venv) c:\\sdnist-project> python -m sdnist.report toy_deidentified_data/syn_national.csv NATIONAL
```
7. A deidentified dataset can be a .csv or a parquet file, and the path of this file is required
@@ -258,9 +265,6 @@ by the sdnist.report package to generate a data quality report.
## Download Data Manually
-1. If the sdnist.report package is not able to download the datasets, you can download them from [Github:SDNist toy data beta release](https://github.com/usnistgov/SDNist/releases/download/v1.4.0-b.1/SDNist-toy-data-1.4.0-b.1.zip).
-2. Move the downloaded SDNist-toy-data-1.4.0-b.1.zip file to the sdnist-project directory.
-3. Unzip the SDNist-toy-data-1.4.0-b.1.zip file and move the data directory inside it to the sdnist-project directory.
-4. Delete the SDNist-toy-data-1.4.0-b.1.zip file once the data directory is successfully moved out of the unzipped directory.
-5. Also delete the now-empty SDNist-toy-data-1.4.0-b.1 directory from where the zip file was extracted.
-6. And finally, to successfully install datasets manually, change the name of the data directory inside the sdnist-project directory to sdnist_toy_data.
+1. If the sdnist.report package is not able to download the datasets, you can download them from Github [Diverse Community Excerpts Data](https://github.com/usnistgov/SDNist/releases/download/v2.0.0/diverse_community_excerpts_data.zip).
+3. Unzip the **diverse_community_excerpts_data.zip** file and move the unzipped **diverse_community_excerpts_data** directory to the **sdnist-project** directory.
+4. Delete the **diverse_community_excerpts_data.zip** file once the data is successfully extracted from the zip.
diff --git a/SDNist_introduction_paper_PPAI22.pdf b/SDNist_introduction_paper_PPAI22.pdf
deleted file mode 100644
index b5636c9..0000000
Binary files a/SDNist_introduction_paper_PPAI22.pdf and /dev/null differ
diff --git a/challenge benchmark problems/Location Sequence Benchmark.pdf b/challenge benchmark problems/Location Sequence Benchmark.pdf
deleted file mode 100644
index 5bf9284..0000000
Binary files a/challenge benchmark problems/Location Sequence Benchmark.pdf and /dev/null differ
diff --git a/challenge benchmark problems/Survey Data Benchmark.pdf b/challenge benchmark problems/Survey Data Benchmark.pdf
deleted file mode 100644
index e7da2d3..0000000
Binary files a/challenge benchmark problems/Survey Data Benchmark.pdf and /dev/null differ
diff --git a/examples/DPSyn/LICENSE b/examples/DPSyn/LICENSE
deleted file mode 100644
index 15629b1..0000000
--- a/examples/DPSyn/LICENSE
+++ /dev/null
@@ -1,21 +0,0 @@
-MIT License
-
-Copyright (c) 2021 DPSyn
-
-Permission is hereby granted, free of charge, to any person obtaining a copy
-of this software and associated documentation files (the "Software"), to deal
-in the Software without restriction, including without limitation the rights
-to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-copies of the Software, and to permit persons to whom the Software is
-furnished to do so, subject to the following conditions:
-
-The above copyright notice and this permission notice shall be included in all
-copies or substantial portions of the Software.
-
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-SOFTWARE.
\ No newline at end of file
diff --git a/examples/DPSyn/config/data.yaml b/examples/DPSyn/config/data.yaml
deleted file mode 100644
index 86505c7..0000000
--- a/examples/DPSyn/config/data.yaml
+++ /dev/null
@@ -1,176 +0,0 @@
----
-pub_dataset_path: /codeexecution/dataloader/public.csv
-priv_dataset_path: /codeexecution/data/ground_truth.csv
-parameter_spec: /codeexecution/data/parameters.json
-numerical_binning:
- "AGE":
- - 20
- - 105
- - 5
- "INCTOT":
- - 0
- - 105_000
- - 5_000
- "INCWAGE":
- - 0
- - 105_000
- - 5_000
- "INCWELFR":
- - 0
- - 105_000
- - 5_000
- "INCINVST":
- - 0
- - 105_000
- - 5_000
- "INCEARN":
- - 0
- - 105_000
- - 5_000
- "POVERTY":
- - 0
- - 520
- - 20
- "HHWT":
- - 0
- - 520
- - 20
- "PERWT":
- - 0
- - 520
- - 20
- "DEPARTS":
- - 0
- - 15
- - 30
- - 45
- "ARRIVES":
- - 0
- - 15
- - 30
- - 45
-grouping_attributes:
- - attributes:
- - "SEX"
- - "MARST"
- num_values: 12
- grouped_name: "SEX+MARST"
- combinations:
- - !!python/tuple [1, 1]
- - !!python/tuple [1, 2]
- - !!python/tuple [1, 3]
- - !!python/tuple [1, 4]
- - !!python/tuple [1, 5]
- - !!python/tuple [1, 6]
- - !!python/tuple [2, 1]
- - !!python/tuple [2, 2]
- - !!python/tuple [2, 3]
- - !!python/tuple [2, 4]
- - !!python/tuple [2, 5]
- - !!python/tuple [2, 6]
- - attributes:
- - "HCOVANY"
- - "HCOVPRIV"
- - "HINSEMP"
- - "HINSCAID"
- - "HINSCARE"
- num_values: 13
- grouped_name: "HINS-COV"
- combinations:
- - !!python/tuple [1, 1, 1, 1, 1]
- - !!python/tuple [2, 1, 1, 1, 1]
- - !!python/tuple [2, 1, 1, 1, 2]
- - !!python/tuple [2, 1, 1, 2, 1]
- - !!python/tuple [2, 1, 1, 2, 2]
- - !!python/tuple [2, 2, 1, 1, 1]
- - !!python/tuple [2, 2, 1, 1, 2]
- - !!python/tuple [2, 2, 1, 2, 1]
- - !!python/tuple [2, 2, 1, 2, 2]
- - !!python/tuple [2, 2, 2, 1, 1]
- - !!python/tuple [2, 2, 2, 1, 2]
- - !!python/tuple [2, 2, 2, 2, 1]
- - !!python/tuple [2, 2, 2, 2, 2]
- - attributes:
- - "EMPSTATD"
- - "WORKEDYR"
- - "WRKLSTWK"
- num_values: 28
- grouped_name: "EMP"
- combinations:
- - !!python/tuple [0, 0, 0]
- - !!python/tuple [10, 3, 2]
- - !!python/tuple [30, 1, 1]
- - !!python/tuple [30, 2, 1]
- - !!python/tuple [30, 1, 3]
- - !!python/tuple [30, 3, 1]
- - !!python/tuple [10, 3, 3]
- - !!python/tuple [20, 3, 1]
- - !!python/tuple [20, 2, 1]
- - !!python/tuple [30, 2, 3]
- - !!python/tuple [30, 3, 3]
- - !!python/tuple [12, 3, 1]
- - !!python/tuple [20, 1, 1]
- - !!python/tuple [12, 3, 2]
- - !!python/tuple [14, 3, 2]
- - !!python/tuple [20, 3, 3]
- - !!python/tuple [30, 3, 2]
- - !!python/tuple [20, 2, 3]
- - !!python/tuple [12, 3, 3]
- - !!python/tuple [20, 1, 3]
- - !!python/tuple [10, 3, 1]
- - !!python/tuple [14, 3, 3]
- - !!python/tuple [30, 1, 2]
- - !!python/tuple [30, 2, 2]
- - !!python/tuple [15, 3, 1]
- - !!python/tuple [14, 3, 1]
- - !!python/tuple [15, 3, 3]
- - !!python/tuple [20, 1, 2]
- - attributes:
- - "ABSENT"
- - "LOOKING"
- num_values: 10
- grouped_name: "ABS+LOOK"
- combinations:
- - !!python/tuple [0, 0]
- - !!python/tuple [1, 1]
- - !!python/tuple [1, 2]
- - !!python/tuple [1, 3]
- - !!python/tuple [3, 1]
- - !!python/tuple [3, 2]
- - !!python/tuple [3, 3]
- - !!python/tuple [4, 1]
- - !!python/tuple [4, 2]
- - !!python/tuple [4, 3]
- - attributes:
- - "AVAILBLE"
- - "WRKRECAL"
- num_values: 12
- grouped_name: "AVA+RECAL"
- combinations:
- - !!python/tuple [0, 0]
- - !!python/tuple [2, 1]
- - !!python/tuple [2, 2]
- - !!python/tuple [2, 3]
- - !!python/tuple [3, 1]
- - !!python/tuple [3, 2]
- - !!python/tuple [3, 3]
- - !!python/tuple [4, 1]
- - !!python/tuple [4, 2]
- - !!python/tuple [4, 3]
- - !!python/tuple [5, 1]
- - !!python/tuple [5, 2]
- - !!python/tuple [5, 3]
-determined_attributes:
- "EMPSTAT":
- by: "EMPSTATD"
- mapping:
- 20: 2
- 30: 3
- 0: 0
- default: 1
- "LABFORCE":
- by: "EMPSTATD"
- mapping:
- 30: 1
- 0: 0
- default: 2
diff --git a/examples/DPSyn/config/data_type.py b/examples/DPSyn/config/data_type.py
deleted file mode 100644
index 23d59b8..0000000
--- a/examples/DPSyn/config/data_type.py
+++ /dev/null
@@ -1,38 +0,0 @@
-
-COLS = {
- "PUMA": "str",
- "YEAR": "uint32",
- "HHWT": "float",
- "GQ": "uint8",
- "PERWT": "float",
- "SEX": "uint8",
- "AGE": "uint8",
- "MARST": "uint8",
- "RACE": "uint8",
- "HISPAN": "uint8",
- "CITIZEN": "uint8",
- "SPEAKENG": "uint8",
- "HCOVANY": "uint8",
- "HCOVPRIV": "uint8",
- "HINSEMP": "uint8",
- "HINSCAID": "uint8",
- "HINSCARE": "uint8",
- "EDUC": "uint8",
- "EMPSTAT": "uint8",
- "EMPSTATD": "uint8",
- "LABFORCE": "uint8",
- "WRKLSTWK": "uint8",
- "ABSENT": "uint8",
- "LOOKING": "uint8",
- "AVAILBLE": "uint8",
- "WRKRECAL": "uint8",
- "WORKEDYR": "uint8",
- "INCTOT": "int32",
- "INCWAGE": "int32",
- "INCWELFR": "int32",
- "INCINVST": "int32",
- "INCEARN": "int32",
- "POVERTY": "uint32",
- "DEPARTS": "uint32",
- "ARRIVES": "uint32",
-}
diff --git a/examples/DPSyn/config/path.py b/examples/DPSyn/config/path.py
deleted file mode 100644
index 9aee3d5..0000000
--- a/examples/DPSyn/config/path.py
+++ /dev/null
@@ -1,17 +0,0 @@
-from pathlib import Path
-
-import os
-
-ROOT_DIRECTORY = Path("")
-DATA_DIRECTORY = ROOT_DIRECTORY / "data"
-CONFIG_DIRECTORY = ROOT_DIRECTORY / "config"
-DATALOADER_DIRECTORY = ROOT_DIRECTORY / "dataloader"
-PICKLE_DIRECTORY = DATALOADER_DIRECTORY / "pkl"
-
-SUBMISSION_FORMAT = DATA_DIRECTORY / "submission_format.csv"
-INPUT = DATA_DIRECTORY / "ground_truth.csv"
-PUBLIC_INPUT = DATA_DIRECTORY / "ground_truth.csv"
-PARAMS = DATA_DIRECTORY / "parameters.json"
-CONFIG_DATA = CONFIG_DIRECTORY / "data.yaml"
-OUTPUT = ROOT_DIRECTORY / "submission.csv"
-
diff --git a/examples/DPSyn/dataloader/DataLoader.py b/examples/DPSyn/dataloader/DataLoader.py
deleted file mode 100644
index 7cf64b8..0000000
--- a/examples/DPSyn/dataloader/DataLoader.py
+++ /dev/null
@@ -1,239 +0,0 @@
-import json
-import os
-import pickle
-from typing import Tuple, Dict
-import numpy as np
-import pandas as pd
-import yaml
-from loguru import logger
-
-from config.path import CONFIG_DATA, PICKLE_DIRECTORY, DATA_DIRECTORY, INPUT
-from config.data_type import COLS
-
-import sdnist
-
-
-class DataLoader:
- def __init__(self):
- self.public_data = None
- self.private_data = None
- self.all_attrs = []
-
- self.encode_mapping = {}
- self.decode_mapping = {}
-
- self.pub_marginals = {}
- self.priv_marginals = {}
-
- self.encode_schema = {}
-
- self.general_schema = {}
- self.filter_values = {}
-
- self.config = None
-
- def load_data(self, pub_only=False):
- # load public data and get grouping mapping and filter values
-
- with open(CONFIG_DATA, 'r') as f:
- config = yaml.load(f, Loader=yaml.FullLoader)
- self.config = config
-
- # load public data
- logger.info("Loading public data")
- self.public_data, self.general_schema = sdnist.census(root="~/datasets", public=False)
- self.public_data = self.binning_attributes(config['numerical_binning'], self.public_data)
- self.public_data = self.grouping_attributes(config['grouping_attributes'], self.public_data)
- self.public_data = self.remove_determined_attributes(config['determined_attributes'], self.public_data)
- self.public_data = self.recode_remain(self.general_schema, config, self.public_data)
- # pickle.dump([self.public_data, self.encode_mapping], open(public_pickle_path, 'wb'))
-
- # load private data
- logger.info("Loading private data")
- self.private_data, self.general_schema = sdnist.census(root="~/datasets", public=True)
- self.private_data = self.binning_attributes(config['numerical_binning'], self.private_data)
- self.private_data = self.grouping_attributes(config['grouping_attributes'], self.private_data)
- self.private_data = self.remove_determined_attributes(config['determined_attributes'], self.private_data)
- self.private_data = self.recode_remain(self.general_schema, config, self.private_data, is_private=True)
- # pickle.dump([self.private_data, self.encode_mapping], open(priv_pickle_path, 'wb'))
-
- for attr, encode_mapping in self.encode_mapping.items():
- self.encode_schema[attr] = sorted(encode_mapping.values())
-
- logger.info(f"public data size {self.public_data.shape}, priv data size {self.private_data.shape}")
-
- def obtain_attrs(self):
- if not self.all_attrs:
- all_attrs = list(self.public_data.columns)
- try:
- all_attrs.remove("sim_individual_id")
- except:
- pass
- self.all_attrs = all_attrs
- return self.all_attrs
-
- def binning_attributes(self, binning_info, data):
- """
- Numerical attributes can be binned
- """
- for attr, spec_list in binning_info.items():
- if attr == "DEPARTS" or attr == "ARRIVES":
- bins = np.r_[-np.inf, [h * 100 + m for h in range(24) for m in spec_list], np.inf]
- else:
- [s, t, step] = spec_list
- bins = np.r_[-np.inf, np.arange(s, t, step), np.inf]
- data[attr] = pd.cut(data[attr], bins).cat.codes
- self.encode_mapping[attr] = {(bins[i], bins[i + 1]): i for i in range(len(bins) - 1)}
- self.decode_mapping[attr] = [i for i in range(len(bins) - 1)]
- return data
-
- def grouping_attributes(self, grouping_info, data):
- """
- Some attributes can be grouped
- """
- for grouping in grouping_info:
- attributes = grouping['attributes']
- new_attr = grouping['grouped_name']
-
- # group attribute values into tuples
- data[new_attr] = data[attributes].apply(tuple, axis=1)
-
- # map tuples to new values in new columns
- encoding = {v: i for i, v in enumerate(grouping['combinations'])}
- data[new_attr] = data[attributes].apply(tuple, axis=1)
- data[new_attr] = data[new_attr].map(encoding)
- self.encode_mapping[new_attr] = encoding
- self.decode_mapping[new_attr] = grouping['combinations']
-
- # drop grouped columns
- data = data.drop(attributes, axis=1)
- return data
-
- @staticmethod
- def remove_determined_attributes(determined_info, data):
- """
- Some dataset are determined by other attributes
- """
- for determined_attr in determined_info.keys():
- data = data.drop(determined_attr, axis=1)
- # print("remove", determined_attr)
- data = data.drop('sim_individual_id', axis=1)
- return data
-
- # recode the remaining single attributes to save storage
- def recode_remain(self, schema, config, data, is_private=False):
- encoded_attr = list(config['numerical_binning'].keys()) + [grouping['grouped_name'] for grouping in config['grouping_attributes']]
- for attr in data.columns:
- if attr in ['sim_individual_id'] or attr in encoded_attr:
- continue
- # print("encode remain:", attr)
- assert attr in schema and 'values' in schema[attr]
- if is_private and attr == 'PUMA':
- mapping = data[attr].unique()
- else:
- mapping = schema[attr]['values']
- encoding = {v: i for i, v in enumerate(mapping)}
- data[attr] = data[attr].map(encoding)
- self.encode_mapping[attr] = encoding
- self.decode_mapping[attr] = mapping
- return data
-
- def generate_all_pub_marginals(self):
- pub_marginal_pickle = PICKLE_DIRECTORY / f"pub_all_marginals.pkl"
-
- if pub_marginal_pickle is not None and os.path.isfile(pub_marginal_pickle):
- self.pub_marginals = pickle.load(open(pub_marginal_pickle, 'rb'))
- return self.pub_marginals
-
- all_attrs = list(self.public_data.columns)
- # all_attrs.remove("sim_individual_id")
- # one-way marginals except PUMA and YEAR
- for attr in all_attrs:
- if attr == 'PUMA' or attr == 'YEAR':
- continue
- self.pub_marginals[frozenset([attr])] = self.generate_one_way_marginal(self.public_data, attr)
- # two_way marginals except PUMA and YEAR
- for i, attr in enumerate(all_attrs):
- if attr == 'PUMA' or attr == 'YEAR':
- continue
- for j in range(i + 1, len(all_attrs)):
- if all_attrs[j] == 'PUMA' or all_attrs[j] == 'YEAR':
- continue
- self.pub_marginals[frozenset([all_attrs[i], all_attrs[j]])] = self.generate_two_way_marginal(
- self.public_data, all_attrs[i], all_attrs[j])
-
- if pub_marginal_pickle is not None:
- pickle.dump(self.pub_marginals, open(pub_marginal_pickle, 'wb'))
-
- return self.pub_marginals
-
- def generate_one_way_marginal(self, records: pd.DataFrame, index_attribute: list):
- marginal = records.assign(n=1).pivot_table(values='n', index=index_attribute, aggfunc=np.sum, fill_value=0)
- indices = sorted([i for i in self.encode_mapping[index_attribute].values()])
- marginal = marginal.reindex(index=indices).fillna(0).astype(np.int32)
- return marginal
-
- def generate_two_way_marginal(self, records: pd.DataFrame, index_attribute: list, column_attribute: list):
- marginal = records.assign(n=1).pivot_table(values='n', index=index_attribute, columns=column_attribute,
- aggfunc=np.sum, fill_value=0)
- indices = sorted([i for i in self.encode_mapping[index_attribute].values()])
- columns = sorted([i for i in self.encode_mapping[column_attribute].values()])
- marginal = marginal.reindex(index=indices, columns=columns).fillna(0).astype(np.int32)
- return marginal
-
- def generate_all_one_way_marginals_except_PUMA_YEAR(self, records: pd.DataFrame):
- all_attrs = self.obtain_attrs()
- marginals = {}
- for attr in all_attrs:
- if attr == 'PUMA' or attr == 'YEAR':
- continue
- marginals[frozenset([attr])] = self.generate_one_way_marginal(records, attr)
- return marginals
-
- def generate_all_two_way_marginals_except_PUMA_YEAR(self, records: pd.DataFrame):
- all_attrs = self.obtain_attrs()
- marginals = {}
- for i, attr in enumerate(all_attrs):
- if attr == 'PUMA' or attr == 'YEAR':
- continue
- for j in range(i + 1, len(all_attrs)):
- if all_attrs[j] == 'PUMA' or all_attrs[j] == 'YEAR':
- continue
- marginals[frozenset([attr, all_attrs[j]])] = self.generate_two_way_marginal(records, attr, all_attrs[j])
- return marginals
-
- def generate_marginal_by_config(self, records: pd.DataFrame, config: dict) -> Tuple[Dict, Dict]:
- marginal_sets = {}
- epss = {}
- for marginal_key, marginal_dict in config.items():
- marginals = {}
- if marginal_key == 'priv_all_one_way':
- # merge the returned marginal dictionary
- marginals.update(self.generate_all_one_way_marginals_except_PUMA_YEAR(records))
- elif marginal_key == 'priv_all_two_way':
- # merge the returned marginal dictionary
- marginals.update(self.generate_all_two_way_marginals_except_PUMA_YEAR(records))
- else:
- attrs = marginal_dict['attributes']
- if len(attrs) == 1:
- marginals[frozenset(attrs)] = self.generate_one_way_marginal(records, attrs[0])
- elif len(attrs) == 2:
- marginals[frozenset(attrs)] = self.generate_two_way_marginal(records, attrs[0], attrs[1])
- else:
- raise NotImplementedError
- epss[marginal_key] = marginal_dict['total_eps']
- marginal_sets[marginal_key] = marginals
- return marginal_sets, epss
-
- def get_marginal_grouping_info(self, cur_attrs):
- info = {}
- grouping_info = self.config['grouping_attributes']
- for attr in cur_attrs:
- for grouping in grouping_info:
- new_attr = grouping['grouped_name']
- if new_attr == attr:
- info[new_attr] = grouping['attributes']
- break
- if attr not in info:
- info[attr] = [attr]
- return info
diff --git a/examples/DPSyn/dataloader/RecordPostprocessor.py b/examples/DPSyn/dataloader/RecordPostprocessor.py
deleted file mode 100644
index 73036e3..0000000
--- a/examples/DPSyn/dataloader/RecordPostprocessor.py
+++ /dev/null
@@ -1,121 +0,0 @@
-import numpy as np
-import pandas as pd
-import yaml
-
-COLS = {
- "PUMA": "str",
- "YEAR": "uint32",
- "HHWT": "float",
- "GQ": "uint8",
- "PERWT": "float",
- "SEX": "uint8",
- "AGE": "uint8",
- "MARST": "uint8",
- "RACE": "uint8",
- "HISPAN": "uint8",
- "CITIZEN": "uint8",
- "SPEAKENG": "uint8",
- "HCOVANY": "uint8",
- "HCOVPRIV": "uint8",
- "HINSEMP": "uint8",
- "HINSCAID": "uint8",
- "HINSCARE": "uint8",
- "EDUC": "uint8",
- "EMPSTAT": "uint8",
- "EMPSTATD": "uint8",
- "LABFORCE": "uint8",
- "WRKLSTWK": "uint8",
- "ABSENT": "uint8",
- "LOOKING": "uint8",
- "AVAILBLE": "uint8",
- "WRKRECAL": "uint8",
- "WORKEDYR": "uint8",
- "INCTOT": "int32",
- "INCWAGE": "int32",
- "INCWELFR": "int32",
- "INCINVST": "int32",
- "INCEARN": "int32",
- "POVERTY": "uint32",
- "DEPARTS": "uint32",
- "ARRIVES": "uint32",
-}
-
-class RecordPostprocessor:
- def __init__(self):
- self.config = None
- pass
-
- def post_process(self, data: pd.DataFrame, config_file_path: str, grouping_mapping: dict):
- assert isinstance(data, pd.DataFrame)
- with open(config_file_path, 'r') as f:
- self.config = yaml.load(f, Loader=yaml.BaseLoader)
-
- data = self.ungrouping_attributes(data, grouping_mapping)
- data = self.unbinning_attributes(data)
- data = self.add_determined_attrs(data)
- data = self.decode_other_attributes(data, grouping_mapping)
- data = self.ensure_types(data)
- return data
-
- def unbinning_attributes(self, data: pd.DataFrame):
- binning_info = self.config['numerical_binning']
- print(binning_info)
- for att, spec_list in binning_info.items():
- if att == "DEPARTS" or att == "ARRIVES":
- bins = np.r_[-np.inf, [int(h) * 100 + int(m) for h in range(24) for m in spec_list], np.inf]
- else:
- [s, t, step] = spec_list
- bins = np.r_[-np.inf, np.arange(int(s), int(t), int(step)), np.inf]
-
- # remove np.inf
- bins[0] = bins[1] - 1
- bins[-1] = bins[-2] + 2
-
- values_map = {i: int((bins[i] + bins[i + 1]) / 2) for i in range(len(bins) - 1)}
- data[att] = data[att].map(values_map)
- return data
-
- def ungrouping_attributes(self, data: pd.DataFrame, decode_mapping: dict):
- grouping_info = self.config['grouping_attributes']
- for grouping in grouping_info:
- grouped_attr = grouping['grouped_name']
- attributes = grouping['attributes']
-
- data[grouped_attr] = [decode_mapping[grouped_attr][i] for i in data[grouped_attr]]
-
- # mapping = pd.Index(decode_mapping[grouped_attr])
- # data[grouped_attr] = mapping[data[grouped_attr]] # somehow this raises an error
-
- data[attributes] = pd.DataFrame(data[grouped_attr].tolist(), index=data.index)
- data = data.drop(grouped_attr, axis=1)
- return data
-
- def decode_other_attributes(self, data: pd.DataFrame, decode_mapping: dict):
- grouping_attr = [info["grouped_name"] for info in self.config['grouping_attributes']]
- binning_attr = [attr for attr in self.config['numerical_binning'].keys()]
- for attr, mapping in decode_mapping.items():
- if attr in grouping_attr or attr in binning_attr:
- continue
- else:
- mapping = pd.Index(mapping)
- data[attr] = mapping[data[attr]]
- return data
-
- def add_determined_attrs(self, data: pd.DataFrame):
- """
- Some dataset are determined by other attributes
- """
- determined_info = self.config['determined_attributes']
- for determined_attr in determined_info.keys():
- control_attr = determined_info[determined_attr]['by']
- mapping = determined_info[determined_attr]['mapping']
- default = determined_info[determined_attr]['default']
- # type = data[control_attr].dtype
- mapping = {int(k): int(v) for k, v in mapping.items()}
- data[determined_attr] = data.apply(lambda row: mapping.get(row[control_attr], default), axis=1)
- return data
-
- def ensure_types(self, data: pd.DataFrame):
- for col, data_type in COLS.items():
- data[col] = data[col].astype(data_type)
- return data
diff --git a/examples/DPSyn/dataloader/parameters.json b/examples/DPSyn/dataloader/parameters.json
deleted file mode 100644
index 55fa6b8..0000000
--- a/examples/DPSyn/dataloader/parameters.json
+++ /dev/null
@@ -1,490 +0,0 @@
-{
- "runs": [
- {
- "epsilon": 0.1,
- "delta": 3.4498908254380166e-11,
- "max_records": 1350000,
- "max_records_per_individual": 7
- },
- {
- "epsilon": 1.0,
- "delta": 3.4498908254380166e-11,
- "max_records": 1350000,
- "max_records_per_individual": 7
- },
- {
- "epsilon": 10.0,
- "delta": 3.4498908254380166e-11,
- "max_records": 1350000,
- "max_records_per_individual": 7
- }
- ],
- "schema": {
- "PUMA": {
- "dtype": "str",
- "values": [
- "17-1001",
- "17-104",
- "17-105",
- "17-1104",
- "17-1105",
- "17-1204",
- "17-1205",
- "17-1300",
- "17-1500",
- "17-1602",
- "17-1701",
- "17-1900",
- "17-2000",
- "17-202",
- "17-2100",
- "17-2200",
- "17-2300",
- "17-2400",
- "17-2501",
- "17-2601",
- "17-2700",
- "17-2801",
- "17-2901",
- "17-300",
- "17-3005",
- "17-3007",
- "17-3008",
- "17-3009",
- "17-3102",
- "17-3105",
- "17-3106",
- "17-3107",
- "17-3108",
- "17-3202",
- "17-3203",
- "17-3204",
- "17-3205",
- "17-3207",
- "17-3208",
- "17-3209",
- "17-3306",
- "17-3307",
- "17-3308",
- "17-3309",
- "17-3310",
- "17-3401",
- "17-3407",
- "17-3408",
- "17-3409",
- "17-3410",
- "17-3411",
- "17-3412",
- "17-3413",
- "17-3414",
- "17-3415",
- "17-3416",
- "17-3417",
- "17-3418",
- "17-3419",
- "17-3420",
- "17-3421",
- "17-3422",
- "17-3501",
- "17-3502",
- "17-3503",
- "17-3504",
- "17-3520",
- "17-3521",
- "17-3522",
- "17-3523",
- "17-3524",
- "17-3525",
- "17-3526",
- "17-3527",
- "17-3528",
- "17-3529",
- "17-3530",
- "17-3531",
- "17-3532",
- "17-3601",
- "17-3602",
- "17-3700",
- "17-401",
- "17-501",
- "17-600",
- "17-700",
- "17-800",
- "17-900",
- "39-100",
- "39-1000",
- "39-1100",
- "39-1200",
- "39-1300",
- "39-1400",
- "39-1500",
- "39-1600",
- "39-1700",
- "39-1801",
- "39-1802",
- "39-1803",
- "39-1804",
- "39-1805",
- "39-1900",
- "39-200",
- "39-2000",
- "39-2100",
- "39-2200",
- "39-2300",
- "39-2400",
- "39-2500",
- "39-2600",
- "39-2700",
- "39-2800",
- "39-2900",
- "39-300",
- "39-3000",
- "39-3100",
- "39-3200",
- "39-3300",
- "39-3400",
- "39-3500",
- "39-3600",
- "39-3700",
- "39-3800",
- "39-3900",
- "39-400",
- "39-4000",
- "39-4101",
- "39-4102",
- "39-4103",
- "39-4104",
- "39-4105",
- "39-4106",
- "39-4107",
- "39-4108",
- "39-4109",
- "39-4110",
- "39-4111",
- "39-4200",
- "39-4300",
- "39-4400",
- "39-4500",
- "39-4601",
- "39-4602",
- "39-4603",
- "39-4604",
- "39-4700",
- "39-4800",
- "39-4900",
- "39-500",
- "39-5000",
- "39-5100",
- "39-5200",
- "39-5301",
- "39-5302",
- "39-5401",
- "39-5402",
- "39-5403",
- "39-5501",
- "39-5502",
- "39-5503",
- "39-5504",
- "39-5505",
- "39-5506",
- "39-5507",
- "39-5600",
- "39-5700",
- "39-600",
- "39-700",
- "39-801",
- "39-802",
- "39-901",
- "39-902",
- "39-903",
- "39-904",
- "39-905",
- "39-906",
- "39-907",
- "39-908",
- "39-909",
- "39-910"
- ]
- },
- "YEAR": {
- "dtype": "uint32",
- "values": [
- 2012,
- 2013,
- 2014,
- 2015,
- 2016,
- 2017,
- 2018
- ]
- },
- "HHWT": {
- "dtype": "float"
- },
- "GQ": {
- "values": [
- 0,
- 1,
- 2,
- 3,
- 4,
- 5,
- 6
- ],
- "dtype": "uint8"
- },
- "PERWT": {
- "dtype": "float"
- },
- "SEX": {
- "values": [
- 1,
- 2
- ],
- "dtype": "uint8"
- },
- "AGE": {
- "min": 0,
- "max": 135,
- "dtype": "uint8"
- },
- "MARST": {
- "values": [
- 1,
- 2,
- 3,
- 4,
- 5,
- 6
- ],
- "dtype": "uint8"
- },
- "RACE": {
- "values": [
- 1,
- 2,
- 3,
- 4,
- 5,
- 6,
- 7,
- 8,
- 9
- ],
- "dtype": "uint8"
- },
- "HISPAN": {
- "values": [
- 0,
- 1,
- 2,
- 3,
- 4,
- 9
- ],
- "dtype": "uint8"
- },
- "CITIZEN": {
- "values": [
- 0,
- 1,
- 2,
- 3,
- 4,
- 5,
- 6
- ],
- "dtype": "uint8"
- },
- "SPEAKENG": {
- "values": [
- 0,
- 1,
- 2,
- 3,
- 4,
- 5,
- 6,
- 7,
- 8
- ],
- "dtype": "uint8"
- },
- "HCOVANY": {
- "values": [
- 1,
- 2
- ],
- "dtype": "uint8"
- },
- "HCOVPRIV": {
- "values": [
- 1,
- 2
- ],
- "dtype": "uint8"
- },
- "HINSEMP": {
- "values": [
- 1,
- 2
- ],
- "dtype": "uint8"
- },
- "HINSCAID": {
- "values": [
- 1,
- 2
- ],
- "dtype": "uint8"
- },
- "HINSCARE": {
- "values": [
- 1,
- 2
- ],
- "dtype": "uint8"
- },
- "EDUC": {
- "values": [
- 0,
- 1,
- 2,
- 3,
- 4,
- 5,
- 6,
- 7,
- 8,
- 9,
- 10,
- 11
- ],
- "dtype": "uint8"
- },
- "EMPSTAT": {
- "values": [
- 0,
- 1,
- 2,
- 3
- ],
- "dtype": "uint8"
- },
- "EMPSTATD": {
- "values": [
- 0,
- 10,
- 11,
- 12,
- 13,
- 14,
- 15,
- 20,
- 21,
- 22,
- 30,
- 31,
- 32,
- 33,
- 34
- ],
- "dtype": "uint8"
- },
- "LABFORCE": {
- "values": [
- 0,
- 1,
- 2
- ],
- "dtype": "uint8"
- },
- "WRKLSTWK": {
- "values": [
- 0,
- 1,
- 2,
- 3
- ],
- "dtype": "uint8"
- },
- "ABSENT": {
- "values": [
- 0,
- 1,
- 2,
- 3,
- 4
- ],
- "dtype": "uint8"
- },
- "LOOKING": {
- "values": [
- 0,
- 1,
- 2,
- 3
- ],
- "dtype": "uint8"
- },
- "AVAILBLE": {
- "values": [
- 0,
- 1,
- 2,
- 3,
- 4,
- 5
- ],
- "dtype": "uint8"
- },
- "WRKRECAL": {
- "values": [
- 0,
- 1,
- 2,
- 3
- ],
- "dtype": "uint8"
- },
- "WORKEDYR": {
- "values": [
- 0,
- 1,
- 2,
- 3
- ],
- "dtype": "uint8"
- },
- "INCTOT": {
- "dtype": "int32"
- },
- "INCWAGE": {
- "dtype": "int32"
- },
- "INCWELFR": {
- "dtype": "int32"
- },
- "INCINVST": {
- "dtype": "int32"
- },
- "INCEARN": {
- "dtype": "int32"
- },
- "POVERTY": {
- "min": 0,
- "max": 501,
- "dtype": "uint32"
- },
- "DEPARTS": {
- "min": 0,
- "max": 2359,
- "dtype": "uint32"
- },
- "ARRIVES": {
- "min": 0,
- "max": 2359,
- "dtype": "uint32"
- }
- }
-}
\ No newline at end of file
diff --git a/examples/DPSyn/dataloader/pkl/.gitkeep b/examples/DPSyn/dataloader/pkl/.gitkeep
deleted file mode 100644
index e69de29..0000000
diff --git a/examples/DPSyn/dataloader/read_csv_kwargs.json b/examples/DPSyn/dataloader/read_csv_kwargs.json
deleted file mode 100644
index c0d0c67..0000000
--- a/examples/DPSyn/dataloader/read_csv_kwargs.json
+++ /dev/null
@@ -1,40 +0,0 @@
-{
- "dtype": {
- "PUMA": "str",
- "YEAR": "uint32",
- "HHWT": "float",
- "GQ": "uint8",
- "PERWT": "float",
- "SEX": "uint8",
- "AGE": "uint8",
- "MARST": "uint8",
- "RACE": "uint8",
- "HISPAN": "uint8",
- "CITIZEN": "uint8",
- "SPEAKENG": "uint8",
- "HCOVANY": "uint8",
- "HCOVPRIV": "uint8",
- "HINSEMP": "uint8",
- "HINSCAID": "uint8",
- "HINSCARE": "uint8",
- "EDUC": "uint8",
- "EMPSTAT": "uint8",
- "EMPSTATD": "uint8",
- "LABFORCE": "uint8",
- "WRKLSTWK": "uint8",
- "ABSENT": "uint8",
- "LOOKING": "uint8",
- "AVAILBLE": "uint8",
- "WRKRECAL": "uint8",
- "WORKEDYR": "uint8",
- "INCTOT": "int32",
- "INCWAGE": "int32",
- "INCWELFR": "int32",
- "INCINVST": "int32",
- "INCEARN": "int32",
- "POVERTY": "uint32",
- "DEPARTS": "uint32",
- "ARRIVES": "uint32"
- },
- "skipinitialspace": true
-}
diff --git a/examples/DPSyn/dataloader/submission_format.csv b/examples/DPSyn/dataloader/submission_format.csv
deleted file mode 100644
index 7682eeb..0000000
--- a/examples/DPSyn/dataloader/submission_format.csv
+++ /dev/null
@@ -1,16 +0,0 @@
-epsilon,PUMA,YEAR,HHWT,GQ,PERWT,SEX,AGE,MARST,RACE,HISPAN,CITIZEN,SPEAKENG,HCOVANY,HCOVPRIV,HINSEMP,HINSCAID,HINSCARE,EDUC,EMPSTAT,EMPSTATD,LABFORCE,WRKLSTWK,ABSENT,LOOKING,AVAILBLE,WRKRECAL,WORKEDYR,INCTOT,INCWAGE,INCWELFR,INCINVST,INCEARN,POVERTY,DEPARTS,ARRIVES,sim_individual_id
-0.1,39-1900,2015,301.0,2,151.3,1,102,2,3,2,2,7,1,2,2,2,1,5,0,10,1,1,1,3,3,0,0,10882,22317,0,0,16966,48,474,1082,0
-0.1,39-801,2015,364.0,6,60.9,1,50,4,7,3,0,2,1,1,1,1,1,6,1,12,0,3,1,1,0,1,0,7875,8273,0,0,34490,427,775,34,1
-0.1,17-3602,2015,20.1,3,37.4,2,53,2,4,9,4,1,2,2,2,2,2,3,1,34,2,3,3,0,5,3,0,14763,9496,0,0,52936,409,2264,763,2
-0.1,39-3800,2016,15.2,6,18.1,1,70,1,8,0,3,7,2,1,1,1,1,7,2,11,0,2,4,1,1,0,3,59655,29289,0,0,12054,47,502,1751,3
-0.1,17-3205,2014,89.5,5,218.2,1,100,6,3,0,4,6,1,2,1,1,1,9,1,31,0,2,0,3,4,3,3,71437,8602,0,0,15848,170,540,1059,4
-1.0,17-2000,2018,34.2,5,212.4,2,44,6,9,4,5,3,2,2,1,2,2,6,3,21,2,1,0,0,0,2,1,15,13042,0,0,10905,197,1479,1434,0
-1.0,39-4110,2017,25.1,2,31.4,2,58,6,8,0,3,0,2,1,2,2,2,6,1,11,0,0,0,3,0,2,0,112669,15881,0,0,1008,223,1661,1153,1
-1.0,17-3413,2015,59.4,5,27.7,2,122,1,6,4,3,1,2,2,1,1,2,0,0,31,2,1,4,2,3,3,2,7463,37255,0,0,49743,147,1734,1843,2
-1.0,39-5503,2018,276.0,0,112.6,1,41,3,7,3,1,8,2,2,1,2,1,11,0,13,2,2,0,0,2,2,3,56253,58205,0,0,15551,160,627,586,3
-1.0,39-2800,2015,228.8,3,86.9,2,98,1,2,9,1,1,2,1,2,1,1,3,0,12,0,1,4,3,3,3,3,40897,38289,0,0,11047,175,459,954,4
-10.0,39-300,2017,30.8,5,89.1,2,108,3,6,2,2,1,1,2,1,2,1,0,0,31,0,3,2,3,2,3,0,1337,156367,0,0,19043,45,2146,1768,0
-10.0,17-3205,2016,55.9,4,84.9,2,115,5,5,0,2,1,1,2,1,2,2,2,3,33,1,3,1,2,2,1,1,11682,56648,0,0,4407,245,1500,1705,1
-10.0,17-3007,2014,29.6,4,8.5,1,56,4,4,0,5,2,1,1,2,2,2,6,2,14,2,0,1,2,5,0,1,26808,929,0,0,1141,325,1077,2311,2
-10.0,17-3413,2015,24.3,4,61.1,2,108,4,3,3,4,1,2,2,2,1,1,7,2,11,1,2,1,0,1,0,1,54233,35719,0,0,15775,57,633,512,3
-10.0,17-3202,2017,9.9,0,107.9,1,128,5,7,4,5,6,1,2,2,2,1,4,1,21,0,1,3,0,0,1,1,116635,5783,0,0,519,473,550,1635,4
diff --git a/examples/DPSyn/lib_dpsyn/consistent.py b/examples/DPSyn/lib_dpsyn/consistent.py
deleted file mode 100644
index 24081e1..0000000
--- a/examples/DPSyn/lib_dpsyn/consistent.py
+++ /dev/null
@@ -1,150 +0,0 @@
-import copy
-from loguru import logger
-
-import numpy as np
-from lib_dpsyn.view import View
-
-
-class Consistenter:
- class SubsetWithDependency:
- def __init__(self, attributes_set):
- self.attributes_set = attributes_set
- # a set of tuples this object depends on
- self.dependency = set()
-
- def __init__(self, views, num_categories):
- self.views = views
- self.num_categories = num_categories
- self.iterations = 30
-
- def compute_dependency(self):
- subsets_with_dependency = {}
- ret_subsets = {}
-
- for key, view in self.views.items():
- new_subset = self.SubsetWithDependency(view.attributes_set)
- subsets_temp = copy.deepcopy(subsets_with_dependency)
-
- for subset_key, subset_value in subsets_temp.items():
- attributes_intersection = subset_value.attributes_set & view.attributes_set
-
- if attributes_intersection:
- if tuple(attributes_intersection) not in subsets_with_dependency:
- intersection_subset = self.SubsetWithDependency(attributes_intersection)
- subsets_with_dependency[tuple(attributes_intersection)] = intersection_subset
-
- if not tuple(attributes_intersection) == subset_key:
- subsets_with_dependency[subset_key].dependency.add(tuple(attributes_intersection))
- new_subset.dependency.add(tuple(attributes_intersection))
-
- subsets_with_dependency[tuple(view.attributes_set)] = new_subset
-
- for subset_key, subset_value in subsets_with_dependency.items():
- if len(subset_key) == 1:
- subset_value.dependency = set()
-
- ret_subsets[subset_key] = subset_value
-
- return subsets_with_dependency
-
- def consist_views(self):
- def find_subset_without_dependency():
- for key, subset in subsets_with_dependency_temp.items():
- if not subset.dependency:
- return key, subset
-
- return None, None
-
- def find_views_containing_target(target):
- result = []
-
- for key, view in self.views.items():
- if target <= view.attributes_set:
- result.append(view)
-
- return result
-
- # current strategy: if two views do not agree on the levels: v1: 4*2, v2: 2*4, then consist on 2*2
- def consist_on_subset(target):
- target_views = find_views_containing_target(target)
-
- common_view_indicator = np.zeros(self.num_categories.shape[0])
- for index in target:
- common_view_indicator[index] = 1
-
- common_view = View(common_view_indicator, self.num_categories)
- common_view.initialize_consist_parameters(len(target_views))
-
- for index, view in enumerate(target_views):
- common_view.project_from_bigger_view(view, index)
-
- common_view.calculate_delta()
- # if np.sum(np.absolute(common_view.delta)) > 1000:
- # print(common_view.attr_one_hot)
- # print(np.sum(np.absolute(common_view.delta)))
- if np.sum(np.absolute(common_view.delta)) > 1e-3:
- for index, view in enumerate(target_views):
- view.update_view(common_view, index)
-
- def remove_subset_from_dependency(target):
- for _, subset in subsets_with_dependency_temp.items():
- if tuple(target.attributes_set) in subset.dependency:
- subset.dependency.remove(tuple(target.attributes_set))
-
- # calculate necessary variables
- for key, view in self.views.items():
- view.calculate_tuple_key()
- view.generate_attributes_index_set()
- view.sum = np.sum(view.count)
-
- # calculate the dependency relationship
- subsets_with_dependency = self.compute_dependency()
- logger.debug("dependency computed")
-
- # ripple steps needs several iterations
- # for i in range(self.iterations):
- non_negativity = True
- iterations = 0
-
- while non_negativity and iterations < self.iterations:
- # first make sure summation are the same
- consist_on_subset(set())
-
- for key, view in self.views.items():
- view.sum = np.sum(view.count)
-
- subsets_with_dependency_temp = copy.deepcopy(subsets_with_dependency)
-
- while len(subsets_with_dependency_temp) > 0:
- key, subset = find_subset_without_dependency()
-
- if not subset:
- break
-
- consist_on_subset(subset.attributes_set)
- remove_subset_from_dependency(subset)
- subsets_with_dependency_temp.pop(key, None)
-
- logger.debug("consist finish")
-
- nonneg_view_count = 0
-
- for key, view in self.views.items():
- if (view.count < 0.0).any():
- view.non_negativity()
- view.sum = np.sum(view.count)
- else:
- nonneg_view_count += 1
-
- if nonneg_view_count == len(self.views):
- logger.info("finish in %s round" % (iterations,))
- non_negativity = False
-
- iterations += 1
-
- logger.debug("non-negativity finish")
-
- # calculate normalized count
- for key, view in self.views.items():
- view.sum = np.sum(view.count)
- view.normalize_count = view.count if view.sum <= 0 else view.count / view.sum
diff --git a/examples/DPSyn/lib_dpsyn/record_synthesizer.py b/examples/DPSyn/lib_dpsyn/record_synthesizer.py
deleted file mode 100644
index 0364cc4..0000000
--- a/examples/DPSyn/lib_dpsyn/record_synthesizer.py
+++ /dev/null
@@ -1,242 +0,0 @@
-from loguru import logger
-from numpy import linalg as LA
-import copy
-
-import numpy as np
-import pandas as pd
-
-
-class RecordSynthesizer:
- records = None
- df = None
- error_tracker = None
-
- rounding_method = 'deterministic'
-
- under_cell_indices = None
- zero_cell_indices = None
- over_cell_indices = None
- records_throw_indices = None
-
- add_amount = 0
- add_amount_zero = 0
- reduce_amount = 0
-
- actual_marginal = None
- synthesize_marginal = None
- alpha = 1.0
-
- encode_records = None
- encode_records_sort_index = None
-
- def __init__(self, attrs, domains, num_records):
- self.attrs = attrs
- self.domains = domains
- self.num_records = num_records
-
- def update_alpha(self, iteration):
- self.alpha = 1.0 * 0.84 ** (iteration // 20)
-
- def update_order(self, iteration, views, iterate_keys):
-
- self.error_tracker.insert(loc=0, column=f"{iteration}-before", value=0)
-
- for key_i, key in enumerate(iterate_keys):
- self.track_error(views[key], key_i)
-
- sort_error_tracker = self.error_tracker.sort_values(by=f"{iteration}-before", ascending=False)
-
- self.error_tracker.insert(loc=0, column=f"{iteration}-after", value=0)
- return list(sort_error_tracker.index)
-
- def update_records(self, original_view, iteration):
- view = copy.deepcopy(original_view)
-
- if iteration % 2 == 0:
- self.complete_partial_ratio(view, 0.5)
- else:
- self.complete_partial_ratio(view, 1.0)
-
- def initialize_records(self, iterate_keys, method="random", singleton_views=None):
- self.records = np.empty([self.num_records, len(self.attrs)], dtype=np.uint32)
-
- for attr_i, attr in enumerate(self.attrs):
- if method == "random":
- self.records[:, attr_i] = np.random.randint(0, self.domains[attr_i], size=self.num_records)
-
- elif method == "singleton":
- self.records[:, attr_i] = self.generate_singleton_records(singleton_views[attr])
-
- self.df = pd.DataFrame(self.records, columns=self.attrs)
- self.error_tracker = pd.DataFrame(index=iterate_keys)
-
- def generate_singleton_records(self, singleton):
- record = np.empty(self.num_records, dtype=np.uint32)
- dist_cumsum = np.cumsum(singleton.count)
- start = 0
-
- for index, value in enumerate(dist_cumsum):
- end = int(round(value * self.num_records))
- record[start: end] = index
- start = end
-
- np.random.shuffle(record)
-
- return record
-
- def update_records_prepare(self, view):
- alpha = self.alpha
-
- # deal with under cells (synthesize_marginal < actual_marginal) where synthesize_marginal != 0
- self.under_cell_indices = np.where((self.synthesize_marginal < self.actual_marginal) & (self.synthesize_marginal != 0))[0]
-
- under_rate = (self.actual_marginal[self.under_cell_indices] - self.synthesize_marginal[self.under_cell_indices]) / self.synthesize_marginal[self.under_cell_indices]
- ratio_add = np.minimum(under_rate, np.full(self.under_cell_indices.shape[0], alpha))
- self.add_amount = self._rounding(ratio_add * self.synthesize_marginal[self.under_cell_indices] * self.num_records)
-
- # deal with the case synthesize_marginal == 0 and actual_marginal != 0
- self.zero_cell_indices = np.where((self.synthesize_marginal == 0) & (self.actual_marginal != 0))[0]
- self.add_amount_zero = self._rounding(alpha * self.actual_marginal[self.zero_cell_indices] * self.num_records)
-
- # determine the number of records to be removed
- self.over_cell_indices = np.where(self.synthesize_marginal > self.actual_marginal)[0]
- num_add_total = np.sum(self.add_amount) + np.sum(self.add_amount_zero)
-
- beta = self.find_optimal_beta(num_add_total, self.over_cell_indices)
- over_rate = (self.synthesize_marginal[self.over_cell_indices] - self.actual_marginal[self.over_cell_indices]) / self.synthesize_marginal[self.over_cell_indices]
- ratio_reduce = np.minimum(over_rate, np.full(self.over_cell_indices.shape[0], beta))
- self.reduce_amount = self._rounding(ratio_reduce * self.synthesize_marginal[self.over_cell_indices] * self.num_records).astype(int)
-
- # logger.debug("alpha: %s | beta: %s" % (alpha, beta))
- # logger.debug("num_boost: %s | num_reduce: %s" % (num_add_total, np.sum(self.reduce_amount)))
-
- # convert each record from multiple attributes to one attribute
- self.encode_records = np.matmul(self.records[:, view.attributes_index], view.encode_num)
- self.encode_records_sort_index = np.argsort(self.encode_records)
- self.encode_records = self.encode_records[self.encode_records_sort_index]
-
- def determine_throw_indices(self):
- valid_indices = np.nonzero(self.reduce_amount)[0]
- valid_cell_over_indices = self.over_cell_indices[valid_indices]
- valid_cell_num_reduce = self.reduce_amount[valid_indices]
- valid_data_over_index_left = np.searchsorted(self.encode_records, valid_cell_over_indices, side="left")
- valid_data_over_index_right = np.searchsorted(self.encode_records, valid_cell_over_indices, side="right")
-
- valid_num_reduce = np.sum(valid_cell_num_reduce)
- self.records_throw_indices = np.zeros(valid_num_reduce, dtype=np.uint32)
- throw_pointer = 0
-
- for i, cell_index in enumerate(valid_cell_over_indices):
- match_records_indices = self.encode_records_sort_index[valid_data_over_index_left[i]: valid_data_over_index_right[i]]
- throw_indices = np.random.choice(match_records_indices, valid_cell_num_reduce[i], replace=False)
-
- self.records_throw_indices[throw_pointer: throw_pointer + throw_indices.size] = throw_indices
- throw_pointer += throw_indices.size
-
- np.random.shuffle(self.records_throw_indices)
-
- def handle_zero_cells(self, view):
- # overwrite / partial when synthesize_marginal == 0
- if self.zero_cell_indices.size != 0:
- for index, cell_index in enumerate(self.zero_cell_indices):
- num_partial = int(self.add_amount_zero[index])
-
- if num_partial != 0:
- for i in range(view.view_num_attr):
- self.records[self.records_throw_indices[: num_partial], view.attributes_index[i]] = \
- view.tuple_key[cell_index, i]
-
- self.records_throw_indices = self.records_throw_indices[num_partial:]
-
- def complete_partial_ratio(self, view, complete_ratio):
- num_complete = np.rint(complete_ratio * self.add_amount).astype(int)
- num_partial = np.rint((1 - complete_ratio) * self.add_amount).astype(int)
-
- valid_indices = np.nonzero(num_complete + num_partial)
- num_complete = num_complete[valid_indices]
- num_partial = num_partial[valid_indices]
-
- valid_cell_under_indices = self.under_cell_indices[valid_indices]
- valid_data_under_index_left = np.searchsorted(self.encode_records, valid_cell_under_indices, side="left")
- valid_data_under_index_right = np.searchsorted(self.encode_records, valid_cell_under_indices, side="right")
-
- for valid_index, cell_index in enumerate(valid_cell_under_indices):
- match_records_indices = self.encode_records_sort_index[valid_data_under_index_left[valid_index]: valid_data_under_index_right[valid_index]]
-
- np.random.shuffle(match_records_indices)
-
- if self.records_throw_indices.shape[0] >= (num_complete[valid_index] + num_partial[valid_index]):
- # complete update code
- if num_complete[valid_index] != 0:
- self.records[self.records_throw_indices[: num_complete[valid_index]]] = self.records[
- match_records_indices[: num_complete[valid_index]]]
-
- # partial update code
- if num_partial[valid_index] != 0:
- self.records[np.ix_(
- self.records_throw_indices[num_complete[valid_index]: (num_complete[valid_index] + num_partial[valid_index])],
- view.attributes_index)] = view.tuple_key[cell_index]
-
- # update records_throw_indices
- self.records_throw_indices = self.records_throw_indices[num_complete[valid_index] + num_partial[valid_index]:]
-
- else:
- # todo: simply apply complete operation here, do not know whether it is make sense
- self.records[self.records_throw_indices] = self.records[match_records_indices[: self.records_throw_indices.size]]
-
- def find_optimal_beta(self, num_add_total, cell_over_indices):
- actual_marginal_under = self.actual_marginal[cell_over_indices]
- synthesize_marginal_under = self.synthesize_marginal[cell_over_indices]
-
- lower_bound = 0.0
- upper_bound = 1.0
- beta = 0.0
- current_num = 0.0
- iteration = 0
-
- while abs(num_add_total - current_num) >= 1.0:
- beta = (upper_bound + lower_bound) / 2.0
- current_num = np.sum(
- np.minimum((synthesize_marginal_under - actual_marginal_under) / synthesize_marginal_under,
- np.full(cell_over_indices.shape[0], beta)) * synthesize_marginal_under * self.records.shape[0])
-
- if current_num < num_add_total:
- lower_bound = beta
- elif current_num > num_add_total:
- upper_bound = beta
- else:
- return beta
-
- iteration += 1
- if iteration > 50:
- # logger.warning("cannot find the optimal beta")
- break
-
- return beta
-
- def track_error(self, view, key_i):
- self.actual_marginal = view.count
- count = view.count_records_general(self.records)
- self.synthesize_marginal = count / np.sum(count)
-
- l1_error = LA.norm(self.actual_marginal - self.synthesize_marginal, 1)
- self.error_tracker.iloc[key_i, 0] = l1_error
-
- # logger.info("the l1 error before updating is %s" % (l1_error,))
-
- def _rounding(self, vector):
- if self.rounding_method == 'stochastic':
- ret_vector = np.zeros(vector.size)
- rand = np.random.rand(vector.size)
-
- integer = np.floor(vector)
- decimal = vector - integer
-
- ret_vector[rand > decimal] = np.floor(decimal[rand > decimal])
- ret_vector[rand < decimal] = np.ceil(decimal[rand < decimal])
- ret_vector += integer
- return ret_vector
- elif self.rounding_method == 'deterministic':
- return np.round(vector)
- else:
- raise NotImplementedError(self.rounding_method)
diff --git a/examples/DPSyn/lib_dpsyn/view.py b/examples/DPSyn/lib_dpsyn/view.py
deleted file mode 100644
index 14746e1..0000000
--- a/examples/DPSyn/lib_dpsyn/view.py
+++ /dev/null
@@ -1,221 +0,0 @@
-import numpy as np
-
-
-class View:
- def __init__(self, attr_one_hot: np.array, domain_size_list: np.array):
- self.attr_one_hot = attr_one_hot
- self.domain_size_list = domain_size_list
-
- self.domain_size = np.product(self.domain_size_list[np.nonzero(self.attr_one_hot)[0]])
- self.total_num_attr = len(self.attr_one_hot)
- self.view_num_attr = np.count_nonzero(self.attr_one_hot)
-
- self.encode_num = np.zeros(self.view_num_attr, dtype=np.uint32)
- self.cum_mul = np.zeros(self.view_num_attr, dtype=np.uint32)
- self.attributes_index = np.nonzero(self.attr_one_hot)[0]
-
- self.count = np.zeros(self.domain_size)
- self.sum = 0
- self.calculate_encode_num(self.domain_size_list)
-
- self.attributes_set = set()
- self.tuple_key = np.array([0], dtype=np.uint32)
-
- self.count_matrix = None
- self.summations = None
- self.weights = []
- self.delta = 0
- self.weight_coeff = 1
-
- ########################################### general functions ####################################
- def calculate_encode_num(self, domain_size_list):
- if self.view_num_attr != 0:
- categories_index = self.attributes_index
-
- categories_num = domain_size_list[categories_index]
- categories_num = np.roll(categories_num, 1)
- categories_num[0] = 1
- self.cum_mul = np.cumprod(categories_num)
-
- categories_num = domain_size_list[categories_index]
- categories_num = np.roll(categories_num, self.view_num_attr - 1)
- categories_num[-1] = 1
- categories_num = np.flip(categories_num)
- self.encode_num = np.flip(np.cumprod(categories_num))
-
- def calculate_tuple_key(self):
- self.tuple_key = np.zeros([self.domain_size, self.view_num_attr], dtype=np.uint32)
-
- if self.view_num_attr != 0:
- for i in range(self.attributes_index.shape[0]):
- index = self.attributes_index[i]
- categories = np.arange(self.domain_size_list[index])
- column_key = np.tile(np.repeat(categories, self.encode_num[i]), self.cum_mul[i])
-
- self.tuple_key[:, i] = column_key
- else:
- self.tuple_key = np.array([0], dtype=np.uint32)
- self.domain_size = 1
-
- def count_records(self, records):
- encode_records = np.matmul(records[:, self.attributes_index], self.encode_num)
- encode_key, count = np.unique(encode_records, return_counts=True)
-
- indices = np.where(np.isin(np.arange(self.domain_size), encode_key))[0]
- self.count[indices] = count
-
- def calculate_count_matrix(self):
- shape = []
-
- for attri in self.attributes_index:
- shape.append(self.domain_size_list[attri])
-
- self.count_matrix = np.copy(self.count).reshape(tuple(shape))
-
- return self.count_matrix
-
- def generate_attributes_index_set(self):
- self.attributes_set = set(self.attributes_index)
-
- ################################### functions for outside invoke #########################
- def calculate_encode_num_general(self, attributes_index):
- categories_index = attributes_index
-
- categories_num = self.domain_size_list[categories_index]
- categories_num = np.roll(categories_num, attributes_index.size - 1)
- categories_num[-1] = 1
- categories_num = np.flip(categories_num)
- encode_num = np.flip(np.cumprod(categories_num))
-
- return encode_num
-
- def count_records_general(self, records):
- count = np.zeros(self.domain_size)
-
- encode_records = np.matmul(records[:, self.attributes_index], self.encode_num)
- encode_key, value_count = np.unique(encode_records, return_counts=True)
-
- indices = np.where(np.isin(np.arange(self.domain_size), encode_key))[0]
- count[indices] = value_count
-
- return count
-
- def calculate_count_matrix_general(self, count):
- shape = []
-
- for attri in self.attributes_index:
- shape.append(self.domain_size_list[attri])
-
- return np.copy(count).reshape(tuple(shape))
-
- def calculate_tuple_key_general(self, unique_value_list):
- self.tuple_key = np.zeros([self.domain_size, self.view_num_attr], dtype=np.uint32)
-
- if self.view_num_attr != 0:
- for i in range(self.attributes_index.shape[0]):
- categories = unique_value_list[i]
- column_key = np.tile(np.repeat(categories, self.encode_num[i]), self.cum_mul[i])
-
- self.tuple_key[:, i] = column_key
- else:
- self.tuple_key = np.array([0], dtype=np.uint32)
- self.domain_size = 1
-
- def project_from_bigger_view_general(self, bigger_view):
- encode_num = np.zeros(self.total_num_attr, dtype=np.uint32)
- encode_num[self.attributes_index] = self.encode_num
- encode_num = encode_num[bigger_view.attributes_index]
-
- encode_records = np.matmul(bigger_view.tuple_key, encode_num)
-
- for i in range(self.domain_size):
- key_index = np.where(encode_records == i)[0]
- self.count[i] = np.sum(bigger_view.count[key_index])
-
- ######################################## functions for consistency #######################
- ############ used in commom view #############
- def initialize_consist_parameters(self, num_target_views):
- self.summations = np.zeros([self.domain_size, num_target_views])
- self.weights = np.zeros(num_target_views)
-
- def calculate_delta(self):
- target = np.matmul(self.summations, self.weights) / np.sum(self.weights)
- self.delta = - (self.summations - target.reshape(len(target), 1))
-
- def project_from_bigger_view(self, bigger_view, index):
- encode_num = np.zeros(self.total_num_attr, dtype=np.uint32)
- encode_num[self.attributes_index] = self.encode_num
- encode_num = encode_num[bigger_view.attributes_index]
-
- encode_records = np.matmul(bigger_view.tuple_key, encode_num)
-
- self.weights[index] = bigger_view.weight_coeff / np.product(self.domain_size_list[np.setdiff1d(bigger_view.attributes_index, self.attributes_index)])
-
- for i in range(self.domain_size):
- key_index = np.where(encode_records == i)[0]
- self.summations[i, index] = np.sum(bigger_view.count[key_index])
-
- ############### used in views to be consisted ###############
- def update_view(self, common_view, index):
- encode_num = np.zeros(self.total_num_attr, dtype=np.uint32)
- encode_num[common_view.attributes_index] = common_view.encode_num
- encode_num = encode_num[self.attributes_index]
-
- encode_records = np.matmul(self.tuple_key, encode_num)
-
- for i in range(common_view.domain_size):
- key_index = np.where(encode_records == i)[0]
- self.count[key_index] += common_view.delta[i, index] / len(key_index)
-
- def non_negativity(self):
- count = np.copy(self.count)
- self.norm_cut(count)
- # self.norm_sub(count)
- self.count = count
-
- @staticmethod
- def norm_sub(count):
- while (np.fabs(sum(count) - 1) > 1e-6) or (count < 0).any():
- count[count < 0] = 0
- total = sum(count)
- mask = count > 0
- if sum(mask) == 0:
- count[:] = 1.0 / len(count)
- break
- diff = (1 - total) / sum(mask)
- count[mask] += diff
- return count
-
- @staticmethod
- def norm_cut(count):
- # set all negative value to 0.0
- negative_indices = np.where(count < 0.0)[0]
- negative_total = abs(np.sum(count[negative_indices]))
- count[negative_indices] = 0.0
-
- # find all positive value
- positive_indices = np.where(count > 0.0)[0]
-
- if positive_indices.size != 0:
- positive_sort_indices = np.argsort(count[positive_indices])
- sort_cumsum = np.cumsum(count[positive_indices[positive_sort_indices]])
-
- # set the smallest positive value to 0.0 to preserve the total density
- threshold_indices = np.where(sort_cumsum <= negative_total)[0]
-
- if threshold_indices.size == 0:
- count[positive_indices[positive_sort_indices[0]]] = sort_cumsum[0] - negative_total
- else:
- count[positive_indices[positive_sort_indices[threshold_indices]]] = 0.0
- next_index = threshold_indices[-1] + 1
-
- if next_index < positive_sort_indices.size:
- count[positive_indices[positive_sort_indices[next_index]]] = sort_cumsum[next_index] - negative_total
- else:
- count[:] = 0.0
-
- return count
-
-
-if __name__ == "__main__":
- view = View([1, 1, 0, 0], [3, 3, 0, 0])
diff --git a/examples/DPSyn/main.py b/examples/DPSyn/main.py
deleted file mode 100644
index 4d8857a..0000000
--- a/examples/DPSyn/main.py
+++ /dev/null
@@ -1,53 +0,0 @@
-import argparse
-import numpy as np
-import yaml
-from loguru import logger
-
-from dataloader.DataLoader import *
-from dataloader.RecordPostprocessor import RecordPostprocessor
-from method.sample_parallel import Sample
-from config.path import *
-
-import sdnist
-
-
-# Warning
-# This version is modified so as to pretrain on the private data
-# and train on the public data (unlike the original challenge)
-# The goal is to use the visualization tools of the public dataset.
-
-
-def main():
- with open(CONFIG_DATA, 'r') as f:
- config = yaml.load(f, Loader=yaml.BaseLoader)
-
- # dataloader initialization
- dataloader = DataLoader()
- dataloader.load_data()
-
- # sample
- eps, delta, sensitivity = 1, 2.5e-4, 7
- logger.info(f'working on eps={eps}, delta={delta}, and sensitivity={sensitivity}')
- synthesizer = Sample(dataloader, eps, delta, sensitivity)
- synthetic_data = synthesizer.synthesize()
-
- # preprocess
- postprocessor = RecordPostprocessor()
- synthetic_data = postprocessor.post_process(synthetic_data, args.config, dataloader.decode_mapping)
- logger.info("post-processed synthetic data")
- synthetic_data.to_csv("DP_synth.csv") #output saved to working directory
-
- public_data, schema = sdnist.census(root="~/datasets", public=True)
- score = sdnist.score(public_data, synthetic_data, schema)
- print(score)
- score.html(browser=True)
-
-if __name__ == "__main__":
- parser = argparse.ArgumentParser()
-
- parser.add_argument("--config", type=str, default="./config/data.yaml",
- help="specify the path of config file in yaml")
-
- args = parser.parse_args()
- main()
-
diff --git a/examples/DPSyn/method/dpsyn.py b/examples/DPSyn/method/dpsyn.py
deleted file mode 100644
index 96f9580..0000000
--- a/examples/DPSyn/method/dpsyn.py
+++ /dev/null
@@ -1,422 +0,0 @@
-import copy
-import multiprocessing as mp
-from typing import List, Tuple, Dict, KeysView
-
-import numpy as np
-import pandas as pd
-from loguru import logger
-from numpy import linalg as LA
-
-from lib_dpsyn.consistent import Consistenter
-from lib_dpsyn.record_synthesizer import RecordSynthesizer
-from lib_dpsyn.view import View
-from method.synthesizer import Synthesizer
-
-
-class DPSyn(Synthesizer):
- synthesized_df = None
- update_iterations = 60
-
- attrs_view_dict = {}
- onehot_view_dict = {}
-
- attr_list = []
- domain_list = []
- attr_index_map = {}
-
- Attrs = List[str]
- Domains = np.ndarray
- Marginals = Dict[Tuple[str], np.array]
- Clusters = Dict[Tuple[str], List[Tuple[str]]]
-
- d = None
-
- def obtain_consistent_marginals(self, priv_marginal_config, priv_split_method):
- # marginals are specified by a dict from attribute tuples to frequency (pandas) tables
- pub_marginals = self.data.generate_all_pub_marginals()
- noisy_marginals = self.get_noisy_marginals(priv_marginal_config, priv_split_method)
-
- num_synthesize_records = np.mean([np.sum(x.values) for _, x in noisy_marginals.items()]).round().astype(np.int)
- noisy_puma_year = noisy_marginals[frozenset(['PUMA', 'YEAR'])]
- del noisy_marginals[frozenset(['PUMA', 'YEAR'])]
-
- self.attr_list = self.data.obtain_attrs()
- self.domain_list = np.array([len(self.data.encode_schema[att]) for att in self.attr_list])
- self.attr_index_map = {att: att_i for att_i, att in enumerate(self.attr_list)}
-
- # views are wrappers of marginals with additional functions for consistency
- pub_onehot_view_dict, pub_attr_view_dict = self.construct_views(pub_marginals)
- noisy_onehot_view_dict, noisy_attr_view_dict = self.construct_views(noisy_marginals)
-
- # all_views is one-hot to view dict, views_dict is attribute to view dict
- # they have different format to satisfy the needs of consistenter and synthesiser
- self.onehot_view_dict, self.attrs_view_dict = self.normalize_views(
- pub_onehot_view_dict,
- pub_attr_view_dict,
- noisy_attr_view_dict,
- self.attr_index_map,
- num_synthesize_records)
-
- consistenter = Consistenter(self.onehot_view_dict, self.domain_list)
- consistenter.consist_views()
-
- # consistenter uses unnormalized counts; after consistency, synthesizer uses normalized counts
- for _, view in self.onehot_view_dict.items():
- view.count /= sum(view.count)
-
- return noisy_puma_year, noisy_marginals
-
-
- def synthesize(self, fixed_n=0) -> pd.DataFrame:
- noisy_puma_year = self.obtain_consistent_marginals()
-
- # find clusters for synthesize; a cluster is a set of marginals closely connected
- # here we do not cluster and use all marginals as a single cluster
- clusters = self.cluster(self.attrs_view_dict)
-
- # target_marginals = self.data.generate_all_two_way_marginals_except_PUMA_YEAR(self.data.private_data)
- # pub_marginals = self.data.generate_all_pub_marginals()
- # self.calculate_l1_errors_v2(pub_marginals, self.attrs_view_dict, target_marginals, self.data.private_data)
-
- self.synthesize_records_PUMA_YEAR(noisy_puma_year, clusters, fixed_n)
- # self.synthesize_records_numbers(noisy_puma_year, clusters, fixed_n)
-
- return self.synthesized_df
-
- # synthesize for each possible number
- # then for each puma-year, we just duplicate the appropriate synthesized data
- def synthesize_records_numbers(self, puma_year: pd.DataFrame, clusters: Clusters, fixed_n: int):
- interval = 100
- puma_year = puma_year.round(interval).astype(np.int)
- details = False
- cell_count_max = puma_year.max().max()
- cell_count_min = puma_year.min().min()
- cell_count_min = cell_count_max - 100
-
- singleton_views = self.obtain_singleton_views(self.attrs_view_dict)
-
- for cluster_attrs, list_marginal_attrs in clusters.items():
- attrs_index_map = {attrs: index for index, attrs in enumerate(list_marginal_attrs)}
-
- pool = mp.Pool(mp.cpu_count())
- # pool = mp.Pool(1)
- manager = mp.Manager()
- self.d = manager.list([])
-
- counts = range(cell_count_min, cell_count_max, interval)
- for count_i, cell_count in enumerate(counts):
- logger.info(f"working on puma-year: {count_i + 1}/{len(counts)}")
- pool.apply_async(self.syn_puma_year, args=(
- count_i, 0, cell_count, attrs_index_map, singleton_views, list_marginal_attrs, details),
- callback=self.log_result)
- pool.close()
- pool.join()
-
- syn_df_list = []
- for puma, puma_row in puma_year.iterrows():
- for year_i, cell_count in enumerate(puma_row):
- for df in self.d:
- if df.shape[0] == cell_count:
- tmp = copy.deepcopy(df)
- tmp['PUMA'] = puma
- tmp['YEAR'] = year_i
- syn_df_list.append(tmp)
- self.synthesized_df = pd.concat(syn_df_list, ignore_index=True)
-
- # synthesize for each combination of puma-year because the final scoring is on puma-year
- def synthesize_records_PUMA_YEAR(self, puma_year: pd.DataFrame, clusters: Clusters, fixed_n: int):
- if fixed_n:
- puma_year = pd.DataFrame([[fixed_n]])
- details = True
- else:
- puma_year = puma_year.round().astype(np.int)
- details = False
-
- for cluster_attrs, list_marginal_attrs in clusters.items():
- logger.info("synthesizing for %s" % (cluster_attrs,))
-
- attrs_index_map = {attrs: index for index, attrs in enumerate(list_marginal_attrs)}
-
- singleton_views = self.obtain_singleton_views(self.attrs_view_dict)
-
- if fixed_n:
- for puma, puma_row in puma_year.iterrows():
- for year_i, cell_count in enumerate(puma_row):
- self.synthesized_df = self.syn_puma_year(puma, year_i, cell_count, attrs_index_map,
- singleton_views, list_marginal_attrs, details)
- else:
- pool = mp.Pool(mp.cpu_count())
- # pool = mp.Pool(1)
- manager = mp.Manager()
- self.d = manager.list([])
- for puma, puma_row in puma_year.iterrows():
- for year_i, cell_count in enumerate(puma_row):
- logger.info(f"working on puma-year: {puma + 1}/{len(puma_year)}-{year_i + 1}/{len(puma_row)}")
- pool.apply_async(self.syn_puma_year, args=(
- puma, puma_year, year_i, puma_row, cell_count, attrs_index_map, singleton_views,
- list_marginal_attrs, details), callback=self.log_result)
- pool.close()
- pool.join()
- self.synthesized_df = pd.concat(self.d, ignore_index=True)
- logger.info(f'finished with a list of {len(self.d)} dataframes')
- logger.info(f'the final dataframe size is {self.synthesized_df.shape}')
-
- def syn_puma_year(self, puma, year, cell_count, attrs_index_map, singleton_views, list_marginal_attrs, details):
- cur_syn_df = None
- synthesizer = RecordSynthesizer(self.attr_list, self.domain_list, cell_count)
- synthesizer.initialize_records(list_marginal_attrs, singleton_views=singleton_views)
-
- for update_iteration in range(self.update_iterations + 1):
-
- synthesizer.alpha = 1.0 * 0.84 ** (update_iteration // 20)
- error_sorted_attrs_list = synthesizer.update_order(update_iteration, self.attrs_view_dict,
- list_marginal_attrs)
-
- for cur_attrs in error_sorted_attrs_list:
- attrs_i = attrs_index_map[cur_attrs]
- view = self.attrs_view_dict[cur_attrs]
-
- synthesizer.track_error(view, attrs_i)
- synthesizer.update_records_prepare(view)
- synthesizer.determine_throw_indices()
- synthesizer.handle_zero_cells(view)
- synthesizer.update_records(view, update_iteration)
- synthesizer.track_error(view, attrs_i)
-
- if update_iteration % 20 == 0 and details:
- tmp_df = synthesizer.df.copy()
- tmp_df['iteration'] = update_iteration
- if cur_syn_df is None:
- cur_syn_df = tmp_df
- else:
- cur_syn_df = cur_syn_df.append(tmp_df, ignore_index=True)
- logger.info(update_iteration)
-
- if update_iteration == self.update_iterations:
- # target_marginals = self.data.generate_all_two_way_marginals_except_PUMA_YEAR(self.data.private_data)
- # T_M, T_S, M_S = self.calculate_l1_errors(synthesizer.records, target_marginals, self.attrs_view_dict)
- # logger.success(f'L1 errors of 2-way: T_M = {T_M}, T_S = {T_S}, M_S = {M_S}')
-
- # target_marginals = self.data.generate_all_one_way_marginals_except_PUMA_YEAR(self.data.private_data)
- # T_M, T_S, M_S = self.calculate_l1_errors(synthesizer.records, target_marginals, self.attrs_view_dict)
- # logger.success(f'L1 errors of 1-way: T_M = {T_M}, T_S = {T_S}, M_S = {M_S}')
- pass
- if cur_syn_df is None:
- cur_syn_df = synthesizer.df
- else:
- cur_syn_df.append(synthesizer.df)
- cur_syn_df.loc[:, 'PUMA'] = puma
- cur_syn_df.loc[:, 'YEAR'] = year
- return cur_syn_df
-
- @staticmethod
- def calculate_l1_errors(records, target_marginals, attrs_view_dict):
- l1_T_Ms = []
- l1_T_Ss = []
- l1_M_Ss = []
-
- for cur_attrs, target_marginal_pd in target_marginals.items():
- view = attrs_view_dict[cur_attrs]
- syn_marginal = view.count_records_general(records)
- target_marginal = target_marginal_pd.values.flatten()
-
- T = target_marginal / np.sum(target_marginal)
- M = view.count
- S = syn_marginal / np.sum(syn_marginal)
-
- l1_T_Ms.append(LA.norm(T - M, 1))
- l1_T_Ss.append(LA.norm(T - S, 1))
- l1_M_Ss.append(LA.norm(M - S, 1))
-
- return np.mean(l1_T_Ms), np.mean(l1_T_Ss), np.mean(l1_M_Ss)
-
- @staticmethod
- def normalize_views(pub_onehot_view_dict: Dict, pub_attr_view_dict, noisy_view_dict, attr_index_map, num_synthesize_records) -> Tuple[Dict, Dict]:
- pub_weight = 0.00
- noisy_weight = 1 - pub_weight
-
- for key, view in pub_onehot_view_dict.items():
- if noisy_view_dict:
- view.weight_coeff = 0.01
- # need to first calculate (num_synthesize_records / np.sum(view.count)), otherwise have numerical problems
- view.count = view.count * (num_synthesize_records / np.sum(view.count))
- else:
- if not np.sum(view.count) == np.sum(list(pub_onehot_view_dict.values())[0].count):
- raise ValueError(
- f'view sizes do not match; maybe a data reading problem (current key: {key}, sum: {np.sum(view.count)}')
- view.count = view.count.astype(np.float)
-
- views_dict = pub_attr_view_dict
- onehot_view_dict = pub_onehot_view_dict
- for view_att, view in noisy_view_dict.items():
- if view_att in views_dict:
- views_dict[view_att].count = pub_weight * pub_attr_view_dict[view_att].count + noisy_weight * view.count
- views_dict[view_att].weight_coeff = pub_weight * pub_attr_view_dict[
- view_att].weight_coeff + noisy_weight * view.weight_coeff
- else:
- views_dict[view_att] = view
- view_onehot = DPSyn.one_hot(view_att, attr_index_map)
- onehot_view_dict[tuple(view_onehot)] = view
- return onehot_view_dict, views_dict
-
- @staticmethod
- def obtain_singleton_views(attrs_view_dict):
- singleton_views = {}
- for cur_attrs, view in attrs_view_dict.items():
- # puma and year won't be there because they only appear together (size=2)
- if len(cur_attrs) == 1:
- singleton_views[cur_attrs] = view
- return singleton_views
-
- def construct_views(self, marginals: Marginals) -> Tuple[Dict, Dict]:
- onehot_view_dict = {}
- attr_view_dict = {}
-
- for marginal_att, marginal_value in marginals.items():
- view_onehot = DPSyn.one_hot(marginal_att, self.attr_index_map)
- view = View(view_onehot, self.domain_list)
- view.count = marginal_value.values.flatten()
-
- onehot_view_dict[tuple(view_onehot)] = view
- attr_view_dict[marginal_att] = view
-
- if not len(view.count) == view.domain_size:
- raise Exception('no match')
-
- return onehot_view_dict, attr_view_dict
-
- def log_result(self, result):
- self.d.append(result)
-
- @staticmethod
- def build_attr_set(attrs: KeysView[Tuple[str]]) -> Tuple[str]:
- attrs_set = set()
-
- for attr in attrs:
- attrs_set.update(attr)
-
- return tuple(attrs_set)
-
- # simple clustering: just build the data structure; not doing any clustering
- def cluster(self, marginals: Marginals) -> Clusters:
- clusters = {}
- keys = []
- for marginal_attrs, _ in marginals.items():
- keys.append(marginal_attrs)
-
- clusters[DPSyn.build_attr_set(marginals.keys())] = keys
- return clusters
-
- @staticmethod
- def one_hot(cur_att, attr_index_map):
- cur_view_key = [0] * len(attr_index_map)
- for attr in cur_att:
- cur_view_key[attr_index_map[attr]] = 1
- return cur_view_key
-
- # synthesize cluster by cluster: the general function, not used for now
- # (we have a graph where nodes represent attributes and edges represent marginals,
- # it helps in terms of running time and accuracy if we do it cluster by cluster)
- def synthesize_records(self, attrs: Attrs, domains: Domains, clusters: Clusters, num_synthesize_records: int):
- for cluster_attrs, list_marginal_attrs in clusters.items():
- logger.info("synthesizing for %s" % (cluster_attrs,))
-
- # singleton_views = {attr: self.attr_view_dict[frozenset([attr])] for attr in attrs}
- singleton_views = {}
- for cur_attrs, view in self.attrs_view_dict.items():
- if len(cur_attrs) == 1:
- singleton_views[cur_attrs] = view
-
- synthesizer = RecordSynthesizer(attrs, domains, num_synthesize_records)
- synthesizer.initialize_records(list_marginal_attrs, singleton_views=singleton_views)
-
- attrs_index_map = {attrs: index for index, attrs in enumerate(list_marginal_attrs)}
-
- for update_iteration in range(self.update_iterations):
- logger.info("update round: %d" % (update_iteration,))
-
- synthesizer.update_alpha(update_iteration)
- sorted_error_attrs = synthesizer.update_order(update_iteration, self.attrs_view_dict,
- list_marginal_attrs)
-
- for attrs in sorted_error_attrs:
- attrs_i = attrs_index_map[attrs]
- synthesizer.update_records_prepare(self.attrs_view_dict[attrs])
- synthesizer.update_records(self.attrs_view_dict[attrs], attrs_i)
-
- self.synthesized_df.loc[:, cluster_attrs] = synthesizer.df.loc[:, cluster_attrs]
-
- def calculate_l1_errors_v2(self, M0, M1, M2, Te):
-
- l1_0_1 = []
- l1_1_2 = []
- l1_0_2 = []
- l1_t_0 = []
- l1_t_1 = []
- l1_t_2 = []
-
- count = 0
- total = len(M1)
- for cur_attrs, m1 in M1.items():
- if len(cur_attrs) == 1:
- continue
-
- count += 1
- # logger.info(f'working on {count}/{total}: {cur_attrs}')
-
- m0 = M0[cur_attrs].values.flatten()
- m1 = m1.count
- m2 = M2[cur_attrs].values.flatten()
-
- m0 = m0 / np.sum(m0)
- m1 = m1 / np.sum(m1)
- m2 = m2 / np.sum(m2)
-
- l1_0_1.append(LA.norm(m0 - m1, 1))
- l1_1_2.append(LA.norm(m1 - m2, 1))
- l1_0_2.append(LA.norm(m0 - m2, 1))
-
- tmp_l1_t_0 = []
- tmp_l1_t_1 = []
- tmp_l1_t_2 = []
-
- cur_attrs_list = [M0[cur_attrs].index.name, M0[cur_attrs].columns.name]
- indices = sorted([i for i in self.data.encode_mapping[cur_attrs_list[0]].values()])
- columns = sorted([i for i in self.data.encode_mapping[cur_attrs_list[1]].values()])
- att_list = ['PUMA', 'YEAR', ] + cur_attrs_list
- cur_Te = Te[att_list]
- puma_year_cur_Te = cur_Te.groupby(['PUMA', 'YEAR'])
- for puma_year, one_Te in puma_year_cur_Te:
- tmp = one_Te.assign(n=1).pivot_table(values='n', index=cur_attrs_list[0], columns=cur_attrs_list[1], aggfunc=np.sum, fill_value=0)
- marginal = tmp.reindex(index=indices, columns=columns).fillna(0).astype(np.int32).values.flatten()
- marginal = marginal / np.sum(marginal)
-
- tmp_l1_t_0.append(LA.norm(marginal - m0, 1))
- tmp_l1_t_1.append(LA.norm(marginal - m1, 1))
- tmp_l1_t_2.append(LA.norm(marginal - m2, 1))
-
- # print(tmp_l1_t_0)
- # print(tmp_l1_t_1)
- # print(tmp_l1_t_2)
- l1_t_0.append(np.mean(tmp_l1_t_0))
- l1_t_1.append(np.mean(tmp_l1_t_1))
- l1_t_2.append(np.mean(tmp_l1_t_2))
-
- # logger.success(f't_0 = {np.mean(tmp_l1_t_0)}, t_1 = {np.mean(tmp_l1_t_1)}, t_2 = {np.mean(tmp_l1_t_2)}')
- logger.success(f'0_1 = {np.mean(l1_0_1)}, 0_2 = {np.mean(l1_0_2)}, 1_2 = {np.mean(l1_1_2)}, t_0 = {np.mean(l1_t_0)}, t_1 = {np.mean(l1_t_1)}, t_2 = {np.mean(l1_t_2)}')
- exit()
- return
-
- def internal_synthesize(self, noisy_puma_year, fixed_n=0) -> pd.DataFrame:
- # find clusters for synthesize; a cluster is a set of marginals closely connected
- # here we do not cluster and use all marginals as a single cluster
- clusters = self.cluster(self.attrs_view_dict)
-
- # target_marginals = self.data.generate_all_two_way_marginals_except_PUMA_YEAR(self.data.private_data)
- # pub_marginals = self.data.generate_all_pub_marginals()
- # self.calculate_l1_errors_v2(pub_marginals, self.attrs_view_dict, target_marginals, self.data.private_data)
-
- self.synthesize_records_PUMA_YEAR(noisy_puma_year, clusters, fixed_n)
- # self.synthesize_records_numbers(noisy_puma_year, clusters, fixed_n)
-
- return self.synthesized_df
\ No newline at end of file
diff --git a/examples/DPSyn/method/sample_parallel.py b/examples/DPSyn/method/sample_parallel.py
deleted file mode 100644
index dcad0f2..0000000
--- a/examples/DPSyn/method/sample_parallel.py
+++ /dev/null
@@ -1,247 +0,0 @@
-import pandas as pd
-import numpy as np
-from loguru import logger
-import multiprocessing
-
-from method.dpsyn import DPSyn
-from lib_dpsyn.view import View
-
-
-class Sample(DPSyn):
-
- def synthesize(self) -> pd.DataFrame:
- '''
- decide budget distribution strategy
- '''
- eps_0 = self.sensitivity / 1400
- eps_1 = self.sensitivity / 50
- eps_2 = self.sensitivity / 35
- eps_3 = self.sensitivity / 25
-
- priv_marginal_config = {}
- priv_split_method = {}
- if self.eps < eps_2:
- # get only PUMA-YEAR(lap)
- logger.info("get only PUMA-YEAR(lap), no total count estimate")
- priv_marginal_config['priv_PUMA_YEAR'] = {'total_eps': self.eps, 'attributes': ['PUMA', 'YEAR']}
- priv_split_method['priv_PUMA_YEAR'] = 'lap'
- sample_data, scoring = 'pub', 'pub'
- else:
- # first try to use eps_0 to get total count of samples
- noisy_total_count = self.data.private_data.shape[0] + np.random.laplace(scale=self.sensitivity / eps_0)
- tau_1 = 1500 * self.sensitivity / noisy_total_count
- tau_2 = 6 * tau_1
- if self.eps < eps_0 + eps_1 + tau_1:
- # get only PUMA-YEAR(lap)
- logger.info(
- f"get only PUMA-YEAR(lap), total count estimate {noisy_total_count}, tau_1 {tau_1}, tau_2 {tau_2}")
- priv_marginal_config['priv_PUMA_YEAR'] = {'total_eps': self.eps - eps_0, 'attributes': ['PUMA', 'YEAR']}
- priv_split_method['priv_PUMA_YEAR'] = 'lap'
- sample_data, scoring = 'pub', 'pub'
- elif self.eps < eps_0 + eps_1 + tau_2:
- # get PUMA_YEAR(lap) + one way (lap)
- logger.info(
- f"get only PUMA-YEAR(lap)+ one way (lap), total count estimate {noisy_total_count}, tau_1 {tau_1}, tau_2 {tau_2}")
- x = self.eps - eps_0 - eps_1 - tau_1
- y = np.min([eps_3, eps_1 + x / 2])
- priv_marginal_config['priv_PUMA_YEAR'] = {'total_eps': y,
- 'attributes': ['PUMA', 'YEAR']}
- priv_marginal_config['priv_all_one_way'] = {'total_eps': self.eps - eps_0 - y}
- priv_split_method['priv_PUMA_YEAR'] = 'lap'
- priv_split_method['priv_all_one_way'] = 'lap'
- sample_data, scoring = 'dpsyn', '1way'
- else:
- # get PUMA_YEAR(lap) + two way (zcdp)
- logger.info(
- f"get only PUMA-YEAR(lap)+ two way (zcdp), total count estimate {noisy_total_count}, tau_1 {tau_1}, tau_2 {tau_2}")
- x = self.eps - eps_0 - eps_1 - tau_1
- y = np.min([eps_3, eps_1 + x / 2])
- priv_marginal_config['priv_PUMA_YEAR'] = {'total_eps': y,
- 'attributes': ['PUMA', 'YEAR']}
- priv_marginal_config['priv_all_two_way'] = {'total_eps': self.eps - eps_0 - y}
- priv_split_method['priv_PUMA_YEAR'] = 'lap'
- priv_split_method['priv_all_two_way'] = 'gauss'
- sample_data, scoring = 'dpsyn', '2way'
-
- num_processes = 5
- ''' obtain DP marginals (only place that access priv data other than noisy_total_count) '''
- noisy_puma_year, noisy_marginals = self.obtain_consistent_marginals(priv_marginal_config, priv_split_method)
-
- '''
- generate data
- 1. when eps is sufficiently large (eps>=0.2), call the parent class (DPSyn)'s method with fixed_n=10000
- 2. when eps is small (e.g. eps<0.2), sample from pub
- self.obtain_consistent_marginals() already got the noisy marginals and stored them in self.attrs_view_dict
- self.obtain_consistent_marginals() also built other data structures
- '''
- if sample_data == 'dpsyn':
- logger.info("DPsyn generate candidate samples")
- init_data = super().internal_synthesize(noisy_puma_year, fixed_n=10000)
- init_data = init_data.drop('iteration', axis=1)
- init_data = init_data.values
- else:
- logger.info(f'eps {self.eps}, sampling from pub data')
- init_data = self.data.public_data.sample(n=int(10000)).values
-
- attrs = self.data.obtain_attrs()
-
- '''
- calculate weights based on number of attributes in marginals
- '''
- weights = {}
- # generate weights for marginals
- for cur_attrs in self.attrs_view_dict.keys():
- attrs_info = self.data.get_marginal_grouping_info(cur_attrs)
- weight = 1
- for attr, sub_attrs in attrs_info.items():
- weight *= len(sub_attrs)
- weights[cur_attrs] = weight
-
- '''
- decide which marginals will be used in scoring in sampling
- '''
- if scoring is None or scoring == 'pub':
- logger.info("using pub 2-way as scoring marginals for sampling...")
- scoring_marginals = self.data.generate_all_two_way_marginals_except_PUMA_YEAR(self.data.public_data)
- elif scoring == '1way':
- # 1 way marginals are not consistent
- logger.info("using noisy 1-way as scoring marginals for sampling...")
- # noisy_marginals has only 1-ways
- scoring_marginals = {}
- for key, view in self.attrs_view_dict.items():
- if key in noisy_marginals:
- scoring_marginals[key] = view
- elif scoring == '2way':
- # 2-way marginals are consistent
- logger.info("using consistent 2-way as scoring marginals for sampling...")
- # noisy_marginals has only 2-ways
- scoring_marginals = {}
- for key, view in self.attrs_view_dict.items():
- if key in noisy_marginals:
- scoring_marginals[key] = view
- else:
- raise NotImplementedError
-
- '''
- only generate datasets for PUMA-YEAR with different size in 100s
- '''
- if self.eps < self.sensitivity / 35:
- # if eps is too small, round puma year to cloest 50
- rounded_puma_year = (np.round(noisy_puma_year / 10) * 10).astype(int)
- else:
- rounded_puma_year = noisy_puma_year.round(-2).astype(np.int)
- rounded_puma_year[rounded_puma_year < 300] = 300
- logger.info(f'sampling for sizes {np.unique(rounded_puma_year)}')
-
- ''' sample for the largest PUMA-YEAR size'''
- n = np.max(np.unique(rounded_puma_year))
- splited_data = np.split(init_data, num_processes)
- grouped_params = zip(splited_data, [scoring_marginals] * num_processes, [weights] * num_processes,
- [int(n / num_processes)] * num_processes, [int(n / num_processes / 3)] * num_processes,
- [self.attrs_view_dict] * num_processes)
- with multiprocessing.Pool(processes=num_processes) as pool:
- sub_set = pool.imap(self.map_sample, grouped_params)
- sample_data = list(sub_set)
- total_data = np.concatenate(sample_data, axis=0)
- logger.info("largest PUMA-YEAR sampling finish")
-
- ''' parallel sampling for each unique PUMA-YEAR sizes'''
- syn_data_count = {}
- num_sizes = len(np.unique(rounded_puma_year))
- grouped_params = zip([np.copy(total_data)] * num_sizes, [scoring_marginals] * num_sizes, [weights] * num_sizes,
- np.unique(rounded_puma_year), [int(n / 20)] * num_sizes,
- [self.attrs_view_dict] * num_sizes)
- with multiprocessing.Pool(processes=num_processes) as pool:
- sub_set = pool.imap(self.map_sample, grouped_params)
- sample_data = list(sub_set)
- for d in sample_data:
- syn_data_count[d.shape[0]] = d
-
- ''' assign samples to each PUMA-YEAR '''
- syn_df_list = []
- for puma, puma_row in rounded_puma_year.iterrows():
- for year_i, cell_count in enumerate(puma_row):
- # logger.info(f'=========== PUMA: {puma} YEAR: {year_i}, size:{cell_count} =================')
- tmp = np.copy(syn_data_count[cell_count])
- tmp = pd.DataFrame(tmp, columns=attrs)
- tmp['PUMA'] = puma
- tmp['YEAR'] = year_i
- syn_df_list.append(tmp)
- syn_pd = pd.concat(syn_df_list, ignore_index=True)
- return syn_pd
-
- '''
- parallel version of sampling
- '''
- @staticmethod
- def map_sample(grounped) -> np.ndarray:
- assert len(grounped) == 6
- init_data, target_marginals, weights, n, T, attrs_view_dict= grounped
- if n == init_data.shape[0]:
- return init_data
- ''' split candidate records into two parts, keep replacing records in D with R '''
- D = np.copy(init_data[:n, :])
- R = np.copy(init_data[n:, :])
- early_stopping_threshold = 0.0001
-
- pre_l1_error = len(target_marginals) * 2
- for i in range(T):
- D_scores = np.zeros(D.shape[0])
- R_scores = np.zeros(R.shape[0])
- l1_error = 0
- for cur_attrs, target_marginal in target_marginals.items():
- view = attrs_view_dict[cur_attrs]
- syn_marginal = view.count_records_general(D)
- if isinstance(target_marginal, pd.DataFrame):
- target_marginal = np.copy(target_marginal.values.flatten())
- elif isinstance(target_marginal, View):
- target_marginal = target_marginal.count
- target_marginal = target_marginal / np.sum(target_marginal) * np.sum(syn_marginal)
- l1_error += Sample._simple_l1(syn_marginal, target_marginal)
-
- under_cell_indices = np.where(syn_marginal < target_marginal - 1)[0]
- over_cell_indices = np.where(syn_marginal > target_marginal + 1)[0]
-
- # compute D_score and R_score
- for data, scores in zip([D, R], [D_scores, R_scores]):
- # for cell_indices in [over_cell_indices, under_cell_indices]:
- encode_records = np.matmul(data[:, view.attributes_index], view.encode_num)
-
- scores[np.in1d(encode_records, over_cell_indices)] += weights[cur_attrs]
- scores[np.in1d(encode_records, under_cell_indices)] -= weights[cur_attrs]
-
- # reverse R_score
- R_scores = -1 * R_scores
-
- D_scores_sort_index = np.argsort(D_scores)
- R_scores_sort_index = np.argsort(R_scores)
- ''' add randomness if multiple highest'''
- d_i = Sample._sampled_largest_if_tie(D_scores, D_scores_sort_index)
- r_i = Sample._sampled_largest_if_tie(R_scores, R_scores_sort_index)
-
- tmp = np.copy(R[R_scores_sort_index[-r_i], :])
- R[R_scores_sort_index[-r_i], :] = np.copy(D[D_scores_sort_index[-d_i], :])
- D[D_scores_sort_index[-d_i], :] = tmp
-
- ''' check early stop '''
- if pre_l1_error - l1_error < early_stopping_threshold:
- logger.info(f' ==== EARLY STOP at round {i + 1}/{T}, threshold: {early_stopping_threshold} ')
- break
- else:
- pre_l1_error = l1_error
- return D
-
- @staticmethod
- def _simple_l1(m1, m2):
- normalize_m1 = m1 / np.sum(m1)
- normalize_m2 = m2 / np.sum(m2)
- return np.sum(np.abs(normalize_m1 - normalize_m2))
-
- @staticmethod
- def _sampled_largest_if_tie(scores, scores_sort_index):
- i = 1
- while i < len(scores) and scores[scores_sort_index[-i]] == scores[scores_sort_index[-i - 1]]:
- i += 1
- # logger.info(f"i {i}")
- i = np.random.randint(low=1, high=i + 1)
- return i
\ No newline at end of file
diff --git a/examples/DPSyn/method/synthesizer.py b/examples/DPSyn/method/synthesizer.py
deleted file mode 100644
index 221cd25..0000000
--- a/examples/DPSyn/method/synthesizer.py
+++ /dev/null
@@ -1,59 +0,0 @@
-import abc
-
-import numpy as np
-import pandas as pd
-from loguru import logger
-
-from dataloader.DataLoader import DataLoader
-from utils import advanced_composition
-from typing import Dict, Tuple
-
-
-class Synthesizer(object):
- __metaclass__ = abc.ABCMeta
- Marginals = Dict[Tuple[str], np.array]
-
- def __init__(self, data: DataLoader, eps: float, delta: float, sensitivity: int):
- self.data = data
- self.eps = eps
- self.delta = delta
- self.sensitivity = sensitivity
-
- @abc.abstractmethod
- def synthesize(self, fixed_n: int) -> pd.DataFrame:
- pass
-
- # make sure the synthetic data size does not exceed the max allowed size
- # currently not used
- def synthesize_cutoff(self, submit_data: pd.DataFrame) -> pd.DataFrame:
- if submit_data.shape > 0:
- submit_data.sample()
- return submit_data
-
- def anonymize(self, priv_marginal_sets: Dict, epss: Dict, priv_split_method: Dict) -> Marginals:
- noisy_marginals = {}
- for set_key, marginals in priv_marginal_sets.items():
- eps = epss[set_key]
- # noise_type, noise_param = advanced_composition.get_noise(eps, self.delta, self.sensitivity, len(marginals))
- noise_type = priv_split_method[set_key]
- if noise_type == 'lap':
- noise_param = 1 / advanced_composition.lap_comp(eps, self.delta, self.sensitivity, len(marginals))
- for marginal_att, marginal in marginals.items():
- marginal += np.random.laplace(scale=noise_param, size=marginal.shape)
- noisy_marginals[marginal_att] = marginal
- else:
- noise_param = advanced_composition.gauss_zcdp(eps, self.delta, self.sensitivity, len(marginals))
- for marginal_att, marginal in marginals.items():
- noise = np.random.normal(scale=noise_param, size=marginal.shape)
- marginal += noise
- noisy_marginals[marginal_att] = marginal
- logger.info(f"marginal {set_key} use eps={eps}, noise type:{noise_type}, noise parameter={noise_param}, sensitivity:{self.sensitivity}")
- return noisy_marginals
-
- def get_noisy_marginals(self, priv_marginal_config, priv_split_method):
- ''' THIS IS THE ONLY PLACE ACCESS PRIVATE DATA (other than calculate noisy_totaly_count)'''
- priv_marginal_sets, epss = self.data.generate_marginal_by_config(self.data.private_data, priv_marginal_config)
- ''' Add DP noise to the private marginals with pre-defined privacy allocation strategy '''
- noisy_marginals = self.anonymize(priv_marginal_sets, epss, priv_split_method)
- del priv_marginal_sets
- return noisy_marginals
diff --git a/examples/DPSyn/proof.pdf b/examples/DPSyn/proof.pdf
deleted file mode 100644
index 3e6275b..0000000
Binary files a/examples/DPSyn/proof.pdf and /dev/null differ
diff --git a/examples/DPSyn/readme.md b/examples/DPSyn/readme.md
deleted file mode 100644
index 00a44b1..0000000
--- a/examples/DPSyn/readme.md
+++ /dev/null
@@ -1,39 +0,0 @@
-# DPSyn
-
-### Overview
-This is the original code we submitted for the Sprint 2.
-The code can be run if put to the benchmark folder of the [runtime repo](https://github.com/drivendataorg/deid2-runtime/tree/sprint-2).
-The competition description is at [drivendata](https://www.drivendata.org/competitions/75/deid2-sprint-2-prescreened/page/285/),
-and the data can be downloaded [here](https://www.drivendata.org/competitions/75/deid2-sprint-2-prescreened/data/).
-We will refactor the code for public use and publish the polished version.
-
-### Description of the code
-* ``main.py`` is the entry of our algorithm.
-
-* ``config/data.yaml`` is the configuration file with public dataset path, target dataset paths
-and binning/grouping attributes strategies.
-
-* ``dataloader/public.csv`` is the public data used in the open/pre-screened arena (IL and OH data).
-
-* ``dataloader/Dataloader.py`` is the dataloader used to load and preprocess both public and private (target) dataset.
-It also contains the methods of generating 1-way/2-way marginals.
-
-* ``dataloader/RecordPostprocessor.py`` is for post-processing synthetic dataset to have the same attributes
-as input dataset.
-
-* ``method/sample_parallel.py`` is one of the key components in our algorithm.
-Details can be found in our pdf document.
-
-* ``method/dpsyn.py`` is another key component in our algorithm.
-It generates synthetic data (for sampling) based on the noisy 1-way or 2-way marginals,
-when we have sufficient privacy budget.
-
-* ``method/sythesizer.py`` is the base class. It contains functions of generating privacy-preserved
-marginals.
-
-* ``lib_dysyn/`` contains the classes of enforcing consistency on marginals and synthesize data.
-
-* ``utils/advanced_composition.py`` contains functions to compute noise variance
-given privacy budget, sensitivity and number of queries.
-
-* ``proof.pdf`` contains the detailed description of the code and the formal proof.
\ No newline at end of file
diff --git a/examples/DPSyn/utils/advanced_composition.py b/examples/DPSyn/utils/advanced_composition.py
deleted file mode 100644
index e991e9f..0000000
--- a/examples/DPSyn/utils/advanced_composition.py
+++ /dev/null
@@ -1,164 +0,0 @@
-import math
-
-import numpy as np
-from scipy.optimize import fsolve
-
-
-def lap_comp(epsilon, delta, sensitivity, k):
- return epsilon * 1.0 / k / sensitivity
-
-
-def lap_adv_comp(epsilon, delta, sensitivity, k):
- def func(x_0):
- eps_0 = x_0[0]
- return math.sqrt(2 * k * math.log(1 / delta)) * eps_0 + k * (math.exp(eps_0) - 1) * eps_0 - epsilon
-
- result = fsolve(func, np.array([0.0]))
-
- return result[0] / sensitivity
-
-
-def gauss_adv_comp(epsilon, delta, sensitivity, k):
- def gauss(delta_0):
- dlt = delta - delta_0 * k
-
- def eps_func(x_0):
- eps_0 = x_0[0]
- return math.sqrt(2 * k * math.log(1 / dlt)) * eps_0 + k * (math.exp(eps_0) - 1) * eps_0 - epsilon
-
- epsilon_0 = fsolve(eps_func, np.array([0.0]))[0]
- sigma = sensitivity * np.sqrt(2 * math.log(1.25 / delta_0)) / epsilon_0
- return sigma
-
- l, h = 1e-30, delta * 1.0 / k / 1.1
- min_delta_0 = my_minimize(gauss, l, h)
- return gauss(min_delta_0)
-
-
-def my_minimize(func, l, h):
- vfunc = np.vectorize(func)
- cur_l, cur_h = l, h
- n = 20000
- for i in range(10):
- xs = np.linspace(cur_l, cur_h, n)
- vs = vfunc(xs)
- vs_index = np.argsort(vs)
- cur_l_index, cur_h_index = vs_index[0], vs_index[1]
- cur_l, cur_h = xs[cur_l_index], xs[cur_h_index]
-
- return (cur_l + cur_h) / 2
-
-
-def gauss_renyi(epsilon, delta, sensitivity, k):
- def renyi(low):
- epsilon0 = max(1e-20, epsilon - np.log(1.0 / delta) * 1.0 / (low - 1))
- sigma = np.sqrt(k * low * sensitivity ** 2 * 1.0 / 2 / epsilon0)
- return sigma
-
- l, h = 1.00001, 100000
- min_low = my_minimize(renyi, l, h)
- min_sigma = renyi(min_low)
-
- return min_sigma
-
-
-def gauss_zcdp(epsilon, delta, sensitivity, k):
- tmp_var = 2 * k * sensitivity ** 2 * math.log(1 / delta)
-
- sigma = (math.sqrt(tmp_var) + math.sqrt(tmp_var + 2 * k * sensitivity ** 2 * epsilon)) / (2 * epsilon)
-
- return sigma
-
-
-# zcdp and zcdp2 and rdp perform the same
-def gauss_zcdp2(epsilon, delta, sensitivity, k):
- my_log = math.log(1 / delta)
-
- sigma = sensitivity * math.sqrt(k / 2) / (math.sqrt(epsilon + my_log) - math.sqrt(my_log))
-
- return sigma
-
-
-def lap_zcdp_comp(epsilon, delta, sensitivity, k):
- return math.sqrt(2.0 * (math.sqrt(k) * sensitivity / epsilon) ** 2)
-
-
-def get_noise(eps, delta, sensitivity, num_composition):
- lap_param = lap_comp(eps, delta, sensitivity, num_composition)
- lap_naive_var = 2 * (1.0 / lap_param ** 2)
-
- gauss_param = gauss_zcdp(eps, delta, sensitivity, num_composition)
- gauss_var_zcdp = gauss_param ** 2
- if lap_naive_var < gauss_var_zcdp:
- return 'lap', 1 / lap_param
- else:
- return 'gauss', gauss_param
-
-
-# print(gauss_zcdp(0.88, 3e-11, 7, 22) ** 2)
-# print(2 * (1.0 / lap_comp(0.88, 3e-11, 7, 22) ** 2))
-#
-# print(gauss_zcdp(10, 3e-11, 7, 233) ** 1)
-# print(2 * (1.0 / lap_comp(10, 3e-11, 7, 233) ** 2))
-
-'''
-# when 2 * k ** 0.5 > (math.log(1/delta) + epsilon) ** 0.5 + (math.log(1/delta)) ** 0.5, zcdp gives better accuracy
-
-total_epss = [0.001, 0.002]
-total_epss = [0.3, 1, 8]
-total_epss = [0.252, 0.87, 7.28]
-total_epss = [0.01]
-sensitivitys = [1]
-n = 30000
-# total_delta = 1.0 / n ** 2
-total_delta = 1e-15
-ks = [1, 5, 10, 15, 20, 30, 50, 100]
-
-alpha = 0.05
-domain_size = 100
-
-for total_eps in total_epss:
- for sensitivity in sensitivitys:
- print(total_eps)
- for k in ks:
- lap_param = lap_comp(total_eps, total_delta, sensitivity, k)
- lap_naive_var = 2 * (1.0 / lap_param ** 2)
- print('& $%.1E$' % lap_naive_var, end='\t')
-
- print('\\\\')
- for k in ks:
- lap_param = lap_adv_comp(total_eps, total_delta, sensitivity, k)
- lap_var = 2 * (1.0 / lap_param ** 2)
- print('& $%.1E$' % lap_var, end='\t')
-
- print('\\\\')
- for k in ks:
- gauss_param = gauss_adv_comp(total_eps, total_delta, sensitivity, k)
- gauss_var_adv = gauss_param ** 2
- print('& $%.1E$' % gauss_var_adv, end='\t')
-
- # gauss_param = gauss_renyi(total_eps, total_delta, sensitivity, k)
- # gauss_var = gauss_param ** 2
-
- print('\\\\')
- for k in ks:
- gauss_param = gauss_zcdp(total_eps, total_delta, sensitivity, k)
- gauss_var_zcdp = gauss_param ** 2
- print('& $%.1E$' % gauss_var_zcdp, end='\t')
-
- # gauss_param3 = gauss_zcdp2(total_eps, total_delta, sensitivity, k)
- # gauss_var_zcdp3 = gauss_param3 ** 2
-
- # print(
- # f'${total_eps}, ${sensitivity}$, ${math.sqrt(lap_naive_var)}$, ${math.sqrt(lap_var)}$, ${math.sqrt(gauss_var_adv)}$, ${math.sqrt(gauss_var_zcdp)}$')
-
- norm_scale = gauss_param
- lap_scale = 1.0 / lap_param
- norm_mean = 0
- laplace_mean = 0
-
- # norm_threshold = 4.0 * norm.ppf(1 - alpha / domain_size, norm_mean, norm_scale)
- # lap_threshold = 4.0 * laplace.ppf(1 - alpha / domain_size, laplace_mean, lap_scale)
- # print(norm_threshold, lap_threshold)
- print('\\\\')
-'''
\ No newline at end of file
diff --git a/examples/Minutemen/LICENSE.txt b/examples/Minutemen/LICENSE.txt
deleted file mode 100644
index 261eeb9..0000000
--- a/examples/Minutemen/LICENSE.txt
+++ /dev/null
@@ -1,201 +0,0 @@
- Apache License
- Version 2.0, January 2004
- http://www.apache.org/licenses/
-
- TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
-
- 1. Definitions.
-
- "License" shall mean the terms and conditions for use, reproduction,
- and distribution as defined by Sections 1 through 9 of this document.
-
- "Licensor" shall mean the copyright owner or entity authorized by
- the copyright owner that is granting the License.
-
- "Legal Entity" shall mean the union of the acting entity and all
- other entities that control, are controlled by, or are under common
- control with that entity. For the purposes of this definition,
- "control" means (i) the power, direct or indirect, to cause the
- direction or management of such entity, whether by contract or
- otherwise, or (ii) ownership of fifty percent (50%) or more of the
- outstanding shares, or (iii) beneficial ownership of such entity.
-
- "You" (or "Your") shall mean an individual or Legal Entity
- exercising permissions granted by this License.
-
- "Source" form shall mean the preferred form for making modifications,
- including but not limited to software source code, documentation
- source, and configuration files.
-
- "Object" form shall mean any form resulting from mechanical
- transformation or translation of a Source form, including but
- not limited to compiled object code, generated documentation,
- and conversions to other media types.
-
- "Work" shall mean the work of authorship, whether in Source or
- Object form, made available under the License, as indicated by a
- copyright notice that is included in or attached to the work
- (an example is provided in the Appendix below).
-
- "Derivative Works" shall mean any work, whether in Source or Object
- form, that is based on (or derived from) the Work and for which the
- editorial revisions, annotations, elaborations, or other modifications
- represent, as a whole, an original work of authorship. For the purposes
- of this License, Derivative Works shall not include works that remain
- separable from, or merely link (or bind by name) to the interfaces of,
- the Work and Derivative Works thereof.
-
- "Contribution" shall mean any work of authorship, including
- the original version of the Work and any modifications or additions
- to that Work or Derivative Works thereof, that is intentionally
- submitted to Licensor for inclusion in the Work by the copyright owner
- or by an individual or Legal Entity authorized to submit on behalf of
- the copyright owner. For the purposes of this definition, "submitted"
- means any form of electronic, verbal, or written communication sent
- to the Licensor or its representatives, including but not limited to
- communication on electronic mailing lists, source code control systems,
- and issue tracking systems that are managed by, or on behalf of, the
- Licensor for the purpose of discussing and improving the Work, but
- excluding communication that is conspicuously marked or otherwise
- designated in writing by the copyright owner as "Not a Contribution."
-
- "Contributor" shall mean Licensor and any individual or Legal Entity
- on behalf of whom a Contribution has been received by Licensor and
- subsequently incorporated within the Work.
-
- 2. Grant of Copyright License. Subject to the terms and conditions of
- this License, each Contributor hereby grants to You a perpetual,
- worldwide, non-exclusive, no-charge, royalty-free, irrevocable
- copyright license to reproduce, prepare Derivative Works of,
- publicly display, publicly perform, sublicense, and distribute the
- Work and such Derivative Works in Source or Object form.
-
- 3. Grant of Patent License. Subject to the terms and conditions of
- this License, each Contributor hereby grants to You a perpetual,
- worldwide, non-exclusive, no-charge, royalty-free, irrevocable
- (except as stated in this section) patent license to make, have made,
- use, offer to sell, sell, import, and otherwise transfer the Work,
- where such license applies only to those patent claims licensable
- by such Contributor that are necessarily infringed by their
- Contribution(s) alone or by combination of their Contribution(s)
- with the Work to which such Contribution(s) was submitted. If You
- institute patent litigation against any entity (including a
- cross-claim or counterclaim in a lawsuit) alleging that the Work
- or a Contribution incorporated within the Work constitutes direct
- or contributory patent infringement, then any patent licenses
- granted to You under this License for that Work shall terminate
- as of the date such litigation is filed.
-
- 4. Redistribution. You may reproduce and distribute copies of the
- Work or Derivative Works thereof in any medium, with or without
- modifications, and in Source or Object form, provided that You
- meet the following conditions:
-
- (a) You must give any other recipients of the Work or
- Derivative Works a copy of this License; and
-
- (b) You must cause any modified files to carry prominent notices
- stating that You changed the files; and
-
- (c) You must retain, in the Source form of any Derivative Works
- that You distribute, all copyright, patent, trademark, and
- attribution notices from the Source form of the Work,
- excluding those notices that do not pertain to any part of
- the Derivative Works; and
-
- (d) If the Work includes a "NOTICE" text file as part of its
- distribution, then any Derivative Works that You distribute must
- include a readable copy of the attribution notices contained
- within such NOTICE file, excluding those notices that do not
- pertain to any part of the Derivative Works, in at least one
- of the following places: within a NOTICE text file distributed
- as part of the Derivative Works; within the Source form or
- documentation, if provided along with the Derivative Works; or,
- within a display generated by the Derivative Works, if and
- wherever such third-party notices normally appear. The contents
- of the NOTICE file are for informational purposes only and
- do not modify the License. You may add Your own attribution
- notices within Derivative Works that You distribute, alongside
- or as an addendum to the NOTICE text from the Work, provided
- that such additional attribution notices cannot be construed
- as modifying the License.
-
- You may add Your own copyright statement to Your modifications and
- may provide additional or different license terms and conditions
- for use, reproduction, or distribution of Your modifications, or
- for any such Derivative Works as a whole, provided Your use,
- reproduction, and distribution of the Work otherwise complies with
- the conditions stated in this License.
-
- 5. Submission of Contributions. Unless You explicitly state otherwise,
- any Contribution intentionally submitted for inclusion in the Work
- by You to the Licensor shall be under the terms and conditions of
- this License, without any additional terms or conditions.
- Notwithstanding the above, nothing herein shall supersede or modify
- the terms of any separate license agreement you may have executed
- with Licensor regarding such Contributions.
-
- 6. Trademarks. This License does not grant permission to use the trade
- names, trademarks, service marks, or product names of the Licensor,
- except as required for reasonable and customary use in describing the
- origin of the Work and reproducing the content of the NOTICE file.
-
- 7. Disclaimer of Warranty. Unless required by applicable law or
- agreed to in writing, Licensor provides the Work (and each
- Contributor provides its Contributions) on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
- implied, including, without limitation, any warranties or conditions
- of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
- PARTICULAR PURPOSE. You are solely responsible for determining the
- appropriateness of using or redistributing the Work and assume any
- risks associated with Your exercise of permissions under this License.
-
- 8. Limitation of Liability. In no event and under no legal theory,
- whether in tort (including negligence), contract, or otherwise,
- unless required by applicable law (such as deliberate and grossly
- negligent acts) or agreed to in writing, shall any Contributor be
- liable to You for damages, including any direct, indirect, special,
- incidental, or consequential damages of any character arising as a
- result of this License or out of the use or inability to use the
- Work (including but not limited to damages for loss of goodwill,
- work stoppage, computer failure or malfunction, or any and all
- other commercial damages or losses), even if such Contributor
- has been advised of the possibility of such damages.
-
- 9. Accepting Warranty or Additional Liability. While redistributing
- the Work or Derivative Works thereof, You may choose to offer,
- and charge a fee for, acceptance of support, warranty, indemnity,
- or other liability obligations and/or rights consistent with this
- License. However, in accepting such obligations, You may act only
- on Your own behalf and on Your sole responsibility, not on behalf
- of any other Contributor, and only if You agree to indemnify,
- defend, and hold each Contributor harmless for any liability
- incurred by, or claims asserted against, such Contributor by reason
- of your accepting any such warranty or additional liability.
-
- END OF TERMS AND CONDITIONS
-
- APPENDIX: How to apply the Apache License to your work.
-
- To apply the Apache License to your work, attach the following
- boilerplate notice, with the fields enclosed by brackets "[]"
- replaced with your own identifying information. (Don't include
- the brackets!) The text should be enclosed in the appropriate
- comment syntax for the file format. We also recommend that a
- file or class name and description of purpose be included on the
- same "printed page" as the copyright notice for easier
- identification within third-party archives.
-
- Copyright [yyyy] [name of copyright owner]
-
- Licensed under the Apache License, Version 2.0 (the "License");
- you may not use this file except in compliance with the License.
- You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
diff --git a/examples/Minutemen/README.md b/examples/Minutemen/README.md
deleted file mode 100644
index 1114d22..0000000
--- a/examples/Minutemen/README.md
+++ /dev/null
@@ -1,329 +0,0 @@
-# nist-synthetic-data-2021
-Source code for the second place submission in the third round of the 2021 NIST differential privacy temporal map challenge.
-
-The contest-submission folder contains the code submitted during the contest, and only works on the contest dataset. A writeup about the solution can be found in this folder: [AdaptiveGrid.pdf](https://github.com/ryan112358/nist-synthetic-data-2021/blob/main/contest-submission/AdaptiveGrid.pdf). The extensions folder contains a new mechanism, inspired by the solution to the competition, that works on arbitrary discrete datasets. Several benchmark datasets can be found in the extensions/datasets folder.
-
-## Setting up
-
-The following setup instructions apply to Linux and OSX. This code has not been tested on Windows, although it should run with a modified setup procedure.
-First, make sure you have Python>=3.6 installed, and create a virtual environment as follows:
-
-```
-$ mkdir $HOME/venvs
-$ python3 -m venv pgm
-$ source ~/venvs/pgm/bin/activate
-$ pip install -r requirements.txt
-```
-
-This code depends on [Private-PGM](https://github.com/ryan112358/private-pgm). Private-PGM can be set up using the following commands:
-```
-$ cd $HOME
-$ git clone git@github.com:ryan112358/private-pgm.git
-$ echo 'export PYTHONPATH="PYTHONPATH:$HOME/private-pgm/src/"' >> ~/.bashrc
-$ source ~/.bashrc
-$ cd private-pgm/test
-$ nosetests
-........................................
-----------------------------------------------------------------------
-Ran 40 tests in 5.009s
-
-OK
-```
-
-## Assumptions and Limitations
-
-Here is a list of assumptions and limitations we are imposing:
-
-* The input dataset must be discrete --- i.e., all columns must be categorical. Numerical columns must be discretized into a small number of bins. We expect the input dataset to be specified as a mbi.Dataset object (from Private-PGM). This object expects each attribute to take values from the set {0, 1, …, n_i}. We include a collection of eight already preprocessed datasets alongside the released code in the extensions/datasets folder. We also provide utilities for preprocessing arbitrary datasets into the required format, which we describe how to use in the next section.
-* We do not support “id” columns. Other columns with high-cardinality domains should be discared. The domain size for each column should be finite and reasonably small (definitely less than 1000, smaller than 100 preferred).
-* Our mechanism satisfies unbounded differential privacy (add/remove one individual) and we assume a single individual only affects one row of the dataset. If these assumptions are not satisfies, the privacy parameters must be modified accordingly.
-
-## Preprocessing the Data
-
-If you would like to run our mechanism on your own dataset, this section shows you how to do that. If you would like to run the mechanism on one of our datasets, feel free to skip to the next section. As mentioned in the previous section, our mechanism expects the data for each column be come from the set {0, 1, ..., n_i}. This section shows how to transform an arbitrary dataset into one of this form, and how to reverse this transformation as well.
-
-Our script, `transform.py` can discretize and undiscretize your data. To use it, we first need to get the schema of the data using `schemagen`. Schemagen was written by Maia Hansen and the original repo can be accessed [here](https://github.com/hd23408/nist-schemagen). For convenience, we have included a copy of the script in our repo. It is located in `extensions/schemagen.py`. We have also provided a copy of the undiscretized version of the adult dataset so that users can test the script first before using it on their own dataset. To run schemagen on the adult dataset, use this command:
-
-```
-python schemagen.py /path_to_repo/nist_synthetic-data-2021/extensions/datasets/raw/adult.csv --max_categorical 40
-```
-
-The README in [here](https://github.com/hd23408/nist-schemagen) provides detailed descriptions of each argument. In the example above, we specify the path to the dataframe and the maximum number of categorical features for any column. Users may want to play around with the flags of the script to see what works for their use case. Note that if the dataset contains an id column, you can use the `--skip_columns` argument and include any columns you don't want to be privatized making sure that they are seperated with commas.
-
-By default, the script will place `parameters.json` and `column_datatypes.json` into the current directory. If you want to specify a different output directory, use the `--output_dir` argument. Additionally, for `numeric` column types, the default number of bins is 10. If you want custom binning, it is straighforward to open the `parameters.json` file in your favorite text editor and edit the `bins` field for each column.
-
-To discretize your data and place the result into a folder `PATH_TO_OUTPUT`,
-run:
-
-```
-python transform.py --transform discretize --df
-PATH_TO_REPO/nist_synthetic-data-2021/extensions/datasets/raw/adult.csv
---schema parameters.json --output_dir PATH_TO_OUTPUT
-```
-
-Note that if you use `--transform discretize`, the script will also write a
-`domain.json` file to `OUTPUT_DIR`. This file will be necessary to run PGM.
-
-The arguments for `transform.py` are:
-
-```
-usage: transform.py [-h] [--output_dir OUTPUT_DIR] --transform TRANSFORM --df
- DF --schema SCHEMA
-
-Pre and post processing functions for the Adagrid mechanism
-
-optional arguments:
- -h, --help show this help message and exit
- --output_dir OUTPUT_DIR
- output directory for transformed data and domain if
- using `discretize` (default: .)
-
-required arguments:
- --transform TRANSFORM
- either discretize or undo_discretize (default: None)
- --df DF path to dataset (default: None)
- --schema SCHEMA path to schema file from schemagen (default: None)
-```
-
-And that's all there is to it!
-
-## Running the Mechanism
-
-After setting up Private-PGM, we can generate synthetic data using the following command
-
-```
-$ cd extensions/
-$ python adaptive_grid.py --dataset datasets/adult.zip --domain datasets/adult-domain.json --save adult-synthetic.csv
-
-Measuring ('native-country',), L2 sensitivity 1.000000
-Measuring ('fnlwgt',), L2 sensitivity 1.000000
-Measuring ('relationship',), L2 sensitivity 1.000000
-Measuring ('capital-gain',), L2 sensitivity 1.000000
-Measuring ('hours-per-week',), L2 sensitivity 1.000000
-Measuring ('income>50K',), L2 sensitivity 1.000000
-Measuring ('workclass',), L2 sensitivity 1.000000
-Measuring ('sex',), L2 sensitivity 1.000000
-Measuring ('marital-status',), L2 sensitivity 1.000000
-Measuring ('capital-loss',), L2 sensitivity 1.000000
-Measuring ('occupation',), L2 sensitivity 1.000000
-Measuring ('age',), L2 sensitivity 1.000000
-Measuring ('race',), L2 sensitivity 1.000000
-Measuring ('education-num',), L2 sensitivity 1.000000
-
-Measuring ('age', 'marital-status'), L2 sensitivity 1.000000
-Measuring ('age', 'hours-per-week'), L2 sensitivity 1.000000
-Measuring ('age', 'fnlwgt'), L2 sensitivity 1.000000
-Measuring ('age', 'capital-gain'), L2 sensitivity 1.000000
-Measuring ('age', 'capital-loss'), L2 sensitivity 1.000000
-Measuring ('workclass', 'occupation'), L2 sensitivity 1.000000
-Measuring ('fnlwgt', 'native-country'), L2 sensitivity 1.000000
-Measuring ('fnlwgt', 'race'), L2 sensitivity 1.000000
-Measuring ('education-num', 'occupation'), L2 sensitivity 1.000000
-Measuring ('marital-status', 'relationship'), L2 sensitivity 1.000000
-Measuring ('occupation', 'hours-per-week'), L2 sensitivity 1.000000
-Measuring ('relationship', 'sex'), L2 sensitivity 1.000000
-Measuring ('relationship', 'income>50K'), L2 sensitivity 1.000000
-
-Post-processing with Private-PGM, will take some time...
-```
-
-As we can see, the mechanism measured queries about all 1-way marginals, and a subset of 13 2-way marginals. This produces an output adult-synthetic.csv that we can radily view
-
-```
-$ head adult-synthetic.csv
-age,workclass,fnlwgt,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income>50K
-14,0,13,3,3,9,3,0,0,0,0,46,0,0
-10,0,11,12,2,4,5,0,0,0,0,39,0,0
-15,0,9,9,0,3,2,0,1,0,45,19,0,0
-8,1,16,8,2,10,1,0,0,0,0,19,20,0
-32,0,4,13,0,8,2,0,1,0,0,59,0,1
-4,0,21,9,2,9,1,0,1,0,0,39,0,0
-21,0,10,8,1,7,3,0,0,0,0,39,0,0
-31,8,12,6,0,14,2,0,1,0,0,34,0,1
-21,0,14,13,1,4,3,3,0,0,0,44,0,0
-```
-
-We can undo the discretization function we applied earlier:
-
-```
-python transform.py --transform undo_discretize --df adult-synthetic.csv
---schema parameters.json --output_dir .
-```
-
-## The target option
-
-By default, this mechanism will try to preserve all 2-way marginals. If one column has increased importance, we can specify that with the **targets** column. With this option specified, we will instead try to preserve higher order marginals involving the targets. If we specify ```--targets="income>50K"``` then the mechanism will try to preserve 3-way marginals involving the income column. We can pass in multiple targets if desired, although scalability will suffer if the list is longer than a few columns.
-
-```
-$ python adaptive_grid.py --dataset datasets/adult.zip --domain datasets/adult-domain.json --targets="income>50K" --save adult-synthetic-target.csv
-
-Measuring ('income>50K',), L2 sensitivity 1.000000
-Measuring ('marital-status',), L2 sensitivity 1.000000
-Measuring ('age',), L2 sensitivity 1.000000
-Measuring ('race',), L2 sensitivity 1.000000
-Measuring ('capital-gain',), L2 sensitivity 1.000000
-Measuring ('workclass',), L2 sensitivity 1.000000
-Measuring ('relationship',), L2 sensitivity 1.000000
-Measuring ('education-num',), L2 sensitivity 1.000000
-Measuring ('hours-per-week',), L2 sensitivity 1.000000
-Measuring ('capital-loss',), L2 sensitivity 1.000000
-Measuring ('fnlwgt',), L2 sensitivity 1.000000
-Measuring ('occupation',), L2 sensitivity 1.000000
-Measuring ('native-country',), L2 sensitivity 1.000000
-Measuring ('sex',), L2 sensitivity 1.000000
-
-Measuring ('marital-status', 'income>50K'), L2 sensitivity 1.000000
-Measuring ('race', 'income>50K'), L2 sensitivity 1.000000
-Measuring ('relationship', 'income>50K'), L2 sensitivity 1.000000
-Measuring ('capital-loss', 'income>50K'), L2 sensitivity 1.000000
-Measuring ('fnlwgt', 'income>50K'), L2 sensitivity 1.000000
-Measuring ('native-country', 'income>50K'), L2 sensitivity 1.000000
-Measuring ('workclass', 'income>50K'), L2 sensitivity 1.000000
-Measuring ('occupation', 'income>50K'), L2 sensitivity 1.000000
-Measuring ('hours-per-week', 'income>50K'), L2 sensitivity 1.000000
-Measuring ('age', 'income>50K'), L2 sensitivity 1.000000
-Measuring ('education-num', 'income>50K'), L2 sensitivity 1.000000
-Measuring ('capital-gain', 'income>50K'), L2 sensitivity 1.000000
-Measuring ('sex', 'income>50K'), L2 sensitivity 1.000000
-
-Measuring ('age', 'marital-status', 'income>50K'), L2 sensitivity 1.000000
-Measuring ('age', 'hours-per-week', 'income>50K'), L2 sensitivity 1.000000
-Measuring ('age', 'fnlwgt', 'income>50K'), L2 sensitivity 1.000000
-Measuring ('age', 'native-country', 'income>50K'), L2 sensitivity 1.000000
-Measuring ('age', 'capital-gain', 'income>50K'), L2 sensitivity 1.000000
-Measuring ('age', 'capital-loss', 'income>50K'), L2 sensitivity 1.000000
-Measuring ('age', 'race', 'income>50K'), L2 sensitivity 1.000000
-Measuring ('workclass', 'occupation', 'income>50K'), L2 sensitivity 1.000000
-Measuring ('education-num', 'occupation', 'income>50K'), L2 sensitivity 1.000000
-Measuring ('marital-status', 'relationship', 'income>50K'), L2 sensitivity 1.000000
-Measuring ('occupation', 'hours-per-week', 'income>50K'), L2 sensitivity 1.000000
-Measuring ('relationship', 'sex', 'income>50K'), L2 sensitivity 1.000000
-
-Post-processing with Private-PGM, will take some time...
-```
-
-As we can see, the mechanism now measured a lot more things about marginals involving the target column. In particular, it measured all 2-way marginals involving income, and 12 3-way marginals involving income.
-
-## Evaluating the synthetic data
-
-We can score our synthetic data using the score.py function, as follows:
-
-```
-$ python score.py --synthetic adult-synthetic.csv
-relationship sex 0.003655
- income>50K 0.003675
-marital-status relationship 0.007279
-sex income>50K 0.008364
-marital-status income>50K 0.011373
- ...
-age education-num 0.139112
-occupation relationship 0.151089
-age hours-per-week 0.157170
-occupation sex 0.159074
-age fnlwgt 0.207127
-Length: 91, dtype: float64
-Average Error: 0.05595402713661589
-```
-
-The error is calculated as an total variation distance between true and synthetic marginals, averaged over all 2-way marginals. We can see both the breakdown (which marginals are estimated well and which are not), and the overall error. We can also specify a list of targets, which modifies the evaluation criteria to include the target columns in all evalaution marginals.
-
-```
-$ python score.py --synthetic adult-synthetic-target.csv --targets "income>50K"
-relationship sex income>50K 0.005139
-marital-status relationship income>50K 0.011445
-workclass race income>50K 0.024426
-sex capital-loss income>50K 0.027927
-marital-status sex income>50K 0.028029
- ...
-occupation relationship income>50K 0.140740
-age education-num income>50K 0.151325
-occupation sex income>50K 0.161418
-age hours-per-week income>50K 0.163691
- fnlwgt income>50K 0.174420
-Length: 78, dtype: float64
-Average Error: 0.06613134555274515
-```
-
-**NOTE**: By specifying targets in adaptive_grid.py, we can expect the synthetic data to score better when passing --targets to score.py. If we score adult-synthetic.csv with the target option enabled, the score is 0.1038, almost 2X worse than the 0.0661 we achieved.
-
-We can compare the score we obtain from our differentially private mechanism with that of a simple non-private baseline which samples n records with replacement from the original dataset. We can run this baseline using resample.py and score it using the same score function.
-
-```
-$ python resample.py --dataset datasets/adult.csv --save resample.csv
-$ python score.py --synthetic resample.csv
-sex income>50K 0.001679
-relationship sex 0.002191
-race income>50K 0.002211
-relationship income>50K 0.002764
-capital-loss income>50K 0.003030
- ...
-age education-num 0.042545
- occupation 0.046599
-fnlwgt hours-per-week 0.048462
-age hours-per-week 0.063142
- fnlwgt 0.069919
-Length: 91, dtype: float64
-Average Error: 0.014676838660295519
-```
-
-
-## Full configuration options
-
-The default configuration options are shown below. In general, the dataset, domain, epsilon, delta, targets, and save options should be specified. For other options, the defaults settings should work fine in most cases. Interested users can try modifying them if desired.
-
-```
-$ cd extensions/
-$ python adaptive_grid.py --help
-usage: adaptive_grid.py [-h] [--dataset DATASET] [--domain DOMAIN] [--epsilon EPSILON]
- [--delta DELTA] [--targets TARGETS [TARGETS ...]] [--pgm_iters PGM_ITERS]
- [--warm_start WARM_START] [--metric {L1,L2}] [--threshold THRESHOLD]
- [--split_strategy SPLIT_STRATEGY [SPLIT_STRATEGY ...]] [--save SAVE]
-
-A generalization of the Adaptive Grid Mechanism that won 2nd place in the 2020 NIST temporal map
-challenge
-
-optional arguments:
- -h, --help show this help message and exit
- --dataset DATASET dataset to use (default: datasets/adult.zip)
- --domain DOMAIN dataset to use (default: datasets/adult-domain.json)
- --epsilon EPSILON privacy parameter (default: 1.0)
- --delta DELTA privacy parameter (default: 1e-10)
- --targets TARGETS [TARGETS ...]
- target columns to preserve (default: [])
- --pgm_iters PGM_ITERS
- number of iterations (default: 2500)
- --warm_start WARM_START
- warm start PGM (default: True)
- --metric {L1,L2} loss function metric to use (default: L2)
- --threshold THRESHOLD
- adagrid treshold parameter (default: 5.0)
- --split_strategy SPLIT_STRATEGY [SPLIT_STRATEGY ...]
- budget split for 3 steps (default: [0.1, 0.1, 0.8])
- --save SAVE path to save synthetic data (default: out.csv)
-
-$ python score.py --help
-usage: score.py [-h] [--dataset DATASET] [--domain DOMAIN]
- [--synthetic SYNTHETIC] [--targets TARGETS [TARGETS ...]]
- [--save SAVE]
-
-A script to score the quality of synthetic data
-
-optional arguments:
- -h, --help show this help message and exit
- --dataset DATASET dataset to use (default: datasets/adult.zip)
- --domain DOMAIN domain of dataset (default: datasets/adult-
- domain.json)
- --synthetic SYNTHETIC
- synthetic dataset to use (default: out.csv)
- --targets TARGETS [TARGETS ...]
- target columns to define evaluation criteria (default:
- [])
- --save SAVE path to save error report (default: error.csv)
-```
-
-Notes about other options:
-* The metric options corresponds to the loss function used to resolve inconsistencies in noisy marginals. The L2 loss function is more natural with Gaussian noise, which is what we use, and also has better smoothness properties, making optimization simpler. We don't recommend changing this, but feel free to try to see if it makes a difference in your use case.
-* The `pgm_iters` specifies how many iterations to run the proximal gradient algorithm underlying Private-PGM. Increasing this value may improve error slightly, at the cost of increased runtime. Decreasing this value too aggressively could destroy performance.
-* The warm_start option is activated during the second invocation of Private-PGM after step 3. If it is turned on, then the parameters from in the previous invocation will be used to initialize the proximal algorithm.
-* The threshold option is used to determine whether to measure a cell in a marginal at finer granularity. If a cell in a marginal had noisy count below threshold\*sigma, then no future measurements will be made at finer granularity. The mechanism seems fairly robust to the choice of threshold, and even setting it to -inf should be fine (no threshold). Feel free to modify it and see how it impact performance on your problem, but expect the default to provide reasonable behavior.
-* the split_strategy option specifies how much of the privacy budget to devote to the three steps of the algorithm. By default 10% is used on step 1 (measuring 1 way marginals), 10% is used on step 2 (selecting higher order marginals), and 80% is used on step 3 (measuring higher order marginals). Since the first two steps are coarse-grained aggregations, they are more robust to noise than step 3. Again, feel free to change this, but expect the default to work reasonably well.
diff --git a/examples/Minutemen/adaptive_grid.py b/examples/Minutemen/adaptive_grid.py
deleted file mode 100644
index 906aace..0000000
--- a/examples/Minutemen/adaptive_grid.py
+++ /dev/null
@@ -1,173 +0,0 @@
-import numpy as np
-import pandas as pd
-import json
-from util import discretize, undo_discretize, downward_closure
-from mbi import FactoredInference, Factor
-from scipy import sparse
-from autodp import privacy_calibrator
-from functools import partial
-from pathlib import Path
-import typer
-from multiprocessing import Pool
-from math import ceil
-
-import sdnist
-
-
-def get_permutation_matrix(cl1, cl2, domain):
- # permutation matrix that maps datavector of cl1 factor to datavector of cl2 factor
- assert set(cl1) == set(cl2)
- n = domain.size(cl1)
- fac = Factor(domain.project(cl1),np.arange(n))
- new = fac.transpose(cl2)
- data = np.ones(n)
- row_ind = fac.datavector()
- col_ind = new.datavector()
- return sparse.csr_matrix((data, (row_ind, col_ind)), shape=(n,n))
-
-def get_aggregate(cl, matrices, domain):
- children = [r for r in matrices if set(r) < set(cl) and len(r)+1 == len(cl)]
- ans = [sparse.csr_matrix((0,domain.size(cl)))]
- for c in children:
- coef = 1.0 / np.sqrt(len(children))
- a = tuple(set(cl)-set(c))
- cl2 = a + c
- Qc = matrices[c]
- P = get_permutation_matrix(cl, cl2, domain)
- T = np.ones(domain.size(a))
- Q = sparse.kron(T, Qc) @ P
- ans.append(coef*Q)
- return sparse.vstack(ans)
-
-def get_identity(cl, post_plausibility, domain):
- # determine which cells in the cl marginal *could* have a count above threshold,
- # based on previous measurements
- children = [r for r in post_plausibility if set(r) < set(cl) and len(r)+1 == len(cl)]
- plausibility = Factor.ones(domain.project(cl))
- for c in children:
- plausibility *= post_plausibility[c]
-
- row_ind = col_ind = np.nonzero(plausibility.datavector())[0]
- data = np.ones_like(row_ind)
- n = domain.size(cl)
- Q = sparse.csr_matrix((data, (row_ind, col_ind)), (n,n))
- return Q
-
-
-def adagrid(data, epsilon, delta, threshold, cliques, iters=2500, clip=200):
- # Calibrate noise using Gaussian differential privacy
- # We have an adaptive composistion of K=len(cliques) Gaussian mechanisms,
- # each applied to a quantity with L2 sensitivty of 1
- # Requried noise is thus sqrt(K) * sigma(epsilon, delta) * 200
- # the 200 can be reduced if clipping is done
- noise = privacy_calibrator.gaussian_mech(epsilon, delta)['sigma']*clip*np.sqrt(len(cliques))
- domain = data.domain
- threshold = noise*threshold
- measurements = []
- post_plausibility = {}
- matrices = {}
-
- for k in [1,2,3,4]:
- split = [cl for cl in cliques if len(cl) == k]
- print()
- for cl in split:
- I = sparse.eye(domain.size(cl))
- Q1 = get_identity(cl, post_plausibility, domain) # get fine-granularity measurements
- Q2 = get_aggregate(cl, matrices, domain) @ (I - Q1) #get remaining aggregate measurements
- Q1 = Q1[Q1.getnnz(1)>0] # remove all-zero rows
- Q = sparse.vstack([Q1,Q2])
- Q.T = sparse.csr_matrix(Q.T) # a trick to improve efficiency of Private-PGM
- # Q has sensitivity 1 by construction
- print('Measuring %s, L2 sensitivity %.6f' % (cl, Q.power(2).sum(axis=0).max()))
- #########################################
- ### This code uses the sensitive data ###
- #########################################
- mu = data.project(cl).datavector()
- y = Q @ mu + np.random.normal(loc=0, scale=noise, size=Q.shape[0])
- #########################################
- est = Q1.T @ y[:Q1.shape[0]]
-
- post_plausibility[cl] = Factor(domain.project(cl), est >= threshold)
- matrices[cl] = Q
- measurements.append((Q, y, 1.0, cl))
-
- print('Post-processing with Private-PGM, will take some time...')
- elim_order = ['trip_seconds', 'payment_type', 'trip_miles', 'trip_total', 'tips', 'fare', 'company_id', 'dropoff_community_area', 'pickup_community_area', 'shift']
- engine = FactoredInference(domain,elim_order=elim_order,log=False,iters=iters,warm_start=True)
-
- small = [M for M in measurements if len(M[-1]) == 1]
- engine.estimate(small)
-
- return engine.estimate(measurements, total=engine.model.total)
-
-def assign_taxi_ids(priv):
- gt = pd.read_csv('public_taxiid.zip')
- pair = ['pickup_community_area','shift']
- sizes = priv.groupby(pair).size().unstack().fillna(0).astype(int).stack()
-
- def assign_identifier(g):
- num = sizes[g.name]
- if num == 0:
- return pd.DataFrame(columns=g.columns)
- g = g.sample(frac=1) # shuffle
- reps = ceil(num/g.shape[0])
- g = pd.concat([g]*reps, ignore_index=True).iloc[:num] # grab correct number of rows
- g['key'] = np.arange(g.shape[0]) # assign key for later join operation
- return g
- gt2 = gt.groupby(pair).apply(assign_identifier).reset_index(drop=True)
- priv2 = priv.groupby(pair).apply(assign_identifier).reset_index(drop=True)
- ans = priv2.merge(gt2, how='left', on=pair+['key']).drop(columns=['key'])
- ans['taxi_id'] = ans.taxi_id.astype('category').cat.codes
- cumct = ans.groupby('taxi_id').cumcount()
- num = (cumct >= 200).sum()
- #new_ids = np.repeat(np.arange(ceil(num/200.0))+ans.taxi_id.max()+1, 200)[:num]
- ans.loc[cumct >= 200,'taxi_id'] = np.arange(num)+ans.taxi_id.max()+1
- return ans
-
-def run_mechanism(df, schema, run):
- epsilon, delta = run['epsilon'], run['delta']
- iters = int(344*epsilon+256)
- threshold = 5
- copies = 8
-
- if epsilon <= 1:
- clip = 150
- else:
- clip = 200
-
- data = discretize(df, schema, clip)
-
- cliques = [('pickup_community_area', 'shift', 'fare', 'trip_total'),
- ('pickup_community_area', 'shift', 'trip_total', 'trip_seconds'),
- ('pickup_community_area', 'shift', 'dropoff_community_area', 'fare'),
- ('pickup_community_area', 'shift', 'payment_type', 'trip_total'),
- ('pickup_community_area', 'shift', 'fare', 'trip_miles'),
- ('pickup_community_area', 'shift', 'company_id'),
- ('pickup_community_area', 'shift', 'tips', 'trip_total')]
-
- cliques = downward_closure(cliques)
- cliques += [('pickup_community_area', 'dropoff_community_area')]*(copies - 1)
-
- model = adagrid(data,epsilon,delta,threshold,cliques,iters,clip)
-
- synth = model.synthetic_data()
- submit = undo_discretize(synth, schema)
- # submit = assign_taxi_ids(submit)
- submit["taxi_id"] = 0
- cols = ['taxi_id', 'shift', 'company_id', 'pickup_community_area', 'dropoff_community_area', 'payment_type', 'fare', 'tips', 'trip_total', 'trip_seconds', 'trip_miles']
- submit = submit[cols]
- return submit
-
-
-def main():
- df, schema = sdnist.taxi()
-
- del schema["trip_day_of_week"]
- del schema["trip_hour_of_day"]
- run = {"epsilon": 1, "delta": 2.5e-4}
- synthetic_df = run_mechanism(df, schema, run)
-
- print(sdnist.score(df, synthetic_df, challenge="taxi"))
-
-if __name__ == '__main__':
- typer.run(main)
\ No newline at end of file
diff --git a/examples/Minutemen/util.py b/examples/Minutemen/util.py
deleted file mode 100644
index 48fe25c..0000000
--- a/examples/Minutemen/util.py
+++ /dev/null
@@ -1,117 +0,0 @@
-import numpy as np
-import pandas as pd
-from mbi import Domain, Dataset
-import itertools
-
-def powerset(iterable):
- s = list(iterable)
- return itertools.chain.from_iterable(itertools.combinations(s, r) for r in range(1,len(s)+1))
-
-def downward_closure(cliques):
- ans = set()
- for proj in cliques:
- ans.update(powerset(proj))
- return list(sorted(ans, key=len))
-
-BINS = {
- "fare": np.r_[-1, np.arange(0, 100, step=10), 9900],
- "tips": np.r_[-1, np.arange(0, 100, step=10), 407],
- "trip_total": np.r_[-1, np.arange(0, 100, step=10), 9900],
- "trip_seconds": np.r_[-1, np.arange(0, 2000, step=200), 86400],
- "trip_miles": np.r_[-1, np.arange(0, 100, step=10), 1428] }
-
-def discretize(df, schema, clip=None):
- weights = None
- if clip is not None:
- # each individual now only contributes "clip" records
- # achieved by reweighting records, rather than resampling them
- weights = df.taxi_id.value_counts()
- weights = np.minimum(clip/weights, 1.0)
- weights = np.array(df.taxi_id.map(weights).values)
-
- new = df.copy()
- domain = { }
- for col in schema:
- info = schema[col]
- #print(col)
- if col in BINS:
- new[col] = pd.cut(df[col], BINS[col], right=False).cat.codes
- domain[col] = len(BINS[col]) - 1
- elif 'values' in info:
- new[col] = df[col].astype(pd.CategoricalDtype(info['values'])).cat.codes
- domain[col] = len(info['values'])
- else:
- new[col] = df[col] - info['min']
- domain[col] = info['max'] - info['min'] + 1
-
- domain = Domain.fromdict(domain)
- return Dataset(new, domain, weights)
-
-def undo_discretize(dataset, schema):
- df = dataset.df
- new = df.copy()
-
- for col in dataset.domain:
- info = schema[col]
- if col in BINS:
- low = BINS[col][:-1];
- high = BINS[col][1:]
- low[0] = low[1]-2
- high[-1] = high[-2]+2
- mid = (low + high) / 2
- new[col] = mid[df[col].values]
- elif 'values' in info:
- mapping = np.array(info['values'])
- new[col] = mapping[df[col].values]
- else:
- new[col] = df[col] + info['min']
-
- #if 'max' in info:
- # new[col] = np.minimum(new[col], info['max'])
- #if 'min' in info:
- # new[col] = np.maximum(new[col], info['min'])
-
- dtypes = { col : schema[col]['dtype'] for col in schema }
-
- return new.astype(dtypes)
-
-
-def score(real, synth):
- # Replicate the NIST scoring metric
- # Calculates score for *every* 2-way marginal instead of a sample of them
- # performs scoring using the mbi.Dataset representation, which is different from raw data format
- # scores should match exactly
- # to score raw dataset, call score(discretize(real, schema), discretize(synth, schema))
- assert real.domain == synth.domain
- dom = real.domain
- proj = ('pickup_community_area','shift')
- newdom = dom.project(dom.invert(proj))
- keys = dom.project(proj)
- pairs = list(itertools.combinations(newdom.attrs, 2))
-
- idx = np.argsort(real.project('pickup_community_area').datavector())
-
- overall = 0
- breakdown = {}
- breakdown2 = np.zeros(dom.size('pickup_community_area'))
-
- for pair in pairs:
- #print(pair)
- proj = ('pickup_community_area','shift') + pair
- X = real.project(proj).datavector(flatten=False)
- Y = synth.project(proj).datavector(flatten=False)
- X /= X.sum(axis=(2,3), keepdims=True)
- Y /= Y.sum(axis=(2,3), keepdims=True)
-
- err = np.nan_to_num( np.abs(X-Y).sum(axis=(2,3)), nan=2.0)
- breakdown[pair] = err.mean()
- breakdown2 += err.mean(axis=1)
- overall += err.mean()
-
- score = overall / len(pairs)
-
- nist_score = ((2.0 - score) / 2.0) * 1_000
- breakdown2 /= len(pairs)
-
- return nist_score, pd.Series(breakdown), (2.0 - breakdown2[idx]) / 2.0
-
diff --git a/examples/bayesnet/bayesnet.ipynb b/examples/bayesnet/bayesnet.ipynb
deleted file mode 100644
index 7aa3655..0000000
--- a/examples/bayesnet/bayesnet.ipynb
+++ /dev/null
@@ -1,1484 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": 1,
- "id": "c68aab23",
- "metadata": {},
- "outputs": [],
- "source": [
- "import itertools\n",
- "import functools\n",
- "\n",
- "import numpy as np\n",
- "import pandas as pd\n",
- "\n",
- "import networkx as nx\n",
- "import pydot\n",
- "from networkx.drawing.nx_pydot import graphviz_layout\n",
- "import matplotlib.pyplot as plt\n",
- "\n",
- "from tqdm import tqdm\n",
- "\n",
- "import sdnist"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "7b6eff8c",
- "metadata": {},
- "source": [
- "## 1. Load both the public & the private dataset"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "id": "78af8a1e",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " PUMA | \n",
- " YEAR | \n",
- " HHWT | \n",
- " GQ | \n",
- " PERWT | \n",
- " SEX | \n",
- " AGE | \n",
- " MARST | \n",
- " RACE | \n",
- " HISPAN | \n",
- " ... | \n",
- " WORKEDYR | \n",
- " INCTOT | \n",
- " INCWAGE | \n",
- " INCWELFR | \n",
- " INCINVST | \n",
- " INCEARN | \n",
- " POVERTY | \n",
- " DEPARTS | \n",
- " ARRIVES | \n",
- " sim_individual_id | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " 0 | \n",
- " 17-1001 | \n",
- " 2012 | \n",
- " 88.0 | \n",
- " 1 | \n",
- " 61.0 | \n",
- " 1 | \n",
- " 21 | \n",
- " 6 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 14000 | \n",
- " 14000 | \n",
- " 0 | \n",
- " 0 | \n",
- " 14000 | \n",
- " 118 | \n",
- " 902 | \n",
- " 909 | \n",
- " 12 | \n",
- "
\n",
- " \n",
- " 1 | \n",
- " 17-1001 | \n",
- " 2012 | \n",
- " 61.0 | \n",
- " 1 | \n",
- " 85.0 | \n",
- " 1 | \n",
- " 21 | \n",
- " 6 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 18000 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " 18000 | \n",
- " 262 | \n",
- " 732 | \n",
- " 744 | \n",
- " 33 | \n",
- "
\n",
- " \n",
- " 2 | \n",
- " 17-1001 | \n",
- " 2012 | \n",
- " 54.0 | \n",
- " 1 | \n",
- " 54.0 | \n",
- " 1 | \n",
- " 21 | \n",
- " 6 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 14000 | \n",
- " 14000 | \n",
- " 0 | \n",
- " 0 | \n",
- " 14000 | \n",
- " 118 | \n",
- " 642 | \n",
- " 654 | \n",
- " 401 | \n",
- "
\n",
- " \n",
- " 3 | \n",
- " 17-1001 | \n",
- " 2012 | \n",
- " 106.0 | \n",
- " 1 | \n",
- " 69.0 | \n",
- " 1 | \n",
- " 21 | \n",
- " 6 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 3800 | \n",
- " 3800 | \n",
- " 0 | \n",
- " 0 | \n",
- " 3800 | \n",
- " 262 | \n",
- " 0 | \n",
- " 0 | \n",
- " 470 | \n",
- "
\n",
- " \n",
- " 4 | \n",
- " 17-1001 | \n",
- " 2012 | \n",
- " 31.0 | \n",
- " 1 | \n",
- " 56.0 | \n",
- " 1 | \n",
- " 21 | \n",
- " 6 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 14000 | \n",
- " 14000 | \n",
- " 0 | \n",
- " 0 | \n",
- " 14000 | \n",
- " 501 | \n",
- " 0 | \n",
- " 0 | \n",
- " 702 | \n",
- "
\n",
- " \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- "
\n",
- " \n",
- " 1035196 | \n",
- " 39-4300 | \n",
- " 2018 | \n",
- " 103.0 | \n",
- " 1 | \n",
- " 90.0 | \n",
- " 2 | \n",
- " 37 | \n",
- " 1 | \n",
- " 9 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 36000 | \n",
- " 36000 | \n",
- " 0 | \n",
- " 0 | \n",
- " 36000 | \n",
- " 231 | \n",
- " 1605 | \n",
- " 1624 | \n",
- " 556291 | \n",
- "
\n",
- " \n",
- " 1035197 | \n",
- " 39-4106 | \n",
- " 2018 | \n",
- " 207.0 | \n",
- " 1 | \n",
- " 207.0 | \n",
- " 2 | \n",
- " 41 | \n",
- " 6 | \n",
- " 9 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 52800 | \n",
- " 52000 | \n",
- " 0 | \n",
- " 0 | \n",
- " 52000 | \n",
- " 361 | \n",
- " 1005 | \n",
- " 1019 | \n",
- " 1139708 | \n",
- "
\n",
- " \n",
- " 1035198 | \n",
- " 17-2200 | \n",
- " 2018 | \n",
- " 73.0 | \n",
- " 1 | \n",
- " 58.0 | \n",
- " 2 | \n",
- " 46 | \n",
- " 4 | \n",
- " 9 | \n",
- " 0 | \n",
- " ... | \n",
- " 2 | \n",
- " 25800 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " 200 | \n",
- " 0 | \n",
- " 0 | \n",
- " 346052 | \n",
- "
\n",
- " \n",
- " 1035199 | \n",
- " 17-2300 | \n",
- " 2018 | \n",
- " 47.0 | \n",
- " 1 | \n",
- " 47.0 | \n",
- " 2 | \n",
- " 46 | \n",
- " 1 | \n",
- " 9 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 5000 | \n",
- " 5000 | \n",
- " 0 | \n",
- " 0 | \n",
- " 5000 | \n",
- " 399 | \n",
- " 732 | \n",
- " 754 | \n",
- " 40265 | \n",
- "
\n",
- " \n",
- " 1035200 | \n",
- " 39-910 | \n",
- " 2018 | \n",
- " 86.0 | \n",
- " 1 | \n",
- " 86.0 | \n",
- " 2 | \n",
- " 75 | \n",
- " 1 | \n",
- " 9 | \n",
- " 0 | \n",
- " ... | \n",
- " 1 | \n",
- " 9600 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " 501 | \n",
- " 0 | \n",
- " 0 | \n",
- " 811103 | \n",
- "
\n",
- " \n",
- "
\n",
- "
1035201 rows × 36 columns
\n",
- "
"
- ],
- "text/plain": [
- " PUMA YEAR HHWT GQ PERWT SEX AGE MARST RACE HISPAN ... \\\n",
- "0 17-1001 2012 88.0 1 61.0 1 21 6 1 0 ... \n",
- "1 17-1001 2012 61.0 1 85.0 1 21 6 1 0 ... \n",
- "2 17-1001 2012 54.0 1 54.0 1 21 6 1 0 ... \n",
- "3 17-1001 2012 106.0 1 69.0 1 21 6 1 0 ... \n",
- "4 17-1001 2012 31.0 1 56.0 1 21 6 1 0 ... \n",
- "... ... ... ... .. ... ... ... ... ... ... ... \n",
- "1035196 39-4300 2018 103.0 1 90.0 2 37 1 9 0 ... \n",
- "1035197 39-4106 2018 207.0 1 207.0 2 41 6 9 0 ... \n",
- "1035198 17-2200 2018 73.0 1 58.0 2 46 4 9 0 ... \n",
- "1035199 17-2300 2018 47.0 1 47.0 2 46 1 9 0 ... \n",
- "1035200 39-910 2018 86.0 1 86.0 2 75 1 9 0 ... \n",
- "\n",
- " WORKEDYR INCTOT INCWAGE INCWELFR INCINVST INCEARN POVERTY \\\n",
- "0 3 14000 14000 0 0 14000 118 \n",
- "1 3 18000 0 0 0 18000 262 \n",
- "2 3 14000 14000 0 0 14000 118 \n",
- "3 3 3800 3800 0 0 3800 262 \n",
- "4 3 14000 14000 0 0 14000 501 \n",
- "... ... ... ... ... ... ... ... \n",
- "1035196 3 36000 36000 0 0 36000 231 \n",
- "1035197 3 52800 52000 0 0 52000 361 \n",
- "1035198 2 25800 0 0 0 0 200 \n",
- "1035199 3 5000 5000 0 0 5000 399 \n",
- "1035200 1 9600 0 0 0 0 501 \n",
- "\n",
- " DEPARTS ARRIVES sim_individual_id \n",
- "0 902 909 12 \n",
- "1 732 744 33 \n",
- "2 642 654 401 \n",
- "3 0 0 470 \n",
- "4 0 0 702 \n",
- "... ... ... ... \n",
- "1035196 1605 1624 556291 \n",
- "1035197 1005 1019 1139708 \n",
- "1035198 0 0 346052 \n",
- "1035199 732 754 40265 \n",
- "1035200 0 0 811103 \n",
- "\n",
- "[1035201 rows x 36 columns]"
- ]
- },
- "execution_count": 2,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "public, public_schema = sdnist.census(public=True)\n",
- "private, private_schema = sdnist.census(public=False)\n",
- "public"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "id": "f3c1f7ee",
- "metadata": {},
- "outputs": [],
- "source": [
- "BINS = sdnist.kmarginal.CensusKMarginalScore.BINS\n",
- "\n",
- "public_bin = sdnist.utils.discretize(public, public_schema, BINS)\n",
- "private_bin = sdnist.utils.discretize(private, private_schema, BINS)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "c577a79b",
- "metadata": {},
- "source": [
- "## 2. First order bayesian network ($k=1$)\n",
- "\n",
- "As described in *PrivBayes: Private Data Release via Bayesian Networks* (http://dimacs.rutgers.edu/~graham/pubs/papers/PrivBayes.pdf), except that we compute the tree structure on a public dataset.\n",
- "\n",
- "\n",
- "### 2.1 Building the Chow-Liu tree structure from the public dataset\n",
- "\n",
- "We represent the joint distribution as a first order bayesian network. The dependency tree is constructed from the public dataset."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "id": "4a0b51af",
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:19<00:00, 1.78it/s]\n"
- ]
- }
- ],
- "source": [
- "COLS = list(public_schema.keys())\n",
- "n = len(public_bin)\n",
- "\n",
- "def mutual_information(df, col_a, col_b):\n",
- " ab = df.groupby([*col_a, col_b]).size().unstack(col_b, fill_value=0).to_numpy() / n\n",
- " a = ab.sum(axis=0, keepdims=True)\n",
- " b = ab.sum(axis=1, keepdims=True)\n",
- " \n",
- " llr = np.zeros_like(ab)\n",
- " np.log(ab / (a * b), where=ab > 0, out=llr)\n",
- " return np.sum(ab * llr)\n",
- "\n",
- "def greedy_bayes(df, root: str = \"PUMA\", order=1):\n",
- " # Graph : for vizualization purposes\n",
- " graph = nx.DiGraph()\n",
- " graph.add_node(root)\n",
- " \n",
- " # Conditional distribution in topological order\n",
- " cond = [(root,)]\n",
- "\n",
- " # Greedy algorithm\n",
- " remaining = set(COLS)\n",
- " remaining.remove(root)\n",
- " \n",
- " mutual_information_memoize = functools.cache(functools.partial(mutual_information, df))\n",
- " \n",
- " for i in tqdm(range(len(COLS)-1)):\n",
- " max_col = None\n",
- " max_parents = None\n",
- " max_mi = 0\n",
- " \n",
- " for col in remaining:\n",
- " for parents in itertools.combinations(graph.nodes, r=min(len(graph.nodes), order)):\n",
- " mi = mutual_information_memoize(parents, col)\n",
- " \n",
- " if mi > max_mi:\n",
- " max_mi = mi\n",
- " max_col = col\n",
- " max_parents = parents\n",
- " \n",
- " graph.add_node(max_col)\n",
- " graph.add_edges_from(((p, max_col) for p in max_parents), weight=max_mi)\n",
- " \n",
- " cond.append((*max_parents, max_col))\n",
- " \n",
- " remaining.remove(max_col) \n",
- " \n",
- " return graph, cond\n",
- " \n",
- "graph, cond = greedy_bayes(public, order=1)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "id": "de465b99",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "Text(0.5, 1.0, 'First order bayesian network over the public dataset')"
- ]
- },
- "execution_count": 5,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.figure(dpi=100)\n",
- "pos = graphviz_layout(graph, prog=\"neato\")\n",
- "width = [w ** .5 * 2 for w in nx.get_edge_attributes(graph, \"weight\").values()]\n",
- "nx.draw(graph, pos=pos, with_labels=True, font_size=8, width=width, edge_color=\"grey\")\n",
- "plt.title(\"First order bayesian network over the public dataset\")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "5768ba16",
- "metadata": {},
- "source": [
- "### 2.2 Computing the conditional distributions from the private dataset\n",
- "\n",
- "We compute the conditional distribution on the private dataset using noisy histograms, for each edge of the tree and starting from the root column (PUMA)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "id": "3dc8d6ea",
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:01<00:00, 29.55it/s]\n"
- ]
- },
- {
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAIEAAAEGCAYAAABCYvaEAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAAP3UlEQVR4nO2de4wd1X3HP9/d9YIxNubpEAzYECAlEU+LUB4lwaUhgUClVAKnQTSpStSGNCGllIiKUFRVSE1LItJCeIVAKaQlODGUpwg0Qg0EG8zDGAOJDRjzDDgGA96Hf/1jZr3Xu3f3zszdmdlrfz/S6t6ZM2fOz+vvnplzfud3fooIzNZNV90GmPqxCIxFYCwCg0VggJ66DchCb9fUmNozPXe96B8owZr62f+g93LXWfVSP2++NahmZR0hgqk90zlq19Ny1xt47Y38jW0czF+nYu6+e2nuOkd8+qUxy/w4MBaBqUkEkk6UtELS85LOr8MGM0zlIpDUDfwb8BngQGCBpAOrtsMMU0dPcATwfET8JiL6gJuBU2uww6TUIYI9gMZX1dXpuc2QdJakxZIW9218vzLjtkbqEEGzseooV2ZEXBkR8yJiXm/X1ArM2nqpQwSrgT0bjmcDa2qww6TUIYJHgP0kzZXUC5wOLKrBDpNS+YxhRAxIOhu4G+gGro2IZVXbYYapZdo4Iu4A7qijbTMazxiaznAgDW6/DWuPnZO73ow71+eus3F9fg8dgLq7C9WL/r7cdb646pO566zs++mYZe4JjEVgLAKDRWCwCAwWgcEiMFgEBovAYBEYLAKDRWDoEAdS/4xgzfz8kUEzbs9fp6gjqGvqtoXqDRZwID1y3+/lrrN+3V1jlrknMBaBsQgM9UQg7SnpfknLJS2T9PWqbTCbU8eL4QDwNxHxqKTpwBJJ90bE0zXYYqihJ4iIVyLi0fT7O8BymkQgmeqo9Z1A0hzgUODhJmWbwtAG38m/VtBkpzYRSNoe+AnwjYhYN7K8MQyte/q06g3ciqhrf4IpJAK4MSJurcMGM0wdowMB1wDLI+Jfq27fjKaOnuBo4AzgeElL05/P1mCHSakjFvFBmoenm5rwjKHpDC/ilN4BPrz3b/NX3LgxdxV1F/u70LTtCtVj3aiBUUt2Wp4/PcGaD8Yuc09gLAJjERgsAoNFYLAIDBaBwSIwWAQGi8BgERgsAkOHOJB6ujay89T8+wv2FcgDXXg/wu0LOpAKsN1r/bnrdPWP/btwT2AsAmMRGOpdct4t6TFJt9dlg0mosyf4Okn0kamZuuIOZgMnAVfX0b7ZnLp6gu8C5wH5FwGaCaeO4JOTgdcjYkmL64ZT4q11SrwyqSv45BRJq0gSYx4v6T9GXrRZSryZTolXJnWEpn8rImZHxBySTGg/j4gvVm2HGcbzBKZe30FEPAA8UKcNxj2BoUO8iAA9yr8xZf5tIkE9BX8lvVOK1StA93sDuetoo72IZhwsAmMRGIvAYBEYLAKDRWCwCAwWgSHHjKGkHYEPA+8DqyLCC0K2EMYVgaQdgK8CC4Be4A1gW2CWpIeAf4+I+0u30pRKq57gFuB64NiIWNtYIOlw4AxJ+0TENSXZZypgXBFExAnjlC0Bxl0iZjqDVo+Dw8YrH0peUQUDUSxGMC8xkN9DB8CUYvZpSm/+On0FbBwnLrPV4+Bf0s9tgXnA4yT7Eh9EkqjimPzWmMnGuEPEiPhURHwKeAE4LF34eThJtpLnqzDQlE/WeYKPRsSTQwcR8RRwSCkWmcrJKoLlkq6W9ElJx0m6ijZCyCTNlHSLpGfS1Hi/X/Repn2yThZ9CfhLkvhBgF8Al7fR7veAuyLiTyT1AtXt8GBGkUkEEfGBpCuAOyJiRTsNSpoB/AHwZ+m9+yi2HNBMEJkeB5JOAZYCd6XHh0haVLDNfUhmHn+YhqZfLWlUujOHoVVH1neCbwNHAGsBImIpMKdgmz3AYcDlEXEosB44f+RFDkOrjqwiGIiI301Qm6uB1RExlBDzFhJRmJrIKoKnJH0B6Ja0n6TLgP8r0mBEvAq8JOmA9NR8wHmSaySrCL4GfAzYANwErAO+0Ua7XwNulPQEyXzDP7VxL9MmWUcH7wEXpD9tk75TzJuIe5n2ySQCSfsD55K8DG6qExHHl2PW5nQp2K4n/yiyX/nTL0Z/QQdSd7FUjyoQvqb38v8uxgtDyzpZ9N/AFSR7DOUPCjSTmqwiGIiIdmYIzSQm64vhbZL+StLuknYa+inVMlMZWXuCM9PPv204FySzf6bDyTo6mFu2IaY+8iw5P4rRo4PrS7DJVEzWIeINwL4kTqSh0UGQrEQ2HU7WnmAecGBEgSwSZtKT2XcAfKhMQ0x9ZO0JdgGelvQrEv8BABFxSilWmUrJKoKLyjTC1EvWIeL/lm2IqY9WEUgPRsQxkt4hGQ1sKgIiImaUap2phFaxiMekn9OrMWcsO2BjFPAIFhnMdBXzBhZFU7fNX6mIp7ONMLRNpHGJx5D0CA9GxGP5LTGTkayrjS8EfgTsTDJSuE7S35dpmKmOrD3BAuDQiPgAQNIlwKPAP5ZlmKmOrJNFq0gik4fYBvh10UYlnSNpmaSnJN0kqcCD0UwUrUYHl5G8A2wAlkm6Nz0+AXiwSIOS9gD+mmQa+n1J/0WSAeW6Ivcz7dPqcbA4/VwCLGw4/8AEtDtVUj9JHOKaNu9n2qDVEPFHE91gRLws6TvAiyQ7od0TEfeMvE7SWcBZAFNnbT/RZpgGxn0nkHSbpM9JGrUkVtI+ki6W9OU8DaZb4Z0KzCXZEm+apFGJsBrD0LaZ6VeGMmn1YvgXwLHAM5IekXSHpJ9LWgn8AFgSEdfmbPMPgZUR8UZE9AO3AkflttxMGK0eB6+SZDI9T9IcYHeSLvzZNCClCC8CR0raLr3XfIbfPUwNZJ4xjIhVJEPFtoiIhyXdQjLPMAA8BlzZ7n1NcWpJhBUR3yYJdzeTAG9wbVpOFs2IiHVjlO0VES+WY9bmBKJvsECnNVhdxJwGCy6/nJnfG68J9iK26gke2NSwdN+Isp/mt8RMRlqJoNG5PjLsrFrHuymNViKIMb43OzYdSqsH7W6SvknyVz/0nfR411ItM5XRSgRXAdObfIdkrwKzBdBqxvAfqjLE1EerIeLHgH0jYlF6fCmwQ1r8/SrzHZjyaPVieAnwZsPxp4H/Ae4HLizLKFMtrd4Jdo+Ixv0K10XETwAkfaU8s0yVtOoJNos3iIgjGw53m3hzTB20EsEaSZ8YeVLSkXhJ2BZDq8fB3wE/lnQdiesX4HCSPYxOK9EuUyGthoi/SnuCs0nzEwDLgCMj4rWSbdvEYHTxu778S8y26c6foazoZpZdRTKUAQO75ncgTVn5av6G2glDi4jXGTESkHS0pAsj4qv5rTGTjTyxiIeQRCKdBqwkWRtotgBaTRbtTxIYsgD4LfBjQGmaPLOF0Gp08AzJQtDPRcQxEXEZGfc2lnStpNclPdVwbidJ90p6Lv3csbjpZqJoJYLPA68C90u6StJ8sq8juA44ccS584H7ImI/4D6apL0x1dMqQ+rCiDgN+CjJKqNzgFmSLpf0Ry3q/gJ4a8TpU0lC3Ek//7iAzWaCybTQNCLWR8SNEXEyMJtkU8sif8WzIuKV9J6vMM6sY2M2tP61RUMcTBZyrzaOiLci4gdlJ7xoDEObMtO5M8uk6iXnr0naHSD9fL3i9k0TqhbBIoa3zT8T+FnF7ZsmlCYCSTcBvwQOkLRa0p+TrE84QdJzJBtdXFJW+yY7pYWhRcSCMYrml9WmKYbD0Ew9Aal56R/o5pW383vb5iq/o7NwMEVff6Fq7+6VP5XUzGcLtNVGGJrZCrAIjEVgLAKDRWCwCAwWgcEiMFgEBovAYBEYLAJDhziQoq+L/tXT8lfsyq/xriLZyQC9v6H1RU14Z+/8Ns7UxG4c557AWATGIjCUu8awWRjaP0t6RtITkhZKmllW+yY7ZfYE1zE6DO1e4OMRcRDwLPCtEts3GSlNBM3C0CLinogY2s3hIZJoJlMzdb4TfBm4c6zCxjC0wfXrKzRr66MWEUi6gCT1zY1jXdMYhtY9rcAcgclM5ZNFks4ETgbmR6Hc9maiqVQEkk4k2RHtuDayqZkJpuowtO+TbJB5r6Slkq4oq32TnarD0K4pqz1THM8Yms7wInZvgJkrCnjOCnjbNhYcjqq3t1C9vsPfzV8pb2LiFrgnMBaBsQgMFoHBIjBYBAaLwGARGCwCg0VgsAgMFoHBIjB0jBcxmPFC/pRzsaFYfGCVnPyRp1pfNIJl70zJXScGN45Z5p7AWASm4jC0hrJzJYWkXcpq32Sn6jA0JO1JkuvgxRLbNjmoNAwt5VLgPNrYUNxMLJW+E0g6BXg5Ih7PcO1wNrQ+h6GVSWVDREnbARcA4+ZTHCIirgSuBJi+w2z3GiVSZU+wLzAXeFzSKpKI5EclfahCG0wTKusJIuJJGpJhpkKYFxFvVmWDaU7VYWhmElJHNrSh8jlltW3y4RlD0xkOpOgWfdMLbEx53MG56/TctyR3HYDBt98uVO+25w7KXWfuhifyN+RsaGY8LAJjERiLwGARGCwCg0VgsAgMFoHBIjBYBAaLwGARGECdsNG4pDeAF8Yo3gXw6qRhxvp97B0Ruzar0BEiGA9JiyNiXt12TBaK/D78ODAWgdkyRHBl3QZMMnL/Pjr+ncC0z5bQE5g2sQhM54pA0omSVkh6XtL5ddtTN5JWSXoyzS21OFfdTnwnkNRNkmb3BGA18AiwICKertWwGmknrK9Te4IjgOcj4jcR0QfcDJxas00dS6eKYA/gpYbj1em5rZkA7pG0RNJZeSp2RARSE5pluOq859rEcnRErJG0G0neyWfS3WJa0qk9wWpgz4bj2cCammyZFETEmvTzdWAhySMzE50qgkeA/STNldQLnA4sqtmm2pA0TdL0oe8ku8Fk3iWzIx8HETEg6WzgbqAbuDYiltVsVp3MAhYqyQPZA/xnRNyVtXJHDhHNxNKpjwMzgVgExiIwFoHBIjBYBACkO67f0HDcI+kNSbePuO5nkn454txFkl5OvXdPS1rQUHakpIfTsuXptV9Kj5dK6mvw/F1S/r90DCJiq/8B3gUeA6amx58BlgK3N1wzk8RfsRyY23D+IuDc9Pt+wDpgSnq8Ajg4/d4NHDii3VXALnX/+90TDHMncFL6fQFw04jyzwO3kXgsT292g4h4DngP2DE9tRvwSlo2OFld3RbBMDcDp0vaFjgIeHhE+ZAwbkq/j0LSYcBz6fw9JNv6r5C0UNJX0ntPOiyClIh4AphD8h98R2OZpFnAR4AHI+JZYEDSxxsuOUfSChLhXNRwz4uBecA9wBeAzFO5VWIRbM4i4DuMfhScRtLFr0xX8Mxh80fCpRFxQHrd9Y1/8RHx64i4HJgPHCxp5/LML4ZFsDnXAhdHsiN7IwuAEyNiTiR7Mh9Ok/eCiLgVWAycCSDpJKVeHZKXxkFgbTmmF8ciaCAiVkfE9xrPSZoD7AU81HDdSmCdpE80uc3FwDcldQFnkLwTLAVuAP40IgZLMr8w9iIa9wTGIjBYBAaLwGARGCwCg0VggP8Hr1D33Iap+KgAAAAASUVORK5CYII=\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "eps = 1\n",
- "noise_scale = len(COLS) / (n * eps) * 7 # 7 accounts for repeated rows\n",
- "\n",
- "def compute_marginal_distribution(df, column, noise_scale: float = 0):\n",
- " marginal = df.groupby(column).size() / n\n",
- " # Only place where the data is accessed\n",
- " if noise_scale > 0:\n",
- " marginal += np.random.laplace(scale=noise_scale, size=marginal.size)\n",
- " marginal = marginal.clip(lower=0)\n",
- " return marginal / marginal.sum(axis=0)\n",
- "\n",
- "def compute_conditional_distribution(df, column, parents, noise_scale: float = 0):\n",
- " joint = (df.groupby([column] + parents).size() / n)\n",
- " joint = joint.unstack(column, fill_value=0)\n",
- "\n",
- " # Only place where the data is accessed\n",
- " if noise_scale > 0:\n",
- " joint += np.random.laplace(scale=noise_scale, size=joint.shape)\n",
- " joint = joint.clip(lower=0)\n",
- " \n",
- " return joint.div(joint.sum(axis=1), axis=\"rows\")\n",
- "\n",
- "def compute_bayesnet(df, cond, noise_scale: float = noise_scale):\n",
- " bayesnet = {}\n",
- " \n",
- " node = cond[0]\n",
- " *parents, root = node\n",
- " \n",
- " bayesnet[node] = compute_marginal_distribution(df, root, noise_scale)\n",
- " \n",
- " \n",
- " for node in tqdm(cond[1:]):\n",
- " *parents, column = node\n",
- " bayesnet[node] = compute_conditional_distribution(df, column, parents, noise_scale)\n",
- " \n",
- " return bayesnet\n",
- " \n",
- "bayesnet = compute_bayesnet(private_bin, cond, noise_scale=noise_scale)\n",
- "\n",
- "# Example: marital status as a function of the age\n",
- "plt.imshow(bayesnet[(\"AGE\", \"MARST\")].to_numpy())\n",
- "plt.xlabel(\"MARST\")\n",
- "plt.ylabel(\"AGE (binned)\")\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "dfc0f576",
- "metadata": {},
- "source": [
- "For instance, we plotted the marital status distribution as a function of the age.\n",
- "\n",
- "### 2.3 Generating sample from the estimated joint probability distribution"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "id": "0b172932",
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:07<00:00, 4.39it/s]\n"
- ]
- },
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " PUMA | \n",
- " YEAR | \n",
- " HHWT | \n",
- " GQ | \n",
- " PERWT | \n",
- " SEX | \n",
- " AGE | \n",
- " MARST | \n",
- " RACE | \n",
- " HISPAN | \n",
- " ... | \n",
- " WORKEDYR | \n",
- " INCTOT | \n",
- " INCWAGE | \n",
- " INCWELFR | \n",
- " INCINVST | \n",
- " INCEARN | \n",
- " POVERTY | \n",
- " DEPARTS | \n",
- " ARRIVES | \n",
- " sim_individual_id | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " 0 | \n",
- " 64 | \n",
- " 0 | \n",
- " 8 | \n",
- " 1 | \n",
- " 11 | \n",
- " 0 | \n",
- " 9 | \n",
- " 0 | \n",
- " 5 | \n",
- " 4 | \n",
- " ... | \n",
- " 3 | \n",
- " 19 | \n",
- " 19 | \n",
- " 1 | \n",
- " 1 | \n",
- " 19 | \n",
- " 26 | \n",
- " 18 | \n",
- " 89 | \n",
- " 0 | \n",
- "
\n",
- " \n",
- " 1 | \n",
- " 150 | \n",
- " 0 | \n",
- " 15 | \n",
- " 1 | \n",
- " 26 | \n",
- " 1 | \n",
- " 13 | \n",
- " 4 | \n",
- " 0 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 5 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 5 | \n",
- " 6 | \n",
- " 45 | \n",
- " 54 | \n",
- " 0 | \n",
- "
\n",
- " \n",
- " 2 | \n",
- " 175 | \n",
- " 2 | \n",
- " 8 | \n",
- " 1 | \n",
- " 9 | \n",
- " 0 | \n",
- " 8 | \n",
- " 3 | \n",
- " 8 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 8 | \n",
- " 4 | \n",
- " 1 | \n",
- " 1 | \n",
- " 8 | \n",
- " 17 | \n",
- " 41 | \n",
- " 41 | \n",
- " 0 | \n",
- "
\n",
- " \n",
- " 3 | \n",
- " 21 | \n",
- " 0 | \n",
- " 7 | \n",
- " 1 | \n",
- " 14 | \n",
- " 1 | \n",
- " 11 | \n",
- " 4 | \n",
- " 0 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 5 | \n",
- " 5 | \n",
- " 1 | \n",
- " 1 | \n",
- " 5 | \n",
- " 16 | \n",
- " 30 | \n",
- " 31 | \n",
- " 0 | \n",
- "
\n",
- " \n",
- " 4 | \n",
- " 114 | \n",
- " 3 | \n",
- " 2 | \n",
- " 1 | \n",
- " 2 | \n",
- " 0 | \n",
- " 10 | \n",
- " 0 | \n",
- " 3 | \n",
- " 4 | \n",
- " ... | \n",
- " 3 | \n",
- " 6 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 14 | \n",
- " 1 | \n",
- " 1 | \n",
- " 0 | \n",
- "
\n",
- " \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- "
\n",
- " \n",
- " 1348359 | \n",
- " 116 | \n",
- " 1 | \n",
- " 9 | \n",
- " 1 | \n",
- " 6 | \n",
- " 0 | \n",
- " 7 | \n",
- " 5 | \n",
- " 0 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 5 | \n",
- " 5 | \n",
- " 1 | \n",
- " 1 | \n",
- " 5 | \n",
- " 7 | \n",
- " 17 | \n",
- " 19 | \n",
- " 0 | \n",
- "
\n",
- " \n",
- " 1348360 | \n",
- " 29 | \n",
- " 0 | \n",
- " 3 | \n",
- " 1 | \n",
- " 3 | \n",
- " 0 | \n",
- " 8 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 8 | \n",
- " 8 | \n",
- " 1 | \n",
- " 1 | \n",
- " 8 | \n",
- " 18 | \n",
- " 29 | \n",
- " 33 | \n",
- " 0 | \n",
- "
\n",
- " \n",
- " 1348361 | \n",
- " 54 | \n",
- " 4 | \n",
- " 3 | \n",
- " 1 | \n",
- " 3 | \n",
- " 1 | \n",
- " 5 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 9 | \n",
- " 9 | \n",
- " 1 | \n",
- " 1 | \n",
- " 9 | \n",
- " 26 | \n",
- " 56 | \n",
- " 75 | \n",
- " 0 | \n",
- "
\n",
- " \n",
- " 1348362 | \n",
- " 164 | \n",
- " 0 | \n",
- " 5 | \n",
- " 1 | \n",
- " 5 | \n",
- " 0 | \n",
- " 2 | \n",
- " 5 | \n",
- " 0 | \n",
- " 0 | \n",
- " ... | \n",
- " 1 | \n",
- " 3 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 6 | \n",
- " 1 | \n",
- " 1 | \n",
- " 0 | \n",
- "
\n",
- " \n",
- " 1348363 | \n",
- " 30 | \n",
- " 2 | \n",
- " 3 | \n",
- " 1 | \n",
- " 3 | \n",
- " 0 | \n",
- " 2 | \n",
- " 5 | \n",
- " 0 | \n",
- " 0 | \n",
- " ... | \n",
- " 1 | \n",
- " 8 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 19 | \n",
- " 1 | \n",
- " 1 | \n",
- " 0 | \n",
- "
\n",
- " \n",
- "
\n",
- "
1348364 rows × 36 columns
\n",
- "
"
- ],
- "text/plain": [
- " PUMA YEAR HHWT GQ PERWT SEX AGE MARST RACE HISPAN ... \\\n",
- "0 64 0 8 1 11 0 9 0 5 4 ... \n",
- "1 150 0 15 1 26 1 13 4 0 0 ... \n",
- "2 175 2 8 1 9 0 8 3 8 0 ... \n",
- "3 21 0 7 1 14 1 11 4 0 0 ... \n",
- "4 114 3 2 1 2 0 10 0 3 4 ... \n",
- "... ... ... ... .. ... ... ... ... ... ... ... \n",
- "1348359 116 1 9 1 6 0 7 5 0 0 ... \n",
- "1348360 29 0 3 1 3 0 8 0 0 0 ... \n",
- "1348361 54 4 3 1 3 1 5 0 0 0 ... \n",
- "1348362 164 0 5 1 5 0 2 5 0 0 ... \n",
- "1348363 30 2 3 1 3 0 2 5 0 0 ... \n",
- "\n",
- " WORKEDYR INCTOT INCWAGE INCWELFR INCINVST INCEARN POVERTY \\\n",
- "0 3 19 19 1 1 19 26 \n",
- "1 3 5 1 1 1 5 6 \n",
- "2 3 8 4 1 1 8 17 \n",
- "3 3 5 5 1 1 5 16 \n",
- "4 3 6 1 1 1 1 14 \n",
- "... ... ... ... ... ... ... ... \n",
- "1348359 3 5 5 1 1 5 7 \n",
- "1348360 3 8 8 1 1 8 18 \n",
- "1348361 3 9 9 1 1 9 26 \n",
- "1348362 1 3 1 1 1 1 6 \n",
- "1348363 1 8 1 1 1 1 19 \n",
- "\n",
- " DEPARTS ARRIVES sim_individual_id \n",
- "0 18 89 0 \n",
- "1 45 54 0 \n",
- "2 41 41 0 \n",
- "3 30 31 0 \n",
- "4 1 1 0 \n",
- "... ... ... ... \n",
- "1348359 17 19 0 \n",
- "1348360 29 33 0 \n",
- "1348361 56 75 0 \n",
- "1348362 1 1 0 \n",
- "1348363 1 1 0 \n",
- "\n",
- "[1348364 rows x 36 columns]"
- ]
- },
- "execution_count": 7,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "def generate_sample(columns, cond, bayesnet, size=10):\n",
- " synthetic = pd.DataFrame(0, columns=columns, index=np.arange(size))\n",
- " \n",
- " # Sample the first column as i.i.d variables on the root node.\n",
- " node = cond[0]\n",
- " *parents, root = node\n",
- " dist = bayesnet[node]\n",
- " synthetic[root] = dist.index[np.random.choice(a=len(dist), size=size, replace=True, p=dist.to_numpy())]\n",
- " \n",
- " # Conditional distributions\n",
- " for node in tqdm(cond[1:]):\n",
- " *parents, column = node\n",
- " dist = bayesnet[node]\n",
- "\n",
- " if len(parents) == 1:\n",
- " cumsum = dist.loc[synthetic[parents[0]]].to_numpy().cumsum(axis=1)\n",
- " else:\n",
- " raise NotImplementedError\n",
- " \n",
- " u = np.random.rand(size)\n",
- " k = (u[:, None] > cumsum).sum(axis=1)\n",
- " synthetic[column] = dist.columns[k]\n",
- " \n",
- " return synthetic\n",
- "\n",
- "synthetic_bin = generate_sample(public_bin.columns, cond, bayesnet, size=len(private))\n",
- "\n",
- "# Display the synthetic dataset\n",
- "synthetic_bin"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "d2855d57",
- "metadata": {},
- "source": [
- "### 2.4 Compute the final score"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "id": "817bd734",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " PUMA | \n",
- " YEAR | \n",
- " HHWT | \n",
- " GQ | \n",
- " PERWT | \n",
- " SEX | \n",
- " AGE | \n",
- " MARST | \n",
- " RACE | \n",
- " HISPAN | \n",
- " ... | \n",
- " WORKEDYR | \n",
- " INCTOT | \n",
- " INCWAGE | \n",
- " INCWELFR | \n",
- " INCINVST | \n",
- " INCEARN | \n",
- " POVERTY | \n",
- " DEPARTS | \n",
- " ARRIVES | \n",
- " sim_individual_id | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " 0 | \n",
- " 36-3304 | \n",
- " 2012 | \n",
- " 140.0 | \n",
- " 1 | \n",
- " 200.0 | \n",
- " 1 | \n",
- " 60.0 | \n",
- " 1 | \n",
- " 6 | \n",
- " 4 | \n",
- " ... | \n",
- " 3 | \n",
- " 90000.0 | \n",
- " 90000.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 90000.0 | \n",
- " 500.0 | \n",
- " 415.0 | \n",
- " 2200.0 | \n",
- " 0 | \n",
- "
\n",
- " \n",
- " 1 | \n",
- " 42-1300 | \n",
- " 2012 | \n",
- " 280.0 | \n",
- " 1 | \n",
- " 500.0 | \n",
- " 2 | \n",
- " 80.0 | \n",
- " 5 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 20000.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 20000.0 | \n",
- " 100.0 | \n",
- " 1100.0 | \n",
- " 1315.0 | \n",
- " 0 | \n",
- "
\n",
- " \n",
- " 2 | \n",
- " 42-2500 | \n",
- " 2014 | \n",
- " 140.0 | \n",
- " 1 | \n",
- " 160.0 | \n",
- " 1 | \n",
- " 55.0 | \n",
- " 4 | \n",
- " 9 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 35000.0 | \n",
- " 15000.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 35000.0 | \n",
- " 320.0 | \n",
- " 1000.0 | \n",
- " 1000.0 | \n",
- " 0 | \n",
- "
\n",
- " \n",
- " 3 | \n",
- " 36-2002 | \n",
- " 2012 | \n",
- " 120.0 | \n",
- " 1 | \n",
- " 260.0 | \n",
- " 2 | \n",
- " 70.0 | \n",
- " 5 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 20000.0 | \n",
- " 20000.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 20000.0 | \n",
- " 300.0 | \n",
- " 715.0 | \n",
- " 730.0 | \n",
- " 0 | \n",
- "
\n",
- " \n",
- " 4 | \n",
- " 36-4017 | \n",
- " 2015 | \n",
- " 20.0 | \n",
- " 1 | \n",
- " 20.0 | \n",
- " 1 | \n",
- " 65.0 | \n",
- " 1 | \n",
- " 4 | \n",
- " 4 | \n",
- " ... | \n",
- " 3 | \n",
- " 25000.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 260.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 0 | \n",
- "
\n",
- " \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- "
\n",
- " \n",
- " 1348359 | \n",
- " 36-402 | \n",
- " 2013 | \n",
- " 160.0 | \n",
- " 1 | \n",
- " 100.0 | \n",
- " 1 | \n",
- " 50.0 | \n",
- " 6 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 20000.0 | \n",
- " 20000.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 20000.0 | \n",
- " 120.0 | \n",
- " 400.0 | \n",
- " 430.0 | \n",
- " 0 | \n",
- "
\n",
- " \n",
- " 1348360 | \n",
- " 36-2500 | \n",
- " 2012 | \n",
- " 40.0 | \n",
- " 1 | \n",
- " 40.0 | \n",
- " 1 | \n",
- " 55.0 | \n",
- " 1 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 35000.0 | \n",
- " 35000.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 35000.0 | \n",
- " 340.0 | \n",
- " 700.0 | \n",
- " 800.0 | \n",
- " 0 | \n",
- "
\n",
- " \n",
- " 1348361 | \n",
- " 36-3206 | \n",
- " 2016 | \n",
- " 40.0 | \n",
- " 1 | \n",
- " 40.0 | \n",
- " 2 | \n",
- " 40.0 | \n",
- " 1 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 40000.0 | \n",
- " 40000.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 40000.0 | \n",
- " 500.0 | \n",
- " 1345.0 | \n",
- " 1830.0 | \n",
- " 0 | \n",
- "
\n",
- " \n",
- " 1348362 | \n",
- " 42-1900 | \n",
- " 2012 | \n",
- " 80.0 | \n",
- " 1 | \n",
- " 80.0 | \n",
- " 1 | \n",
- " 25.0 | \n",
- " 6 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 1 | \n",
- " 10000.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 100.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 0 | \n",
- "
\n",
- " \n",
- " 1348363 | \n",
- " 36-2600 | \n",
- " 2014 | \n",
- " 40.0 | \n",
- " 1 | \n",
- " 40.0 | \n",
- " 1 | \n",
- " 25.0 | \n",
- " 6 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 1 | \n",
- " 35000.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 360.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 0 | \n",
- "
\n",
- " \n",
- "
\n",
- "
1348364 rows × 36 columns
\n",
- "
"
- ],
- "text/plain": [
- " PUMA YEAR HHWT GQ PERWT SEX AGE MARST RACE HISPAN ... \\\n",
- "0 36-3304 2012 140.0 1 200.0 1 60.0 1 6 4 ... \n",
- "1 42-1300 2012 280.0 1 500.0 2 80.0 5 1 0 ... \n",
- "2 42-2500 2014 140.0 1 160.0 1 55.0 4 9 0 ... \n",
- "3 36-2002 2012 120.0 1 260.0 2 70.0 5 1 0 ... \n",
- "4 36-4017 2015 20.0 1 20.0 1 65.0 1 4 4 ... \n",
- "... ... ... ... .. ... ... ... ... ... ... ... \n",
- "1348359 36-402 2013 160.0 1 100.0 1 50.0 6 1 0 ... \n",
- "1348360 36-2500 2012 40.0 1 40.0 1 55.0 1 1 0 ... \n",
- "1348361 36-3206 2016 40.0 1 40.0 2 40.0 1 1 0 ... \n",
- "1348362 42-1900 2012 80.0 1 80.0 1 25.0 6 1 0 ... \n",
- "1348363 36-2600 2014 40.0 1 40.0 1 25.0 6 1 0 ... \n",
- "\n",
- " WORKEDYR INCTOT INCWAGE INCWELFR INCINVST INCEARN POVERTY \\\n",
- "0 3 90000.0 90000.0 0.0 0.0 90000.0 500.0 \n",
- "1 3 20000.0 0.0 0.0 0.0 20000.0 100.0 \n",
- "2 3 35000.0 15000.0 0.0 0.0 35000.0 320.0 \n",
- "3 3 20000.0 20000.0 0.0 0.0 20000.0 300.0 \n",
- "4 3 25000.0 0.0 0.0 0.0 0.0 260.0 \n",
- "... ... ... ... ... ... ... ... \n",
- "1348359 3 20000.0 20000.0 0.0 0.0 20000.0 120.0 \n",
- "1348360 3 35000.0 35000.0 0.0 0.0 35000.0 340.0 \n",
- "1348361 3 40000.0 40000.0 0.0 0.0 40000.0 500.0 \n",
- "1348362 1 10000.0 0.0 0.0 0.0 0.0 100.0 \n",
- "1348363 1 35000.0 0.0 0.0 0.0 0.0 360.0 \n",
- "\n",
- " DEPARTS ARRIVES sim_individual_id \n",
- "0 415.0 2200.0 0 \n",
- "1 1100.0 1315.0 0 \n",
- "2 1000.0 1000.0 0 \n",
- "3 715.0 730.0 0 \n",
- "4 0.0 0.0 0 \n",
- "... ... ... ... \n",
- "1348359 400.0 430.0 0 \n",
- "1348360 700.0 800.0 0 \n",
- "1348361 1345.0 1830.0 0 \n",
- "1348362 0.0 0.0 0 \n",
- "1348363 0.0 0.0 0 \n",
- "\n",
- "[1348364 rows x 36 columns]"
- ]
- },
- "execution_count": 8,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Unbin data\n",
- "synthetic = sdnist.utils.undo_discretize(synthetic_bin, private_schema, BINS)\n",
- "synthetic"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "id": "6eac1e6d",
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:10<00:00, 4.78it/s]\n"
- ]
- },
- {
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW0AAACMCAYAAAC3bvixAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAALeklEQVR4nO3df4yVVX7H8fd3GBxhQGD4MSKskV23BANaWVPWtmmhbrTuNlI3dHdJ17XZdqHZ2Gy72TQQYgp/GGjTVGOaXdpojdWKu6EGrGndKmXS0N1VocoPy46iCIIwMzjjrgyCzMzpH/dhZhAY5uL88Mx9v5In97nnPj/O+XLvJ/ee594hUkpIkvJQNdwdkCT1n6EtSRkxtCUpI4a2JGXE0JakjBjakpSR6nI2njhxYrr22msHqy9ZaW9vp7a2dri78YlgLXpYix7WoseOHTuOpZSmDsSxygrt+vp6tm/fPhDnzV5DQwMLFy4c7m58IliLHtaih7XoEREHBupYTo9IUkYMbUnKiKEtSRkxtCUpI4a2JGXE0JakjBjakpQRQ1uSMmJoS1JGDG1JykhZP2OXpDPu/+n9PPTyQwTBvPp5PLL4EU6cPsFXN36Vt957iwldE3huwXNMGjOJFw+/yLJ/WwZAIrH6t1dz55w7h3kEefKdtqSyHf7lYR588UG2f2s7e769h86uTp7c8yTrtq3jllm38Pqfvc78SfNZt20dAHOnzWX7su288qev8OwfPsvyZ5bT0dUxzKPIk6Et6ZJ0dHXwQccHdHR1cOL0Ca4afxWbGzdz9w13A3Bb/W1satwEwNjRY6muKn2wP9lxkogYrm5nz+kRSWWbccUMvnfz97j6/qsZM3oMt37mVm79zK00HW9i+vjpAEyumUxze3P3Pi8ceoFvPv1NDrx3gMfufKw7xFUe32lLKlvbB21sbtzM/u/s553vvkP7h+08vuvxPvdZMHMBr377VV761kus3baWkx0nh6i3I4uhLalsz7/5PLMmzmJq7VRGjxrNl+d8mZ+8/RPqx9Vz5P0jALx76l2m1U47Z985U+dQe1kte5r3DHW3RwRDW1LZrp5wNT87/DNOnD5BSokt+7cwZ8oc7viVO3h056MA/LjpxyyevRiA/W37uy88HnjvAI3HGrlm4jXD1f2sOakkqWwLZi5gyZwlzP+H+VRXVXPj9BtZ9rllHP/wOF/Z+BUefvlhxneN5/k/eB6AbQe3se5/1jG6ajRVUcX3v/R9poydMsyjyJOhLemSrFm0hjWL1pzVVlNdw5ZvbAFK/91Y3Zg6AO664S7uuuGuIe/jSOT0iCRlxNCW9LHU1dUREecsixYt6l5n9YTzbhMR1NXVDfcQsmJoS/pY2traSCmds2zdurV7HTjvNikl2trahnkEeTG0JSkjhrYkZcTQlqSMGNqSlBFDW5IyYmhLUkYMbUnKiKEtSRkxtCUpI4a2JGXE0JakjBjakpQRQ1uSMmJoS1JGDG2pAkTEcHdhUIzUcfXF0JakjBjakpQRQ1uSMmJoS1JGDG1JyoihLUkZMbQlKSOGtiRlxNCWpIwY2pKUEUNbkjJiaEtSRgxtaYht2LCByZMnExHdy7hx485pG8gFSn9c6frrrx+SMbacaOGBow9w7INjQ3K+SmJoS0Now4YNLF++nLa2Nurq6li7di1jx46lvb2d1tbWQT//7t27hyS41+9az5un3mT9zvWDfq5KY2hLQ+i+++7j5MmT1NfXs3HjRlasWMG0adOoqup5KdbX13PllVcyatSos9oHyu7duwf8mL21nGhh877NJBKb9m3y3fYAq77YBhGxDFgGMHXqVBoaGga7T1k4fvy4tShYix4Xq8XevXvp6uqiubmZzs5OGhoaOHjwIF1dXd3bNDU1UVVVdVbbQBvov0Pde8w/fPeHdHR2ANDR2cG9/34vP/jINgPZn0p77kVKqd8bz549OzU2Ng5id/LR0NDAwoULh7sbnwjWosfFajF37lxee+01Jk+ezBNPPMGiRYuYNWvWWcFdX19PRNDS0kJKaVDCu5zX/cVERPfxWk60cPtTt3Oq81T34zWjanh2/xtMubftovt/nHN/kkXEjpTSTQNxLKdHpCG0atUqLr/8cpqamliyZAnr1q2jubn5nHfaR48epbOzc1ACe968eQN+zDPW71pPVzq7z12pi/UTJwzaOSvNRadHJA2cpUuXAnDPPffQ2trKypUrAaitraWmpmbQL0bOmzePXbt2Ddrxdzbv5HTX6bPaTned5pWamkE7Z6UxtKUhtnTp0u7wHipDNY2w8Y6N3etnTRWt9p32QHF6RJIyYmhLUkYMbUnKiKEtSRkxtCUpI4a2JGXE0JakjBjakpQRQ1uSMmJoS1JGDG1JyoihLUkZMbQlKSOGtlQBcviPAi7FSB1XXwxtScqIoS1JGTG0JSkjhrYkZcTQlqSMGNqSlBFDW5IyYmhLUkYMbUnKiKEtSRkxtCUpI4a2JGXE0JakjBjakpSR6uHugKT8RUSfj6e/uuKC20yaNGkwujRi+U5b0seSUjrvsnXr1u51Vv/igtu1trYO9xCyYmhLUkYMbUnKiKEtSRkxtCUpI4a2JGXE0JakjBjakpQRQ1uSMmJoS1JGDG1JyoihLUkZMbQlKSOGtiRlxNCWpIwY2pKUEUNbkjJiaEtSRgxtScqIoS1JGTG0JSkjkVLq/8YR7wONg9edrEwBjg13Jz4hrEUPa9HDWvSYnVIaPxAHqi5z+8aU0k0DceLcRcR2a1FiLXpYix7WokdEbB+oYzk9IkkZMbQlKSPlhvY/Dkov8mQteliLHtaih7XoMWC1KOtCpCRpeDk9IkkZ6VdoR8TvRkRjROyLiBWD3anhFhGfioitEbE3Il6NiO8U7XUR8VxEvF7cTuq1z8qiPo0Rcdvw9X7gRcSoiHg5Ip4p7ldkHQAiYmJEbIyInxfPj5srtR4R8RfF62NPRGyIiMsrpRYR8U8R0RwRe3q1lT32iPhcROwuHnswIuKiJ08p9bkAo4A3gE8DlwE7gesutl/OCzAdmF+sjwdeA64D/gZYUbSvAP66WL+uqEsNMKuo16jhHscA1uO7wBPAM8X9iqxDMcZHgT8p1i8DJlZiPYAZwH5gTHH/R8AfVUotgN8C5gN7erWVPXbgReBmIID/AG6/2Ln7807714B9KaU3U0ofAk8Ci/uxX7ZSSkdSSv9brL8P7KX0JF1M6UVLcfv7xfpi4MmU0qmU0n5gH6W6ZS8iZgJfAh7q1VxxdQCIiCsovVgfBkgpfZhSeo8KrQel33mMiYhqYCzwDhVSi5TSfwOtH2kua+wRMR24IqX001RK8H/utc8F9Se0ZwBv97p/qGirCBFxDXAj8AJQn1I6AqVgB6YVm43kGj0A/CXQ1autEusApU+bLcAjxXTRQxFRSwXWI6V0GPhb4CBwBPhFSuk/qcBa9FLu2GcU6x9t71N/Qvt8cywV8ZWTiBgH/Cvw5ymlX/a16Xnasq9RRPwe0JxS2tHfXc7Tln0deqmm9JH4BymlG4F2Sh+DL2TE1qOYr11M6eP+VUBtRHy9r13O0zYiatEPFxr7JdWkP6F9CPhUr/szKX0MGtEiYjSlwP6XlNJTRXNT8ZGG4ra5aB+pNfoN4I6IeIvStNjvRMTjVF4dzjgEHEopvVDc30gpxCuxHl8A9qeUWlJKp4GngF+nMmtxRrljP1Ssf7S9T/0J7ZeAz0bErIi4DPga8HQ/9stWcQX3YWBvSunvej30NHB3sX43sLlX+9cioiYiZgGfpXSBIWsppZUppZkppWso/bv/V0rp61RYHc5IKR0F3o6I2UXTLcD/UZn1OAh8PiLGFq+XWyhd+6nEWpxR1tiLKZT3I+LzRQ2/0WufC+vnldIvUvoGxRvAquG+cjsEV4Z/k9LHlF3AK8XyRWAysAV4vbit67XPqqI+jfTjCnBuC7CQnm+PVHIdfhXYXjw3NgGTKrUewBrg58Ae4DFK346oiFoAGyjN5Z+m9I75jy9l7MBNRf3eAP6e4gePfS3+IlKSMuIvIiUpI4a2JGXE0JakjBjakpQRQ1uSMmJoS1JGDG1JyoihLUkZ+X/sWMXl3xZ6bQAAAABJRU5ErkJggg==\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "# Compute score\n",
- "score = sdnist.score(private, synthetic, schema=private_schema, challenge=\"census\")\n",
- "\n",
- "plt.figure(figsize=(6, 2))\n",
- "score.boxplot()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "3eec75b9",
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3 (ipykernel)",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.9.7"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/examples/bayesnet/longitudinal_bayesnet.ipynb b/examples/bayesnet/longitudinal_bayesnet.ipynb
deleted file mode 100644
index 23611ed..0000000
--- a/examples/bayesnet/longitudinal_bayesnet.ipynb
+++ /dev/null
@@ -1,1221 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": 1,
- "id": "1064cd81",
- "metadata": {},
- "outputs": [],
- "source": [
- "import itertools\n",
- "import functools\n",
- "\n",
- "import numpy as np\n",
- "import pandas as pd\n",
- "\n",
- "import networkx as nx\n",
- "import pydot\n",
- "from networkx.drawing.nx_pydot import graphviz_layout\n",
- "import matplotlib.pyplot as plt\n",
- "import matplotlib.colors as mcolors\n",
- "\n",
- "from tqdm import tqdm\n",
- "\n",
- "import sdnist"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "id": "efb27cb3",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- ""
- ]
- },
- "execution_count": 2,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "import importlib\n",
- "importlib.reload(sdnist)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "1f29c592",
- "metadata": {},
- "source": [
- "## 1. Load both the public & the private dataset"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "id": "8391a015",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " PUMA | \n",
- " YEAR | \n",
- " HHWT | \n",
- " GQ | \n",
- " PERWT | \n",
- " SEX | \n",
- " AGE | \n",
- " MARST | \n",
- " RACE | \n",
- " HISPAN | \n",
- " ... | \n",
- " WORKEDYR | \n",
- " INCTOT | \n",
- " INCWAGE | \n",
- " INCWELFR | \n",
- " INCINVST | \n",
- " INCEARN | \n",
- " POVERTY | \n",
- " DEPARTS | \n",
- " ARRIVES | \n",
- " sim_individual_id | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " 0 | \n",
- " 17-1001 | \n",
- " 2012 | \n",
- " 88.0 | \n",
- " 1 | \n",
- " 61.0 | \n",
- " 1 | \n",
- " 21 | \n",
- " 6 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 14000 | \n",
- " 14000 | \n",
- " 0 | \n",
- " 0 | \n",
- " 14000 | \n",
- " 118 | \n",
- " 902 | \n",
- " 909 | \n",
- " 12 | \n",
- "
\n",
- " \n",
- " 1 | \n",
- " 17-1001 | \n",
- " 2012 | \n",
- " 61.0 | \n",
- " 1 | \n",
- " 85.0 | \n",
- " 1 | \n",
- " 21 | \n",
- " 6 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 18000 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " 18000 | \n",
- " 262 | \n",
- " 732 | \n",
- " 744 | \n",
- " 33 | \n",
- "
\n",
- " \n",
- " 2 | \n",
- " 17-1001 | \n",
- " 2012 | \n",
- " 54.0 | \n",
- " 1 | \n",
- " 54.0 | \n",
- " 1 | \n",
- " 21 | \n",
- " 6 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 14000 | \n",
- " 14000 | \n",
- " 0 | \n",
- " 0 | \n",
- " 14000 | \n",
- " 118 | \n",
- " 642 | \n",
- " 654 | \n",
- " 401 | \n",
- "
\n",
- " \n",
- " 3 | \n",
- " 17-1001 | \n",
- " 2012 | \n",
- " 106.0 | \n",
- " 1 | \n",
- " 69.0 | \n",
- " 1 | \n",
- " 21 | \n",
- " 6 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 3800 | \n",
- " 3800 | \n",
- " 0 | \n",
- " 0 | \n",
- " 3800 | \n",
- " 262 | \n",
- " 0 | \n",
- " 0 | \n",
- " 470 | \n",
- "
\n",
- " \n",
- " 4 | \n",
- " 17-1001 | \n",
- " 2012 | \n",
- " 31.0 | \n",
- " 1 | \n",
- " 56.0 | \n",
- " 1 | \n",
- " 21 | \n",
- " 6 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 14000 | \n",
- " 14000 | \n",
- " 0 | \n",
- " 0 | \n",
- " 14000 | \n",
- " 501 | \n",
- " 0 | \n",
- " 0 | \n",
- " 702 | \n",
- "
\n",
- " \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- "
\n",
- " \n",
- " 1035196 | \n",
- " 39-4300 | \n",
- " 2018 | \n",
- " 103.0 | \n",
- " 1 | \n",
- " 90.0 | \n",
- " 2 | \n",
- " 37 | \n",
- " 1 | \n",
- " 9 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 36000 | \n",
- " 36000 | \n",
- " 0 | \n",
- " 0 | \n",
- " 36000 | \n",
- " 231 | \n",
- " 1605 | \n",
- " 1624 | \n",
- " 556291 | \n",
- "
\n",
- " \n",
- " 1035197 | \n",
- " 39-4106 | \n",
- " 2018 | \n",
- " 207.0 | \n",
- " 1 | \n",
- " 207.0 | \n",
- " 2 | \n",
- " 41 | \n",
- " 6 | \n",
- " 9 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 52800 | \n",
- " 52000 | \n",
- " 0 | \n",
- " 0 | \n",
- " 52000 | \n",
- " 361 | \n",
- " 1005 | \n",
- " 1019 | \n",
- " 1139708 | \n",
- "
\n",
- " \n",
- " 1035198 | \n",
- " 17-2200 | \n",
- " 2018 | \n",
- " 73.0 | \n",
- " 1 | \n",
- " 58.0 | \n",
- " 2 | \n",
- " 46 | \n",
- " 4 | \n",
- " 9 | \n",
- " 0 | \n",
- " ... | \n",
- " 2 | \n",
- " 25800 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " 200 | \n",
- " 0 | \n",
- " 0 | \n",
- " 346052 | \n",
- "
\n",
- " \n",
- " 1035199 | \n",
- " 17-2300 | \n",
- " 2018 | \n",
- " 47.0 | \n",
- " 1 | \n",
- " 47.0 | \n",
- " 2 | \n",
- " 46 | \n",
- " 1 | \n",
- " 9 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 5000 | \n",
- " 5000 | \n",
- " 0 | \n",
- " 0 | \n",
- " 5000 | \n",
- " 399 | \n",
- " 732 | \n",
- " 754 | \n",
- " 40265 | \n",
- "
\n",
- " \n",
- " 1035200 | \n",
- " 39-910 | \n",
- " 2018 | \n",
- " 86.0 | \n",
- " 1 | \n",
- " 86.0 | \n",
- " 2 | \n",
- " 75 | \n",
- " 1 | \n",
- " 9 | \n",
- " 0 | \n",
- " ... | \n",
- " 1 | \n",
- " 9600 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " 501 | \n",
- " 0 | \n",
- " 0 | \n",
- " 811103 | \n",
- "
\n",
- " \n",
- "
\n",
- "
1035201 rows × 36 columns
\n",
- "
"
- ],
- "text/plain": [
- " PUMA YEAR HHWT GQ PERWT SEX AGE MARST RACE HISPAN ... \\\n",
- "0 17-1001 2012 88.0 1 61.0 1 21 6 1 0 ... \n",
- "1 17-1001 2012 61.0 1 85.0 1 21 6 1 0 ... \n",
- "2 17-1001 2012 54.0 1 54.0 1 21 6 1 0 ... \n",
- "3 17-1001 2012 106.0 1 69.0 1 21 6 1 0 ... \n",
- "4 17-1001 2012 31.0 1 56.0 1 21 6 1 0 ... \n",
- "... ... ... ... .. ... ... ... ... ... ... ... \n",
- "1035196 39-4300 2018 103.0 1 90.0 2 37 1 9 0 ... \n",
- "1035197 39-4106 2018 207.0 1 207.0 2 41 6 9 0 ... \n",
- "1035198 17-2200 2018 73.0 1 58.0 2 46 4 9 0 ... \n",
- "1035199 17-2300 2018 47.0 1 47.0 2 46 1 9 0 ... \n",
- "1035200 39-910 2018 86.0 1 86.0 2 75 1 9 0 ... \n",
- "\n",
- " WORKEDYR INCTOT INCWAGE INCWELFR INCINVST INCEARN POVERTY \\\n",
- "0 3 14000 14000 0 0 14000 118 \n",
- "1 3 18000 0 0 0 18000 262 \n",
- "2 3 14000 14000 0 0 14000 118 \n",
- "3 3 3800 3800 0 0 3800 262 \n",
- "4 3 14000 14000 0 0 14000 501 \n",
- "... ... ... ... ... ... ... ... \n",
- "1035196 3 36000 36000 0 0 36000 231 \n",
- "1035197 3 52800 52000 0 0 52000 361 \n",
- "1035198 2 25800 0 0 0 0 200 \n",
- "1035199 3 5000 5000 0 0 5000 399 \n",
- "1035200 1 9600 0 0 0 0 501 \n",
- "\n",
- " DEPARTS ARRIVES sim_individual_id \n",
- "0 902 909 12 \n",
- "1 732 744 33 \n",
- "2 642 654 401 \n",
- "3 0 0 470 \n",
- "4 0 0 702 \n",
- "... ... ... ... \n",
- "1035196 1605 1624 556291 \n",
- "1035197 1005 1019 1139708 \n",
- "1035198 0 0 346052 \n",
- "1035199 732 754 40265 \n",
- "1035200 0 0 811103 \n",
- "\n",
- "[1035201 rows x 36 columns]"
- ]
- },
- "execution_count": 3,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "public, schema = sdnist.census(public=True)\n",
- "\n",
- "public"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "58a5e3cd",
- "metadata": {},
- "source": [
- "## 2. Subsampling \n",
- "\n",
- "As a warm-up, we compute the score of subsampled dataset.\n",
- "\n",
- "### 2.1 Naive subsampling"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "id": "c4b385b7",
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- " 8%|███▌ | 25/300 [00:00<00:02, 116.53it/s]/opt/homebrew/Caskroom/miniforge/base/envs/py39/lib/python3.9/site-packages/pandas/core/indexes/multi.py:3554: RuntimeWarning: The values in the array are unorderable. Pass `sort=False` to suppress this warning.\n",
- " result = lib.fast_unique_multiple([self._values, rvals], sort=sort)\n",
- "100%|██████████████████████████████████████████| 300/300 [00:02<00:00, 104.17it/s]\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- "135.37621976379612"
- ]
- },
- "execution_count": 4,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Randomly select rows\n",
- "synthetic = public.sample(frac=0.02)\n",
- "\n",
- "# Compute longitudinal score\n",
- "score = sdnist.kmarginal.CensusLongitudinalKMarginalScore(public, synthetic)\n",
- "score.compute_score()"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "04c67f01",
- "metadata": {},
- "source": [
- "### 2.2 User-level subsampling"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "id": "5f9900a5",
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████████████████████████████████████| 300/300 [00:02<00:00, 104.15it/s]\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- "948.6686829496338"
- ]
- },
- "execution_count": 5,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Build flat dataset\n",
- "public_flat = sdnist.utils.unstack(public)\n",
- "\n",
- "# Randomly sample individuals\n",
- "synthetic_flat = public_flat.sample(frac=0.1)\n",
- "\n",
- "# Reset into original format\n",
- "synthetic = sdnist.utils.stack(synthetic_flat)\n",
- "\n",
- "# Compute score\n",
- "score = sdnist.kmarginal.CensusLongitudinalKMarginalScore(public, synthetic)\n",
- "score.compute_score()"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "16cfe8f4",
- "metadata": {},
- "source": [
- "## 3. First order bayesian network ($k=1$)\n",
- "\n",
- "As described in *PrivBayes: Private Data Release via Bayesian Networks* (http://dimacs.rutgers.edu/~graham/pubs/papers/PrivBayes.pdf), except that we compute the tree structure on a public dataset.\n",
- "\n",
- "\n",
- "### 3.1 Building the Chow-Liu tree structure from the public dataset\n",
- "\n",
- "We represent the joint distribution as a first order bayesian network. The dependency tree is constructed from the public dataset.\n",
- "\n",
- "We only allow a column to depend on the current or the previous year."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "id": "5502e6cb",
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|███████████████████████████████████████████| 237/237 [00:44<00:00, 5.30it/s]\n"
- ]
- }
- ],
- "source": [
- "public_bin = sdnist.utils.discretize(public, sdnist.kmarginal.CensusKMarginalScore.BINS)\n",
- "public_flat = sdnist.utils.unstack(public_bin)\n",
- "n = len(public_flat)\n",
- "\n",
- "def mutual_information(df, col_a, col_b):\n",
- " ab = df.groupby([*col_a, col_b]).size().unstack(col_b, fill_value=0).to_numpy() / n\n",
- " a = ab.sum(axis=0, keepdims=True)\n",
- " b = ab.sum(axis=1, keepdims=True)\n",
- " \n",
- " llr = np.zeros_like(ab)\n",
- " np.log(ab / (a * b), where=ab > 0, out=llr)\n",
- " return np.sum(ab * llr)\n",
- "\n",
- "def greedy_bayes(df, root = (\"PUMA\", 2012), order=1):\n",
- " # Graph : for vizualization purposes\n",
- " graph = nx.DiGraph()\n",
- " graph.add_node(root)\n",
- " \n",
- " # Conditional distribution in topological order\n",
- " cond = [(root,)]\n",
- "\n",
- " # Greedy algorithm\n",
- " remaining = list(sorted(df.columns, key=lambda c: c[1]))\n",
- " remaining.remove(root)\n",
- " \n",
- " mutual_information_memoize = functools.cache(functools.partial(mutual_information, df))\n",
- " \n",
- " for i in tqdm(range(len(df.columns)-1)):\n",
- " max_col = None\n",
- " max_parents = None\n",
- " max_mi = 0\n",
- " \n",
- " for col in remaining:\n",
- " # col[1] : YEAR\n",
- " if col[1] != remaining[0][1]:\n",
- " continue\n",
- " \n",
- " for parents in itertools.combinations(graph.nodes, r=min(len(graph.nodes), order)):\n",
- " if parents[0][1] < col[1] - 1:\n",
- " continue\n",
- " \n",
- " mi = mutual_information_memoize(parents, col)\n",
- " \n",
- " if mi > max_mi:\n",
- " max_mi = mi\n",
- " max_col = col\n",
- " max_parents = parents\n",
- " \n",
- " graph.add_node(max_col)\n",
- " graph.add_edges_from(((p, max_col) for p in max_parents), weight=max_mi)\n",
- " \n",
- " cond.append((*max_parents, max_col))\n",
- " \n",
- " remaining.remove(max_col) \n",
- " \n",
- " return graph, cond\n",
- " \n",
- "graph, cond = greedy_bayes(public_flat, order=1)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "id": "d7de594a",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "colors = list(mcolors.TABLEAU_COLORS.values())\n",
- "\n",
- "plt.figure(dpi=200)\n",
- "pos = graphviz_layout(graph, prog=\"neato\")\n",
- "width = [w ** .5 for w in nx.get_edge_attributes(graph, \"weight\").values()]\n",
- "nx.draw(graph, pos=pos, width=width, edge_color=\"grey\", node_color=[colors[n[1]-2012] for n in graph.nodes],\n",
- " node_size=50)\n",
- "nx.draw_networkx_labels(graph, pos, labels={n:n[0] for n in nx.nodes(graph)}, font_size=3);"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "5e15c345",
- "metadata": {},
- "source": [
- "### 3.2 Computing the conditional distributions"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "id": "0ccbbf00",
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████████████████████████████████████| 237/237 [00:01<00:00, 204.69it/s]\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- ""
- ]
- },
- "execution_count": 8,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "eps = 1\n",
- "noise_scale = len(public_flat.columns) / n * eps\n",
- "\n",
- "def compute_marginal_distribution(df, column, noise_scale: float = 0):\n",
- " marginal = df.groupby(column).size() / n\n",
- "\n",
- " # Only place where the data is accessed\n",
- " if noise_scale > 0:\n",
- " marginal += np.random.laplace(scale=noise_scale, size=marginal.size)\n",
- " marginal = marginal.clip(lower=0)\n",
- " return marginal / marginal.sum(axis=0)\n",
- "\n",
- "def compute_conditional_distribution(df, column, parents, noise_scale: float = 0):\n",
- " joint = (df.groupby([column] + parents).size() / n)\n",
- " joint = joint.unstack(column, fill_value=0)\n",
- " \n",
- " # For some reason this is necessary (?)\n",
- " joint.columns = pd.Index([k[0] for k in joint.columns], name=str(column))\n",
- " \n",
- " if noise_scale > 0:\n",
- " joint += np.random.laplace(scale=noise_scale, size=joint.shape)\n",
- " joint = joint.clip(lower=0)\n",
- " \n",
- " return joint.div(joint.sum(axis=1), axis=\"rows\")\n",
- "\n",
- "def compute_bayesnet(df, cond, noise_scale: float = noise_scale):\n",
- " bayesnet = {}\n",
- " \n",
- " node = cond[0]\n",
- " *parents, root = node\n",
- " bayesnet[node] = compute_marginal_distribution(df, root, noise_scale)\n",
- " \n",
- " for node in tqdm(cond[1:]):\n",
- " *parents, column = node\n",
- " bayesnet[node] = compute_conditional_distribution(df, column, parents, noise_scale)\n",
- " \n",
- " return bayesnet\n",
- " \n",
- "# This technique does not actually work, as any small amount of noise\n",
- "# completely destroys the conditional distributions\n",
- "bayesnet = compute_bayesnet(public_flat, cond, noise_scale=0)\n",
- "\n",
- "plt.imshow(bayesnet[(\"PUMA\", 2012), (\"PUMA\", 2013)].to_numpy())"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "7fa04fbb",
- "metadata": {},
- "source": [
- "Above, we plot the PUMA transitions between 2012 and 2013\n",
- "\n",
- "### 3.3 Sampling from the first order bayesian network"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "id": "0be6fb06",
- "metadata": {
- "scrolled": false
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|███████████████████████████████████████████| 237/237 [00:07<00:00, 33.59it/s]\n"
- ]
- },
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " PUMA | \n",
- " YEAR | \n",
- " HHWT | \n",
- " GQ | \n",
- " PERWT | \n",
- " SEX | \n",
- " AGE | \n",
- " MARST | \n",
- " RACE | \n",
- " HISPAN | \n",
- " ... | \n",
- " WORKEDYR | \n",
- " INCTOT | \n",
- " INCWAGE | \n",
- " INCWELFR | \n",
- " INCINVST | \n",
- " INCEARN | \n",
- " POVERTY | \n",
- " DEPARTS | \n",
- " ARRIVES | \n",
- " sim_individual_id | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " 539 | \n",
- " 17-1001 | \n",
- " 2012 | \n",
- " 80.0 | \n",
- " 1 | \n",
- " 80.0 | \n",
- " 2 | \n",
- " 45.0 | \n",
- " 1 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 55000.0 | \n",
- " 50000.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 50000.0 | \n",
- " 400.0 | \n",
- " 715.0 | \n",
- " 745.0 | \n",
- " 77 | \n",
- "
\n",
- " \n",
- " 875 | \n",
- " 17-1001 | \n",
- " 2012 | \n",
- " 80.0 | \n",
- " 1 | \n",
- " 80.0 | \n",
- " 2 | \n",
- " 45.0 | \n",
- " 1 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 10000.0 | \n",
- " 10000.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 10000.0 | \n",
- " 80.0 | \n",
- " 715.0 | \n",
- " 715.0 | \n",
- " 125 | \n",
- "
\n",
- " \n",
- " 1974 | \n",
- " 17-1001 | \n",
- " 2012 | \n",
- " 40.0 | \n",
- " 1 | \n",
- " 60.0 | \n",
- " 2 | \n",
- " 65.0 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " ... | \n",
- " 1 | \n",
- " 10000.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 60.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 282 | \n",
- "
\n",
- " \n",
- " 2457 | \n",
- " 17-1001 | \n",
- " 2012 | \n",
- " 40.0 | \n",
- " 1 | \n",
- " 40.0 | \n",
- " 2 | \n",
- " 65.0 | \n",
- " 4 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 1 | \n",
- " 10000.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 80.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 351 | \n",
- "
\n",
- " \n",
- " 3122 | \n",
- " 17-1001 | \n",
- " 2012 | \n",
- " 20.0 | \n",
- " 1 | \n",
- " 20.0 | \n",
- " 1 | \n",
- " 30.0 | \n",
- " 6 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 85000.0 | \n",
- " 85000.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 85000.0 | \n",
- " 501.0 | \n",
- " 915.0 | \n",
- " 915.0 | \n",
- " 446 | \n",
- "
\n",
- " \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- "
\n",
- " \n",
- " 1177343 | \n",
- " 39-910 | \n",
- " 2018 | \n",
- " 20.0 | \n",
- " 1 | \n",
- " 20.0 | \n",
- " 2 | \n",
- " 60.0 | \n",
- " 6 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 100000.0 | \n",
- " 0.0 | \n",
- " 5000.0 | \n",
- " 0.0 | \n",
- " 100000.0 | \n",
- " 501.0 | \n",
- " 600.0 | \n",
- " 630.0 | \n",
- " 168191 | \n",
- "
\n",
- " \n",
- " 1178043 | \n",
- " 39-910 | \n",
- " 2018 | \n",
- " 100.0 | \n",
- " 1 | \n",
- " 100.0 | \n",
- " 1 | \n",
- " 55.0 | \n",
- " 1 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 1 | \n",
- " 15000.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 501.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 168291 | \n",
- "
\n",
- " \n",
- " 1182138 | \n",
- " 39-910 | \n",
- " 2018 | \n",
- " 140.0 | \n",
- " 1 | \n",
- " 160.0 | \n",
- " 1 | \n",
- " 35.0 | \n",
- " 4 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 80000.0 | \n",
- " 80000.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 80000.0 | \n",
- " 501.0 | \n",
- " 845.0 | \n",
- " 845.0 | \n",
- " 168876 | \n",
- "
\n",
- " \n",
- " 1184966 | \n",
- " 39-910 | \n",
- " 2018 | \n",
- " 60.0 | \n",
- " 1 | \n",
- " 80.0 | \n",
- " 1 | \n",
- " 55.0 | \n",
- " 1 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 100000.0 | \n",
- " 100000.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 100000.0 | \n",
- " 501.0 | \n",
- " 645.0 | \n",
- " 730.0 | \n",
- " 169280 | \n",
- "
\n",
- " \n",
- " 1186261 | \n",
- " 39-910 | \n",
- " 2018 | \n",
- " 60.0 | \n",
- " 1 | \n",
- " 60.0 | \n",
- " 2 | \n",
- " 65.0 | \n",
- " 4 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 1 | \n",
- " 25000.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 200.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 169465 | \n",
- "
\n",
- " \n",
- "
\n",
- "
668219 rows × 36 columns
\n",
- "
"
- ],
- "text/plain": [
- " PUMA YEAR HHWT GQ PERWT SEX AGE MARST RACE HISPAN ... \\\n",
- "539 17-1001 2012 80.0 1 80.0 2 45.0 1 1 0 ... \n",
- "875 17-1001 2012 80.0 1 80.0 2 45.0 1 1 0 ... \n",
- "1974 17-1001 2012 40.0 1 60.0 2 65.0 1 1 1 ... \n",
- "2457 17-1001 2012 40.0 1 40.0 2 65.0 4 1 0 ... \n",
- "3122 17-1001 2012 20.0 1 20.0 1 30.0 6 1 0 ... \n",
- "... ... ... ... .. ... ... ... ... ... ... ... \n",
- "1177343 39-910 2018 20.0 1 20.0 2 60.0 6 1 0 ... \n",
- "1178043 39-910 2018 100.0 1 100.0 1 55.0 1 1 0 ... \n",
- "1182138 39-910 2018 140.0 1 160.0 1 35.0 4 1 0 ... \n",
- "1184966 39-910 2018 60.0 1 80.0 1 55.0 1 1 0 ... \n",
- "1186261 39-910 2018 60.0 1 60.0 2 65.0 4 1 0 ... \n",
- "\n",
- " WORKEDYR INCTOT INCWAGE INCWELFR INCINVST INCEARN POVERTY \\\n",
- "539 3 55000.0 50000.0 0.0 0.0 50000.0 400.0 \n",
- "875 3 10000.0 10000.0 0.0 0.0 10000.0 80.0 \n",
- "1974 1 10000.0 0.0 0.0 0.0 0.0 60.0 \n",
- "2457 1 10000.0 0.0 0.0 0.0 0.0 80.0 \n",
- "3122 3 85000.0 85000.0 0.0 0.0 85000.0 501.0 \n",
- "... ... ... ... ... ... ... ... \n",
- "1177343 3 100000.0 0.0 5000.0 0.0 100000.0 501.0 \n",
- "1178043 1 15000.0 0.0 0.0 0.0 0.0 501.0 \n",
- "1182138 3 80000.0 80000.0 0.0 0.0 80000.0 501.0 \n",
- "1184966 3 100000.0 100000.0 0.0 0.0 100000.0 501.0 \n",
- "1186261 1 25000.0 0.0 0.0 0.0 0.0 200.0 \n",
- "\n",
- " DEPARTS ARRIVES sim_individual_id \n",
- "539 715.0 745.0 77 \n",
- "875 715.0 715.0 125 \n",
- "1974 0.0 0.0 282 \n",
- "2457 0.0 0.0 351 \n",
- "3122 915.0 915.0 446 \n",
- "... ... ... ... \n",
- "1177343 600.0 630.0 168191 \n",
- "1178043 0.0 0.0 168291 \n",
- "1182138 845.0 845.0 168876 \n",
- "1184966 645.0 730.0 169280 \n",
- "1186261 0.0 0.0 169465 \n",
- "\n",
- "[668219 rows x 36 columns]"
- ]
- },
- "execution_count": 9,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "def generate_sample(columns, cond, bayesnet, size=10):\n",
- " synthetic = pd.DataFrame(0, columns=columns, index=np.arange(size))\n",
- " \n",
- " # Sample the first column as i.i.d variables on the root node.\n",
- " node = cond[0]\n",
- " *parents, root = node\n",
- " dist = bayesnet[node]\n",
- " k = np.random.choice(a=len(dist), size=size, replace=True, p=dist.to_numpy())\n",
- " synthetic[root] = dist.index[k]\n",
- " # Conditional distributions\n",
- " for node in tqdm(cond[1:]):\n",
- " *parents, column = node\n",
- " dist = bayesnet[node]\n",
- "\n",
- " if len(parents) == 1:\n",
- " rows = synthetic[parents[0]]\n",
- " cumsum = dist.loc[rows].to_numpy().cumsum(axis=1)\n",
- " else:\n",
- " raise NotImplementedError()\n",
- " \n",
- " u = np.random.rand(size)\n",
- " k = (u[:, None] > cumsum).sum(axis=1)\n",
- " synthetic[column] = dist.columns[k]\n",
- " \n",
- " return synthetic\n",
- "\n",
- "synthetic_flat = generate_sample(public_flat.columns, cond, bayesnet, size=len(public_flat))\n",
- "synthetic_bin = sdnist.utils.stack(synthetic_flat)\n",
- "\n",
- "synthetic = sdnist.utils.undo_discretize(synthetic_bin, sdnist.kmarginal.CensusKMarginalScore.BINS)\n",
- "synthetic = synthetic.reindex(public.columns, axis=1).sort_values([\"PUMA\", \"YEAR\"])\n",
- "\n",
- "synthetic"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "id": "b5545769",
- "metadata": {
- "scrolled": false
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- " 10%|████▍ | 30/300 [00:00<00:03, 86.79it/s]/opt/homebrew/Caskroom/miniforge/base/envs/py39/lib/python3.9/site-packages/pandas/core/indexes/multi.py:3554: RuntimeWarning: The values in the array are unorderable. Pass `sort=False` to suppress this warning.\n",
- " result = lib.fast_unique_multiple([self._values, rvals], sort=sort)\n",
- "100%|███████████████████████████████████████████| 300/300 [00:04<00:00, 71.09it/s]\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- "493.97682178239444"
- ]
- },
- "execution_count": 10,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "score = sdnist.kmarginal.CensusLongitudinalKMarginalScore(public, synthetic)\n",
- "score.compute_score()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "3103cb23",
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3 (ipykernel)",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.9.7"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/examples/census_subsample.pdf b/examples/census_subsample.pdf
deleted file mode 100644
index 57a6d5a..0000000
Binary files a/examples/census_subsample.pdf and /dev/null differ
diff --git a/examples/minimal.ipynb b/examples/minimal.ipynb
deleted file mode 100644
index 5c22038..0000000
--- a/examples/minimal.ipynb
+++ /dev/null
@@ -1,116 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "946f4143",
- "metadata": {},
- "outputs": [],
- "source": [
- "import matplotlib.pyplot as plt\n",
- "import sdnist"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "c2ffe2d4",
- "metadata": {},
- "source": [
- "1. Load the public dataset :"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "c7ba484b",
- "metadata": {},
- "outputs": [],
- "source": [
- "public_data, schema = sdnist.census(public=True)\n",
- "dataset_path = sdnist.load.build_name(challenge='census', public=True)\n",
- "public_data"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "b498877f",
- "metadata": {},
- "source": [
- "2. Compute k-marginal scores :"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "618c29f5",
- "metadata": {},
- "outputs": [],
- "source": [
- "fracs = (0.02, 0.1, 0.5)\n",
- "scores = []\n",
- "\n",
- "for i, frac in enumerate(fracs):\n",
- " synthetic_data = public_data.sample(frac=frac)\n",
- " score = sdnist.score(public_data, synthetic_data, schema, challenge=\"census\")\n",
- " scores.append(score)\n",
- " score.boxplot(idx=i, name=str(frac))\n",
- " \n",
- "plt.ylabel(\"Subsample fraction\")\n",
- "plt.xlabel(\"K-marginal score\")\n",
- "plt.title(f\"Scores of subsampled datasets (census)\")\n",
- "plt.savefig(\"census_subsample.pdf\")\n",
- "\n",
- "scores"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "318268c7",
- "metadata": {},
- "source": [
- "3. Display the k-marginal score"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "55cffca1",
- "metadata": {},
- "outputs": [],
- "source": [
- "# As a map (only works on the public dataset - IL/OH)\n",
- "print(dataset_path)\n",
- "scores[0].html(dataset_path, schema)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "8567d8c6",
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "knexusnlp",
- "language": "python",
- "name": "knexusnlp"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.10"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/examples/pgm/pgm.ipynb b/examples/pgm/pgm.ipynb
deleted file mode 100644
index bc39f13..0000000
--- a/examples/pgm/pgm.ipynb
+++ /dev/null
@@ -1,1117 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "id": "9d397bd7",
- "metadata": {},
- "source": [
- "# NIST-MST on Census dataset\n",
- "\n",
- "Adapted from **WINNING THE NIST CONTEST: A SCALABLE AND GENERAL APPROACH TO DIFFERENTIALLY PRIVATE SYNTHETIC DATA** (https://arxiv.org/pdf/2108.04978.pdf)\n",
- "\n",
- "Depends on https://github.com/ryan112358/private-pgm."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "id": "302b3208",
- "metadata": {},
- "outputs": [],
- "source": [
- "import itertools\n",
- "\n",
- "import numpy as np\n",
- "import pandas as pd\n",
- "import networkx as nx\n",
- "\n",
- "import scipy.stats, scipy.optimize\n",
- "\n",
- "import pydot\n",
- "from networkx.drawing.nx_pydot import graphviz_layout\n",
- "import matplotlib.pyplot as plt\n",
- "\n",
- "from tqdm.auto import tqdm\n",
- "\n",
- "from mbi import Dataset, Domain, FactoredInference\n",
- "\n",
- "import sdnist"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "id": "d44b3abb",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " PUMA | \n",
- " YEAR | \n",
- " HHWT | \n",
- " GQ | \n",
- " PERWT | \n",
- " SEX | \n",
- " AGE | \n",
- " MARST | \n",
- " RACE | \n",
- " HISPAN | \n",
- " ... | \n",
- " WORKEDYR | \n",
- " INCTOT | \n",
- " INCWAGE | \n",
- " INCWELFR | \n",
- " INCINVST | \n",
- " INCEARN | \n",
- " POVERTY | \n",
- " DEPARTS | \n",
- " ARRIVES | \n",
- " sim_individual_id | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " 517534 | \n",
- " 164 | \n",
- " 0 | \n",
- " 8 | \n",
- " 1 | \n",
- " 8 | \n",
- " 1 | \n",
- " 6 | \n",
- " 5 | \n",
- " 0 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 3 | \n",
- " 3 | \n",
- " 1 | \n",
- " 1 | \n",
- " 3 | \n",
- " 7 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1166909 | \n",
- "
\n",
- " \n",
- " 737386 | \n",
- " 143 | \n",
- " 3 | \n",
- " 3 | \n",
- " 1 | \n",
- " 3 | \n",
- " 1 | \n",
- " 10 | \n",
- " 3 | \n",
- " 0 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 2 | \n",
- " 1 | \n",
- " 1 | \n",
- " 581009 | \n",
- "
\n",
- " \n",
- " 990959 | \n",
- " 179 | \n",
- " 3 | \n",
- " 6 | \n",
- " 1 | \n",
- " 6 | \n",
- " 1 | \n",
- " 6 | \n",
- " 5 | \n",
- " 1 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 4 | \n",
- " 1 | \n",
- " 1 | \n",
- " 570487 | \n",
- "
\n",
- " \n",
- " 189454 | \n",
- " 17 | \n",
- " 3 | \n",
- " 5 | \n",
- " 1 | \n",
- " 5 | \n",
- " 0 | \n",
- " 4 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 15 | \n",
- " 15 | \n",
- " 1 | \n",
- " 1 | \n",
- " 15 | \n",
- " 26 | \n",
- " 76 | \n",
- " 76 | \n",
- " 166984 | \n",
- "
\n",
- " \n",
- " 349642 | \n",
- " 173 | \n",
- " 5 | \n",
- " 4 | \n",
- " 1 | \n",
- " 4 | \n",
- " 0 | \n",
- " 10 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 21 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 21 | \n",
- " 26 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1008940 | \n",
- "
\n",
- " \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- "
\n",
- " \n",
- " 278741 | \n",
- " 90 | \n",
- " 4 | \n",
- " 2 | \n",
- " 1 | \n",
- " 1 | \n",
- " 0 | \n",
- " 8 | \n",
- " 5 | \n",
- " 0 | \n",
- " 0 | \n",
- " ... | \n",
- " 1 | \n",
- " 4 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 26 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1270662 | \n",
- "
\n",
- " \n",
- " 595636 | \n",
- " 84 | \n",
- " 1 | \n",
- " 3 | \n",
- " 1 | \n",
- " 3 | \n",
- " 1 | \n",
- " 9 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " ... | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 15 | \n",
- " 1 | \n",
- " 1 | \n",
- " 625537 | \n",
- "
\n",
- " \n",
- " 262882 | \n",
- " 129 | \n",
- " 4 | \n",
- " 6 | \n",
- " 1 | \n",
- " 7 | \n",
- " 0 | \n",
- " 6 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 21 | \n",
- " 21 | \n",
- " 1 | \n",
- " 1 | \n",
- " 21 | \n",
- " 26 | \n",
- " 29 | \n",
- " 29 | \n",
- " 891652 | \n",
- "
\n",
- " \n",
- " 1031190 | \n",
- " 46 | \n",
- " 6 | \n",
- " 11 | \n",
- " 1 | \n",
- " 11 | \n",
- " 1 | \n",
- " 4 | \n",
- " 0 | \n",
- " 6 | \n",
- " 1 | \n",
- " ... | \n",
- " 3 | \n",
- " 7 | \n",
- " 7 | \n",
- " 1 | \n",
- " 1 | \n",
- " 7 | \n",
- " 16 | \n",
- " 39 | \n",
- " 41 | \n",
- " 1223988 | \n",
- "
\n",
- " \n",
- " 681411 | \n",
- " 86 | \n",
- " 2 | \n",
- " 4 | \n",
- " 1 | \n",
- " 4 | \n",
- " 1 | \n",
- " 13 | \n",
- " 4 | \n",
- " 0 | \n",
- " 0 | \n",
- " ... | \n",
- " 1 | \n",
- " 4 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 7 | \n",
- " 1 | \n",
- " 1 | \n",
- " 586270 | \n",
- "
\n",
- " \n",
- "
\n",
- "
100000 rows × 36 columns
\n",
- "
"
- ],
- "text/plain": [
- " PUMA YEAR HHWT GQ PERWT SEX AGE MARST RACE HISPAN ... \\\n",
- "517534 164 0 8 1 8 1 6 5 0 0 ... \n",
- "737386 143 3 3 1 3 1 10 3 0 0 ... \n",
- "990959 179 3 6 1 6 1 6 5 1 0 ... \n",
- "189454 17 3 5 1 5 0 4 0 0 0 ... \n",
- "349642 173 5 4 1 4 0 10 0 0 0 ... \n",
- "... ... ... ... .. ... ... ... ... ... ... ... \n",
- "278741 90 4 2 1 1 0 8 5 0 0 ... \n",
- "595636 84 1 3 1 3 1 9 0 0 0 ... \n",
- "262882 129 4 6 1 7 0 6 0 0 0 ... \n",
- "1031190 46 6 11 1 11 1 4 0 6 1 ... \n",
- "681411 86 2 4 1 4 1 13 4 0 0 ... \n",
- "\n",
- " WORKEDYR INCTOT INCWAGE INCWELFR INCINVST INCEARN POVERTY \\\n",
- "517534 3 3 3 1 1 3 7 \n",
- "737386 3 1 1 1 1 1 2 \n",
- "990959 3 1 1 1 1 1 4 \n",
- "189454 3 15 15 1 1 15 26 \n",
- "349642 3 21 1 1 1 21 26 \n",
- "... ... ... ... ... ... ... ... \n",
- "278741 1 4 1 1 1 1 26 \n",
- "595636 1 1 1 1 1 1 15 \n",
- "262882 3 21 21 1 1 21 26 \n",
- "1031190 3 7 7 1 1 7 16 \n",
- "681411 1 4 1 1 1 1 7 \n",
- "\n",
- " DEPARTS ARRIVES sim_individual_id \n",
- "517534 1 1 1166909 \n",
- "737386 1 1 581009 \n",
- "990959 1 1 570487 \n",
- "189454 76 76 166984 \n",
- "349642 1 1 1008940 \n",
- "... ... ... ... \n",
- "278741 1 1 1270662 \n",
- "595636 1 1 625537 \n",
- "262882 29 29 891652 \n",
- "1031190 39 41 1223988 \n",
- "681411 1 1 586270 \n",
- "\n",
- "[100000 rows x 36 columns]"
- ]
- },
- "execution_count": 2,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "private, schema = sdnist.census(public=True)\n",
- "private = private.sample(100000)\n",
- "private_bin = sdnist.utils.discretize(private, schema, bins=sdnist.kmarginal.CensusKMarginalScore.BINS)\n",
- "\n",
- "# Build private-pgm Domain and Dataset objects\n",
- "attrs = []\n",
- "shape = []\n",
- "\n",
- "for col in private_bin:\n",
- " if col == \"sim_individual_id\":\n",
- " continue\n",
- "\n",
- " attrs.append(col)\n",
- " shape.append(private_bin[col].max() + 1)\n",
- " \n",
- "domain = Domain(attrs=attrs, shape=shape)\n",
- "data = Dataset(private_bin, domain)\n",
- "\n",
- "private_bin"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "id": "8b24aa77",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "41.23660003602558\n"
- ]
- }
- ],
- "source": [
- "# Calibrate noise to add to each marginal\n",
- "logsf = scipy.stats.norm.logsf\n",
- "\n",
- "def delta_eps_normal(eps, mu):\n",
- " a = logsf(eps/mu - mu*.5)\n",
- " b = eps + logsf(eps/mu + mu*.5)\n",
- "\n",
- " return np.exp(b) * (np.expm1(a - b))\n",
- "\n",
- "eps, delta = 1, 2.5e-5\n",
- "C = 7 # maximum contribution of each individual\n",
- "\n",
- "mu = scipy.optimize.bisect(lambda m: delta_eps_normal(eps, m) - delta, a=1e-5, b=2)\n",
- "sigma = np.sqrt((2 * len(attrs) - 1) * C / mu) \n",
- "print(sigma)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "id": "86d6a3a4",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Total clique size: 682\n",
- "iteration\t\ttime\t\tl1_loss\t\tl2_loss\t\tfeasibility\n",
- "0.00\t\t0.00\t\t83306.88\t\t31352539.74\t\t0.00\n",
- "50.00\t\t0.60\t\t3136.26\t\t13505.00\t\t0.00\n",
- "100.00\t\t1.34\t\t947.89\t\t1260.56\t\t0.00\n",
- "150.00\t\t2.07\t\t531.24\t\t446.09\t\t0.00\n",
- "200.00\t\t2.80\t\t373.46\t\t236.39\t\t0.00\n",
- "250.00\t\t3.53\t\t293.89\t\t153.36\t\t0.00\n",
- "300.00\t\t4.27\t\t246.46\t\t110.69\t\t0.00\n",
- "350.00\t\t5.00\t\t217.41\t\t86.79\t\t0.00\n",
- "400.00\t\t5.71\t\t197.77\t\t71.83\t\t0.00\n",
- "450.00\t\t6.43\t\t183.67\t\t61.61\t\t0.00\n",
- "500.00\t\t7.15\t\t172.87\t\t54.46\t\t0.00\n",
- "550.00\t\t7.88\t\t164.41\t\t49.23\t\t0.00\n",
- "600.00\t\t8.61\t\t157.58\t\t45.29\t\t0.00\n",
- "650.00\t\t9.33\t\t151.96\t\t42.24\t\t0.00\n",
- "700.00\t\t10.09\t\t147.30\t\t39.77\t\t0.00\n",
- "750.00\t\t10.81\t\t143.25\t\t37.81\t\t0.00\n",
- "800.00\t\t11.57\t\t139.77\t\t36.21\t\t0.00\n",
- "850.00\t\t12.27\t\t136.81\t\t34.90\t\t0.00\n",
- "900.00\t\t13.02\t\t134.23\t\t33.79\t\t0.00\n",
- "950.00\t\t13.75\t\t132.02\t\t32.85\t\t0.00\n"
- ]
- }
- ],
- "source": [
- "# Step 1 : measure all rank-1 marginals\n",
- "measurements_1 = []\n",
- "\n",
- "for col in attrs:\n",
- " y = data.project([col]).datavector()\n",
- " y += np.random.normal(scale=sigma, size=y.size)\n",
- " \n",
- " measurements_1.append((np.eye(y.size), y, sigma, (col,)))\n",
- " \n",
- "engine = FactoredInference(domain, log=True)\n",
- "model = engine.estimate(measurements_1, engine=\"MD\")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "id": "2e89abcf",
- "metadata": {
- "scrolled": false
- },
- "outputs": [
- {
- "data": {
- "application/vnd.jupyter.widget-view+json": {
- "model_id": "1a44d05908c6478cb6edfb114b4dcdf1",
- "version_major": 2,
- "version_minor": 0
- },
- "text/plain": [
- " 0%| | 0/595 [00:00, ?it/s]"
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "q = {}\n",
- "\n",
- "total = int(len(attrs) * (len(attrs) - 1) * .5)\n",
- "for i, j in tqdm(itertools.combinations(attrs, r=2), total=total):\n",
- " q[i, j] = np.abs(data.project([i, j]).datavector() - model.project([i, j]).datavector()).sum()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "id": "4c19990b",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "340.0\n"
- ]
- }
- ],
- "source": [
- "graph = nx.Graph()\n",
- "graph.add_nodes_from(attrs)\n",
- "\n",
- "# Compute noise scale (using naive composition for simplicity)\n",
- "eps = 0.05\n",
- "r = len(attrs) - 1\n",
- "noise_scale = .5 / (eps) * r\n",
- "print(noise_scale)\n",
- "\n",
- "for _ in range(len(attrs)-1):\n",
- " # Compute set of disconnected attrs\n",
- " # The whole algorithm is probably equivalent to a DP-Maximum spanning tree (?)\n",
- " S = {(i,j): q[i,j] + np.random.gumbel(scale=noise_scale) for (i, j) in q if not nx.has_path(graph, i,j)}\n",
- " \n",
- " # Exponential mechanism\n",
- " ij_max = 0\n",
- " qij_max = 0\n",
- " \n",
- " for ij, qij in S.items():\n",
- " if qij > qij_max:\n",
- " qij_max = qij\n",
- " ij_max = ij\n",
- " \n",
- " graph.add_edge(*ij_max, weight=qij_max)\n",
- " \n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "id": "181cb28c",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.figure(dpi=100)\n",
- "pos = graphviz_layout(graph, prog=\"neato\")\n",
- "width = [w ** .5 * .003 for w in nx.get_edge_attributes(graph, \"weight\").values()]\n",
- "nx.draw(graph, pos=pos, with_labels=True, font_size=8, width=width, edge_color=\"grey\")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "id": "26bda7c6",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Total clique size: 34234\n",
- "iteration\t\ttime\t\tl1_loss\t\tl2_loss\t\tfeasibility\n",
- "0.00\t\t0.00\t\t212967.95\t\t49990107.75\t\t0.00\n",
- "50.00\t\t5.90\t\t63995.64\t\t884823.73\t\t0.00\n",
- "100.00\t\t13.55\t\t49242.30\t\t308068.88\t\t0.00\n",
- "150.00\t\t21.19\t\t43896.49\t\t186526.22\t\t0.00\n",
- "200.00\t\t29.06\t\t40848.17\t\t131835.94\t\t0.00\n",
- "250.00\t\t36.67\t\t38871.86\t\t100987.97\t\t0.00\n",
- "300.00\t\t45.18\t\t37419.72\t\t81512.79\t\t0.00\n",
- "350.00\t\t53.69\t\t36241.84\t\t67954.61\t\t0.00\n",
- "400.00\t\t61.63\t\t35326.58\t\t58597.01\t\t0.00\n",
- "450.00\t\t70.37\t\t34524.29\t\t51390.81\t\t0.00\n",
- "500.00\t\t78.76\t\t33863.88\t\t46000.76\t\t0.00\n",
- "550.00\t\t93.82\t\t33295.41\t\t41773.42\t\t0.00\n",
- "600.00\t\t103.89\t\t32804.28\t\t38399.02\t\t0.00\n",
- "650.00\t\t111.98\t\t32370.04\t\t35658.83\t\t0.00\n",
- "700.00\t\t119.59\t\t31993.86\t\t33413.24\t\t0.00\n",
- "750.00\t\t127.12\t\t31649.82\t\t31524.05\t\t0.00\n",
- "800.00\t\t135.21\t\t31341.75\t\t29937.51\t\t0.00\n",
- "850.00\t\t143.25\t\t31064.36\t\t28592.97\t\t0.00\n",
- "900.00\t\t151.37\t\t30799.75\t\t27412.58\t\t0.00\n",
- "950.00\t\t158.78\t\t30566.23\t\t26404.30\t\t0.00\n"
- ]
- }
- ],
- "source": [
- "# Step 2\n",
- "\n",
- "measurements_2 = [] \n",
- "\n",
- "for i, j in graph.edges:\n",
- " y = data.project([i, j]).datavector()\n",
- " y += np.random.normal(scale=sigma, size=y.size)\n",
- " \n",
- " measurements_2.append((np.eye(y.size), y, sigma, (i, j)))\n",
- " \n",
- "model = engine.estimate(measurements_1 + measurements_2, engine=\"MD\")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "id": "dc21b9e6",
- "metadata": {
- "scrolled": false
- },
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " PUMA | \n",
- " YEAR | \n",
- " HHWT | \n",
- " GQ | \n",
- " PERWT | \n",
- " SEX | \n",
- " AGE | \n",
- " MARST | \n",
- " RACE | \n",
- " HISPAN | \n",
- " ... | \n",
- " WRKRECAL | \n",
- " WORKEDYR | \n",
- " INCTOT | \n",
- " INCWAGE | \n",
- " INCWELFR | \n",
- " INCINVST | \n",
- " INCEARN | \n",
- " POVERTY | \n",
- " DEPARTS | \n",
- " ARRIVES | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " 0 | \n",
- " 5 | \n",
- " 6 | \n",
- " 4 | \n",
- " 1 | \n",
- " 4 | \n",
- " 0 | \n",
- " 3 | \n",
- " 5 | \n",
- " 0 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 3 | \n",
- " 5 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 11 | \n",
- " 35 | \n",
- " 27 | \n",
- "
\n",
- " \n",
- " 1 | \n",
- " 149 | \n",
- " 1 | \n",
- " 4 | \n",
- " 1 | \n",
- " 5 | \n",
- " 1 | \n",
- " 13 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 2 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 22 | \n",
- " 1 | \n",
- " 1 | \n",
- "
\n",
- " \n",
- " 2 | \n",
- " 34 | \n",
- " 5 | \n",
- " 4 | \n",
- " 1 | \n",
- " 5 | \n",
- " 1 | \n",
- " 8 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 3 | \n",
- " 7 | \n",
- " 7 | \n",
- " 1 | \n",
- " 1 | \n",
- " 7 | \n",
- " 20 | \n",
- " 21 | \n",
- " 22 | \n",
- "
\n",
- " \n",
- " 3 | \n",
- " 149 | \n",
- " 3 | \n",
- " 5 | \n",
- " 1 | \n",
- " 10 | \n",
- " 1 | \n",
- " 12 | \n",
- " 4 | \n",
- " 0 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 1 | \n",
- " 4 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 6 | \n",
- " 1 | \n",
- " 1 | \n",
- "
\n",
- " \n",
- " 4 | \n",
- " 105 | \n",
- " 3 | \n",
- " 4 | \n",
- " 1 | \n",
- " 4 | \n",
- " 1 | \n",
- " 10 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 1 | \n",
- " 5 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 26 | \n",
- " 1 | \n",
- " 1 | \n",
- "
\n",
- " \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- " ... | \n",
- "
\n",
- " \n",
- " 99995 | \n",
- " 89 | \n",
- " 4 | \n",
- " 4 | \n",
- " 1 | \n",
- " 5 | \n",
- " 1 | \n",
- " 1 | \n",
- " 5 | \n",
- " 0 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 3 | \n",
- " 2 | \n",
- " 1 | \n",
- " 1 | \n",
- " 15 | \n",
- " 1 | \n",
- " 4 | \n",
- " 1 | \n",
- " 1 | \n",
- "
\n",
- " \n",
- " 99996 | \n",
- " 120 | \n",
- " 5 | \n",
- " 3 | \n",
- " 1 | \n",
- " 3 | \n",
- " 0 | \n",
- " 1 | \n",
- " 5 | \n",
- " 0 | \n",
- " 1 | \n",
- " ... | \n",
- " 3 | \n",
- " 3 | \n",
- " 5 | \n",
- " 7 | \n",
- " 1 | \n",
- " 1 | \n",
- " 7 | \n",
- " 9 | \n",
- " 27 | \n",
- " 27 | \n",
- "
\n",
- " \n",
- " 99997 | \n",
- " 123 | \n",
- " 5 | \n",
- " 8 | \n",
- " 1 | \n",
- " 4 | \n",
- " 0 | \n",
- " 3 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 3 | \n",
- " 8 | \n",
- " 11 | \n",
- " 1 | \n",
- " 1 | \n",
- " 8 | \n",
- " 20 | \n",
- " 28 | \n",
- " 24 | \n",
- "
\n",
- " \n",
- " 99998 | \n",
- " 174 | \n",
- " 2 | \n",
- " 4 | \n",
- " 1 | \n",
- " 4 | \n",
- " 1 | \n",
- " 2 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 3 | \n",
- " 6 | \n",
- " 1 | \n",
- " 1 | \n",
- " 1 | \n",
- " 20 | \n",
- " 26 | \n",
- " 48 | \n",
- " 78 | \n",
- "
\n",
- " \n",
- " 99999 | \n",
- " 80 | \n",
- " 1 | \n",
- " 26 | \n",
- " 1 | \n",
- " 10 | \n",
- " 1 | \n",
- " 7 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " ... | \n",
- " 3 | \n",
- " 3 | \n",
- " 9 | \n",
- " 9 | \n",
- " 1 | \n",
- " 1 | \n",
- " 9 | \n",
- " 21 | \n",
- " 84 | \n",
- " 33 | \n",
- "
\n",
- " \n",
- "
\n",
- "
100000 rows × 35 columns
\n",
- "
"
- ],
- "text/plain": [
- " PUMA YEAR HHWT GQ PERWT SEX AGE MARST RACE HISPAN ... \\\n",
- "0 5 6 4 1 4 0 3 5 0 0 ... \n",
- "1 149 1 4 1 5 1 13 0 0 0 ... \n",
- "2 34 5 4 1 5 1 8 0 0 0 ... \n",
- "3 149 3 5 1 10 1 12 4 0 0 ... \n",
- "4 105 3 4 1 4 1 10 0 0 0 ... \n",
- "... ... ... ... .. ... ... ... ... ... ... ... \n",
- "99995 89 4 4 1 5 1 1 5 0 0 ... \n",
- "99996 120 5 3 1 3 0 1 5 0 1 ... \n",
- "99997 123 5 8 1 4 0 3 0 0 0 ... \n",
- "99998 174 2 4 1 4 1 2 0 0 0 ... \n",
- "99999 80 1 26 1 10 1 7 0 0 0 ... \n",
- "\n",
- " WRKRECAL WORKEDYR INCTOT INCWAGE INCWELFR INCINVST INCEARN \\\n",
- "0 3 3 5 1 1 1 1 \n",
- "1 3 2 1 1 1 1 1 \n",
- "2 3 3 7 7 1 1 7 \n",
- "3 3 1 4 1 1 1 1 \n",
- "4 3 1 5 1 1 1 1 \n",
- "... ... ... ... ... ... ... ... \n",
- "99995 3 3 2 1 1 15 1 \n",
- "99996 3 3 5 7 1 1 7 \n",
- "99997 3 3 8 11 1 1 8 \n",
- "99998 3 3 6 1 1 1 20 \n",
- "99999 3 3 9 9 1 1 9 \n",
- "\n",
- " POVERTY DEPARTS ARRIVES \n",
- "0 11 35 27 \n",
- "1 22 1 1 \n",
- "2 20 21 22 \n",
- "3 6 1 1 \n",
- "4 26 1 1 \n",
- "... ... ... ... \n",
- "99995 4 1 1 \n",
- "99996 9 27 27 \n",
- "99997 20 28 24 \n",
- "99998 26 48 78 \n",
- "99999 21 84 33 \n",
- "\n",
- "[100000 rows x 35 columns]"
- ]
- },
- "execution_count": 9,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "synthetic_bin = model.synthetic_data(rows=len(private)).df\n",
- "synthetic_bin"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "id": "afa0ae64",
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:01<00:00, 25.90it/s]\n"
- ]
- }
- ],
- "source": [
- "synthetic = sdnist.utils.undo_discretize(synthetic_bin, schema, sdnist.kmarginal.CensusKMarginalScore.BINS)\n",
- "score = sdnist.score(private, synthetic, schema)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "id": "a3fd2bae",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "Text(0.5, 1.0, 'Score distribution over (PUMA,YEAR)')"
- ]
- },
- "execution_count": 14,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.figure(figsize=(6, 2))\n",
- "score.boxplot(name=\"private-pgm\")\n",
- "plt.title(f\"Score distribution over (PUMA,YEAR)\")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "e1be6f97",
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3 (ipykernel)",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.9.7"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/examples/score_example.png b/examples/score_example.png
deleted file mode 100644
index 3947483..0000000
Binary files a/examples/score_example.png and /dev/null differ
diff --git a/leadersboard/Leadersboard_census_challenge.pdf b/leadersboard/Leadersboard_census_challenge.pdf
deleted file mode 100644
index d9dc03b..0000000
Binary files a/leadersboard/Leadersboard_census_challenge.pdf and /dev/null differ
diff --git a/leadersboard/Leadersboard_taxi_challenge.pdf b/leadersboard/Leadersboard_taxi_challenge.pdf
deleted file mode 100644
index 5ace13c..0000000
Binary files a/leadersboard/Leadersboard_taxi_challenge.pdf and /dev/null differ
diff --git a/sdnist/load.py b/sdnist/load.py
index 55e91f8..eeb9d66 100644
--- a/sdnist/load.py
+++ b/sdnist/load.py
@@ -16,6 +16,8 @@
import sdnist.strs as strs
+DEFAULT_DATASET = 'diverse_community_excerpts_data'
+
class TestDatasetName(Enum):
NONE = 1
@@ -29,6 +31,13 @@ class TestDatasetName(Enum):
national2019 = 9
+dataset_name_state_map = {
+ TestDatasetName.national2019.name: 'national',
+ TestDatasetName.ma2019.name: 'massachusetts',
+ TestDatasetName.tx2019.name: 'texas'
+}
+
+
data_challenge_map = {
TestDatasetName.NONE: None,
TestDatasetName.GA_NC_SC_10Y_PUMS: strs.CENSUS,
@@ -76,7 +85,8 @@ def check_exists(root: Path, name: Path, download: bool, data_name: str = strs.D
version = "1.4.0-b.1"
version_v = f"v{version}"
- sdnist_version = f"SDNist-{data_name}-{version}"
+ sdnist_version = DEFAULT_DATASET
+
download_link = f"https://github.com/usnistgov/SDNist/releases/download/{version_v}/{sdnist_version}.zip"
if zip_path.exists() and error_opening_zip(zip_path):
os.remove(zip_path)
@@ -90,14 +100,15 @@ def check_exists(root: Path, name: Path, download: bool, data_name: str = strs.D
zip_path.as_posix(),
reporthook
)
- print('\n Success! Downloaded all datasets zipfile to'.format(zip_path))
+ print(f'\n Success! Downloaded all datasets to "{root}" directory\n')
except:
- shutil.rmtree(zip_path)
+ if zip_path.exists():
+ shutil.rmtree(zip_path)
raise RuntimeError(f"Unable to download {name}. Try: \n "
f"- re-running the command, \n "
f"- downloading manually from {download_link} "
- f"and unpack the zip, and copy 'data' directory in the root/working-directory, \n "
- f"- or download the data as part of a release: https://github.com/usnistgov/SDNist/releases")
+ f"and unpack the zip. \n "
+ f"- or download the data as part of a release: https://github.com/usnistgov/SDNist/releases\n")
if zip_path.exists():
# extract zipfile
@@ -110,7 +121,8 @@ def check_exists(root: Path, name: Path, download: bool, data_name: str = strs.D
raise e
# delete zipfile
os.remove(zip_path)
- copy_from_path = str(Path(extract_path, sdnist_version, 'data'))
+ print()
+ copy_from_path = str(Path(extract_path, sdnist_version))
copy_to_path = str(Path(root))
copy_tree(copy_from_path, copy_to_path)
shutil.rmtree(extract_path)
@@ -122,11 +134,11 @@ def build_name(challenge: str,
root: Path = Path("data"),
public: bool = False,
test: TestDatasetName = TestDatasetName.NONE,
- data_name: str = strs.DATA):
+ data_name: str = DEFAULT_DATASET):
root = root.expanduser()
directory = root
- if data_name == "toy-data":
+ if data_name == DEFAULT_DATASET:
directory = root
elif challenge == strs.CENSUS:
directory = root / "census" / "dataset"
@@ -164,7 +176,8 @@ def build_name(challenge: str,
else:
raise ValueError(f"Unrecognized challenge {challenge}")
-
+ if fname in dataset_name_state_map.keys():
+ fname = Path(dataset_name_state_map[fname], fname)
return directory / fname
@@ -173,7 +186,7 @@ def load_parameters(challenge: str,
public: bool = True,
test: TestDatasetName = TestDatasetName.NONE,
download: bool = True,
- data_name: str = strs.DATA) -> dict:
+ data_name: str = DEFAULT_DATASET) -> dict:
dataset_path = build_name(challenge=challenge, root=root,
public=public, test=test, data_name=data_name)
dataset_parameters = dataset_path.with_suffix('.json')
@@ -211,7 +224,7 @@ def load_dataset(challenge: str,
test: TestDatasetName = TestDatasetName.NONE,
download: bool = True,
format_: str = "parquet",
- data_name: str = 'data') -> Tuple[pd.DataFrame, dict]:
+ data_name: str = DEFAULT_DATASET) -> Tuple[pd.DataFrame, dict]:
""" Load one of the original SDNist datasets.
:param challenge: str: base challenge. Must be `census` or `taxi`.
@@ -275,7 +288,7 @@ def load_dataset(challenge: str,
else:
raise ValueError(f"Unknown format {format_}")
- if data_name != 'toy-data':
+ if data_name != DEFAULT_DATASET:
config = load_config(challenge, root, public, test, download)
params[strs.CONFIG] = config
return dataset, params
diff --git a/sdnist/metrics/inconsistency.py b/sdnist/metrics/inconsistency.py
index b9785dc..ee45d43 100644
--- a/sdnist/metrics/inconsistency.py
+++ b/sdnist/metrics/inconsistency.py
@@ -24,6 +24,11 @@
["AGEP", "EDU"]),
("a", "child_NOC", "Children (< 10) don't have children",
["AGEP", "NOC"]),
+ ("a", "adult_child", "Even when the AGEP feature is not explicitly used, "
+ "features which use N to indicate children ( < 15) must agree",
+ ["MSP", "PINCP", "PINCP_DECILE"]),
+ ("a", "adult_N", "Adults ( > 14) must specify values (other than N) for all adult features",
+ ["AGEP", "MSP", "PINCP", "PINCP_DECILE", "EDU", "DPHY", "DREM"]),
("a", "toddler_DPHY", "Toddlers (< 5) naturally toddle, it's not a physical disability",
["AGEP", "DPHY"]),
("a", "toddler_DREM", "Toddlers (< 5) are naturally forgetful, it's not a cognitive disability",
@@ -155,6 +160,32 @@ def compute(self):
if "EDU" in fl and not (r["EDU"] == 'N'):
ic_dict["infant_EDU"].append(i)
+ if not ("AGEP" in fl):
+ # This forces agreement on MSP, PINCP and PINCP_DECILE if at least 2 exist.
+ if ("MSP" in fl and (r["MSP"] == 'N')) and (
+ ("PINCP" in fl and not (r["PINCP"] == 'N')) or (
+ "PINCP_DECILE" in fl and not (r["PINCP_DECILE"] == 'N'))):
+ ic_dict["adult_child"].append(i)
+ if ("MSP" in fl and not (r["MSP"] == 'N')) and (
+ ("PINCP" in fl and (r["PINCP"] == 'N')) or (
+ "PINCP_DECILE" in fl and (r["PINCP_DECILE"] == 'N'))):
+ ic_dict["adult_child"].append(i)
+ if not ("MSP" in fl) and ("PINCP" in fl) and ("PINCP_DECILE" in fl):
+ if ((r["PINCP"] == 'N') and not (r["PINCP_DECILE"] == 'N')) or (
+ not (r["PINCP"] == 'N') and (r["PINCP_DECILE"] == 'N')):
+ ic_dict["adult_child"].append(i)
+
+ # this catches adults who still have the child 'N' for their features.
+ if "AGEP" in fl and r["AGEP"] > 15:
+ if ("MSP" in fl and (r["MSP"] == 'N')) or (
+ "PINCP" in fl and (r["PINCP"] == 'N')) or (
+ "PINCP_DECILE" in fl and (r["PINCP_DECILE"] == 'N')) or (
+ "EDU" in fl and (r["EDU"] == 'N')) or (
+ "DPHY" in fl and (r["DPHY"] == 'N')) or (
+ "DREM" in fl and (r["DREM"] == 'N')):
+ ic_dict["adult_N"].append(i)
+
+
# -------------------work and finance related inconsistencies---------------
# income > 300K
if "PINCP" in fl and not (r["PINCP"] == 'N') and float(r["PINCP"]) > 3000000:
diff --git a/sdnist/report/README.md b/sdnist/report/README.md
index a9ffa5b..1db6a83 100644
--- a/sdnist/report/README.md
+++ b/sdnist/report/README.md
@@ -1,4 +1,4 @@
-SDNist v1.4 beta: Deidentified Data Report Tool
+SDNist v2.0: Deidentified Data Report Tool
====================================
This tool evaluates utility and privacy of a given deidentified dataset and generates a summary quality report with performance of a deide dataset enumerated and illustrated for each utility and privacy metric.
@@ -18,7 +18,7 @@ Setting Up the SDNIST Report Tool
### Brief Setup Instructions
-SDNist v1.4 requires Python version 3.7 or greater. If you have installed a previous version of the SDNist library, we recommend uninstalling or installing v1.4 in a virtual environment. v1.4 can be installed via [Release 1.4.0b](https://github.com/usnistgov/SDNist/releases/tag/v1.4.1-b.1). The NIST Diverse Community Exceprt data will download on the fly.
+SDNist v2.0 requires Python version 3.7 or greater. If you have installed a previous version of the SDNist library, we recommend uninstalling or installing v2.0 in a virtual environment. v2.0 can be installed via [Release 2.0](https://github.com/usnistgov/SDNist/releases/tag/v2.0.0). The NIST Diverse Community Exceprt data will download on the fly.
### Detailed Setup Instructions
@@ -38,10 +38,10 @@ SDNist v1.4 requires Python version 3.7 or greater. If you have installed a prev
c:\\sdnist-project>
```
-4. Download the sdnist installable wheel (sdnist-1.4.1b-py3-none-any.whl) from the [Github:SDNist beta release](https://github.com/usnistgov/SDNist/releases/download/v1.4.1-b.1/sdnist-1.4.1b1-py3-none-any.whl).
+4. Download the sdnist installable wheel (sdnist-2.0.0-py3-none-any.whl) from the Github [SDNist Release 2.0](https://github.com/usnistgov/SDNist/releases/download/v2.0.0/sdnist-2.0.0-py3-none-any.whl).
-5. Move the downloaded sdnist-1.4.1b1-py3-none-any.whl file to the sdnist-project directory.
+5. Move the downloaded sdnist-2.0.0-py3-none-any.whl file to the sdnist-project directory.
6. Using the terminal on Mac/Linux or powershell on Windows, navigate to the sdnist-project directory.
@@ -94,7 +94,7 @@ SDNist v1.4 requires Python version 3.7 or greater. If you have installed a prev
```
-10. Per step 5 above, the sdnist-1.4.1b1-py3-none-any.whl file should already be present in the sdnist-project directory. Check whether that is true by listing the files in the sdnist-project directory.
+10. Per step 5 above, the sdnist-2.0.0-py3-none-any.whl file should already be present in the sdnist-project directory. Check whether that is true by listing the files in the sdnist-project directory.
**MAC OS/Linux:**
```
@@ -104,12 +104,12 @@ SDNist v1.4 requires Python version 3.7 or greater. If you have installed a prev
```
(venv) c:\\sdnist-project> dir
```
- The sdnist-1.4.0b2-py3-none-any.whl file should be in the list printed by the above command; otherwise, follow steps 4 and 5 again to download the .whl file.
+ The sdnist-2.0.0-py3-none-any.whl file should be in the list printed by the above command; otherwise, follow steps 4 and 5 again to download the .whl file.
11. Install sdnist Python library:
```
- (venv) c:\\sdnist-project> pip install sdnist-1.4.1b1-py3-none-any.whl
+ (venv) c:\\sdnist-project> pip install sdnist-2.0.0-py3-none-any.whl
```
@@ -127,8 +127,9 @@ SDNist v1.4 requires Python version 3.7 or greater. If you have installed a prev
TARGET_DATASET_NAME Select name of the target dataset that was used to generated given deidentified dataset
optional arguments:
- \-h, \--help show this help message and exit
- \--data-root DATA_ROOT Path of the directory to be used as the root for the target datasets\--download DOWNLOAD Download toy datasets if not present locallyChoices for Target Dataset Name::
+ \-h, \--help Show this help message and exit
+ \--data-root DATA_ROOT Path of the directory to be used as the root for the target datasets
+ \--download DOWNLOAD Download toy datasets if not present locallyChoices for Target Dataset Name::
(dataname) (filename)
MA ma2019
@@ -188,7 +189,7 @@ Generate Data Quality Report
- TX
- NATIONAL
- - **--data-root**: The absolute or relative path to the directory containing the bundled dataset, or the directory where the bundled dataset should be downloaded to if it is not available locally. The default directory is set to sdnist_toy_data.
+ - **--data-root**: The absolute or relative path to the directory containing the bundled dataset, or the directory where the bundled dataset should be downloaded to if it is not available locally. The default directory is set to **diverse_community_excerpts_data**.
## Setup Data for SDNIST Report Tool
@@ -199,7 +200,7 @@ Generate Data Quality Report
(venv) c:\\sdnist-project> python -m sdnist.report syn_tx.csv TX
Downloading all SDNist datasets from:
- https://github.com/usnistgov/SDNist/releases/download/v1.4.0-b.1/SDNist-toy-data-1.4.0-b.1.zip ...
+ https://github.com/usnistgov/SDNist/releases/download/v2.0.0/diverse_community_excerpts_data.zip ...
...5%, 47352 KB, 8265 KB/s, 5 seconds elapsed
```
@@ -211,30 +212,30 @@ Generate Data Quality Report
3. The sdnist.report package also needs a deidentified dataset that it can evaluate against its original counterpart. Since the sdnist.report package comes bundled with the datasets, the deidentified dataset should be generated using the bundled datasets.
- You can download a copy of the datasets from [Github Sdnist Toy Dataset](https://github.com/usnistgov/SDNist/tree/main/nist%20diverse%20communities%20data%20excerpts). This copy is similar to the one bundled with the sdnist.report package, but it contains more documentation and a description of the datasets.
+ You can download a copy of the datasets from Github [Diverse Community Excerpts Data](https://github.com/usnistgov/SDNist/tree/main/nist%20diverse%20communities%20data%20excerpts). This copy is similar to the one bundled with the sdnist.report package, but it contains more documentation and a description of the datasets.
-4. You can download the toy deidentified datasets from [Github Sdnist Toy Synthetic Dataset](https://github.com/usnistgov/SDNist/releases/download/v1.4.0-b.1/toy_synthetic_data.zip). Unzip the downloaded file, and move the unzipped toy_synthetic_dataset directory to the sdnist-project directory.
+4. You can download the toy deidentified datasets from Github [Sdnist Toy Synthetic Dataset](https://github.com/usnistgov/SDNist/releases/download/v2.0.0/toy_deidentified_data.zip). Unzip the downloaded file, and move the unzipped toy_synthetic_dataset directory to the sdnist-project directory.
-5. Each toy deidentified dataset file is generated using the [Sdnist Toy Dataset](https://github.com/usnistgov/SDNist/releases/download/v1.4.0-b.1/SDNist-toy-data-1.4.0-b.1.zip). The syn_ma.csv, syn_tx.csv, and syn_national.csv deidentified dataset files are created from target datasets MA (ma2019.csv), TX (tx2019.csv), and NATIONAL(national2019.csv), respectively. You can use one of the toy synthetic dataset files for testing whether the sdnist.report package is installed correctly on your system.
+5. Each toy deidentified dataset file is generated using the [Diverse Community Excerpts Data](https://github.com/usnistgov/SDNist/releases/download/v2.0.0/diverse_community_excerpts_data.zip). The syn_ma.csv, syn_tx.csv, and syn_national.csv deidentified dataset files are created from target datasets MA (ma2019.csv), TX (tx2019.csv), and NATIONAL(national2019.csv), respectively. You can use one of the toy synthetic dataset files for testing whether the sdnist.report package is installed correctly on your system.
6. Use the following commands for generating reports if you are using a toy deidentified dataset file:
For evaluating the Massachusetts dataset:
```
- (venv) c:\\sdnist-project> python -m sdnist.report toy_synthetic_data/syn_ma.csv MA
+ (venv) c:\\sdnist-project> python -m sdnist.report toy_deidentified_data/syn_ma.csv MA
```
For evaluating the Texas dataset:
```
- (venv) c:\\sdnist-project> python -m sdnist.report toy_synthetic_data/syn_tx.csv TX
+ (venv) c:\\sdnist-project> python -m sdnist.report toy_deidentified_data/syn_tx.csv TX
```
For evaluating the national dataset:
```
- (venv) c:\\sdnist-project> python -m sdnist.report toy_synthetic_data/syn_national.csv NATIONAL
+ (venv) c:\\sdnist-project> python -m sdnist.report toy_deidentified_data/syn_national.csv NATIONAL
```
7. A deidentified dataset can be a .csv or a parquet file, and the path of this file is required
@@ -242,9 +243,6 @@ by the sdnist.report package to generate a data quality report.
## Download Data Manually
-1. If the sdnist.report package is not able to download the datasets, you can download them from [Github:SDNist toy data beta release](https://github.com/usnistgov/SDNist/releases/download/v1.4.0-b.1/SDNist-toy-data-1.4.0-b.1.zip).
-2. Move the downloaded SDNist-toy-data-1.4.0-b.1.zip file to the sdnist-project directory.
-3. Unzip the SDNist-toy-data-1.4.0-b.1.zip file and move the data directory inside it to the sdnist-project directory.
-4. Delete the SDNist-toy-data-1.4.0-b.1.zip file once the data directory is successfully moved out of the unzipped directory.
-5. Also delete the now-empty SDNist-toy-data-1.4.0-b.1 directory from where the zip file was extracted.
-6. And finally, to successfully install datasets manually, change the name of the data directory inside the sdnist-project directory to sdnist_toy_data.
+1. If the sdnist.report package is not able to download the datasets, you can download them from Github [Diverse Community Excerpts Data](https://github.com/usnistgov/SDNist/releases/download/v2.0.0/diverse_community_excerpts_data.zip).
+3. Unzip the **diverse_community_excerpts_data.zip** file and move the unzipped **diverse_community_excerpts_data** directory to the **sdnist-project** directory.
+4. Delete the **diverse_community_excerpts_data.zip** file once the data is successfully extracted from the zip.
diff --git a/sdnist/report/__main__.py b/sdnist/report/__main__.py
index 926987c..e9294fd 100644
--- a/sdnist/report/__main__.py
+++ b/sdnist/report/__main__.py
@@ -15,37 +15,51 @@
from sdnist.load import TestDatasetName
from sdnist.strs import *
+from sdnist.utils import *
+# from setup import version
def run(synthetic_filepath: Path,
output_directory: Path = REPORTS_DIR,
test: TestDatasetName = TestDatasetName.NONE,
- data_root: Path = Path('sdnist_toy_data'),
+ data_root: Path = Path("diverse_community_excerpts_data"),
download: bool = False,
test_mode: bool = False):
outfile = Path(output_directory, 'report.json')
ui_data = ReportUIData(output_directory=output_directory)
report_data = ReportData(output_directory=output_directory)
+ log = SimpleLogger()
+ log.msg('SDNist: Deidentified Data Report Tool', level=0, timed=False)
+ log.msg(f'Creating Evaluation Report for Deidentified Data at path: {synthetic_filepath}',
+ level=1)
if not outfile.exists():
- print('Loading Dataset...')
- dataset = Dataset(synthetic_filepath, test, data_root, download)
+ log.msg('Loading Datasets', level=2)
+ dataset = Dataset(synthetic_filepath, log, test, data_root, download)
ui_data = data_description(dataset, ui_data)
+ log.end_msg()
# Create scores
- print('Computing Utility Scores...')
- ui_data, report_data = utility_score(dataset, ui_data, report_data)
- print('Computing Privacy Scores...')
- ui_data, report_data = privacy_score(dataset, ui_data, report_data)
+ log.msg('Computing Utility Scores', level=2)
+ ui_data, report_data = utility_score(dataset, ui_data, report_data, log)
+ log.end_msg()
+
+ log.msg('Computing Privacy Scores', level=2)
+ ui_data, report_data = privacy_score(dataset, ui_data, report_data, log)
+ log.end_msg()
+
+ log.msg('Saving Report Data')
ui_data.save()
report_data.save()
ui_data = ui_data.data
+ log.end_msg()
else:
with open(outfile, 'r') as f:
ui_data = json.load(f)
-
+ log.end_msg()
# Generate Report
generate(ui_data, output_directory, test_mode)
+ log.msg(f'Reports available at path: {output_directory}', level=0, timed=False)
class NoAction(argparse.Action):
@@ -73,7 +87,7 @@ def __call__(self, parser, namespace, values, option_string=None):
help="Select name of the target dataset "
"that was used to generated given deidentified dataset")
parser.add_argument("--data-root", type=Path,
- default=Path("sdnist_toy_data"),
+ default=Path("diverse_community_excerpts_data"),
help="Path of the directory "
"to be used as the root for the target datasets")
parser.add_argument("--download", type=bool, default=True,
diff --git a/sdnist/report/config.json b/sdnist/report/config.json
index 222c1e9..69424e4 100644
--- a/sdnist/report/config.json
+++ b/sdnist/report/config.json
@@ -15,7 +15,7 @@
"PUMA",
"SEX",
"AGEP",
- "MST",
+ "MSP",
"HISP",
"RAC1P",
"NOC",
@@ -27,7 +27,9 @@
"PINCP_DECILE",
"DVET",
"DEAR",
- "DPHYS",
- "DEYE"
+ "DPHY",
+ "DEYE",
+ "DREM",
+ "DENSITY"
]
}
\ No newline at end of file
diff --git a/sdnist/report/dataset.py b/sdnist/report/dataset.py
index a367dfe..9f1f7fc 100644
--- a/sdnist/report/dataset.py
+++ b/sdnist/report/dataset.py
@@ -1,3 +1,4 @@
+import math
from pathlib import Path
from typing import Dict, List
from dataclasses import dataclass, field
@@ -154,6 +155,74 @@ def add_bin_for_NA(data, reference_data, features):
return d
+def bin_density(data: pd.DataFrame, data_dict: Dict, update: bool = True) -> pd.DataFrame:
+ """
+ data: Data containing density feature
+ data_dict: Dictionary containing values range for density feature
+ update: if True, update the input data's density feature and return
+ else, create two new columns: binned_density and bin_range
+ and return the data
+ """
+ def get_bin_range_log(x):
+ for i, v in enumerate(bins):
+ if i == x:
+ return f'({round(v, 2)}, {round(bins[i + 1], 2)}]'
+ d = data
+ dd = data_dict
+ base = 10
+ # we remove first 8 bins from this bins list, and prepend
+ # two bins. So effective bins are 12. This is done to bottom
+ # code density category for the PUMAs with small density.
+ n_bins = 20 # number of bins
+ # max of range
+ n_max = dd['DENSITY']['values']['max'] + 500
+
+ bins = np.logspace(start=math.log(10, base), stop=math.log(n_max, base), num=n_bins+1)
+ # remove first 8 bins and prepend two new bins
+ bins = [0, 150] + list(bins[8:])
+ # print('Bins', bins)
+ # print('Densities', d['DENSITY'].unique().tolist())
+ n_bins = len(bins) # update number of bins to effective bins
+ labels = [i for i in range(n_bins-1)]
+
+ # top code values to n_max and bottom code values to 0 in the data
+ d.loc[d['DENSITY'] < 0, 'DENSITY'] = float(0)
+ d.loc[d['DENSITY'] > n_max, 'DENSITY'] = float(n_max) - 100
+
+ if update:
+ d['DENSITY'] = pd.cut(d['DENSITY'], bins=bins, labels=labels)
+ return d
+ else:
+ d['binned_density'] = pd.cut(d['DENSITY'], bins=bins, labels=labels)
+
+ d['bin_range'] = d['binned_density'].apply(lambda x: get_bin_range_log(x))
+ return d
+
+
+def get_density_bins_description(data: pd.DataFrame, data_dict: Dict, mappings: Dict) -> Dict:
+ bin_desc = dict()
+ # If puma is not available in the features, return empty description dictionary
+ if 'PUMA' not in data:
+ return bin_desc
+
+ d = bin_density(data.copy(), data_dict, update=False)
+
+ for dbin, g in d.groupby(by=['binned_density']):
+ if g.shape[0] == 0:
+ continue
+
+ density_range = g['bin_range'].unique()[0]
+ bin_data = []
+ for puma, pg in g.groupby(by='PUMA'):
+ density = pg['DENSITY'].unique()[0]
+ bin_data.append([puma, density, mappings["PUMA"][puma]["name"]])
+ bin_df = pd.DataFrame(bin_data, columns=['PUMA', 'DENSITY', 'PUMA NAME'])
+ bin_desc[dbin] = (density_range, bin_df)
+ del d
+ # print(bin_desc)
+ return bin_desc
+
+
def unavailable_features(config: Dict, synthetic_data: pd.DataFrame):
"""remove features from configuration that are not available in
the input synthetic data"""
@@ -170,6 +239,7 @@ def unavailable_features(config: Dict, synthetic_data: pd.DataFrame):
@dataclass
class Dataset:
synthetic_filepath: Path
+ log: u.SimpleLogger
test: TestDatasetName = TestDatasetName.NONE
data_root: Path = Path('sdnist_toy_data')
download: bool = True
@@ -188,25 +258,23 @@ def __post_init__(self):
download=self.download,
public=False,
test=self.test,
- format_="csv",
- data_name="toy-data"
+ format_="csv"
)
self.target_data_path = build_name(
challenge=strs.CENSUS,
root=self.data_root,
public=False,
- test=self.test,
- data_name="toy-data"
+ test=self.test
)
self.schema = params[strs.SCHEMA]
-
+ configs_path = self.target_data_path.parent.parent
# add config packaged with data and also the config package with sdnist.report package
- config_1 = u.read_json(Path(self.target_data_path.parent, 'config.json'))
+ config_1 = u.read_json(Path(configs_path, 'config.json'))
config_2 = u.read_json(Path(FILE_DIR, 'config.json'))
self.config = {**config_1, **config_2}
- self.mappings = u.read_json(Path(self.target_data_path.parent, 'mappings.json'))
- self.data_dict = u.read_json(Path(self.target_data_path.parent, 'data_dictionary.json'))
+ self.mappings = u.read_json(Path(configs_path, 'mappings.json'))
+ self.data_dict = u.read_json(Path(configs_path, 'data_dictionary.json'))
self.features = self.target_data.columns.tolist()
drop_features = self.config[strs.DROP_FEATURES] \
@@ -241,11 +309,25 @@ def __post_init__(self):
self.features = list(set(self.features).difference(set(ind_features)))
self.features = list(set(self.features).intersection(list(common_columns)))
+ self.log.msg(f'Features ({len(self.features)}): {self.features}', level=3, timed=False)
+ self.log.msg(f'Deidentified Data Records Count: {self.synthetic_data.shape[0]}', level=3, timed=False)
+ self.log.msg(f'Target Data Records Count: {self.target_data.shape[0]}', level=3, timed=False)
+
validate(self.synthetic_data, self.schema, self.features)
+
# raw data
self.target_data = self.target_data[self.features]
self.synthetic_data = self.synthetic_data[self.features]
+ # bin the density feature if present in the datasets
+ self.density_bin_desc = dict()
+ if 'DENSITY' in self.features:
+ self.density_bin_desc = get_density_bins_description(self.target_data,
+ self.data_dict,
+ self.mappings)
+ self.target_data = bin_density(self.target_data, self.data_dict)
+ self.synthetic_data = bin_density(self.synthetic_data, self.data_dict)
+
# update config to contain only available features
self.config = unavailable_features(self.config, self.synthetic_data)
@@ -348,6 +430,18 @@ def data_description(dataset: Dataset, ui_data: ReportUIData) -> ReportUIData:
dd_as.append(Attachment(name=f_name,
_data=data,
_type=AttachmentType.Table))
+ if feat == 'DENSITY':
+ for bin, bdata in dataset.density_bin_desc.items():
+ bdc = bdata[1].columns.tolist() # bin data columns
+ # report bin data: bin data format for report
+ rbd = [{c: row[j] for j, c in enumerate(bdc)}
+ for i, row in bdata[1].iterrows()]
+ dd_as.append(Attachment(name=None,
+ _data=f'Density Bin: {bin} | Bin Range: {bdata[0]}',
+ _type=AttachmentType.String))
+ dd_as.append(Attachment(name=None,
+ _data=rbd,
+ _type=AttachmentType.Table))
r_ui_d.add(ScorePacket(metric_name='Data Dictionary',
score=None,
diff --git a/sdnist/report/score/privacy.py b/sdnist/report/score/privacy.py
index eb8c4fd..486aea5 100644
--- a/sdnist/report/score/privacy.py
+++ b/sdnist/report/score/privacy.py
@@ -7,17 +7,21 @@
PrivacyScorePacket, Attachment, AttachmentType
from sdnist.report.score.paragraphs import *
from sdnist.strs import *
+from sdnist.utils import *
-def privacy_score(dataset: Dataset, ui_data: ReportUIData, report_data) \
+def privacy_score(dataset: Dataset, ui_data: ReportUIData, report_data, log: SimpleLogger) \
-> Tuple[ReportUIData, ReportData]:
ds = dataset
r_ui_d = ui_data
rd = report_data
+ log.msg('Apparent Match Distribution', level=3)
quasi_idf = [] # list of quasi-identifier features
excluded = [] # list of excluded features from apparent match computation
if ds.challenge == CENSUS:
+
+
quasi_idf = ['SEX', 'RAC1P', 'EDU', 'INDP_CAT', 'MST']
quasi_idf = list(set(ds.features).intersection(set(quasi_idf)))
excluded = ['PUMA', 'RACE']
@@ -82,4 +86,5 @@ def privacy_score(dataset: Dataset, ui_data: ReportUIData, report_data) \
total_quasi_matched,
adp_para_a,
adp]))
+ log.end_msg()
return r_ui_d, rd
diff --git a/sdnist/report/score/utility/__init__.py b/sdnist/report/score/utility/__init__.py
index 68acd1b..bdd4d92 100644
--- a/sdnist/report/score/utility/__init__.py
+++ b/sdnist/report/score/utility/__init__.py
@@ -428,24 +428,25 @@ def grid_plot_attachment(group_features: List[str],
return gp_a
-def utility_score(dataset: Dataset, ui_data: ReportUIData, report_data: ReportData) \
+def utility_score(dataset: Dataset, ui_data: ReportUIData, report_data: ReportData,
+ log: SimpleLogger) \
-> Tuple[ReportUIData, ReportData]:
ds = dataset
r_ui_d = ui_data # report ui data
rd = report_data
- scorers = []
-
features = ds.features
corr_features = ds.config[strs.CORRELATION_FEATURES]
# Initiated k-marginal, correlation and propensity scorer
# selected challenge type: census or taxi
if ds.challenge == strs.CENSUS:
+ log.msg('Univariates', level=3)
up = UnivariatePlots(ds.d_synthetic_data, ds.d_target_data,
ds, r_ui_d.output_directory, ds.challenge)
u_feature_data = up.save() # univariate features data
rd.add('Univariate', up.report_data())
+
u_as = [] # univariate attachments
for k, v in u_feature_data.items():
@@ -476,7 +477,9 @@ def utility_score(dataset: Dataset, ui_data: ReportUIData, report_data: ReportDa
strs.PATH: u_rel_path}],
_type=AttachmentType.ImageLinks)
u_as.append(a)
+ log.end_msg()
+ log.msg('Correlations', level=3)
cdp_saved_file_paths = []
pcp_saved_file_paths = []
if len(corr_features) > 1:
@@ -492,17 +495,12 @@ def utility_score(dataset: Dataset, ui_data: ReportUIData, report_data: ReportDa
rd.add('Correlations', {"kendall correlation difference": cdp.report_data,
"pearson correlation difference": pcp.report_data})
+ log.end_msg()
- scorers = [CensusKMarginalScore(ds.d_target_data,
- ds.d_synthetic_data,
- ds.schema, **ds.config[strs.K_MARGINAL]),
- PropensityMSE(ds.t_target_data,
- ds.t_synthetic_data,
- r_ui_d.output_directory,
- features)]
else:
raise Exception(f'Unknown challenge type: {ds.challenge}')
+ log.msg('K-Marginal', level=3)
group_features = ds.config[strs.K_MARGINAL][strs.GROUP_FEATURES]
f_val_dict = {
f: {i: v for i, v in enumerate(ds.schema[f]['values'])}
@@ -513,59 +511,71 @@ def utility_score(dataset: Dataset, ui_data: ReportUIData, report_data: ReportDa
prop_pkt = None # propensity score packet
# compute scores and plots
- for s in scorers:
- s.compute_score()
- metric_name = s.NAME
-
- metric_score = int(s.score) if s.score > 100 else round(s.score, 5)
- metric_attachments = []
-
- if s.NAME == CensusKMarginalScore.NAME \
- and ds.challenge == strs.CENSUS:
-
- group_scores = s.scores if hasattr(s, 'scores') and len(s.scores) else None
- subsample_scores = kmarginal_subsamples(ds, CensusKMarginalScore)
- kmarg_sum_pkt, kmarg_det_pkt = kmarginal_score_packet(metric_score,
- f_val_dict,
- ds,
- r_ui_d,
- rd,
- subsample_scores,
- 'PUMA',
- group_features,
- group_scores)
-
- elif s.NAME == PropensityMSE.NAME:
- p_dist_plot = PropensityDistribution(s.prob_dist, r_ui_d.output_directory)
- # pps = PropensityPairPlot(s.std_two_way_scores, rd.output_directory)
- #
- prop_rep_data = {**s.report_data, **p_dist_plot.report_data}
- rd.add('propensity mean square error', prop_rep_data)
-
- p_dist_paths = p_dist_plot.save()
- # pps_paths = pps.save('spmse',
- # 'Two-Way Standardized Propensity Mean Square Error')
- rel_pd_path = ["/".join(list(p.parts)[-2:])
- for p in p_dist_paths]
- # rel_pps_path = ["/".join(list(p.parts)[-2:])
- # for p in pps_paths]
-
- # probability distribution attachment
- pd_para_a = Attachment(name=None,
- _data=propensity_para,
- _type=AttachmentType.String)
- pd_score_a = Attachment(name=None,
- _data=f"Highlight-Score: {metric_score}",
- _type=AttachmentType.String)
- pd_a = Attachment(name=f'Propensities Distribution',
- _data=[{strs.IMAGE_NAME: Path(p).stem, strs.PATH: p}
- for p in rel_pd_path],
- _type=AttachmentType.ImageLinks)
-
- prop_pkt = UtilityScorePacket(metric_name,
- None,
- [pd_para_a, pd_score_a, pd_a])
+ s = CensusKMarginalScore(ds.d_target_data,
+ ds.d_synthetic_data,
+ ds.schema, **ds.config[strs.K_MARGINAL])
+ s.compute_score()
+ metric_name = s.NAME
+
+ metric_score = int(s.score) if s.score > 100 else round(s.score, 5)
+ metric_attachments = []
+
+ if s.NAME == CensusKMarginalScore.NAME \
+ and ds.challenge == strs.CENSUS:
+ group_scores = s.scores if hasattr(s, 'scores') and len(s.scores) else None
+ subsample_scores = kmarginal_subsamples(ds, CensusKMarginalScore)
+ kmarg_sum_pkt, kmarg_det_pkt = kmarginal_score_packet(metric_score,
+ f_val_dict,
+ ds,
+ r_ui_d,
+ rd,
+ subsample_scores,
+ 'PUMA',
+ group_features,
+ group_scores)
+ log.end_msg()
+
+ log.msg('PropensityMSE', level=3)
+ s = PropensityMSE(ds.t_target_data,
+ ds.t_synthetic_data,
+ r_ui_d.output_directory,
+ features)
+ s.compute_score()
+ metric_name = s.NAME
+
+ metric_score = int(s.score) if s.score > 100 else round(s.score, 5)
+
+ p_dist_plot = PropensityDistribution(s.prob_dist, r_ui_d.output_directory)
+ # pps = PropensityPairPlot(s.std_two_way_scores, rd.output_directory)
+ #
+ prop_rep_data = {**s.report_data, **p_dist_plot.report_data}
+ rd.add('propensity mean square error', prop_rep_data)
+
+ p_dist_paths = p_dist_plot.save()
+ # pps_paths = pps.save('spmse',
+ # 'Two-Way Standardized Propensity Mean Square Error')
+ rel_pd_path = ["/".join(list(p.parts)[-2:])
+ for p in p_dist_paths]
+ # rel_pps_path = ["/".join(list(p.parts)[-2:])
+ # for p in pps_paths]
+
+ # probability distribution attachment
+ pd_para_a = Attachment(name=None,
+ _data=propensity_para,
+ _type=AttachmentType.String)
+ pd_score_a = Attachment(name=None,
+ _data=f"Highlight-Score: {metric_score}",
+ _type=AttachmentType.String)
+ pd_a = Attachment(name=f'Propensities Distribution',
+ _data=[{strs.IMAGE_NAME: Path(p).stem, strs.PATH: p}
+ for p in rel_pd_path],
+ _type=AttachmentType.ImageLinks)
+
+ prop_pkt = UtilityScorePacket(metric_name,
+ None,
+ [pd_para_a, pd_score_a, pd_a])
+ log.end_msg()
# rel_up_saved_file_paths = ["/".join(list(p.parts)[-2:])
# for p in up_saved_file_paths]
@@ -601,7 +611,7 @@ def utility_score(dataset: Dataset, ui_data: ReportUIData, report_data: ReportDa
corr_metric_a.append(pc_para_a)
corr_metric_a.append(pc_a)
-
+ log.msg('PCA', level=3)
pca = PCAMetric(dataset.t_target_data,
dataset.t_synthetic_data,
r_ui_d.output_directory)
@@ -622,6 +632,8 @@ def utility_score(dataset: Dataset, ui_data: ReportUIData, report_data: ReportDa
_data=[{strs.IMAGE_NAME: Path(p).stem, strs.PATH: p}
for p in rel_pca_save_file_path],
_type=AttachmentType.ImageLinks)
+ log.end_msg()
+
# Add metrics reports to UI
if kmarg_sum_pkt:
@@ -639,16 +651,21 @@ def utility_score(dataset: Dataset, ui_data: ReportUIData, report_data: ReportDa
r_ui_d.add(UtilityScorePacket("Correlations",
None,
corr_metric_a))
+ log.msg('Linear Regression', level=3)
lgr = LinearRegressionReport(ds, r_ui_d, rd)
lgr.add_to_ui()
+ log.end_msg()
if prop_pkt:
r_ui_d.add(prop_pkt)
r_ui_d.add(UtilityScorePacket("PCA",
None,
[pca_para_a, pca_a_tt, pca_a]))
+
+ log.msg('Inconsistencies', level=3)
icr = InconsistenciesReport(ds, r_ui_d, rd)
icr.add_to_ui()
+ log.end_msg()
if kmarg_det_pkt:
r_ui_d.add(kmarg_det_pkt)
diff --git a/sdnist/utils.py b/sdnist/utils.py
index dd8afba..5a2f6cb 100644
--- a/sdnist/utils.py
+++ b/sdnist/utils.py
@@ -3,6 +3,8 @@
import pandas as pd
import json
import os
+import time
+import sys
from pathlib import Path
@@ -184,3 +186,86 @@ def df_filter(data: pd.DataFrame, filters: Optional[List] = None) -> pd.DataFram
data = data[data[feature].isin(values)]
return data
+
+class SimpleLogger:
+ ptrn = '/*\*/' # separator pattern
+
+ def __init__(self):
+ self.level_messages = dict()
+ self.current_head = None
+ self.current_level = None
+ self.msg_path = dict()
+ self.root = None
+
+ def msg(self, message: str, level=1, timed=True):
+ if timed:
+ if self.root is None:
+ self.root = message
+ self.current_level = level
+
+ t = Time()
+ t.start(message)
+ msg_full_path = self.get_msg_path(message, level)
+ self.current_level = level
+ self.current_head = msg_full_path
+ self.level_messages[self.current_head] = (message, level, t)
+
+ if level < 3:
+ level_indent = '|' + ''.join(['--'
+ for _ in range(level)])
+
+ sys_print(level_indent + ' ' + message)
+ elif not timed:
+ level_indent = '|' + ''.join(['--'
+ for _ in range(level)])
+
+ sys_print(level_indent + ' ' + message)
+
+ def end_msg(self):
+ head_data = self.level_messages[self.current_head]
+ message, level, t = head_data
+ del self.level_messages[self.current_head]
+ if self.current_head != self.root:
+ parent_path = self.ptrn.join(self.current_head.split(self.ptrn)[:-1])
+ self.current_head = parent_path
+ self.current_level = level - 1
+ secs = t.time()
+ level_indent = '|' + ''.join(['--'
+ for _ in range(level)])
+ sys_print(level_indent + f' >>>> Finished {message} | Time: {round(secs, 1)}s <<<<')
+
+ def get_msg_path(self, msg, level):
+ if self.current_level == level and self.current_head != self.root:
+ if self.current_head:
+ parent_path = self.ptrn.join(self.current_head.split(self.ptrn)[:-1])
+ msg_path = parent_path + self.ptrn + msg
+ return msg_path
+ else:
+ self.current_head = self.root
+ return self.current_head
+ else:
+ return self.current_head + self.ptrn + msg
+
+
+class Time:
+ def __init__(self):
+ self.labels = dict()
+ self.last_label = None
+
+ def start(self, label: str):
+ self.last_label = label
+ self.labels[label] = time.time()
+
+ def time(self):
+ if not self.last_label:
+ sys_print('sdnist.utils.Time.time() Invalid Use of Time: No Label Found')
+ return
+ start = self.labels[self.last_label]
+ end = time.time() - start
+ self.labels[self.last_label] = end
+ return self.labels[self.last_label]
+
+
+def sys_print(data: str):
+ sys.stdout.flush()
+ sys.stdout.write(data + '\n')
diff --git a/setup.py b/setup.py
index 4e8df5b..cf71376 100644
--- a/setup.py
+++ b/setup.py
@@ -8,8 +8,8 @@
setup(
name='sdnist',
- version='1.4.1',
- description='SDNist: datasets and evaluation tools for data synthesizers',
+ version='2.0.0',
+ description='SDNist: Deidentified Data Report Generator',
long_description=long_description,
long_description_content_type='text/markdown',
url='https://github.com/usnistgov/SDNist',