Skip to content

Commit

Permalink
Merge pull request #10 from garyhowarth/main
Browse files Browse the repository at this point in the history
Merge develop to main
  • Loading branch information
garyhowarth authored Feb 28, 2023
2 parents 5892ef1 + 6eedf7a commit 380f68e
Show file tree
Hide file tree
Showing 48 changed files with 438 additions and 7,712 deletions.
51 changes: 0 additions & 51 deletions CHANGELOG.md

This file was deleted.

27 changes: 27 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
cff-version: 1.2.0
title: "SDNist: Deidentified Data Report Tool"
abstract: "SDNist provides benchmark data and a suite of both machine- and human-readable outputs with more than ten metrics including univariate and multivariate statistics, database distance metrics, principal component analysis, propensity, basic privacy evaluation, and other information-rich tools. "
message: >-
If you use this repository or present information about it publicly, please cite us.
type: software
version: 2.0.0
doi: 10.18434/mds2-2943
date-released: 2021-12-16
contact:
- affiliation: "National Institute of Standards and Technology"
email: [email protected]
family-names: Gary
given-names: Howarth
authors:
- family-names: Task
given-names: Christine
affiliation: Knexus Research Corporation
email: [email protected]
- family-names: Bhagat
given-names: Karan
affiliation: Knexus Research Corporation
- family-names: Howarth
given-names: Gary
affiliation: National Institute of Standards and Technology
email: [email protected]
ORCID: 0000-0002-3587-0546
58 changes: 31 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
# SDNist v1.4 beta: Deidentified Data Report Tool

## We anticipate releasing SDNist v2 February 21 2023!
# SDNist v2.0: Deidentified Data Report Tool

## [SDNist is the offical software package for engaging in the NIST Collaborative Research Cycle](https://pages.nist.gov/privacy_collaborative_research_cycle)

Welcome! SDNist v1.4b is a python package that provides benchmark data and evaluation metrics for deidentified data generators. This version of SDNist supports using the [NIST Diverse Community Excerpts](https://github.com/usnistgov/SDNist/tree/main/nist%20diverse%20communities%20data%20excerpts), a geographically partioned, limited feature data set.
Welcome! SDNist v2.0 is a python package that provides benchmark data and evaluation metrics for deidentified data generators. This version of SDNist supports using the [NIST Diverse Community Excerpts](https://github.com/usnistgov/SDNist/tree/main/nist%20diverse%20communities%20data%20excerpts), a geographically partioned, limited feature data set.

The deidentified data report evaluates utility and privacy of a given deidentified dataset and generates a summary quality report with performance of a deidentified dataset enumerated and illustrated for each utility and privacy metric.

Expand All @@ -25,16 +23,24 @@ Help us improve the package and this guide by reporting issues [here](https://gi

### Temporal Map Challenge Environment

SDNist v1.4b does not support the Temporal Map Challenge environment.
SDNist v2.0 does not support the Temporal Map Challenge environment.

To run the testing environment from the [*NIST PSCR Differential Privacy Temporal Map Challenge*](https://www.nist.gov/ctl/pscr/open-innovation-prize-challenges/past-prize-challenges/2020-differential-privacy-temporal) for the Chicago Taxi data sprint or the American Community Survey sprint, please go to the the [Temporal Map Challenge assets repository](https://github.com/usnistgov/Differential-Privacy-Temporal-Map-Challenge-assets).


### Citing SDNist Deidentified Data Report Tool
If you publish work that utilizes the SDNist Deidentified Data Tool, please cite the software. Citation recommendation:
> Task C., Bhagat K., and Howarth G.S. (2023), SDNist v2: Deidentified Data Report Tool,
> National Institute of Standards and Technology,
> https://doi.org/10.18434/mds2-2943
(NOTE: DOI is not yet active, but should be by 1 APR 2023).

Setting Up the SDNIST Report Tool
------------------------

### Brief Setup Instructions

SDNist v1.4 requires Python version 3.7 or greater. If you have installed a previous version of the SDNist library, we recommend uninstalling or installing v1.4 in a virtual environment. v1.4 can be installed via [Release 1.4.0b](https://github.com/usnistgov/SDNist/releases/tag/v1.4.1-b.1). The NIST Diverse Community Exceprt data will download on the fly.
SDNist v2.0 requires Python version 3.7 or greater. If you have installed a previous version of the SDNist library, we recommend uninstalling or installing v2.0 in a virtual environment. v2.0 can be installed via [Release 2.0](https://github.com/usnistgov/SDNist/releases/tag/v2.0.0). The NIST Diverse Community Exceprt data will download on the fly.


### Detailed Setup Instructions
Expand All @@ -54,10 +60,10 @@ SDNist v1.4 requires Python version 3.7 or greater. If you have installed a prev
c:\\sdnist-project>
```
4. Download the sdnist installable wheel (sdnist-1.4.1b-py3-none-any.whl) from the [Github:SDNist beta release](https://github.com/usnistgov/SDNist/releases/download/v1.4.1-b.1/sdnist-1.4.1b1-py3-none-any.whl).
4. Download the sdnist installable wheel (sdnist-2.0.0-py3-none-any.whl) from the [Github:SDNist beta release](https://github.com/usnistgov/SDNist/releases/download/v2.0.0/sdnist-2.0.0-py3-none-any.whl).
5. Move the downloaded sdnist-1.4.1b1-py3-none-any.whl file to the sdnist-project directory.
5. Move the downloaded sdnist-2.0.0-py3-none-any.whl file to the sdnist-project directory.
6. Using the terminal on Mac/Linux or powershell on Windows, navigate to the sdnist-project directory.
Expand Down Expand Up @@ -110,7 +116,7 @@ SDNist v1.4 requires Python version 3.7 or greater. If you have installed a prev
```
10. Per step 5 above, the sdnist-1.4.1b1-py3-none-any.whl file should already be present in the sdnist-project directory. Check whether that is true by listing the files in the sdnist-project directory.
10. Per step 5 above, the sdnist-2.0.0-py3-none-any.whl file should already be present in the sdnist-project directory. Check whether that is true by listing the files in the sdnist-project directory.
**MAC OS/Linux:**
```
Expand All @@ -120,12 +126,12 @@ SDNist v1.4 requires Python version 3.7 or greater. If you have installed a prev
```
(venv) c:\\sdnist-project> dir
```
The sdnist-1.4.0b2-py3-none-any.whl file should be in the list printed by the above command; otherwise, follow steps 4 and 5 again to download the .whl file.
The sdnist-2.0.0-py3-none-any.whl file should be in the list printed by the above command; otherwise, follow steps 4 and 5 again to download the .whl file.
11. Install sdnist Python library:
```
(venv) c:\\sdnist-project> pip install sdnist-1.4.1b1-py3-none-any.whl
(venv) c:\\sdnist-project> pip install sdnist-2.0.0-py3-none-any.whl
```
Expand All @@ -143,8 +149,9 @@ SDNist v1.4 requires Python version 3.7 or greater. If you have installed a prev
TARGET_DATASET_NAME Select name of the target dataset that was used to generated given deidentified dataset
optional arguments:
\-h, \--help show this help message and exit
\--data-root DATA_ROOT Path of the directory to be used as the root for the target datasets\--download DOWNLOAD Download toy datasets if not present locallyChoices for Target Dataset Name::
\-h, \--help Show this help message and exit
\--data-root DATA_ROOT Path of the directory to be used as the root for the target datasets
\--download DOWNLOAD Download toy datasets if not present locallyChoices for Target Dataset Name::
(dataname) (filename)
MA ma2019
Expand Down Expand Up @@ -204,7 +211,7 @@ Generate Data Quality Report
- TX
- NATIONAL
- **--data-root**: The absolute or relative path to the directory containing the bundled dataset, or the directory where the bundled dataset should be downloaded to if it is not available locally. The default directory is set to sdnist_toy_data.
- **--data-root**: The absolute or relative path to the directory containing the bundled dataset, or the directory where the bundled dataset should be downloaded to if it is not available locally. The default directory is set to **diverse_community_excerpts_data**.
## Setup Data for SDNIST Report Tool
Expand All @@ -215,7 +222,7 @@ Generate Data Quality Report
(venv) c:\\sdnist-project> python -m sdnist.report syn_tx.csv TX
Downloading all SDNist datasets from:
https://github.com/usnistgov/SDNist/releases/download/v1.4.0-b.1/SDNist-toy-data-1.4.0-b.1.zip ...
https://github.com/usnistgov/SDNist/releases/download/v2.0.0/diverse_community_excerpts_data.zip ...
...5%, 47352 KB, 8265 KB/s, 5 seconds elapsed
```
Expand All @@ -227,40 +234,37 @@ Generate Data Quality Report
3. The sdnist.report package also needs a deidentified dataset that it can evaluate against its original counterpart. Since the sdnist.report package comes bundled with the datasets, the deidentified dataset should be generated using the bundled datasets.
You can download a copy of the datasets from [Github Sdnist Toy Dataset](https://github.com/usnistgov/SDNist/tree/main/nist%20diverse%20communities%20data%20excerpts). This copy is similar to the one bundled with the sdnist.report package, but it contains more documentation and a description of the datasets.
You can download a copy of the datasets from Github [Diverse Community Excerpts Data](https://github.com/usnistgov/SDNist/tree/main/nist%20diverse%20communities%20data%20excerpts). This copy is similar to the one bundled with the sdnist.report package, but it contains more documentation and a description of the datasets.
4. You can download the toy deidentified datasets from [Github Sdnist Toy Synthetic Dataset](https://github.com/usnistgov/SDNist/releases/download/v1.4.0-b.1/toy_synthetic_data.zip). Unzip the downloaded file, and move the unzipped toy_synthetic_dataset directory to the sdnist-project directory.
4. You can download the toy deidentified datasets from Github [Sdnist Toy Synthetic Dataset](https://github.com/usnistgov/SDNist/releases/download/v2.0.0/toy_deidentified_data.zip). Unzip the downloaded file, and move the unzipped toy_synthetic_dataset directory to the sdnist-project directory.
5. Each toy deidentified dataset file is generated using the [Sdnist Toy Dataset](https://github.com/usnistgov/SDNist/releases/download/v1.4.0-b.1/SDNist-toy-data-1.4.0-b.1.zip). The syn_ma.csv, syn_tx.csv, and syn_national.csv deidentified dataset files are created from target datasets MA (ma2019.csv), TX (tx2019.csv), and NATIONAL(national2019.csv), respectively. You can use one of the toy synthetic dataset files for testing whether the sdnist.report package is installed correctly on your system.
5. Each toy deidentified dataset file is generated using the [Diverse Community Excerpts Data](https://github.com/usnistgov/SDNist/releases/download/v2.0.0/diverse_community_excerpts_data.zip). The syn_ma.csv, syn_tx.csv, and syn_national.csv deidentified dataset files are created from target datasets MA (ma2019.csv), TX (tx2019.csv), and NATIONAL(national2019.csv), respectively. You can use one of the toy synthetic dataset files for testing whether the sdnist.report package is installed correctly on your system.
6. Use the following commands for generating reports if you are using a toy deidentified dataset file:
For evaluating the Massachusetts dataset:
```
(venv) c:\\sdnist-project> python -m sdnist.report toy_synthetic_data/syn_ma.csv MA
(venv) c:\\sdnist-project> python -m sdnist.report toy_deidentified_data/syn_ma.csv MA
```
For evaluating the Texas dataset:
```
(venv) c:\\sdnist-project> python -m sdnist.report toy_synthetic_data/syn_tx.csv TX
(venv) c:\\sdnist-project> python -m sdnist.report toy_deidentified_data/syn_tx.csv TX
```
For evaluating the national dataset:
```
(venv) c:\\sdnist-project> python -m sdnist.report toy_synthetic_data/syn_national.csv NATIONAL
(venv) c:\\sdnist-project> python -m sdnist.report toy_deidentified_data/syn_national.csv NATIONAL
```
7. A deidentified dataset can be a .csv or a parquet file, and the path of this file is required
by the sdnist.report package to generate a data quality report.
## Download Data Manually
1. If the sdnist.report package is not able to download the datasets, you can download them from [Github:SDNist toy data beta release](https://github.com/usnistgov/SDNist/releases/download/v1.4.0-b.1/SDNist-toy-data-1.4.0-b.1.zip).
2. Move the downloaded SDNist-toy-data-1.4.0-b.1.zip file to the sdnist-project directory.
3. Unzip the SDNist-toy-data-1.4.0-b.1.zip file and move the data directory inside it to the sdnist-project directory.
4. Delete the SDNist-toy-data-1.4.0-b.1.zip file once the data directory is successfully moved out of the unzipped directory.
5. Also delete the now-empty SDNist-toy-data-1.4.0-b.1 directory from where the zip file was extracted.
6. And finally, to successfully install datasets manually, change the name of the data directory inside the sdnist-project directory to sdnist_toy_data.
1. If the sdnist.report package is not able to download the datasets, you can download them from Github [Diverse Community Excerpts Data](https://github.com/usnistgov/SDNist/releases/download/v2.0.0/diverse_community_excerpts_data.zip).
3. Unzip the **diverse_community_excerpts_data.zip** file and move the unzipped **diverse_community_excerpts_data** directory to the **sdnist-project** directory.
4. Delete the **diverse_community_excerpts_data.zip** file once the data is successfully extracted from the zip.
Binary file removed SDNist_introduction_paper_PPAI22.pdf
Binary file not shown.
Binary file not shown.
Binary file removed challenge benchmark problems/Survey Data Benchmark.pdf
Binary file not shown.
21 changes: 0 additions & 21 deletions examples/DPSyn/LICENSE

This file was deleted.

Loading

0 comments on commit 380f68e

Please sign in to comment.