Add tabular stats content for DLI (#3219)

Add tabular stats content. ### Description Add tabular stats content since it was missed in the copy with the image stats. ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
NVIDIA · Feb 12, 2025 · 32295a8 · 32295a8
1 parent 98489fb
commit 32295a8
Show file tree

Hide file tree

Showing 11 changed files with 755 additions and 9 deletions.
diff --git a/...ns/02.1_federated_statistics/federated_statistics_with_image_data/code/image_stats_job.py b/...ns/02.1_federated_statistics/federated_statistics_with_image_data/code/image_stats_job.py
@@ -25,7 +25,6 @@ def define_parser():
     parser.add_argument("-o", "--stats_output_path", type=str, nargs="?", default="statistics/stats.json")
     parser.add_argument("-j", "--job_dir", type=str, nargs="?", default="/tmp/nvflare/jobs/image_stats")
     parser.add_argument("-w", "--work_dir", type=str, nargs="?", default="/tmp/nvflare/workspace/image_stats")
-    parser.add_argument("-co", "--export_config", action="store_true", help="config only mode, export config")
 
     return parser.parse_args()
 
@@ -38,7 +37,6 @@ def main():
     output_path = args.stats_output_path
     job_dir = args.job_dir
     work_dir = args.work_dir
-    export_config = args.export_config
 
     statistic_configs = {"count": {}, "histogram": {"*": {"bins": 20, "range": [0, 256]}}}
     # define local stats generator
@@ -54,10 +52,9 @@ def main():
     sites = [f"site-{i + 1}" for i in range(n_clients)]
     job.setup_clients(sites)
 
-    if export_config:
-        job.export_job(job_dir)
-    else:
-        job.simulator_run(work_dir, gpu="0")
+    job.export_job(job_dir)
+
+    job.simulator_run(work_dir, gpu="0")
 
 
 if __name__ == "__main__":

diff --git a/...tatistics/federated_statistics_with_image_data/federated_statistics_with_image_data.ipynb b/...tatistics/federated_statistics_with_image_data/federated_statistics_with_image_data.ipynb
@@ -130,7 +130,7 @@
    "id": "7e972070",
    "metadata": {},
    "source": [
-    "The file [image_stats_job.py](code/image_stats_job.py) uses the StatsJob to generate a job configuration in a Pythonic way. With the default arguments, the job will be exported to `/tmp/nvflare/jobs/image_stats` and then the job will be run with the FL simulator with the `simulator_run()` command with a work_dir of `/tmp/nvflare/workspace/image_stats`."
+    "The file [image_stats_job.py](code/image_stats_job.py) uses `StatsJob` to generate a job configuration in a Pythonic way. With the default arguments, the job will be exported to `/tmp/nvflare/jobs/image_stats` and then the job will be run with the FL simulator with the `simulator_run()` command with a work_dir of `/tmp/nvflare/workspace/image_stats`."
    ]
   },
   {

diff --git a/...atistics/federated_statistics_with_tabular_data/code/df_stats/demo/df_stats.png b/...atistics/federated_statistics_with_tabular_data/code/df_stats/demo/df_stats.png
diff --git a/...tistics/federated_statistics_with_tabular_data/code/df_stats/demo/hist_plot.png b/...tistics/federated_statistics_with_tabular_data/code/df_stats/demo/hist_plot.png
diff --git a/...atistics/federated_statistics_with_tabular_data/code/df_stats/demo/stats_df.png b/...atistics/federated_statistics_with_tabular_data/code/df_stats/demo/stats_df.png
diff --git a/..._statistics/federated_statistics_with_tabular_data/code/df_stats/demo/visualization.ipynb b/..._statistics/federated_statistics_with_tabular_data/code/df_stats/demo/visualization.ipynb
@@ -0,0 +1,293 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "7c4be7b0",
+   "metadata": {},
+   "source": [
+    "# NVFlare Federated Statistics Visualization"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "987e6028",
+   "metadata": {},
+   "source": [
+    "#### Dependencies\n",
+    "\n",
+    "To run the examples, you will need to install the following dependencies:\n",
+    "* numpy\n",
+    "* pandas\n",
+    "* wget\n",
+    "* matplotlib\n",
+    "* jupyter\n",
+    "* notebook\n",
+    "\n",
+    "These are captured in [requirements.txt](../../requirements.txt)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "665dc17e",
+   "metadata": {},
+   "source": [
+    "## Tabular Data Statistics Visualization\n",
+    "In this example, we demonstate how to visualize the results from the statistics of tabular data. The visualization requires json, pandas, matplotlib modules as well as nvflare visualization utlities. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c44a0217",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "import pandas as pd\n",
+    "\n",
+    "from nvflare.app_opt.statistics.visualization.statistics_visualization import Visualization"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "30c79d1a",
+   "metadata": {},
+   "source": [
+    "First, copy the resulting json file to demo directory. In this example, the resulting file is called `adults_stats.json`. Then load json file:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "44f6bed2",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "with open('adults_stats.json', 'r') as f:\n",
+    "    data = json.load(f)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c5cdbcc0",
+   "metadata": {},
+   "source": [
+    "Initialize the Visualization utilities:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "93c62d5e",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "vis = Visualization()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1b0f21fd",
+   "metadata": {},
+   "source": [
+    "### Overall Statistics"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b49588c2",
+   "metadata": {},
+   "source": [
+    "vis.show_statis() will show the statistics for each features, at each site for each dataset:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ab771712",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "\n",
+    "vis.show_stats(data = data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4986dd14",
+   "metadata": {},
+   "source": [
+    "### Select features statistics using white_list_features \n",
+    "You can optionally select to show only specified features via the white_list_features argument. In the following, only three features are selected instead of all the features:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "563a8bb7",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "vis.show_stats(data = data, white_list_features= ['Age', 'fnlwgt', 'Hours per week'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "95e42829",
+   "metadata": {},
+   "source": [
+    "### Histogram Visualization\n",
+    "You can use `vis.show_histograms()` to visualize the histogram. Before doing that, you can set some iPython display settings to make the graph display in a full cell. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fcdfb197",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from IPython.display import display, HTML\n",
+    "display(HTML(\"<style>.container { width:100%  depth:100% !important; }</style>\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3e86860e",
+   "metadata": {},
+   "source": [
+    "The following command displays histograms for numeric features. The result shows both the main plot and sub-plots:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f3dd3821",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "vis.show_histograms(data = data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "49b74579",
+   "metadata": {},
+   "source": [
+    "# Display Options\n",
+    "Similar to other statistics, you can use white_list_features to select features to display on histograms. you can also use display_format=\"percent\" to allow all dataset and sites to be displayed in the same scale. You can set \n",
+    "\n",
+    "* display_format: \"percent\" or \"sample_count\"\n",
+    "* white_list_features: feature names\n",
+    "* plot_type : \"both\" or \"main\" or \"subplot\"\n",
+    "\n",
+    "#### Show percent display format with selected features\n",
+    "In the following, only the feature \"Age\" is displayed, in \"percent\" display_format, with \"both\" as the plot_type (since that is the default setting)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e07b9266",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "vis.show_histograms(data = data, display_format = \"percent\", white_list_features= ['Age'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cddf21af",
+   "metadata": {},
+   "source": [
+    "#### Display main plot_type with selected features\n",
+    "In this example, two features are displayed in \"sample_counts\" display_format, with \"main\" plot_type"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "038e238e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "vis.show_histograms(data, \"sample_counts\", ['Age', 'Hours per week' ], plot_type=\"main\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b06195ac",
+   "metadata": {},
+   "source": [
+    "#### Selected features with subplot plot_type\n",
+    "In next example, one feature is displayed in \"sample_counts\" display_format, with \"subplot\" plot_type"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8958e124",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "vis.show_histograms(data, \"sample_counts\", ['Age', 'Hours per week' ], plot_type=\"subplot\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2f330eb6",
+   "metadata": {},
+   "source": [
+    "### Tip: Avoid repeated calculation\n",
+    "If you intend to plot the histogram main plot and subplot separately, repeatedly calling `show_histograms()` with different plot_types is not efficicent, as it repeatedly calculates the same set of Dataframes. To do it efficiently, you can use the following functions instead of `show_histograms()` to avoid the duplicated calculations. If you intend to show both plots, then `show_histograms()` should be used."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "395315a4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "feature_dfs = vis.get_histogram_dataframes(data, display_format=\"percent\")\n",
+    "   \n",
+    "vis.show_dataframe_plots(feature_dfs, plot_type=\"main\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "nvflare_example",
+   "language": "python",
+   "name": "nvflare_example"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}