Apply automated document enhancement modifications #3165

nvkevlu · 2025-01-22T17:40:29Z

Applies the more straightforward automated document enhancement modifications.

Description

Applies the more straightforward automated document enhancement modifications that are the same as what was merged in the 2.5 branch.

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Quick tests passed locally by running ./runtest.sh.
In-line docstrings updated.
Documentation updated.

nvkevlu · 2025-01-22T19:00:26Z

/build

nvkevlu · 2025-01-27T19:22:22Z

/build

ZiyueXu77 · 2025-01-29T23:32:00Z

/build

YuanTingHsieh

Mostly LGTM, two questions

YuanTingHsieh · 2025-01-30T05:04:26Z

docs/user_guide/dashboard_api.rst


 .. include:: nvflare_cli/dashboard_command.rst

 **********************************
-NVIDIA FLARE Dashboard backend API
+NVIDIA FLARE Dashboard backend APIs
 **********************************


do we need to increase these stars? if this renders fine then we are good.

YuanTingHsieh · 2025-01-30T05:05:02Z

docs/user_guide/flower_integration/flower_job_structure.rst

+Server App Specification



to be consistent with ClientApp, should we add "------------------------" as well?

chesterxgchen · 2025-02-02T06:53:02Z

/build

Bumps [cross-spawn](https://github.com/moxystudio/node-cross-spawn) from 7.0.3 to 7.0.6. - [Changelog](https://github.com/moxystudio/node-cross-spawn/blob/master/CHANGELOG.md) - [Commits](https://github.com/moxystudio/node-cross-spawn/compare/v7.0.3...v7.0.6) --- updated-dependencies: - dependency-name: cross-spawn dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> add launcher for esm2 inference run mlp training run local & federated mlp training with job api remove unused import remove unused imports return inference information fix poc command (#3087) Co-authored-by: Chester Chen <[email protected]> Fixed the simulator without custom folder job run error. (#3091) Docker job launcher (#3072) * Added ClientDockerJobLauncher. * Added ServerDockerJobLauncher, replaced "server" with const. * extract generate_run_command() for sharing the same method between ProcessJobLauncher and DockerJobLauncher. * rename methods. * Moved the shared functions to job_launcher_utils. Removed the improper inheritence. * refactored ProcessJobLauncher. * merged from main codes. * Revert "merged from main codes." This reverts commit 09cac6d96ca47a558088a16b1dc8f0f72439b247. * refactored. * removed the docker workspace requirement for DockerJobLauncher. * removed no use imports. * renamed the GET_JOB_LAUNCHER to BEFORE_JOB_LAUNCH. --------- Co-authored-by: Chester Chen <[email protected]> Logger hierarchy (#3081) * convert to logger hierarchy * add functions to log_utils Refactor provision for general use - Part 1 (#3092) * refactor provision for general use * reformat * address pr comments * fix test case * reorg file structure * address pr comments Implement a new algorithm for the CUDA plugin (#3085) * New cuda plugin implementation Signed-off-by: YuanTingHsieh <[email protected]> * Update docstring * Rename CellTable to GHPairArray for clarity and allow max_num_of_gh_pair_per_launch to be customized --------- Signed-off-by: YuanTingHsieh <[email protected]> Removed the extra client app custom folder. (#3101) Add storage capability for client logs and allow for use with LogSender and LogReceiver (#3077) * add commit * fix order of callback to have file closed in order to access * clean up * more cleanup and use thread for sending * fix ci Keep project_name shorter than the limit (#3106) * Fix #3093 * Address comments Directly send tensor via jit serialization (#3088) * Added support for bfloat16 tensor using JIT * directly send tensor via jit serialization * polish sft_job * polish sft_job * polish local training script * polish tensor params converter * polish decomposer * format correction * header update * update decomposer * end to end tensor communication passed --------- Co-authored-by: Zhihong Zhang <[email protected]> Update root readme (#3108) * update root readme * fix logo alt text Moved xgboost plugin building instructions to prerequisite section (#3111) Handle param converter according to exchange format (#3115) * Added support for bfloat16 tensor using JIT * directly send tensor via jit serialization * polish sft_job * polish sft_job * polish local training script * polish tensor params converter * polish decomposer * format correction * header update * update decomposer * end to end tensor communication passed * update quantization filters to handle tensor * bug fixes and unittest updates * unit test cannot run on gpu, update case * bug fixes and polish * format update * handle param converter according to exchange_format * add missed modifications * expose param_exchange_format to scriptrunner * expose from/to_nvflare_converter to basescriptrunner * update torch exchange format default to numpy --------- Co-authored-by: Zhihong Zhang <[email protected]> Bump nanoid from 3.3.7 to 3.3.8 in /web (#3113) Bumps [nanoid](https://github.com/ai/nanoid) from 3.3.7 to 3.3.8. - [Release notes](https://github.com/ai/nanoid/releases) - [Changelog](https://github.com/ai/nanoid/blob/main/CHANGELOG.md) - [Commits](https://github.com/ai/nanoid/compare/3.3.7...3.3.8) --- updated-dependencies: - dependency-name: nanoid dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Bump astro from 4.16.6 to 4.16.18 in /web (#3117) Bumps [astro](https://github.com/withastro/astro/tree/HEAD/packages/astro) from 4.16.6 to 4.16.18. - [Release notes](https://github.com/withastro/astro/releases) - [Changelog](https://github.com/withastro/astro/blob/[email protected]/packages/astro/CHANGELOG.md) - [Commits](https://github.com/withastro/astro/commits/[email protected]/packages/astro) --- updated-dependencies: - dependency-name: astro dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> fix external lib issue (#3119) Add list of research examples (#3110) * add list of research examples * update research list * blog link, updated doc, publications --------- Co-authored-by: Ziyue Xu <[email protected]> dictConfig, log structure, formatters, and filters (#3126) Fed Statistics: Adding Percentiles support (#3124) * 1. Add percentile support using t-digest 2. Add examples for df_stats 3. refactoring the some of the codebase 4. missing work 1. add DP noise 2. make writing filer easier for end-user 3. add job API for the stats. Job 4. make it even easier to work on stats. 5. unit tests * 1. Add percentile support using t-digest 2. Add examples for df_stats 3. refactoring the some of the codebase 4. missing work 1. add DP noise 2. make writing filer easier for end-user 3. add job API for the stats. Job 4. make it even easier to work on stats. 5. unit tests * add unit tests add job api in example format style * add tdigest license file * remove debugging print * fix test * format style changes * format style changes Fixed the FOBS lazy loading issue (#3121) * Fixed the FOBS lazy loading issue * Fixed format * Added check for builtins without module name Add the fed event runner to the SP and CP. (#3129) Add object retrieval thru streaming (#3125) * add object retrieval thru streaming * address PR comments * fix typo * added exception handling for validate_request update the client send_aux_request logging (#3136) * update the client send_aux_request logging. * Update the logging message. Add capability to publish metrics to prometheus (#2684) One of the feature request is to add system metrics to monitoring FLARE running metrics via Prometheus + Grafana or other monitoring systems. This PR add that missing piece. Here are few pieces to make this work 1) JobMetricsCollector/SysMetricsCollecor, this collector will subscribe a callback for the ReservedTopic.APP_METRICS topic in the DataBus; and receive callback when the topic is published. The SysMetricsCollector listens to the parent process events ( system start/end etc.) for client and server process The JobMetricsCollector listens to the job process events, mostly related to the job, task etc. 2) StatsD-reporter The statsd-reporter post the the metrics received ( from event callback) to the statsd-exporter interface: by default localhost:9125. StatsD-export expose the <host>:9102/metrics web interface for Prometheus to scrape, which can be used as data source for Grafana to visualize. These are standard setup. we added an example with docker-compose file to illustrate this process | Event | Metric Count | Metric Time Taken | |-------|--------------|-------------------| | SYSTEM_START | _system_start_count | | | SYSTEM_END | _system_end_count | _system_time_taken | | ABOUT_TO_START_RUN | _about_to_start_run_count | | | START_RUN | _start_run_count | | | ABOUT_TO_END_RUN | _about_to_end_run_count | | | END_RUN | _end_run_count | _run_time_taken | | CHECK_END_RUN_READINESS | _check_end_run_readiness_count | | | SWAP_IN | _swap_in_count | | | SWAP_OUT | _swap_out_count | | | START_WORKFLOW | _start_workflow_count | | | END_WORKFLOW | _end_workflow_count | _workflow_time_taken | | ABORT_TASK | _abort_task_count | | | FATAL_SYSTEM_ERROR | _fatal_system_error_count | | | JOB_DEPLOYED | _job_deployed_count | | | JOB_STARTED | _job_started_count | | | JOB_COMPLETED | _job_completed_count | _job_time_taken | | JOB_ABORTED | _job_aborted_count | | | JOB_CANCELLED | _job_cancelled_count | | | CLIENT_DISCONNECTED | _client_disconnected_count | | | CLIENT_RECONNECTED | _client_reconnected_count | | | BEFORE_PULL_TASK | _before_pull_task_count | | | AFTER_PULL_TASK | _after_pull_task_count | _pull_task_time_taken | | BEFORE_PROCESS_TASK_REQUEST | _before_process_task_request_count | | | AFTER_PROCESS_TASK_REQUEST | _after_process_task_request_count | _process_task_request_time_taken | | BEFORE_PROCESS_SUBMISSION | _before_process_submission_count | | | AFTER_PROCESS_SUBMISSION | _after_process_submission_count | _process_submission_time_taken | | BEFORE_TASK_DATA_FILTER | _before_task_data_filter_count | | | AFTER_TASK_DATA_FILTER | _after_task_data_filter_count | _data_filter_time_taken | | BEFORE_TASK_RESULT_FILTER | _before_task_result_filter_count | | | AFTER_TASK_RESULT_FILTER | _after_task_result_filter_count | _result_filter_time_taken | | BEFORE_TASK_EXECUTION | _before_task_execution_count | | | AFTER_TASK_EXECUTION | _after_task_execution_count | _task_execution_time_taken | | BEFORE_SEND_TASK_RESULT | _before_send_task_result_count | | | AFTER_SEND_TASK_RESULT | _after_send_task_result_count | _send_task_result_time_taken | | BEFORE_PROCESS_RESULT_OF_UNKNOWN_TASK | _before_process_result_of_unknown_task_count | | | AFTER_PROCESS_RESULT_OF_UNKNOWN_TASK | _after_process_result_of_unknown_task_count | _process_result_of_unknown_task_time_taken | | PRE_RUN_RESULT_AVAILABLE | _pre_run_result_available_count | | | BEFORE_CHECK_CLIENT_RESOURCES | _before_check_client_resources_count | | | AFTER_CHECK_CLIENT_RESOURCES | _after_check_client_resources_count | _check_client_resources_time_taken | | SUBMIT_JOB | _submit_job_count | | | DEPLOY_JOB_TO_SERVER | _deploy_job_to_server_count | | | DEPLOY_JOB_TO_CLIENT | _deploy_job_to_client_count | | | BEFORE_CHECK_RESOURCE_MANAGER | _before_check_resource_manager_count | | | BEFORE_SEND_ADMIN_COMMAND | _before_send_admin_command_count | | | BEFORE_CLIENT_REGISTER | _before_client_register_count | | | AFTER_CLIENT_REGISTER | _after_client_register_count | client_register_time_taken | | CLIENT_REGISTER_RECEIVED | _client_register_received_count | | | CLIENT_REGISTER_PROCESSED | _client_register_processed_count | | | CLIENT_QUIT | _client_quit_count | | | SYSTEM_BOOTSTRAP | _system_bootstrap_count | | These metrics can be separated into Job Metrics and System Metrics. System Metrics are associated with the Client and Server parent processes, while Job Metrics are associated with each job. We support three different setups: ![setup-1](https://github.com/user-attachments/assets/c031cf99-a997-4d0d-9601-be1e71394bc3) ![setup-2](https://github.com/user-attachments/assets/dd37ac9b-32d3-4c6f-94f1-b2878dda1616) ![setup-3](https://github.com/user-attachments/assets/28182d8c-3672-41e9-9e3a-227c613ccf31) The detailed examples for setup 1 and 2 are given using hello-pt A few sentences describing the changes proposed in this pull request.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Update dashboard to use nvflare-dashboard prefix (#3131) * update dashboard to use nvflare-dashboard prefix * fix ci * update with variable Support connection security and message authentication (#3135) Fixes # . This PR implements: - Support of different connection security: mTLS, TLS, and clear (insecure) - Support of explicit message authentication - Refactored job process arg computation to make it easy for different job launchers. This PR is based on the following PRs for 2.5: https://github.com/NVIDIA/NVFlare/pull/3105 https://github.com/NVIDIA/NVFlare/pull/3103 https://github.com/NVIDIA/NVFlare/pull/3096 https://github.com/NVIDIA/NVFlare/pull/3062  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Update to use vulnerability-scan runner [skip ci] (#3142) Required by ProdSec team to enable new malware scan. note: this change does not enable the malware scan directly but try to change to use the new shared runner within NVIDIA org. After merging this change, we will need to ask the Blossom team to enable the actual malware scan internally  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Signed-off-by: Peixin Li <[email protected]> Remove the build folder for unit test to ensure correctness (#3143) Fix the occasionally issue in TestTaskScriptRunner Sometimes the unit test failed because there is a "build" dir, this will lead to test failure: ``` AssertionError: Lists differ: ['/ho[28 chars]VFlare-premerge/build/lib/nvflare/cli.py', '--batch_size', '4'] != ['/ho[28 chars]VFlare-premerge/nvflare/cli.py', '--batch_size', '4'] ``` That build dir is generated by pip install and can be removed during the unit test phase to ensure correctness  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [x] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Add user guide and docs for logging (#3134)  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Dynamic logging with admin commands (#3127) depends on https://github.com/NVIDIA/NVFlare/pull/3126 - Add dynamic logging mechanism with options to provide a file, levelname/level for root level, or to reload - Add log_config argument for simulator - `configure_site_log target config` under operate command category - `configure_job_log job_id target config` under manage_job command category  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Add ignore_errors=True when trying to remove the build dir (#3146) The build directory sometimes is removed during the test is running, right after the check "if os.path.exists(build_dir)", so we add "ignore_errors=True"  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Add commands to list and retrieve additional components (#3114) * add commands to list and retrieve additional components * fix file Add error log sending to master template and add provisioning configuration (#3138) Add error log sending to master template and an ability to configure it in provisioning. Add error log sending to master template and an ability to configure it in provisioning. A new PropKey is expected for allow_error_sending: true to be in the project.yml for a client to allow error log sending. If this is not included, a _modify_error_sender callback added to StaticFileBuilder's build_from_template will remove the error_log_sender component from the client during provisioning.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. docker job launcher provision (#3116) Fixes # . Added docker job launcher configuration provision ability.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Logging enhancements (#3148) - factor out fl_ctx from message into log record attribute - improve formatter inheritance - add exclude_logger_names option to LoggerNameFilter - fix simulator log_config path for running directory  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Add self-paced training tutorial structure (#3145) Training tutorial structure: initial push  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Improve json serialization to accomodate numpy float32 (#3028) Since numpy 2.0, some type promoting rules are different: https://numpy.org/devdocs/numpy_2_0_migration_guide.html#changes-to-numpy-data-type-promotion - Change to always cast np.float32 to np.float64 (for serialization) before calling json.dump - Remove unused calls of dxo.validate, it is already invoked inside the DXO init method and checked  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Co-authored-by: Chester Chen <[email protected]> Remove the hardcode 0 and use the default value for LauncherExecutor (#3153) The original hardcode 0 has a problem, if the external user code has an exception and the program will never return. Our FL client job process (running LauncherExecutor) will never ends. By using the default heartbeat_timeout value, if the FL client job process does not receive the heartbeat from the user process for heartbeat_timeout seconds, then we will consider it dead. - Remove the hardcode 0 and use the default value for LauncherExecutor  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Fix Brats18 readme typo [skip ci] (#3154) fixed poc start command and simulator error message (#3157) …age. Fixes # . - Fix the simulator run missing __server_config__ in FL context error message due to identify changes. - Fix the "nvflare poc start" error.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Self paced training -- structure change plus Chapter 1 (#3158) Fixes # . Restructure the self-paced-training A few sentences describing the changes proposed in this pull request.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Enhance launcher executor (#2744) - Add clear of peer_is_up_or_dead - Enhance error messages - Change "train_with_evaluation" default value to False so we don't require user to return metrics  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Add logging tutorial notebook (#3150)  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. CC authorizers (#3052) * Sync main branch with cc authorizers * Fix merge conflict * Both authorizers verified * Code clenup * Reformatted * Remove MAA authorizer * Address PR comments Update HF training items for future-proof (#3161) Fixes JIRA FLARE-2332 on main branch. Add HF lib version, and remove HF training config items that are either outdated already or to be deprecated in next release  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Disable HA CI tests (#3097) - Disable HA tests as the current impl. is obsolete - Make master_template consistent (in nvflare/lighter/templates/master_template.yml and nvflare/lighter/impl/master_template.yml)  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Co-authored-by: Sean Yang <[email protected]> Add log_config.json to MANIFEST.in (#3166) Fixes [2332](https://jirasw.nvidia.com/browse/FLARE-2332)  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Co-authored-by: Ziyue Xu <[email protected]> remove jobs, support bf16 in sklearn mlp finetuning script Remove 8bit tests causing current unit test failure (#3174) Fixes # . Update quantization_test.py to Remove 8bit tests  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. test with external process, launch several times test sequence level classification update run script for sequence classification fix training script move finetune script minor updates clean imports clean finetuning script nemo wrapper fix update examples shuffle data loaders each round test with shuffling call llm.train once only Fix custom dir path passed to Flower client supernode subprocess (#3168) Fixes #3169. Since the supernode is started with its current working directory set to the client app dir, if the custom directory is not an absolute path, it needs to be passed relative to the app dir.  - [ ] Non-breaking change (fix or new feature that would not break existing functionality). - [X] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Co-authored-by: Ziyue Xu <[email protected]> Change ParamsConverter logs to debug level (#3175)  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Migrate job template to job api (#3172) Fixes # . For KM example  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Remove duplicate templates, add to MANIFEST.in (#3183) - Remove duplicate templates files after provision refactor, update paths - Build process does not include lighter/templates -> added to MANIFEST.in  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Fixed the subprocess read deadlock (#3188) This fixes the subprocess launcher deadlock issue caused by readline. It's cherry-picked from 2.5 branch.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Fix log fl_ctx parsing (#3179) Add failure case when attempting to parse bracketed fl_ctx from message  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Quant unittest fix (#3191) Fixes # . Remove quantization unit test, mainly due to limited testable scope and random errors caused by imported but not used packages  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Add yaml include support (#3133) Fixes # . Added the support for yaml to include another yaml configuration file. YAML does not naturally support any kind of "import" or "include" statement to include another yaml file. Adding this to support the yaml config in the format like: (include could be single file, or a list of include files.) .... include: 1.yml or: include: [1.yml, 2.yml] The "include" can be used at any level. Also support recursively include.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: Ziyue Xu <[email protected]> Container Streaming/Retriever (#3173) 1. ContainerStreamer to stream containers. 2. ContainerRetriever to retrieve containers from a remote site. 3. Added examples for file and container streaming. 4. Merged the class loading functions in class_utils and FOBS. 5. Fixed a F3 bug that wipes out original exception.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Specify framwork for km exmaple (#3194) Fixes # . Otherwise the default pytorch launcher will be used, which is not necessary  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Update streaming example (#3195) Fixes # . Convert job templates to job API, add end-to-end example with memory comparison, update Readme  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Cleanup Bogus Errors (#3190) Removed several bogus errors: 1. Error processing frame: RuntimeError: cannot schedule new futures after shutdown This can happen when the message arrives for a cell already being shutdown. Changed it to debug. 2. Logical Error: Endpoint is already removed This is actually not an error. The weak-ref set entry can be removed when the ref is gone. Changed it to debug. 3. Added name to several threads to help debugging the dangling thread issue.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: Ziyue Xu <[email protected]> Apply automated document enhancement modifications (#3165) Applies the more straightforward automated document enhancement modifications. Applies the more straightforward automated document enhancement modifications that are the same as what was merged in the 2.5 branch.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Co-authored-by: Ziyue Xu <[email protected]> Co-authored-by: Chester Chen <[email protected]> Add P2P distributed optimization to advanced examples (#3189) This PR adds a new set of advanced examples in `examples/advanced/distributed_optimization`, showing how to use the lower-level APIs to build P2P distributed optimization algorithms. - [x] Non-breaking change (fix or new feature that would not break existing functionality). --------- Co-authored-by: Holger Roth <[email protected]> Co-authored-by: Chester Chen <[email protected]> Separate DockerBuilder and DockerLauncherBuilder (#3186) Fixes # . Separate DockerBuilder and DockerLauncherBuilder. Fix the **nvflare poc prepare -d 'nvflare/nvflare'** by keeping the DockerBuilder as before.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Support relay - Part 1 (#3198) Fixes # . This PR implements support of using relays in building NVFLARE cellnet. Relays are nodes in the cellnet that are only used for routing messages. They do not have any learning functions. This is the Part 1 of the relay support. It does not include provisioning functions for creating startup kits for relays. These functions will be done in future PRs. Just like regular clients, relays also register to the Server when they are started. Once successfully registered, the Server sends auth token and signature to the relay. Relay nodes also perform message authentication: relays validate auth headers for all messages going thru them. Since all cross-site messages must go thru either the server or relays (or both), by enforcing message authentication at both the Server and relays ensure that no cross-site messages can go through without valid auth headers. Currently peers of a CellPipe can only communicate via the Server. This PR includes an enhancement that allows peers to communicate via CP or Relay nodes. This can make the peer-to-peer communication more efficient.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. self-paced-training: chapter 3 (#3160) Fixes # . A few sentences describing the changes proposed in this pull request.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: Ziyue Xu <[email protected]> Add self-paced-training tutorial readme (#3202) Fixes # . A few sentences describing the changes proposed in this pull request.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Add content for survival analysis for DLI (#3204) Add content for survival analysis for DLI. Add content for KM survival analysis for DLI.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. add missing init file (#3206) Fixes # . A few sentences describing the changes proposed in this pull request.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Co-authored-by: Ziyue Xu <[email protected]> Add init file for log sender and receiver (#3207)  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Add content for logistic regression for DLI (#3209) Add content for logistic regression for DLI. Add content for logistic regression for DLI.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Convert learner+template to clientAPI+jobAPI for nlp example (#3200) Fixes # . Update NLP_NER example by converting learner+job template to clientAPI+jobAPI, for tutorial  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Add lightning to FL content for DLI (#3208) Add PyTorch lightning to FL content for DLI. Add PyTorch lightning to FL content for DLI.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Add shutdown method to Flare Client API (#3152) - Flare API does not close the Flare agent and CellPipe resources - LauncherExecutor does not shutdown its threadpoolexecutor - SubprocessLauncher have a racing issue when multiple threads try to stop/start the process and monitor the process - Adds the shutdown functionality to the Flare API. - In addition, introduces the context concept so prepare for future if multiple connections need to be made. - Existing API usage is still compatible. - Makes corresponding changes to tracking and lightning API - Adds shutdown of threadpoolexecutor in LauncherExecutor - Adds a lock in SubprocessLauncher New API Usage: ``` import nvflare.client as flare from nvflare.client import FLModel with flare.init() as ctx: input_model = flare.receive(ctx=ctx) # do some training flare.send(FLModel(xxx), ctx=ctx) ```  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Add 8 bit to quantization data type (#3213) Fixes FLARE-2374 . This QA report has an 8 bit input, and previously we have not include it in the valid datatype, but since now we have 4 bit schemes, it makes sense to add it to the original datatype  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Fix filter job (#3211) Fixes # . Currently when adding multiple filters, Job API would fail. This PR fixes the issue. A filter can be added to a set of tasks (e.g. ["train", "validate"]). All task sets must be unique, meaning that task sets must not have intersections. For example, if you add a filter X to task set ["train", "validate"], you cannot add another filter Y to task set ["train", "eval"], because these two task share the same element "train". Of course, if you make each task set to contain a single task, then it's okay to add any filters. You can add any number of filters to the same task set.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: Chester Chen <[email protected]> Add statistics content for DLI (#3212) Add statistics content for DLI. Add statistics content for DLI.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Co-authored-by: Chester Chen <[email protected]> Add content for experiment tracking for DLI (#3203) Add content for experiment tracking for DLI. Add content for experiment tracking for DLI.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Support relay provision (#3215) Fixes # . This PR adds the following features and enhancements. Relay Provision Modified the Provisioner to support provisions of relays. The main properties of a relay are "connect_to" and "listening_host". The "connect_to" property specify the parent site of the relay - it's either the server or another relay. A relay must allow other relays or clients to connect to, hence it must have a listener. The "listening_host" property specifies the host name(s) and port number that the relay will be listening to. Enhanced "connect_to" implementation The "connect_to" property was already introduced to support multiple host names for server. But at that time, it was only used for connecting to the server. This PR generalizes it such that it can also be used for connecting to relays as well. It can take two formats: string or dict. The str format is for backward compatibility (connecting to server); whereas the dict format lets you specify the parent name (which is name of either the server or a relay), host, port, and connection security. Enhanced "listening_host" implementation The "listening_host" was already used to specify the properties of a site's "internal connection" (which is used for connecting cell children). It was only used for connecting jobs (CJ or SJ). Now it is generalized such that it can also be used for connecting relays. Since relays may be connected over internet (instead of connected locally for jobs), it must be able to support connections from external nets, hence could have multiple host names, default host, port number, scheme, and connection security. Client Hierarchy To support massive scale of mobil applications, CPs will need to be organized hierarchically so allow more efficient message processing and aggregation. Each client site can now specify a "parent" client. Note that client hierarchy are not the same as relay hierarchy, though should be closely related. Relay hierarchy guarantees message paths between any nodes. Client hierarchy should be very similar to the relay hierarchy, though doesn't have to. Other fixes Docker Launch builder also requires the use of internal connection. This PR consolidates this use case and relay use case to the general-purpose "listening_host" mechanism. This PR also fixes some errors in dicker launcher implementation.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. add scl example test accuracy metrics update task fitting example to use 8m update figure description fix validation label tokenizer scl launch once = False working use prepare tap data for label_name feature test with step lr reduce adjust learning rate decay test scl example clean previous checkpoints use shared label tokenizer rebuild label tokenizer sabdab classification tap regression update scl results tap regression task exclude vars filter finalize nb add new figs include figures in nb

nvkevlu and others added 2 commits January 22, 2025 12:39

apply automated document enhancement modifications

5ae08f6

Merge branch 'main' into main_automated_doc_enhancement

2937da3

nvkevlu and others added 3 commits January 23, 2025 10:02

Merge branch 'main' into main_automated_doc_enhancement

be6a423

Merge branch 'main' into main_automated_doc_enhancement

1394e41

Merge branch 'main' into main_automated_doc_enhancement

713190b

nvkevlu requested review from chesterxgchen and YuanTingHsieh January 27, 2025 19:22

chesterxgchen approved these changes Jan 27, 2025

View reviewed changes

nvkevlu and others added 2 commits January 28, 2025 11:50

Merge branch 'main' into main_automated_doc_enhancement

0f44f6f

Merge branch 'main' into main_automated_doc_enhancement

024c918

YuanTingHsieh reviewed Jan 30, 2025

View reviewed changes

Merge branch 'main' into main_automated_doc_enhancement

a02a9df

chesterxgchen enabled auto-merge (squash) February 2, 2025 06:52

chesterxgchen merged commit 0329d12 into NVIDIA:main Feb 2, 2025
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply automated document enhancement modifications #3165

Apply automated document enhancement modifications #3165

nvkevlu commented Jan 22, 2025

nvkevlu commented Jan 22, 2025

nvkevlu commented Jan 27, 2025

ZiyueXu77 commented Jan 29, 2025

YuanTingHsieh left a comment

YuanTingHsieh Jan 30, 2025

YuanTingHsieh Jan 30, 2025

chesterxgchen commented Feb 2, 2025

Apply automated document enhancement modifications #3165

Apply automated document enhancement modifications #3165

Conversation

nvkevlu commented Jan 22, 2025

Description

Types of changes

nvkevlu commented Jan 22, 2025

nvkevlu commented Jan 27, 2025

ZiyueXu77 commented Jan 29, 2025

YuanTingHsieh left a comment

Choose a reason for hiding this comment

YuanTingHsieh Jan 30, 2025

Choose a reason for hiding this comment

YuanTingHsieh Jan 30, 2025

Choose a reason for hiding this comment

chesterxgchen commented Feb 2, 2025