Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SmartRedis Integration Guide #214

Merged
merged 5 commits into from
Jun 22, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@
:caption: SmartRedis

smartredis
sr_integration
sr_python_walkthrough
sr_cpp_walkthrough
sr_fortran_walkthrough
Expand Down
229 changes: 229 additions & 0 deletions doc/sr_integration.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,229 @@
*****************************
Integrating into a Simulation
*****************************

========
Overview
========

This document provides some general guidelines to integrate the SmartRedis client
into existing simulation codebases. Developers of these simulation codebases will
need to identify the exact places to add the code; generally SmartRedis calls
will only need to be added in two places:

1. Initialization
2. Main loop

==============
Initialization
==============

+++++++++++++++++++
Creating the client
+++++++++++++++++++

The SmartRedis client must be initialized before it can be used to communicate
with the orchestrator. In the C++ and Python versions of the clients, this is done
when creating a new client. In the C and Fortran client an `initialize`
method must be called.

C++::

#include "client.h"
SmartRedis::Client client(use_cluster);

Python::

from smartredis import Client
client = Client(use_cluster)

Fortran::

use smartredis_client, only : client_type
include "enum_fortran.inc"

type(client_type) :: client
return_code = client%initialize(use_cluster)
if (return_code /= SRNoError) stop 'Error in initialization'

C::

#include "client.h"
#include "sr_enums.h"
void* client = NULL;
return_code = SmartRedisCClient(use_cluster, &client)
if (return_code != SRNoError) {
return -1
}

All these methods only have one configurable parameter -- indicated in the
above cases by the variable `use_cluster`. If this parameter is true,
then the client expects to be able to communicate with an orchestrator with
three or more shards.

++++++++++++++++++++++++++++++++++++++++++
(Parallel Programs): Creating unique names
++++++++++++++++++++++++++++++++++++++++++

For parallel applications, each rank or thread that is communicating with
the orchestrator will likely need to create a unique prefix for names to prevent
another rank or thread inadvertently overwriting data. This prefix should be used
for when creating the name of a tensor, dataset, and model that needs to be unique
to a given rank. (Note: for models run within SmartSim, Additional prefixing may
done by the client when running an ensemble and/or multiple data sources).
Any identifier can be used, though typically the MPI rank number (or equivalent
identifier) is a useful, unique number.

C++::

const std::string name_prefix << std::format("{:06}_", *rank_id);

Python::

name_prefix = f"{rank_id:06d}_"

Fortran::

character(len=12) :: name_prefix
write(name_prefix,'(A,I6.6)') rank_id

C::

char[7] name_prefix;
name_prefix = sprintf(name_prefix", "%06d\0", *rank_id);

++++++++++++++++++++++++++
Storing scripts and models
++++++++++++++++++++++++++

The last task that typically needs to be done is to store models or scripts
that will be used later in the simulation. When using a clustered orchestrator,
this only needs to be done by one client (unless each rank requires a different
model or script). MPI rank 0 is often a convenient choice to set models and
scripts.

C++::

if (root_client) {
client.set_model_from_file(model_name, model_file, backend, device)
}

Python::

if root_client:
client.set_model_from_file(model_name, model_file, backend, device)

Fortran::

if (root_client) return_code = client%set_model_from_file(model_name, model_file, backend, device)
if (return_code /= SRNoError) stop 'Error setting model'

C::

if (root_client) {
return_code = client.set_model_from_file(client, model_name, model_file, backend, device)
if (return_code != SRNoError) {
return -1
}
}

=========
Main loop
=========

Within the main loop of the code (e.g. every timestep or iteration of a solver),
the developer typically uses the SmartRedis client methods to implement a workflow which
may include receiving data, sending data, running a script or model, and/or
retrieving a result. These workflows are covered extensively in the walkthroughs
for the Fortran, C++, and python clients and the integrations with MOM6, OpenFOAM,
LAMMPS, and others.

Generally though, developers are advised to:

1. Find locations where file I/O would normally happen and either augment
or replace code to use the SmartRedis client and store the data in the
orchestrator
2. Use the `name_prefix` created during initialization to avoid accidental
writes/reads from different clients
3. Use the SmartSim `dataset` type when using clients representing decomposed
subdomains to make the retrieval/use of the data more performant

============
Full example
============

The following pseudocode is used to demonstrate various aspects of instrumenting an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the typical reader be more comfortable with C++ than Fortran?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been fielding more queries from the Fortran dev community than the C++ one. We can always include more examples later on down the line, but for a first pass I wanted to keep it in one language.

existing simulation code with SmartRedis. This code is representative of solving
the time-evolving heat equation. but we will augment it using an ML model to
provide a preconditioning step each iteration and post the state of the simulation
to the orchestrator. ::

program main

! Initialize the model, setup MPI, communications, read input files
call initialize_model(temperature, number_of_timesteps)

main_loop: do i=1,number_of_timesteps

! Write the current state of the simulation to a file
call write_current_state(temperature)

! Call a time integrator to step the temperature field forward
call timestep_simulation(temperature)

enddo
end program main

Following the guidelines from above, the first step is to initialize the client
and create a unique identifier for the given processor. This should be done
within roughly the same portion of the code where the rest of the model
performs the initialization of other components. ::

! Import SmartRedis modules
use, only smartredis_client : client_type

! Declare a new variable called client and a string to create a unique
! name for names
type(client_type) :: smartredis_client
character(len=7) :: name_prefix
integer :: mpi_rank, mpi_code, smartredis_code

! Note adding use_cluster as an additional runtime argument for SmartRedis
call initialize_model(temperature, number_of_timesteps, use_cluster)
call smartredis_client%initialize(use_cluster)
call MPI_Comm_rank(MPI_COMM_WORLD, mpi_rank, mpi_code)
! Build the prefix for all tensors set in this model
write(name_prefix,'(I6.6,A)') mpi_rank, '_'

! Assume all ranks will use the same machine learning model, so no need to
! add the prefix to the model name
if (mpi_rank==0) call set_model_from_file("example_model_name", "path/to/model.pt", "TORCH", "gpu")

Next, add the calls in the main loop to send the temperature to the orchestrator ::

character(len=30), dimension(1) :: model_input, model_output

main_loop: do i=1,number_of_timesteps

! Write the current state of the simulation to a file
call write_current_state(temperature)
model_input(1) = name_prefix//"temperature"
model_output(1) = name_prefix//"temperature_out"
call smartredis_client%put_tensor(model_input(1), temperature)

! Run the machine learning model
return_code = smartredis_client%run_model("example_model_name", model_input, model_output)
! The following line overwrites the prognostic temperature array
return_code = smartredis_client%unpack_tensor(model_output(1), temperature)

! Call a time integrator to step the temperature field forward
call timestep_simulation(temperature)

enddo

This model will now use the client every timestep to put a
temperature array in the orchestrator, instruct the orchestrator to call
a machine learning model for prediction/inference, and unpack the resulting
inference into the existing temperature array. For more complex examples,
please see some of the integrations in the SmartSim Zoo or feel free to
contact the team at [email protected]
16 changes: 4 additions & 12 deletions tutorials/getting_started/getting_started.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,9 @@
" - Running and Communicating with the Orchestrator\n",
" - Ensembles using SmartRedis\n",
"\n",
"\n",
"## Experiments and Models \n",
"\n",
"`Experiment`s are how users define workflows in SmartSim. The `Experiment` is used to create `Model` instances which represent applications, scripts, or largely any program. An experiment can start and stop a `Model` and monitor execution.\n"
"`Experiment`s are how users define workflows in SmartSim. The `Experiment` is used to create `Model` instances which represent applications, scripts, or generally a program. An experiment can start and stop a `Model` and monitor execution.\n"
]
},
{
Expand All @@ -32,7 +31,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The next step is to initialize an `Experiment` instance. The `Experiment` must be provided a name. This can be any string, but it is best practice to give it a meaningful name as a broad title for what types of models the experiment will be supervising. For our purposes, our `Experiment` will be named `\"getting-started\"`.\n",
"The next step is to initialize an `Experiment` instance. The `Experiment` must be provided a name. This name can be any string, but it is best practice to give it a meaningful name as a broad title for what types of models the experiment will be supervising. For our purposes, our `Experiment` will be named `\"getting-started\"`.\n",
"\n",
"The `Experiment` also needs to have a `launcher` specified. Launchers provide SmartSim the ability to construct and execute complex workloads on HPC systems with schedulers (workload managers) like Slurm, or PBS. SmartSim currently supports\n",
" * `slurm`\n",
Expand Down Expand Up @@ -65,7 +64,7 @@
"\n",
"Our first `Model` will simply print `hello` using the shell command `echo`.\n",
"\n",
"`Experiment.create_run_settings` is used to create a `RunSettings` instance for our `Model`. `RunSettings` help parameterize *how* a `Model` should be executed provided the system and available computational resources.\n",
"`Experiment.create_run_settings` is used to create a `RunSettings` instance for our `Model`. `RunSettings` describe *how* a `Model` should be executed provided the system and available computational resources.\n",
"\n",
"`create_run_settings` is a factory method that will instantiate a `RunSettings` object of the appropriate type based on the `run_command` argument (i.e. `mpirun`, `aprun`, `srun`, etc). The default argument of `auto` will attempt to choose a `run_command` based on the available system software and the launcher specified in the experiment. If `run_command=None` is provided, the command will be launched without one."
]
Expand Down Expand Up @@ -311,7 +310,7 @@
"source": [
"## Ensembles\n",
"\n",
"In the previous example, the two `Model` instances were created separately. There are more convenient ways of doing this, through `Ensemble`s. `Ensemble`s are groups of `Model` instances that can be treated as a single reference. We start by specifying `RunSettings` similar to how we did with our `Model`s."
"In the previous example, the two `Model` instances were created separately. The `Ensemble` SmartSim object is a more convenient way of setting up multiple models, potentially with different configurations. `Ensemble`s are groups of `Model` instances that can be treated as a single reference. We start by specifying `RunSettings` similar to how we did with our `Model`s."
]
},
{
Expand Down Expand Up @@ -1138,13 +1137,6 @@
"source": [
"exp.stop(db)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why you deleted this stuff?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was an empty cell

}
],
"metadata": {
Expand Down
9 changes: 5 additions & 4 deletions tutorials/ml_inference/Inference-in-SmartSim.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
"\n",
"This tutorial shows how to use trained PyTorch, TensorFlow, and ONNX (format) models, written in Python, directly in HPC workloads written in Fortran, C, C++ and Python.\n",
"\n",
"The examples simulation here is written in Python for brevity, however, the inference API in SmartRedis is the same (besides extra parameters for compiled langauges) across all clients. Examples comparing the usage of the same model across SmartRedis client langauges can be found (put in link).\n"
"The example simulation here is written in Python for brevity, however, the inference API in SmartRedis is the same (besides extra parameters for compiled langauges) across all clients. \n"
]
},
{
Expand Down Expand Up @@ -503,7 +503,7 @@
"### Setting TensorFlow and Keras Models\n",
"\n",
"After a model is created (trained or not), the graph of the model is\n",
"frozen saved to file so the client method `client.set_model_from_file`\n",
"frozen and saved to file so the client method `client.set_model_from_file`\n",
"can load it into the database.\n",
"\n",
"SmartSim includes a utility to freeze the graph of a TensorFlow or Keras model in\n",
Expand Down Expand Up @@ -607,7 +607,7 @@
"\n",
"\n",
"K-means clustering is an unsupervised ML algorithm. It is used to categorize data points\n",
"into f groups (\"clusters\"). Scikit Learn has a built in implementation of K-means clustering\n",
"into functional groups (\"clusters\"). Scikit Learn has a built in implementation of K-means clustering\n",
"and it is easily converted to ONNX for use with SmartSim through \n",
"[skl2onnx.to_onnx](http://onnx.ai/sklearn-onnx/auto_examples/plot_convert_syntax.html)\n",
"\n",
Expand Down Expand Up @@ -769,7 +769,8 @@
"on the same compute hosts an a Model instance defined by the user. In this\n",
"deployment, the database is not connected together in a cluster and each shard\n",
"of the database is addressed individually by the processes running on that compute\n",
"host.\n",
"host. This is particularly important for GPU-intensive workloads which require\n",
"frequent communication with the database.\n",
"\n",
"<img src=\"https://www.craylabs.org/docs/_images/co-located-orc-diagram.png\" alt=\"lattice\" width=\"600\"/>\n"
]
Expand Down
24 changes: 13 additions & 11 deletions tutorials/online_analysis/lattice/online_analysis.ipynb

Large diffs are not rendered by default.