Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/releases/2021/4' into feature/…
Browse files Browse the repository at this point in the history
…azaytsev/mo-devguide-changes
  • Loading branch information
andrew-zaytsev committed Jun 28, 2021
2 parents a4f2192 + c40da68 commit 49d6708
Show file tree
Hide file tree
Showing 44 changed files with 680 additions and 145 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Please report questions, issues and suggestions using:
---
\* Other names and brands may be claimed as the property of others.

[Open Model Zoo]:https://github.com/opencv/open_model_zoo
[Open Model Zoo]:https://github.com/openvinotoolkit/open_model_zoo
[Inference Engine]:https://software.intel.com/en-us/articles/OpenVINO-InferEngine
[Model Optimizer]:https://software.intel.com/en-us/articles/OpenVINO-ModelOptimizer
[nGraph]:https://docs.openvinotoolkit.org/latest/openvino_docs_nGraph_DG_DevGuide.html
Expand Down
36 changes: 36 additions & 0 deletions docs/IE_DG/API_Changes.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,14 @@ The sections below contain detailed list of changes made to the Inference Engine

### Deprecated API

**InferenceEngine::Parameter**

* InferenceEngine::Parameter(const std::shared_ptr<ngraph::Variant>&)
* InferenceEngine::Parameter(std::shared_ptr<ngraph::Variant>& var)
* std::shared_ptr<ngraph::Variant> InferenceEngine::Parameter::asVariant() const
* InferenceEngine::Parameter::operator std::shared_ptr<ngraph::Variant>() const

**GPU plugin configuration keys**
* KEY_CLDNN_NV12_TWO_INPUTS GPU plugin option. Use KEY_GPU_NV12_TWO_INPUTS instead
* KEY_CLDNN_PLUGIN_PRIORITY GPU plugin option. Use KEY_GPU_PLUGIN_PRIORITY instead
* KEY_CLDNN_PLUGIN_THROTTLE GPU plugin option. Use KEY_GPU_PLUGIN_THROTTLE instead
Expand All @@ -24,6 +28,38 @@ The sections below contain detailed list of changes made to the Inference Engine
* KEY_TUNING_MODE GPU plugin option
* KEY_TUNING_FILE GPU plugin option

**InferenceEngine::IInferRequest**
* IInferRequest interface is deprecated, use InferRequest wrapper:
* Constructor for InferRequest from IInferRequest:: Ptr is deprecated
* Cast operator for InferRequest to IInferRequest shared pointer is deprecated

**InferenceEngine::ICNNNetwork**
* ICNNNetwork interface is deprecated by means of deprecation of all its methods, use CNNNetwork wrapper
* CNNNetwork methods working with ICNNNetwork are deprecated:
* Cast to ICNNNetwork shared pointer
* Cast to reference to ICNNNetwork interface
* Constructor from ICNNNetwork shared pointer

**InferenceEngine::IExecutableNetwork**
* IExecutableNetwork is deprecated, use ExecutableNetwork wrappers:
* Constructor of ExecutableNetwork from IExecutableNetwork shared pointer is deprecated
* The following ExecutableNetwork methods are deprecated:
* ExecutableNetwork::reset
* Cast operator to IExecutableNetwork shared pointer
* ExecutableNetwork::CreateInferRequestPtr - use ExecutableNetwork::CreateInferRequest instead

**Extensions API**
* InferenceEngine::make_so_pointer which is used to create Extensions library is replaced by std::make_shared<Extension>(..)
* InferenceEngine::IExtension::Release is deprecated with no replacement
* Use IE_DEFINE_EXTENSION_CREATE_FUNCTION helper macro instead of explicit declaration of CreateExtension function, which create extension.

**Other changes**
* Version::ApiVersion structure is deprecated, Inference Engine does not have API version anymore
* LowLatency - use lowLatency2 instead
* CONFIG_KEY(DUMP_EXEC_GRAPH_AS_DOT) - use InferenceEngine::ExecutableNetwork::GetExecGraphInfo::serialize() instead
* Core::ImportNetwork with no device - pass device name explicitly.
* details::InferenceEngineException - use InferenceEngine::Exception and its derivatives instead.

## 2021.3

### New API
Expand Down
6 changes: 6 additions & 0 deletions docs/IE_DG/Intro_to_Performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,12 @@ input images to achieve optimal throughput. However, high batch size also comes
latency penalty. So, for more real-time oriented usages, lower batch sizes (as low as a single input) are used.
Refer to the [Benchmark App](../../inference-engine/samples/benchmark_app/README.md) sample, which allows latency vs. throughput measuring.

## Using Caching API for first inference latency optimization
Since with the 2021.4 release, Inference Engine provides an ability to enable internal caching of loaded networks.
This can significantly reduce load network latency for some devices at application startup.
Internally caching uses plugin's Export/ImportNetwork flow, like it is done for [Compile tool](../../inference-engine/tools/compile_tool/README.md), using the regular ReadNetwork/LoadNetwork API.
Refer to the [Model Caching Overview](Model_caching_overview.md) for more detailed explanation.

## Using Async API
To gain better performance on accelerators, such as VPU, the Inference Engine uses the asynchronous approach (see
[Integrating Inference Engine in Your Application (current API)](Integrate_with_customer_application_new_API.md)).
Expand Down
65 changes: 65 additions & 0 deletions docs/IE_DG/Model_caching_overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Model Caching Overview {#openvino_docs_IE_DG_Model_caching_overview}

## Introduction

As described in [Inference Engine Introduction](inference_engine_intro.md), common application flow consists of the following steps:

1. **Create Inference Engine Core object**

2. **Read the Intermediate Representation** - Read an Intermediate Representation file into an object of the `InferenceEngine::CNNNetwork`

3. **Prepare inputs and outputs**

4. **Set configuration** Pass device-specific loading configurations to the device

5. **Compile and Load Network to device** - Use the `InferenceEngine::Core::LoadNetwork()` method with specific device

6. **Set input data**

7. **Execute**

Step #5 can potentially perform several time-consuming device-specific optimizations and network compilations,
and such delays can lead to bad user experience on application startup. To avoid this, some devices offer
Import/Export network capability, and it is possible to either use [Compile tool](../../inference-engine/tools/compile_tool/README.md)
or enable model caching to export compiled network automatically. Reusing cached networks can significantly reduce load network time.


## Set "CACHE_DIR" config option to enable model caching

To enable model caching, the application must specify the folder where to store cached blobs. It can be done like this


@snippet snippets/InferenceEngine_Caching0.cpp part0

With this code, if device supports Import/Export network capability, cached blob is automatically created inside the `myCacheFolder` folder
CACHE_DIR config is set to the Core object. If device does not support Import/Export capability, cache is just not created and no error is thrown

Depending on your device, total time for loading network on application startup can be significantly reduced.
Please also note that very first LoadNetwork (when cache is not yet created) takes slightly longer time to 'export' compiled blob into a cache file
![caching_enabled]

## Even faster: use LoadNetwork(modelPath)

In some cases, applications do not need to customize inputs and outputs every time. Such applications always
call `cnnNet = ie.ReadNetwork(...)`, then `ie.LoadNetwork(cnnNet, ..)` and it can be further optimized.
For such cases, more convenient API to load network in one call is introduced in the 2021.4 release.

@snippet snippets/InferenceEngine_Caching1.cpp part1

With enabled model caching, total load time is even smaller - in case that ReadNetwork is optimized as well

@snippet snippets/InferenceEngine_Caching2.cpp part2

![caching_times]


## Advanced examples

Not every device supports network import/export capability, enabling of caching for such devices do not have any effect.
To check in advance if a particular device supports model caching, your application can use the following code:

@snippet snippets/InferenceEngine_Caching3.cpp part3


[caching_enabled]: ../img/caching_enabled.png
[caching_times]: ../img/caching_times.png
3 changes: 3 additions & 0 deletions docs/IE_DG/img/applying_low_latency_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions docs/IE_DG/img/llt2_use_const_initializer.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
130 changes: 128 additions & 2 deletions docs/IE_DG/network_state_intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -209,9 +209,135 @@ Decsriptions can be found in [Samples Overview](./Samples_Overview.md)
[state_network_example]: ./img/state_network_example.png
## LowLatency Transformation
## LowLatency Transformations
If the original framework does not have a special API for working with states, after importing the model, OpenVINO representation will not contain Assign/ReadValue layers. For example, if the original ONNX model contains RNN operations, IR will contain TensorIterator operations and the values will be obtained only after the execution of whole TensorIterator primitive, intermediate values from each iteration will not be available. To be able to work with these intermediate values of each iteration and receive them with a low latency after each infer request, a special LowLatency transformation was introduced.
If the original framework does not have a special API for working with states, after importing the model, OpenVINO representation will not contain Assign/ReadValue layers. For example, if the original ONNX model contains RNN operations, IR will contain TensorIterator operations and the values will be obtained only after execution of the whole TensorIterator primitive. Intermediate values from each iteration will not be available. To enable you to work with these intermediate values of each iteration and receive them with a low latency after each infer request, special LowLatency and LowLatency2 transformations were introduced.
### How to get TensorIterator/Loop operaions from different frameworks via ModelOptimizer.
**ONNX and frameworks supported via ONNX format:** *LSTM, RNN, GRU* original layers are converted to the TensorIterator operation. TensorIterator body contains LSTM/RNN/GRU Cell. Peepholes, InputForget modifications are not supported, sequence_lengths optional input is supported.
*ONNX Loop* layer is converted to the OpenVINO Loop operation.
**MXNet:** *LSTM, RNN, GRU* original layers are converted to TensorIterator operation, TensorIterator body contains LSTM/RNN/GRU Cell operations.
**TensorFlow:** *BlockLSTM* is converted to TensorIterator operation, TensorIterator body contains LSTM Cell operation, Peepholes, InputForget modifications are not supported.
*While* layer is converted to TensorIterator, TensorIterator body can contain any supported operations, but dynamic cases, when count of iterations cannot be calculated in shape inference (ModelOptimizer conversion) time, are not supported.
**TensorFlow2:** *While* layer is converted to Loop operation. Loop body can contain any supported operations.
**Kaldi:** Kaldi models already contain Assign/ReadValue (Memory) operations after model conversion. TensorIterator/Loop operations are not generated.
## LowLatencу2
LowLatency2 transformation changes the structure of the network containing [TensorIterator](../ops/infrastructure/TensorIterator_1.md) and [Loop](../ops/infrastructure/Loop_5.md) by adding the ability to work with the state, inserting the Assign/ReadValue layers as it is shown in the picture below.
### The differences between LowLatency and LowLatency2**:
* Unrolling of TensorIterator/Loop operations became a part of LowLatency2, not a separate transformation. After invoking the transformation, the network can be serialized and inferred without re-invoking the transformation.
* Added support for TensorIterator and Loop operations with multiple iterations inside. TensorIterator/Loop will not be unrolled in this case.
* Resolved the ‘Parameters connected directly to ReadValues’ limitation. To apply the previous version of the transformation in this case, additional manual manipulations were required, now the case is processed automatically.
#### Example of applying LowLatency2 transformation:
![applying_low_latency_2_example](./img/applying_low_latency_2.png)
After applying the transformation, ReadValue operations can receive other operations as an input, as shown in the picture above. These inputs should set the initial value for initialization of ReadValue operations. However, such initialization is not supported in the current State API implementation. Input values are ignored and the initial values for the ReadValue operations are set to zeros unless otherwise specified by the user via [State API](#openvino-state-api).
### Steps to apply LowLatency2 Transformation
1. Get CNNNetwork. Either way is acceptable:
* [from IR or ONNX model](./Integrate_with_customer_application_new_API.md)
* [from nGraph Function](../nGraph_DG/build_function.md)
2. Change the number of iterations inside TensorIterator/Loop nodes in the network using the [Reshape](ShapeInference.md) feature.
For example, the *sequence_lengths* dimension of input of the network > 1, it means the TensorIterator layer has number_of_iterations > 1. You can reshape the inputs of the network to set *sequence_dimension* to exactly 1.
```cpp
// Network before reshape: Parameter (name: X, shape: [2 (sequence_lengths), 1, 16]) -> TensorIterator (num_iteration = 2, axis = 0) -> ...
cnnNetwork.reshape({"X" : {1, 1, 16});
// Network after reshape: Parameter (name: X, shape: [1 (sequence_lengths), 1, 16]) -> TensorIterator (num_iteration = 1, axis = 0) -> ...
```
**Unrolling**: If the LowLatency2 transformation is applied to a network containing TensorIterator/Loop nodes with exactly one iteration inside, these nodes are unrolled; otherwise, the nodes remain as they are. Please see [the picture](#example-of-applying-lowlatency2-transformation) for more details.

3. Apply LowLatency2 transformation
```cpp
#include "ie_transformations.hpp"

...

InferenceEngine::lowLatency2(cnnNetwork); // 2nd argument 'use_const_initializer = true' by default
```
**Use_const_initializer argument**
By default, the LowLatency2 transformation inserts a constant subgraph of the same shape as the previous input node, and with zero values as the initializing value for ReadValue nodes, please see the picture below. We can disable insertion of this subgraph by passing the `false` value for the `use_const_initializer` argument.
```cpp
InferenceEngine::lowLatency2(cnnNetwork, false);
```

![use_const_initializer_example](./img/llt2_use_const_initializer.png)

**State naming rule:** a name of a state is a concatenation of names: original TensorIterator operation, Parameter of the body, and additional suffix "variable_" + id (0-base indexing, new indexing for each TensorIterator). You can use these rules to predict what the name of the inserted State will be after the transformation is applied. For example:
```cpp
// Precondition in ngraph::function.
// Created TensorIterator and Parameter in body of TensorIterator with names
std::string tensor_iterator_name = "TI_name"
std::string body_parameter_name = "param_name"
std::string idx = "0"; // it's a first variable in the network

// The State will be named "TI_name/param_name/variable_0"
auto state_name = tensor_iterator_name + "//" + body_parameter_name + "//" + "variable_" + idx;

InferenceEngine::CNNNetwork cnnNetwork = InferenceEngine::CNNNetwork{function};
InferenceEngine::lowLatency2(cnnNetwork);

InferenceEngine::ExecutableNetwork executableNetwork = core->LoadNetwork(/*cnnNetwork, targetDevice, configuration*/);

// Try to find the Variable by name
auto states = executableNetwork.QueryState();
for (auto& state : states) {
auto name = state.GetName();
if (name == state_name) {
// some actions
}
}
```
4. Use state API. See sections [OpenVINO state API](#openvino-state-api), [Example of stateful network inference](#example-of-stateful-network-inference).
### Known Limitations
1. Unable to execute [Reshape](ShapeInference.md) to change the number iterations of TensorIterator/Loop layers to apply the transformation correctly due to hardcoded values of shapes somewhere in the network.
The only way you can change the number iterations of TensorIterator/Loop layer is to use the Reshape feature, but networks can be non-reshapable, the most common reason is that the value of shapes is hardcoded in a constant somewhere in the network.
![low_latency_limitation_2](./img/low_latency_limitation_2.png)
**Current solution:** Trim non-reshapable layers via [ModelOptimizer CLI](../MO_DG/prepare_model/convert_model/Converting_Model_General.md) `--input`, `--output`. For example, the parameter and the problematic constant in the picture above can be trimmed using the following command line option:
`--input Reshape_layer_name`. The problematic constant can be also replaced using ngraph, as shown in the example below.
```cpp
// nGraph example. How to replace a Constant with hardcoded values of shapes in the network with another one with the new values.
// Assume we know which Constant (const_with_hardcoded_shape) prevents the reshape from being applied.
// Then we can find this Constant by name on the network and replace it with a new one with the correct shape.
auto func = cnnNetwork.getFunction();
// Creating the new Constant with a correct shape.
// For the example shown in the picture above, the new values of the Constant should be 1, 1, 10 instead of 1, 49, 10
auto new_const = std::make_shared<ngraph::opset6::Constant>( /*type, shape, value_with_correct_shape*/ );
for (const auto& node : func->get_ops()) {
// Trying to find the problematic Constant by name.
if (node->get_friendly_name() == "name_of_non_reshapable_const") {
auto const_with_hardcoded_shape = std::dynamic_pointer_cast<ngraph::opset6::Constant>(node);
// Replacing the problematic Constant with a new one. Do this for all the problematic Constants in the network, then
// you can apply the reshape feature.
ngraph::replace_node(const_with_hardcoded_shape, new_const);
}
}
```
## [DEPRECATED] LowLatency

LowLatency transformation changes the structure of the network containing [TensorIterator](../ops/infrastructure/TensorIterator_1.md) and [Loop](../ops/infrastructure/Loop_5.md) by adding the ability to work with the state, inserting the Assign/ReadValue layers as it is shown in the picture below.

Expand Down
Loading

0 comments on commit 49d6708

Please sign in to comment.