[GPU] LSTMSequence and LSTMCell optimization #26767

michal-miotk · 2024-09-24T14:09:12Z

Details:

creating simple primitive for lstm_sequence to be faster than previous approach using many primitives
using oneDNN
based on commit c99ddc0 from 25732

Tickets:

146601

commit 232d272f11fbe65e82fa9787260a8b9d34b57d20 Author: michal-miotk <[email protected]> Date: Mon Jul 29 11:17:47 2024 +0000 wip commit e642ca3 Author: michal-miotk <[email protected]> Date: Sun Jul 28 22:08:24 2024 +0000 wip commit c6b74d3 Author: michal-miotk <[email protected]> Date: Fri Jul 26 14:10:26 2024 +0000 wip commit 0451429 Author: michal-miotk <[email protected]> Date: Thu Jul 25 20:35:11 2024 +0000 wip3

commit 1164592 Author: michal-miotk <[email protected]> Date: Tue Aug 6 09:25:45 2024 +0000 wip commit 8b2c049 Author: michal-miotk <[email protected]> Date: Tue Aug 6 09:24:02 2024 +0000 wip commit 886b412 Author: michal-miotk <[email protected]> Date: Mon Aug 5 14:59:14 2024 +0000 wip commit 08fb207 Author: michal-miotk <[email protected]> Date: Sun Aug 4 20:21:38 2024 +0000 wip, errors on half commit 125884d Author: michal-miotk <[email protected]> Date: Sat Aug 3 23:59:58 2024 +0000 wip commit af4f209 Author: michal-miotk <[email protected]> Date: Fri Aug 2 17:58:38 2024 +0000 wip commit 12626fc Author: michal-miotk <[email protected]> Date: Fri Aug 2 10:52:15 2024 +0000 wip commit dfdd052 Author: michal-miotk <[email protected]> Date: Thu Aug 1 15:38:41 2024 +0000 wip commit 54ee912 Author: michal-miotk <[email protected]> Date: Thu Aug 1 11:01:55 2024 +0000 only bfyx layout commit 240fe4a Author: michal-miotk <[email protected]> Date: Thu Aug 1 10:34:45 2024 +0000 two outputs from prim commit bc775be Author: michal-miotk <[email protected]> Date: Wed Jul 31 22:13:14 2024 +0000 wip commit d1cfd60 Author: michal-miotk <[email protected]> Date: Wed Jul 31 22:07:06 2024 +0000 wip commit 7d18884 Author: michal-miotk <[email protected]> Date: Wed Jul 31 19:19:04 2024 +0000 begin of handling reverse commit 39f64af Author: michal-miotk <[email protected]> Date: Wed Jul 31 15:37:06 2024 +0000 betterbetter commit 67b3c9a Author: michal-miotk <[email protected]> Date: Wed Jul 31 13:12:39 2024 +0000 better commit 6ded5aa Author: michal-miotk <[email protected]> Date: Wed Jul 31 10:12:31 2024 +0000 wip commit 1ccdacc Author: michal-miotk <[email protected]> Date: Tue Jul 30 23:07:21 2024 +0000 wip commit ab1307c Author: michal-miotk <[email protected]> Date: Tue Jul 30 22:00:50 2024 +0000 test passed commit bc65969 Author: michal-miotk <[email protected]> Date: Tue Jul 30 15:37:20 2024 +0000 wip commit 03cbf57 Author: michal-miotk <[email protected]> Date: Tue Jul 30 15:15:06 2024 +0000 only 2 outputs commit fd5f3dc Author: michal-miotk <[email protected]> Date: Tue Jul 30 14:47:12 2024 +0000 wip commit 939d23c Author: michal-miotk <[email protected]> Date: Tue Jul 30 11:34:56 2024 +0000 wip commit 2bb561f Author: michal-miotk <[email protected]> Date: Tue Jul 30 09:28:03 2024 +0000 added to binary buffer commit 1ef83ff Author: michal-miotk <[email protected]> Date: Mon Jul 29 22:30:57 2024 +0000 not works

…tion only in gpu plugin

Signed-off-by: Michal Miotk <[email protected]>

vladimir-paramuzov · 2024-11-22T06:25:37Z

src/plugins/intel_gpu/src/plugin/ops/rnn.cpp

    std::vector<cldnn::activation_func> activations;
    std::vector<cldnn::activation_additional_params> activation_params;
    GetLSTMActivationParams(op, activations, activation_params);
    float clip = op->get_clip();
-
+    assert(!inputs[5].pid.empty());
    if (p.use_new_shape_infer()) {


I suggest replacing it with OPENVINO_ASSERT to ensure that method is called correctly

vladimir-paramuzov · 2024-11-22T06:26:42Z

src/plugins/intel_gpu/src/plugin/ops/rnn.cpp

-                        op_mode, 1, axis, num_splits));
-        p.add_primitive(*op, cldnn::reshape(outputCellID, cldnn::input_info(outputCellCropID),
-                        false, outSzPt, op->get_output_partial_shape(1)));
+        p.add_primitive(*op, cldnn::lstm_cell(layerName+".out0", inputs[0], inputs[1], inputs[2], inputs[3], inputs[4], inputs[5], \


Is it? I think item 2 is still relevant. You pass this layerName + "_md_write.1" argument, and the corresponding parameters from primitive API are still there.

vladimir-paramuzov · 2024-11-22T06:27:10Z

src/plugins/intel_gpu/src/plugin/plugin.cpp

@@ -278,6 +278,9 @@ ov::SupportedOpsMap Plugin::query_model(const std::shared_ptr<const ov::Model>&

    ExecutionConfig config = m_configs_map.at(device_id);
    config.set_user_property(orig_config);
+    if (ctx->get_engine().get_device_info().supports_immad) {


These 2 changes are not needed too.

vladimir-paramuzov · 2024-11-22T06:33:18Z

src/plugins/intel_gpu/src/plugin/ops/rnn.cpp

-            p.add_primitive(*op, cldnn::crop(cellStr, cldnn::input_info(lstm_elt_id), hiddenSz, cellCropSz));
-        }
+    const float clip = op->get_clip();
+    if (op->get_input_shape(2).size() != 3 || op->get_input_shape(3).size() != 1 \


nit: also redundant backslashes here and in other places. Please remove those

vladimir-paramuzov · 2024-11-22T06:33:38Z

src/plugins/intel_gpu/src/plugin/ops/rnn.cpp

-    p.add_primitive(*op, cldnn::reshape(layerName + ".out0", concatStr, tensor_from_dims(op->get_output_shape(0))), {layerName});
-    p.add_primitive(*op, cldnn::reshape(layerName + ".out1", hiddenStr, tensor_from_dims(op->get_output_shape(1))));
-    p.add_primitive(*op, cldnn::reshape(layerName + ".out2", cellStr, tensor_from_dims(op->get_output_shape(2))));
+    if (p.use_new_shape_infer()) {


OPENVINO_ASSERT here as well

vladimir-paramuzov · 2024-11-22T06:46:20Z

src/plugins/intel_gpu/src/graph/include/lstm_cell_inst.h

+public:
+    using parent::parent;
+
+    program_node& input() const { return get_dependency(0); }


Likely the same unused methods as for lstm_seq primitive

vladimir-paramuzov · 2024-11-22T06:49:25Z

src/plugins/intel_gpu/src/graph/impls/onednn/lstm_seq_onednn.hpp

+        std::vector<format::type> in_fmts(node.get_dependencies().size(), format::any);
+        std::vector<format::type> out_fmts(node.get_outputs_count(), format::any);
+
+        size_t out_rank = node.get_output_layout().get_rank();
+        for (size_t idx = 0 ; idx < node.get_dependencies().size() ; idx++) {
+            if (node.get_dependency(idx).is_constant())
+                continue;
+
+            auto target_format = format::get_default_format(out_rank);
+
+            in_fmts[idx] = target_format;
+        }
+        out_fmts[0] = format::ybfx;
+
+        return {in_fmts, out_fmts};


I think that code should actually query onednn for the required tensor formats (as it's done for convolutions). You can do it in the next PR

vladimir-paramuzov · 2024-11-22T06:51:47Z

src/plugins/intel_gpu/src/graph/impls/onednn/lstm_seq_onednn.hpp

+        return node.get_input_layout(0).format == cldnn::format::bfyx || node.get_input_layout(0).format == cldnn::format::fbyx \
+            || node.get_input_layout(0).format == cldnn::format::ybfx;


I think tensor format is not the only restriction. At least we need

Type checks

info.arch == gpu_arch::unknown (see other impls)

padding checks

1., 2. done, 3.not done

vladimir-paramuzov · 2024-11-22T06:58:12Z

src/plugins/intel_gpu/src/graph/impls/onednn/lstm_seq_onednn.cpp

+            int i = 0;
+            auto& input = instance.input_memory(i);
+            auto offset = onednn::get_offset(instance.get_input_layout(i),
+                                             _pd.dnnl::primitive_desc_base::src_desc(i));
+            auto mem = input.get_onednn_memory(_pd.dnnl::primitive_desc_base::src_desc(i), offset);
+            args.insert({DNNL_ARG_SRC_LAYER, mem});
+        }
+
+        {
+            int i = 1;
+            auto& input = instance.input_memory(i);
+            auto offset = onednn::get_offset(instance.get_input_layout(i),
+                                             _pd.dnnl::primitive_desc_base::src_desc(i));
+            auto mem = input.get_onednn_memory(_pd.dnnl::primitive_desc_base::src_desc(i), offset);
+            args.insert({DNNL_ARG_SRC_ITER, mem});
+        }
+
+        {
+            int i = 2;
+            auto& input = instance.input_memory(i);
+            auto offset = onednn::get_offset(instance.get_input_layout(i),
+                                             _pd.dnnl::primitive_desc_base::src_desc(i));
+            auto mem = input.get_onednn_memory(_pd.dnnl::primitive_desc_base::src_desc(i), offset);
+            args.insert({DNNL_ARG_SRC_ITER_C, mem});
+        }


I think this code can be done in a loop if you store these DNNL_ARG_SRC_LAYER, DNNL_ARG_SRC_ITER, etc in a vector. Same for weights and dst buffers

vladimir-paramuzov · 2024-11-22T07:04:04Z

src/plugins/intel_gpu/src/graph/graph_optimizer/post_optimize_weights.cpp

+    auto hiddenSize = reorder_params->get_output_layout().get_shape()[1] / 4;
+    auto cropSize = cldnn::tensor{dir_num, static_cast<int>(hiddenSize), 1, 1};
+    std::string crop_id_b = input_id + "_c";
+    auto get_crop_node = [&](int cropNum) -> cldnn::program_node& {
+        auto crop_id = primitive_id(crop_id_b + std::to_string(cropNum));
+        auto crop_prim = std::make_shared<cldnn::crop>(crop_id,  input_id, cropSize, cldnn::tensor{0, static_cast<int>(cropNum*hiddenSize), 0, 0});
+        return p.get_or_create(crop_prim);
+    };
+    auto& crop0_node = get_crop_node(0);
+    auto& crop1_node = get_crop_node(1);
+    auto& crop2_node = get_crop_node(2);
+    auto& crop3_node = get_crop_node(3);
+    std::vector<input_info> con_input{input_info(crop1_node.id()), input_info(crop0_node.id()), input_info(crop2_node.id()), input_info(crop3_node.id())};


Can it be done with some kind of Slice/StridedSlice primitive?

it can be, actually I've deleted one crop, but I don't think it will be easy to have less nodes using StridedSlice primitive

Signed-off-by: Michal Miotk <[email protected]>

…output of node Signed-off-by: Michal Miotk <[email protected]>

Signed-off-by: Michal Miotk <[email protected]>

sshlyapn

Looks good to me, left a couple of minor suggestions

sshlyapn · 2024-11-25T11:49:03Z

src/plugins/intel_gpu/src/graph/concatenation.cpp

@@ -120,6 +120,9 @@ concatenation_inst::typed_primitive_inst(network& network, concatenation_node co
            if (dim == node.get_primitive()->axis) {
                concat_count += input_mem_size[dim];
            } else {
+                if (i.first->get_outputs_count() > 1 && i.first->get_user_index(node) > 0) {


Did you try to use port number i.second to obtain the proper output layout here?

sshlyapn · 2024-11-25T11:50:58Z

src/plugins/intel_gpu/src/graph/graph_optimizer/post_optimize_weights.cpp

        }
    }
+    p.get_processing_order().calc_processing_order(p);


It makes sense to call this recalculation only for lstm case

sshlyapn · 2024-11-25T12:03:23Z

src/plugins/intel_gpu/src/kernel_selector/kernels/reorder/reorder_kernel_base.cpp

+        dispatchData.gws[2] = input.Batch().v;
+        dispatchData.gws[1] = input.Feature().v;
+        dispatchData.gws[0] = input.Y().v*input.X().v;
+        dispatchData.lws = {1, 1, 1};


This workgroup might not provide optimal performance, we may consider optimizing it in the future

Signed-off-by: Michal Miotk <[email protected]>

…: multiouput prim Signed-off-by: Michal Miotk <[email protected]>

vladimir-paramuzov

Overall, LGTM. Please check the performance carefully

vladimir-paramuzov · 2024-11-26T06:30:51Z

src/plugins/intel_gpu/include/intel_gpu/primitives/rnn.hpp

+        seed = hash_combine(seed, initial_hidden_state.pid);
+        seed = hash_combine(seed, initial_cell_state.pid);
+        seed = hash_combine(seed, seq_lenghts.pid);
+        seed = hash_combine(seed, W.pid);
+        seed = hash_combine(seed, R.pid);
+        seed = hash_combine(seed, B.pid);


Comparison and hashing of the primitive ids prevents primitive reuse if we have multiple instances of the same op. So you shall just hash/compare only presence flag to all inputs. As an example you can use convolution op.

vladimir-paramuzov · 2024-11-26T06:33:01Z

src/plugins/intel_gpu/src/graph/concatenation.cpp

        auto input_i_layout = i.first->get_output_layout();
        auto input_mem_size = input_i_layout.get_dims();
+        if (i.first->get_outputs_count() > 1 && i.second > 0) {
+            input_i_layout = i.first->get_output_layout(false, i.second);
+            input_mem_size = input_i_layout.get_dims();
+        }


I think that can be changed to

auto input_i_layout = i.first->get_output_layout(false, i.second); auto input_mem_size = input_i_layout.get_dims();

isn't it?

vladimir-paramuzov · 2024-11-26T06:35:16Z

src/plugins/intel_gpu/src/graph/impls/onednn/lstm_seq_onednn.hpp

+        bool cell_state_check = one_of(in2_dt, {data_types::f16, data_types::bf16, data_types::f32}) &&
+            one_of(out2_dt, {data_types::f16, data_types::bf16, data_types::f32});
+        bool f16_case = everyone_is(data_types::f16, in0_dt, in1_dt, in3_dt, in4_dt, out0_dt, out1_dt);
+        bool bf16_case = everyone_is(data_types::bf16, in0_dt, in1_dt, in3_dt, in4_dt, out0_dt, out1_dt);


bf16 is not supported by GPU plugin for now. I think it can be removed from here as well

vladimir-paramuzov · 2024-11-26T06:37:58Z

src/plugins/intel_gpu/src/graph/lstm_seq.cpp

+    if (node.get_preferred_impl_type() == impl_types::onednn && node.get_preferred_output_fmt() != format::any) {
+        first_out_fmt = node.get_preferred_output_fmt();
+    }


Why do you consider first out port only?

vladimir-paramuzov · 2024-11-26T06:38:13Z

src/plugins/intel_gpu/src/graph/lstm_seq.cpp

+    return {cldnn::layout{ShapeType{lstm_batch_size, num_directions, lstm_seq_length, lstm_hidden_size}, input_layout.data_type, first_out_fmt}, \
+            cldnn::layout{ShapeType{lstm_batch_size, num_directions, lstm_hidden_size}, input_layout.data_type, second_out_fmt}, \
+            cldnn::layout{ShapeType{lstm_batch_size, num_directions, lstm_hidden_size}, input_layout.data_type, third_out_fmt}};


Suggested change

return {cldnn::layout{ShapeType{lstm_batch_size, num_directions, lstm_seq_length, lstm_hidden_size}, input_layout.data_type, first_out_fmt}, \

cldnn::layout{ShapeType{lstm_batch_size, num_directions, lstm_hidden_size}, input_layout.data_type, second_out_fmt}, \

cldnn::layout{ShapeType{lstm_batch_size, num_directions, lstm_hidden_size}, input_layout.data_type, third_out_fmt}};

return {cldnn::layout{ShapeType{lstm_batch_size, num_directions, lstm_seq_length, lstm_hidden_size}, input_layout.data_type, first_out_fmt},

cldnn::layout{ShapeType{lstm_batch_size, num_directions, lstm_hidden_size}, input_layout.data_type, second_out_fmt},

cldnn::layout{ShapeType{lstm_batch_size, num_directions, lstm_hidden_size}, input_layout.data_type, third_out_fmt}};

vladimir-paramuzov · 2024-11-26T06:39:53Z

src/plugins/intel_gpu/include/intel_gpu/primitives/rnn.hpp

+              const std::vector<activation_additional_params>& activation_params = {},
+              const lstm_weights_order& offset_order = lstm_weights_order::iofz,
+              const ov::op::RecurrentSequenceDirection direction = ov::op::RecurrentSequenceDirection::FORWARD,
+              const padding& output_padding = padding(),


I think padding arg is not needed as it's always set as default

vladimir-paramuzov · 2024-11-26T06:41:38Z

src/plugins/intel_gpu/src/plugin/ops/rnn.cpp

-    }
+    OPENVINO_ASSERT(!inputs[5].pid.empty());
+    OPENVINO_ASSERT(p.use_new_shape_infer());
+    p.add_primitive(*op, cldnn::lstm_cell(layerName+".out0", inputs[0], inputs[1], inputs[2], inputs[3], inputs[4], inputs[5], cldnn::input_info(),


I think this ".out0" suffix is not needed for new shape infer

src/plugins/intel_gpu/src/graph/program.cpp

Signed-off-by: Michal Miotk <[email protected]>

vladimir-paramuzov · 2024-11-26T12:22:15Z

src/plugins/intel_gpu/include/intel_gpu/primitives/rnn.hpp

+               cmp_fields(W) &&
+               cmp_fields(R) &&
+               cmp_fields(B) &&


Here you also shouldn't compare string values, but rather check presence of inputs

Signed-off-by: Michal Miotk <[email protected]>

michal-miotk added 30 commits July 18, 2024 11:36

compiles lstm_seq

9ce143a

more kernel args

027f991

bigger proper run chances

c191c58

19jul

d461e66

inference works

01fa2ac

in middle of implementation

1f017fd

problems with inputs get element in kernel

5787c7d

not compile

837db22

wipx

d4ce531

wip

19c268e

solved problem with too much inputs kernel

f5273bc

wip

d50b3be

more changes

63a8dfd

wip

f54ecc1

wip

3748a11

wip

fae772a

proper shape for 2 outputs

c00ff8a

cleaning

31fcb79

Merge branch 'master' into lstm2

4b16eef

updated to new primitive_base api, disabled lstm to tensor transforma…

dcad182

…tion only in gpu plugin

now it should compile on windows, changed kernel name

d6aeb54

deleted cell, deleted input_forget

9688f63

generic primitive

5003d47

fix compilation problem, smaller lws

5937b14

wip

8b31a91

wip, not resolved fail on dynamic

2ff5a7c

fixed failing dynamic test

2d9e5c6

change name cldnn::rnn -> cldnn::lstm_seq

702e941

michal-miotk added 3 commits November 21, 2024 20:06

some kernel tuning, deleted unused var

63ee3b4

Signed-off-by: Michal Miotk <[email protected]>

new shape infer for lsm_cell, changes from review

c99a348

Signed-off-by: Michal Miotk <[email protected]>

deleting lstm elt

b551bda

Signed-off-by: Michal Miotk <[email protected]>

vladimir-paramuzov reviewed Nov 22, 2024

View reviewed changes

michal-miotk added 5 commits November 24, 2024 17:58

get_arguments in loop

d1bec7b

wip

0a9756d

fix compilation on windows

2a068c2

one less primitive in post optimize weights, deleted legacy output

fcdaab0

deleted unused has_cell function

6685209

Signed-off-by: Michal Miotk <[email protected]>

github-actions bot added the category: NPU OpenVINO NPU plugin label Nov 25, 2024

michal-miotk added 2 commits November 25, 2024 10:32

check node output layout(when concat) only for case when it is first …

debeb39

…output of node Signed-off-by: Michal Miotk <[email protected]>

undo adding level_zero

7ce51bc

Signed-off-by: Michal Miotk <[email protected]>

github-actions bot removed the category: NPU OpenVINO NPU plugin label Nov 25, 2024

sshlyapn approved these changes Nov 25, 2024

View reviewed changes

michal-miotk added 4 commits November 25, 2024 14:02

Merge branch 'master' into lstm_with_onednn

26d0e50

check of dependency port to skip concat check

68ae65b

Signed-off-by: Michal Miotk <[email protected]>

padding check, calc processing order only when lstm

e3b02c2

Signed-off-by: Michal Miotk <[email protected]>

fix compilation error, enabled check of layout for concatenation case…

89bca48

…: multiouput prim Signed-off-by: Michal Miotk <[email protected]>

vladimir-paramuzov approved these changes Nov 26, 2024

View reviewed changes

p-durandin added the under_perf_check label Nov 26, 2024

vladimir-paramuzov reviewed Nov 26, 2024

View reviewed changes

src/plugins/intel_gpu/src/graph/program.cpp Show resolved Hide resolved

michal-miotk added 2 commits November 26, 2024 10:02

forward use_onednn to subgraph

33c9256

Signed-off-by: Michal Miotk <[email protected]>

applied insights from review

36e02cb

Signed-off-by: Michal Miotk <[email protected]>

vladimir-paramuzov reviewed Nov 26, 2024

View reviewed changes

better comparing primitive

2c68353

Signed-off-by: Michal Miotk <[email protected]>

vladimir-paramuzov added this to the 2025.0 milestone Nov 26, 2024

vladimir-paramuzov enabled auto-merge November 26, 2024 13:08

vladimir-paramuzov added this pull request to the merge queue Nov 26, 2024

Merged via the queue into openvinotoolkit:master with commit a88bf5a Nov 26, 2024
172 of 174 checks passed

		return node.get_input_layout(0).format == cldnn::format::bfyx \|\| node.get_input_layout(0).format == cldnn::format::fbyx \
		\|\| node.get_input_layout(0).format == cldnn::format::ybfx;

[GPU] LSTMSequence and LSTMCell optimization #26767

[GPU] LSTMSequence and LSTMCell optimization #26767

Conversation

michal-miotk commented Sep 24, 2024 • edited Loading

Details:

Tickets:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sshlyapn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vladimir-paramuzov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michal-miotk commented Sep 24, 2024 •

edited

Loading