Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] LSTMSequence and LSTMCell optimization #26767

Merged
merged 178 commits into from
Nov 26, 2024

Conversation

michal-miotk
Copy link
Contributor

@michal-miotk michal-miotk commented Sep 24, 2024

Details:

  • creating simple primitive for lstm_sequence to be faster than previous approach using many primitives
  • using oneDNN
  • based on commit c99ddc0 from 25732

Tickets:

  • 146601

commit 232d272f11fbe65e82fa9787260a8b9d34b57d20
Author: michal-miotk <[email protected]>
Date:   Mon Jul 29 11:17:47 2024 +0000

    wip

commit e642ca3
Author: michal-miotk <[email protected]>
Date:   Sun Jul 28 22:08:24 2024 +0000

    wip

commit c6b74d3
Author: michal-miotk <[email protected]>
Date:   Fri Jul 26 14:10:26 2024 +0000

    wip

commit 0451429
Author: michal-miotk <[email protected]>
Date:   Thu Jul 25 20:35:11 2024 +0000

    wip3
commit 1164592
Author: michal-miotk <[email protected]>
Date:   Tue Aug 6 09:25:45 2024 +0000

    wip

commit 8b2c049
Author: michal-miotk <[email protected]>
Date:   Tue Aug 6 09:24:02 2024 +0000

    wip

commit 886b412
Author: michal-miotk <[email protected]>
Date:   Mon Aug 5 14:59:14 2024 +0000

    wip

commit 08fb207
Author: michal-miotk <[email protected]>
Date:   Sun Aug 4 20:21:38 2024 +0000

    wip, errors on half

commit 125884d
Author: michal-miotk <[email protected]>
Date:   Sat Aug 3 23:59:58 2024 +0000

    wip

commit af4f209
Author: michal-miotk <[email protected]>
Date:   Fri Aug 2 17:58:38 2024 +0000

    wip

commit 12626fc
Author: michal-miotk <[email protected]>
Date:   Fri Aug 2 10:52:15 2024 +0000

    wip

commit dfdd052
Author: michal-miotk <[email protected]>
Date:   Thu Aug 1 15:38:41 2024 +0000

    wip

commit 54ee912
Author: michal-miotk <[email protected]>
Date:   Thu Aug 1 11:01:55 2024 +0000

    only bfyx layout

commit 240fe4a
Author: michal-miotk <[email protected]>
Date:   Thu Aug 1 10:34:45 2024 +0000

    two outputs from prim

commit bc775be
Author: michal-miotk <[email protected]>
Date:   Wed Jul 31 22:13:14 2024 +0000

    wip

commit d1cfd60
Author: michal-miotk <[email protected]>
Date:   Wed Jul 31 22:07:06 2024 +0000

    wip

commit 7d18884
Author: michal-miotk <[email protected]>
Date:   Wed Jul 31 19:19:04 2024 +0000

    begin of handling reverse

commit 39f64af
Author: michal-miotk <[email protected]>
Date:   Wed Jul 31 15:37:06 2024 +0000

    betterbetter

commit 67b3c9a
Author: michal-miotk <[email protected]>
Date:   Wed Jul 31 13:12:39 2024 +0000

    better

commit 6ded5aa
Author: michal-miotk <[email protected]>
Date:   Wed Jul 31 10:12:31 2024 +0000

    wip

commit 1ccdacc
Author: michal-miotk <[email protected]>
Date:   Tue Jul 30 23:07:21 2024 +0000

    wip

commit ab1307c
Author: michal-miotk <[email protected]>
Date:   Tue Jul 30 22:00:50 2024 +0000

    test passed

commit bc65969
Author: michal-miotk <[email protected]>
Date:   Tue Jul 30 15:37:20 2024 +0000

    wip

commit 03cbf57
Author: michal-miotk <[email protected]>
Date:   Tue Jul 30 15:15:06 2024 +0000

    only 2 outputs

commit fd5f3dc
Author: michal-miotk <[email protected]>
Date:   Tue Jul 30 14:47:12 2024 +0000

    wip

commit 939d23c
Author: michal-miotk <[email protected]>
Date:   Tue Jul 30 11:34:56 2024 +0000

    wip

commit 2bb561f
Author: michal-miotk <[email protected]>
Date:   Tue Jul 30 09:28:03 2024 +0000

    added to binary buffer

commit 1ef83ff
Author: michal-miotk <[email protected]>
Date:   Mon Jul 29 22:30:57 2024 +0000

    not works
std::vector<cldnn::activation_func> activations;
std::vector<cldnn::activation_additional_params> activation_params;
GetLSTMActivationParams(op, activations, activation_params);
float clip = op->get_clip();

assert(!inputs[5].pid.empty());
if (p.use_new_shape_infer()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest replacing it with OPENVINO_ASSERT to ensure that method is called correctly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

op_mode, 1, axis, num_splits));
p.add_primitive(*op, cldnn::reshape(outputCellID, cldnn::input_info(outputCellCropID),
false, outSzPt, op->get_output_partial_shape(1)));
p.add_primitive(*op, cldnn::lstm_cell(layerName+".out0", inputs[0], inputs[1], inputs[2], inputs[3], inputs[4], inputs[5], \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it? I think item 2 is still relevant. You pass this layerName + "_md_write.1" argument, and the corresponding parameters from primitive API are still there.

@@ -278,6 +278,9 @@ ov::SupportedOpsMap Plugin::query_model(const std::shared_ptr<const ov::Model>&

ExecutionConfig config = m_configs_map.at(device_id);
config.set_user_property(orig_config);
if (ctx->get_engine().get_device_info().supports_immad) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 2 changes are not needed too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

p.add_primitive(*op, cldnn::crop(cellStr, cldnn::input_info(lstm_elt_id), hiddenSz, cellCropSz));
}
const float clip = op->get_clip();
if (op->get_input_shape(2).size() != 3 || op->get_input_shape(3).size() != 1 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: also redundant backslashes here and in other places. Please remove those

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

p.add_primitive(*op, cldnn::reshape(layerName + ".out0", concatStr, tensor_from_dims(op->get_output_shape(0))), {layerName});
p.add_primitive(*op, cldnn::reshape(layerName + ".out1", hiddenStr, tensor_from_dims(op->get_output_shape(1))));
p.add_primitive(*op, cldnn::reshape(layerName + ".out2", cellStr, tensor_from_dims(op->get_output_shape(2))));
if (p.use_new_shape_infer()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OPENVINO_ASSERT here as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

public:
using parent::parent;

program_node& input() const { return get_dependency(0); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely the same unused methods as for lstm_seq primitive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 30 to 44
std::vector<format::type> in_fmts(node.get_dependencies().size(), format::any);
std::vector<format::type> out_fmts(node.get_outputs_count(), format::any);

size_t out_rank = node.get_output_layout().get_rank();
for (size_t idx = 0 ; idx < node.get_dependencies().size() ; idx++) {
if (node.get_dependency(idx).is_constant())
continue;

auto target_format = format::get_default_format(out_rank);

in_fmts[idx] = target_format;
}
out_fmts[0] = format::ybfx;

return {in_fmts, out_fmts};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that code should actually query onednn for the required tensor formats (as it's done for convolutions). You can do it in the next PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Comment on lines 24 to 25
return node.get_input_layout(0).format == cldnn::format::bfyx || node.get_input_layout(0).format == cldnn::format::fbyx \
|| node.get_input_layout(0).format == cldnn::format::ybfx;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think tensor format is not the only restriction. At least we need

  1. Type checks
  2. info.arch == gpu_arch::unknown (see other impls)
  3. padding checks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1., 2. done, 3.not done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3.done

Comment on lines 35 to 59
int i = 0;
auto& input = instance.input_memory(i);
auto offset = onednn::get_offset(instance.get_input_layout(i),
_pd.dnnl::primitive_desc_base::src_desc(i));
auto mem = input.get_onednn_memory(_pd.dnnl::primitive_desc_base::src_desc(i), offset);
args.insert({DNNL_ARG_SRC_LAYER, mem});
}

{
int i = 1;
auto& input = instance.input_memory(i);
auto offset = onednn::get_offset(instance.get_input_layout(i),
_pd.dnnl::primitive_desc_base::src_desc(i));
auto mem = input.get_onednn_memory(_pd.dnnl::primitive_desc_base::src_desc(i), offset);
args.insert({DNNL_ARG_SRC_ITER, mem});
}

{
int i = 2;
auto& input = instance.input_memory(i);
auto offset = onednn::get_offset(instance.get_input_layout(i),
_pd.dnnl::primitive_desc_base::src_desc(i));
auto mem = input.get_onednn_memory(_pd.dnnl::primitive_desc_base::src_desc(i), offset);
args.insert({DNNL_ARG_SRC_ITER_C, mem});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code can be done in a loop if you store these DNNL_ARG_SRC_LAYER, DNNL_ARG_SRC_ITER, etc in a vector. Same for weights and dst buffers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 220 to 232
auto hiddenSize = reorder_params->get_output_layout().get_shape()[1] / 4;
auto cropSize = cldnn::tensor{dir_num, static_cast<int>(hiddenSize), 1, 1};
std::string crop_id_b = input_id + "_c";
auto get_crop_node = [&](int cropNum) -> cldnn::program_node& {
auto crop_id = primitive_id(crop_id_b + std::to_string(cropNum));
auto crop_prim = std::make_shared<cldnn::crop>(crop_id, input_id, cropSize, cldnn::tensor{0, static_cast<int>(cropNum*hiddenSize), 0, 0});
return p.get_or_create(crop_prim);
};
auto& crop0_node = get_crop_node(0);
auto& crop1_node = get_crop_node(1);
auto& crop2_node = get_crop_node(2);
auto& crop3_node = get_crop_node(3);
std::vector<input_info> con_input{input_info(crop1_node.id()), input_info(crop0_node.id()), input_info(crop2_node.id()), input_info(crop3_node.id())};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be done with some kind of Slice/StridedSlice primitive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can be, actually I've deleted one crop, but I don't think it will be easy to have less nodes using StridedSlice primitive

@github-actions github-actions bot added the category: NPU OpenVINO NPU plugin label Nov 25, 2024
@github-actions github-actions bot removed the category: NPU OpenVINO NPU plugin label Nov 25, 2024
Copy link
Contributor

@sshlyapn sshlyapn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, left a couple of minor suggestions

@@ -120,6 +120,9 @@ concatenation_inst::typed_primitive_inst(network& network, concatenation_node co
if (dim == node.get_primitive()->axis) {
concat_count += input_mem_size[dim];
} else {
if (i.first->get_outputs_count() > 1 && i.first->get_user_index(node) > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try to use port number i.second to obtain the proper output layout here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
}
p.get_processing_order().calc_processing_order(p);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to call this recalculation only for lstm case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

dispatchData.gws[2] = input.Batch().v;
dispatchData.gws[1] = input.Feature().v;
dispatchData.gws[0] = input.Y().v*input.X().v;
dispatchData.lws = {1, 1, 1};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workgroup might not provide optimal performance, we may consider optimizing it in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor

@vladimir-paramuzov vladimir-paramuzov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM. Please check the performance carefully

Comment on lines 99 to 104
seed = hash_combine(seed, initial_hidden_state.pid);
seed = hash_combine(seed, initial_cell_state.pid);
seed = hash_combine(seed, seq_lenghts.pid);
seed = hash_combine(seed, W.pid);
seed = hash_combine(seed, R.pid);
seed = hash_combine(seed, B.pid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparison and hashing of the primitive ids prevents primitive reuse if we have multiple instances of the same op. So you shall just hash/compare only presence flag to all inputs. As an example you can use convolution op.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 117 to 122
auto input_i_layout = i.first->get_output_layout();
auto input_mem_size = input_i_layout.get_dims();
if (i.first->get_outputs_count() > 1 && i.second > 0) {
input_i_layout = i.first->get_output_layout(false, i.second);
input_mem_size = input_i_layout.get_dims();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that can be changed to

        auto input_i_layout = i.first->get_output_layout(false, i.second);
        auto input_mem_size = input_i_layout.get_dims();

isn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

bool cell_state_check = one_of(in2_dt, {data_types::f16, data_types::bf16, data_types::f32}) &&
one_of(out2_dt, {data_types::f16, data_types::bf16, data_types::f32});
bool f16_case = everyone_is(data_types::f16, in0_dt, in1_dt, in3_dt, in4_dt, out0_dt, out1_dt);
bool bf16_case = everyone_is(data_types::bf16, in0_dt, in1_dt, in3_dt, in4_dt, out0_dt, out1_dt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bf16 is not supported by GPU plugin for now. I think it can be removed from here as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 44 to 46
if (node.get_preferred_impl_type() == impl_types::onednn && node.get_preferred_output_fmt() != format::any) {
first_out_fmt = node.get_preferred_output_fmt();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you consider first out port only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 49 to 51
return {cldnn::layout{ShapeType{lstm_batch_size, num_directions, lstm_seq_length, lstm_hidden_size}, input_layout.data_type, first_out_fmt}, \
cldnn::layout{ShapeType{lstm_batch_size, num_directions, lstm_hidden_size}, input_layout.data_type, second_out_fmt}, \
cldnn::layout{ShapeType{lstm_batch_size, num_directions, lstm_hidden_size}, input_layout.data_type, third_out_fmt}};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return {cldnn::layout{ShapeType{lstm_batch_size, num_directions, lstm_seq_length, lstm_hidden_size}, input_layout.data_type, first_out_fmt}, \
cldnn::layout{ShapeType{lstm_batch_size, num_directions, lstm_hidden_size}, input_layout.data_type, second_out_fmt}, \
cldnn::layout{ShapeType{lstm_batch_size, num_directions, lstm_hidden_size}, input_layout.data_type, third_out_fmt}};
return {cldnn::layout{ShapeType{lstm_batch_size, num_directions, lstm_seq_length, lstm_hidden_size}, input_layout.data_type, first_out_fmt},
cldnn::layout{ShapeType{lstm_batch_size, num_directions, lstm_hidden_size}, input_layout.data_type, second_out_fmt},
cldnn::layout{ShapeType{lstm_batch_size, num_directions, lstm_hidden_size}, input_layout.data_type, third_out_fmt}};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

const std::vector<activation_additional_params>& activation_params = {},
const lstm_weights_order& offset_order = lstm_weights_order::iofz,
const ov::op::RecurrentSequenceDirection direction = ov::op::RecurrentSequenceDirection::FORWARD,
const padding& output_padding = padding(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think padding arg is not needed as it's always set as default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
OPENVINO_ASSERT(!inputs[5].pid.empty());
OPENVINO_ASSERT(p.use_new_shape_infer());
p.add_primitive(*op, cldnn::lstm_cell(layerName+".out0", inputs[0], inputs[1], inputs[2], inputs[3], inputs[4], inputs[5], cldnn::input_info(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this ".out0" suffix is not needed for new shape infer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 132 to 134
cmp_fields(W) &&
cmp_fields(R) &&
cmp_fields(B) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you also shouldn't compare string values, but rather check presence of inputs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Michal Miotk <[email protected]>
@vladimir-paramuzov vladimir-paramuzov added this to the 2025.0 milestone Nov 26, 2024
@vladimir-paramuzov vladimir-paramuzov added this pull request to the merge queue Nov 26, 2024
Merged via the queue into openvinotoolkit:master with commit a88bf5a Nov 26, 2024
172 of 174 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: build OpenVINO cmake script / infra category: GPU OpenVINO GPU plugin category: IE Tests OpenVINO Test: plugins and common under_perf_check
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants