Skip to content

Commit

Permalink
refactor EmbeddingBagOffsetsSum
Browse files Browse the repository at this point in the history
  • Loading branch information
Zhibin Li committed Nov 12, 2020
1 parent 89d83c2 commit 6130516
Showing 1 changed file with 20 additions and 20 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ static OpRegistration reg("EmbeddingBagOffsetsSum", [](const Context& ctx) {
IE_ASSERT(indices.rank() == 1);
auto num_indices = indices.compute_shape().sizes()[0];
auto offsets = cast_constant_operand<int32_t>(2, layer);

int32_t default_index = -1;
Tensor per_sample_weights;
bool with_weights = false;
Expand All @@ -57,44 +58,43 @@ static OpRegistration reg("EmbeddingBagOffsetsSum", [](const Context& ctx) {
offsets.push_back(num_indices);

auto I_gathered = gather(I, indices);

auto ndims = I_gathered.rank();
std::vector<TensorDim> I_dims(ndims);
std::vector<TensorIndex> I_idxs(ndims);
std::vector<Tensor> slices, Os;
std::vector<Tensor> Os;
I_gathered.bind_dims(I_dims);
auto O_dims = I_dims;
auto O_idxs = I_idxs;

O_dims[0] = edsl::TensorDim(1);
for (size_t i = 0; i < num_indices; ++i) {
O_idxs[0] = I_idxs[0] - i;
auto slice = edsl::Contraction(O_dims, O_idxs).sum(I_gathered(I_idxs)).build();
if (with_weights == true) {
Tensor weight = op::slice(per_sample_weights).add_dim(i, i + 1);
slice = slice * weight;
if (with_weights) {
std::vector<int64_t> unsqueeze_axes;
for (int64_t i = 1; i < I_gathered.rank(); i++) {
unsqueeze_axes.push_back(i);
}
slices.push_back(slice);
auto weights_expanded = op::unsqueeze(per_sample_weights, unsqueeze_axes);
I_gathered = I_gathered * weights_expanded;
}

for (uint32_t l = 0; l < batch; ++l) {
if (offsets[l + 1] == offsets[l]) {
for (uint32_t i = 0; i < batch; ++i) {

This comment has been minimized.

Copy link
@tzerrell

tzerrell Nov 12, 2020

I think it might be possible to perform one overall Contraction, rather than having a separate Contraction from each iteration of the for loop that is later concatenated. I will think more about whether this is possible... it would require a tensor that looked like I_gathered in the offsets[i+1] != offsets[i] locations and like zero or I in the offsets[i+1] == offsets[i] locations. Also it would probably require careful use of Constraints.

If it is possible we should do it, building one contraction will make optimizations work better than using multiple concatenated contractions, and we should expect better performance from doing that.

This comment has been minimized.

Copy link
@haoyouab

haoyouab Nov 13, 2020

I agree but I look into the implementation of class Contraction I do not find anything like a conditional branching function that can probably fit

a tensor that looked like I_gathered in the offsets[i+1] != offsets[i] locations and like zero or I in the offsets[i+1] == offsets[i] locations

The only function that deals with conditions is add_constraint if I'm not misunderstanding and it's not enough for what is mentioned above.

This comment has been minimized.

Copy link
@haoyouab

haoyouab Nov 13, 2020

Also, offsets[i+1] - offsets[i] is variadic as the loop goes forward. I think it's much like the variadic version of split op as the split_lengths here is also variadic. If we can do this in an overall Contraction we might have to revise all those related code blocks that have outer for loops. What do you think?

This comment has been minimized.

Copy link
@tzerrell

tzerrell Nov 17, 2020

I will set aside some time tomorrow to think about this further, but I think you may be correct that there is currently no way around this.

This comment has been minimized.

Copy link
@tzerrell

tzerrell Nov 18, 2020

After consulting with @jbruestle, there is an approach that avoids for loops in C++. However, it requires summing scatter, where if you scatter to the same index multiple times the values are summed. Unfortunately, at the moment, our scatter does not support this (it will overwrite/race if you scatter to the same index).

The good news is that if we added summing scatter, then this approach would allow us to read offsets as a variable Tensor, not a constant.

I'm noting down the (untested) code that I believe would perform this operation if we had a summing scatter so we can come back to it:

  // Generate a tensor `bags` of shape `[batch_size]` that indicates which bag each index belongs to
  std::vector<TensorDim> num_indices(1);
  std::vector<TensorDim> batch_size(1);
  indices.bind_dims(num_indices);
  offsets.bind_dims(batch_size);
  Tensor count = edsl::index({num_indices[0], batch[0]}, 0);
  Tensor above_offsets = count < offsets;
  Tensor bags[i : num_indices] += above_offsets[i, b];
  // This produces 1-based indexing of the bags; so subtract 1 to get 0-based indexing
  bags = bags - 1;

  auto I_gathered = gather(I, indices);
  std::vector<TensorDim> emb_table_dims(I.rank());
  I.bind_dims(emb_table_dims);
  TensorDim num_emb = emb_table_dims[0];
  // Currently `scatter` requires the scatter size (number of legal index values) to be provided as a 1D Tensor whose shape is the scatter size. So build a Tensor of shape [num_emb] but don't bother to initialize it. This should be adapted once the scatter API is upgraded.
  Tensor scatter_shape_tensor(num_emb);
  // TODO: This needs to be a summing `scatter` (i.e., when the same index is scattered to multiple times, sum), but current behavior is override/race.
  auto result = scatter(I_gathered, bags, scatter_shape_tensor);

Note that this doesn't account for cases where the default_index and per_sample_weights parameters are used, which I did not consider after seeing the requirement on a summing scatter. More thought would be needed for them, but I'm optimistic they would be possible to include.

This comment has been minimized.

Copy link
@haoyouab

haoyouab Nov 26, 2020

Though I'm not sure if we should move this discussion to a more public place, I hope this thread is not outdated.

I'm working on summation feature of edsl scatter op and according to the Tensorflow reference summing the duplicate indices is the default behavior so I guess overwrite/race should be deprecated instead of co-existing with summing as two respective options.

As for default_index, I think ScatterNDUpdate is quite a perfect solution for it. Pseudo code for EmbeddingSegmentsSum below (EmbeddingBagOffsetsSum is similar but with some extra code to get bags as Tim listed above):

if (default_index) {
  auto default_slice = Contraction(..., ...).assign(I(default_index)).build();
  auto I_default_slice = op::repeat(default_slice).count(num_segments).axis(0);
}
// TODO: The API of edsl scatter op needs modification to add `update` feature which
// scatter slices into existing tensor and `summing` feature to sum slices under the same index.
auto result = scatter(I_gathered, /*existing_tensor_to_update=*/I_default_slice, segment_ids, /*shape_of_output_tensor=*/shape);

But I'm not sure where this update feature should be implemented. I think tile edsl level is better and it can be added as an option/parameter like what Yanglei did to gather.

if (offsets[i + 1] == offsets[i]) {
if (default_index == -1) {
auto zero = cast(Tensor{0}, slices[0].dtype());
auto slice_shape = slices[0].compute_shape().sizes();
auto zero = cast(Tensor{0}, I_gathered.dtype());
auto slice_shape = I_gathered.compute_shape().sizes();
slice_shape[0] = 1;
std::vector<int> target_shape(begin(slice_shape), end(slice_shape));
std::vector<int> target_axes = {};
Os.push_back(op::broadcast(zero, target_shape, target_axes));

This comment has been minimized.

Copy link
@tzerrell

tzerrell Nov 12, 2020

If we had the like function we discussed earlier this week, we would want to use it here. It's also possible to switch to a better workaround for dealing with the absence of such a like function. (I.e., one that avoids compute_shape.) But we can just wait for a like function IMO.

This comment has been minimized.

Copy link
@tzerrell

tzerrell Nov 12, 2020

Also this entire block of code might be obsolete with a contraction for the outer loop (like I discussed in other comments).

This comment has been minimized.

Copy link
@haoyouab

haoyouab Nov 13, 2020

Sure. I will keep an eye on this and update it at any moment if a new possible solution comes up.

} else {
O_dims[0] = edsl::TensorDim(1);
O_idxs[0] = I_idxs[0] - default_index;
Os.push_back(edsl::Contraction(O_dims, O_idxs).sum(I(I_idxs)));
Os.push_back(edsl::Contraction(O_dims, O_idxs).assign(I(I_idxs)));
}
} else {
Tensor t = slices[offsets[l]];
for (uint32_t i = offsets[l] + 1; i < offsets[l + 1]; ++i) {
t = t + slices[i];
}
Os.push_back(t);
O_dims[0] = edsl::TensorDim(offsets[i + 1] - offsets[i]);
O_idxs[0] = I_idxs[0] - offsets[i];
Tensor reduced = edsl::Contraction(O_dims, O_idxs).assign(I_gathered(I_idxs)).build();

This comment has been minimized.

Copy link
@tzerrell

tzerrell Nov 12, 2020

Contractions are higher performance than for loops. PlaidML can access, optimize, and parallelize the loop structure of contractions. It cannot do so for C++ for loops. So although this Contraction is more verbose to express than the previous version of the code, I expect it to be more efficient since it has a Contraction rather than a C++ for loop.

reduced = op::unsqueeze(op::sum(reduced, edsl::make_tuple(0)), {0});
Os.push_back(reduced);
}
}

Expand Down

0 comments on commit 6130516

Please sign in to comment.