Several micro optimizations #4833

epwalsh · 2020-12-02T19:20:48Z

Mainly replacing calls where a tensor is created and then sent to a device with .to(device) with calls that create the tensor directly on the device, which is over twice as fast. You can run the benchmarks yourself with:

pytest -c benchmarks/pytest.ini benchmarks/nn/util_bench.py -k 'create_tensor'

These are the results I got:

epwalsh · 2020-12-02T19:22:19Z

allennlp/nn/util.py

@@ -1548,7 +1548,6 @@ def add_sentence_boundary_token_ids(
        The new mask for the tensor, taking into account the appended tokens
        marking the beginning and end of the sentence.
    """
-    # TODO: matthewp, profile this transfer


I benchmarked this. This function is actually faster this way than with keeping sequence_lengths on GPU with:

sequence_lengths = mask.sum(dim=1).detach()

dirkgr

Heh, cool!

epwalsh added 2 commits December 2, 2020 10:50

benchmark transfers

f2288eb

create tensors directl on device when possible

9709923

epwalsh commented Dec 2, 2020

View reviewed changes

fix

8689cea

epwalsh requested review from AkshitaB and dirkgr December 2, 2020 19:29

AkshitaB approved these changes Dec 2, 2020

View reviewed changes

dirkgr approved these changes Dec 2, 2020

View reviewed changes

dirkgr merged commit cec9209 into master Dec 2, 2020

dirkgr deleted the benchmark-transfers branch December 2, 2020 22:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Several micro optimizations #4833

Several micro optimizations #4833

epwalsh commented Dec 2, 2020 •

edited

Loading

epwalsh Dec 2, 2020

dirkgr left a comment

Several micro optimizations #4833

Several micro optimizations #4833

Conversation

epwalsh commented Dec 2, 2020 • edited Loading

epwalsh Dec 2, 2020

Choose a reason for hiding this comment

dirkgr left a comment

Choose a reason for hiding this comment

epwalsh commented Dec 2, 2020 •

edited

Loading