Schema performance improvements #632

AlyssaCote · 2024-07-15T21:22:31Z

Performance improvements needed to be made in order to reduce the amount of copies we were making and de/serialization time. Instead of building a Tensor and then adding it to a Request, the Request holds TensorDescriptors and the actual tensor data is sent after the request through the FLInterface.

I was able to delete a lot of TensorFlow and Torch tests that were separated out now that the build_tensor has been updated to build_tensor_descriptor.

…to fli-worker

codecov · 2024-07-15T21:41:15Z

Codecov Report

Attention: Patch coverage is 27.02703% with 27 lines in your changes missing coverage. Please review.

Please upload report for BASE (mli-feature@eace71e). Learn more about missing BASE report.

Additional details and impacted files

@@              Coverage Diff               @@
##             mli-feature     #632   +/-   ##
==============================================
  Coverage               ?   76.61%           
==============================================
  Files                  ?      100           
  Lines                  ?     6905           
  Branches               ?        0           
==============================================
  Hits                   ?     5290           
  Misses                 ?     1615           
  Partials               ?        0

Files	Coverage Δ
smartsim/_core/mli/comm/channel/channel.py	`75.00% <ø> (ø)`
smartsim/_core/mli/infrastructure/worker/worker.py	`53.75% <ø> (ø)`
smartsim/_core/mli/message_handler.py	`99.51% <100.00%> (ø)`
...rtsim/_core/mli/mli_schemas/tensor/tensor_capnp.py	`100.00% <ø> (ø)`
smartsim/_core/mli/comm/channel/dragonchannel.py	`52.38% <66.66%> (ø)`
...im/_core/mli/infrastructure/worker/torch_worker.py	`85.41% <0.00%> (ø)`
smartsim/_core/mli/comm/channel/dragonfli.py	`64.00% <11.11%> (ø)`
.../_core/mli/infrastructure/control/workermanager.py	`22.15% <0.00%> (ø)`

mellis13

Just some general comments and questions but overall great changes to the code.

smartsim/_core/mli/mli_schemas/request/request.capnp

mellis13 · 2024-07-17T22:07:57Z

smartsim/_core/mli/infrastructure/control/workermanager.py

-            msg_tensor = MessageHandler.build_tensor(
-                tensor,
+
+            # TODO isn't this what output descriptors are for?


Can we resolve these two TODO comments? Do we need to make tickets or can they be deleted?

That's actually a note so I remembered to ask this question for the group! We don't use OutputDescriptors anywhere. I think the hardcoded information here can come from the OutputDescriptors so we know how the tensor needs to be reconstructed. I'll make a ticket for further discussion and remove these TODOs.

smartsim/_core/mli/infrastructure/control/workermanager.py

mellis13 · 2024-07-17T23:00:13Z

smartsim/_core/mli/infrastructure/control/workermanager.py


        interm = time.perf_counter()  # timing
        request = deserialize_message(
            request_bytes, self._comm_channel_type, self._device
        )
+
+        if request.input_meta and tensor_list:


Maybe The logic from 248 to 264 (and the deserialization_message() content) would be better encapsulated in a unpack_request. I think _on_iteration should have minimal manipulation of the request based on the serialiation and communication specifics.

(Does this make it more difficult to do perf timing though?)

I completely agree, but maybe we wait to refactor _on_iteration until we're solid with performance timing? It might make it more difficult.

smartsim/_core/mli/comm/channel/dragonfli.py

al-rigazzi

LGTM!

al-rigazzi added 30 commits June 25, 2024 12:21

Initial FLI-based implementation

e98e2fe

Add inference example stub

043f0e7

Lint, style, black magic

efc9e83

Merge branch 'mli-feature' of https://github.com/CrayLabs/SmartSim in…

35ec45e

…to fli-worker

Bring up to feature branch

ed3c42a

Update example

e5be26b

Change the changelog

a23010f

Make style

3c20f46

Attempt to mitigate import dragon error

b9ed5ba

Import dragon optional

0de06f3

isort

d051385

Fix imports in dragon backend tests

e77b1cd

Style

a90888d

Fix type

b431221

Rename examples dir

23efebc

Remove old dir

09b9d24

Add tests for torch worker

56d8e50

Switch to sender-supplied channels in app example

6cec83e

Add prototype client for mock app

3ad6d44

Update mock app

bd5f133

Changes to feature store

3e343ee

Merge upstream

a0525e5

Make style

a2bed26

Fix typing

36e92d9

Fix lint

59840a3

Remove duplicated/useless comments

b35b37d

Bring up to date with new schema

51e0b17

Add feature store prototype caching

1fcf17d

Add redis driver, fix FLI

d76f880

Merge branch 'mli-feature' of https://github.com/CrayLabs/SmartSim in…

0564d01

…to fli-worker

al-rigazzi and others added 11 commits July 11, 2024 12:27

Update post-merge

3938ec8

Fix typing

273a7d9

isort

a12d923

Update envloader test

38b0de1

no more data blob

53eb045

fixing up worker manager

e64532d

fixed tests, maybe fixed mock app?

52f5e74

mli driver runs all the way through

0e3bd61

weaks

e3f44a5

merge

6a80895

more clean up

b57fc8e

AlyssaCote marked this pull request as draft July 15, 2024 21:22

changelog, mypy

c1f856b

AlyssaCote marked this pull request as ready for review July 16, 2024 16:18

AlyssaCote requested review from al-rigazzi and ankona July 16, 2024 16:18

mellis13 suggested changes Jul 17, 2024

View reviewed changes

al-rigazzi approved these changes Jul 18, 2024

View reviewed changes

AlyssaCote added 3 commits July 18, 2024 11:20

Merge branch 'mli-feature' into rework_schemas

d4b67dc

pr comments addressed

f1415f2

style

dafb4df

mellis13 approved these changes Jul 18, 2024

View reviewed changes

AlyssaCote merged commit 7169f1c into CrayLabs:mli-feature Jul 18, 2024
42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema performance improvements #632

Schema performance improvements #632

AlyssaCote commented Jul 15, 2024 •

edited

Loading

codecov bot commented Jul 15, 2024 •

edited

Loading

mellis13 left a comment

mellis13 Jul 17, 2024

AlyssaCote Jul 18, 2024

mellis13 Jul 17, 2024

mellis13 Jul 17, 2024

AlyssaCote Jul 18, 2024

al-rigazzi left a comment

Schema performance improvements #632

Schema performance improvements #632

Conversation

AlyssaCote commented Jul 15, 2024 • edited Loading

codecov bot commented Jul 15, 2024 • edited Loading

Codecov Report

mellis13 left a comment

Choose a reason for hiding this comment

mellis13 Jul 17, 2024

Choose a reason for hiding this comment

AlyssaCote Jul 18, 2024

Choose a reason for hiding this comment

mellis13 Jul 17, 2024

Choose a reason for hiding this comment

mellis13 Jul 17, 2024

Choose a reason for hiding this comment

AlyssaCote Jul 18, 2024

Choose a reason for hiding this comment

al-rigazzi left a comment

Choose a reason for hiding this comment

AlyssaCote commented Jul 15, 2024 •

edited

Loading

codecov bot commented Jul 15, 2024 •

edited

Loading