Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a hardware benchmark to test memory, disk, and network bandwidths #5966

Merged
merged 23 commits into from
Mar 28, 2022

Conversation

mrocklin
Copy link
Member

This includes both a client-side function and a /hardware dashboard page

image

@mrocklin
Copy link
Member Author

@quasiben you all might find this interesting

@github-actions
Copy link
Contributor

github-actions bot commented Mar 20, 2022

Unit Test Results

       12 files  ±  0         12 suites  ±0   6h 9m 16s ⏱️ - 9m 38s
  2 673 tests +  3    2 591 ✔️ +  4    81 💤  - 1  1 +1 
15 942 runs  +18  15 083 ✔️ +24  858 💤  - 5  1 ±0 

For more details on these failures, see this check.

Results for commit 67f7514. ± Comparison against base commit 2fbf9eb.

♻️ This comment has been updated with latest results.

@mrocklin mrocklin changed the title Adds a hardware benchmark to test memory, disk, and network bandwidths Add a hardware benchmark to test memory, disk, and network bandwidths Mar 21, 2022
@mrocklin
Copy link
Member Author

cc @crusaderky if you have time to take a quick look

@ncclementi
Copy link
Member

@mrocklin quick comment: can you add the corresponding entry to this doc-page please https://github.com/dask/distributed/blob/main/docs/source/http_services.rst

@quasiben
Copy link
Member

This is awesome!

I think we should also build client.go_burr() cc @jacobtomlinson

@mrocklin
Copy link
Member Author

can you add the corresponding entry to this doc-page please

I've gone ahead and done this, but I'm not sure that I agree with that page generally.

Historically distributed.dask.org gets almost no traffic. We've really focused on docs.dask.org. I don't think that we should be investing much in writing these docs unless we really invest in them. Instead I think that we should be focusing on docs.dask.org. I'm also not sure how much value there is in a listing of some of the dashboard pages. I might suggest a few alternatives:

  1. If we're going to document the dashboard then let's really document the dashboard, showing what pages do what with images, diagrams, videos, and so on
  2. If we're going for more of API documentation then we might consider adding these to the docstrings and using automatically generated sphinx style docs. If we do this, I'd recommend moving this over to docs.dask.org
  3. Just don't bother doing this, especially if it's not getting read

Personally, I would do 1 or 3

@mrocklin
Copy link
Member Author

Yeah, last week twelve people went to that page. 100% of them bounced in about a minute. Rather than continue spending dev time developing that page in its current style I'm inclined instead to remove it. Thoughts?

@mrocklin
Copy link
Member Author

Thank you for the thorough feedback @crusaderky. I've handled or responded to all of your comments and pushed an update.

@ncclementi
Copy link
Member

ncclementi commented Mar 22, 2022

Historically distributed.dask.org gets almost no traffic. We've really focused on docs.dask.org. I don't think that we should be investing much in writing these docs unless we really invest in them. Instead, I think that we should be focusing on docs.dask.org. I'm also not sure how much value there is in a listing of some of the dashboard pages. I might suggest a few alternatives:
1. If we're going to document the dashboard then let's really document the dashboard, showing what pages do what with images, diagrams, videos, and so on
2. If we're going for more of API documentation then we might consider adding these to the docstrings and using automatically generated sphinx style docs. If we do this, I'd recommend moving this over to docs.dask.org
3. Just don't bother doing this, especially if it's not getting read
Personally, I would do 1 or 3

Matt, I agree that there should be more dashboard documentation and there is been an effort in starting that (see https://docs.dask.org/en/latest/dashboard.html) the truth is that outside the main status page we don't have proper docs, and the dashboard docstrings are not necessarily good. The HTTP endpoints page is the only place where are all the names of the dashboard pages and it's a good reference to find which tab corresponds to which plot.

Yeah, last week twelve people went to that page. 100% of them bounced in about a minute. Rather than continue spending dev time developing that page in its current style I'm inclined instead to remove it. Thoughts?

Those twelve people I believe were Bryan, myself, and a couple more trying to find some information about certain plots, and it was helpful. I agree it's not the best one. But until we have better docs on every page, I'd think we should keep it. Besides the effort in maintaining this page is to add one line every time a plot is added, which I don't find to be a much of a problem

@mrocklin
Copy link
Member Author

Those twelve people I believe were Bryan, myself, and a couple more trying to find some information about certain plots,

To be clear, when I say "twelve" I really mean that I'm rounding that down to zero. If there weren't at least hundreds of views (preferably thousands) then in my mind the doc page doesn't really provide value.

But until we have better docs on every page, I'd think we should keep it. Besides the effort in maintaining this page is to add one line every time a plot is added, which I don't find to be a much of a problem

Eh, things like this add inertia to the dev process. Asing everyone who makes a plot to add a line to a doc page that no one sees feels like unneccessary and useless work to me. I'd rather that we trim off things that aren't useful and focus more on critical issues.

If we want to document the dashboard then great, but we need to work on documentation that has an impact. I'm against asking people to do work that doesn't have an impact. If the answer is "we should document this somewhere" then sure, I agree, but it needs to be somewhere that people actually look. Currently the dashboard is undocumented. That page isn't currently helping to solve the problem.

@crusaderky
Copy link
Collaborator

dashboard test is failing, likely because of the task spawned by the new widget:
https://github.com/dask/distributed/runs/5660831076?check_suite_focus=true

@crusaderky
Copy link
Collaborator

I can't find the new widget in the [More...] dropdown in the GUI splash page.
I guess it's because of this line?

"plots": [x.replace("/", "") for x in applications if "individual" in x],

@mrocklin
Copy link
Member Author

I can't find the new widget in the [More...] dropdown in the GUI splash page.
I guess it's because of this line?

That's also my read of how things worked. I've added this but am still not seeing it locally. I'm not yet sure why.

@@ -7329,6 +7331,51 @@ async def get_call_stack(self, keys=None):
response = {w: r for w, r in zip(workers, results) if r}
return response

async def benchmark_hardware(self) -> "dict[str, dict[str, float]]":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a docstring and explain the output?

@crusaderky
Copy link
Collaborator

crusaderky commented Mar 25, 2022

Outstanding points:

  • add docstring to Scheduler.benchmark_hardware
  • failing test test_scheduler_bokeh.py::test_simple
  • failing linters
  • plot does not appear in [More...] tab

This didn't pass pre-commit checks.  I don't understand mypy well enough
to diagnose why it's upset here.

```
distributed/dashboard/scheduler.py:133: error: No overload variant of "sorted" matches argument types "object", "Callable[[Any], Any]"
distributed/dashboard/scheduler.py:133: note: Possible overload variants:
distributed/dashboard/scheduler.py:133: note:     def [SupportsRichComparisonT] sorted(Iterable[SupportsRichComparisonT], *, key: None = ..., reverse: bool = ...) -> List[SupportsRichComparisonT]
distributed/dashboard/scheduler.py:133: note:     def [_T] sorted(Iterable[_T], *, key: Callable[[_T], Union[SupportsDunderLT, SupportsDunderGT]], reverse: bool = ...) -> List[_T]
distributed/dashboard/scheduler.py:138: error: "object" has no attribute "insert"
Found 2 errors in 1 file (checked 1 source file)
```
@mrocklin
Copy link
Member Author

  • add docstring to Scheduler.benchmark_hardware

    Done

  • failing test test_scheduler_bokeh.py::test_simple

    I'm not seeing this

  • failing linters

    I'm not seeing this either, but I'll check CI when it starts up again

  • plot does not appear in [More...] tab

    I found out what was going on here and fixed it. However mypy is unhappy and I'm not sure why.

I'm also happy to just close this PR. I've probably spent about as much time as I can justify here. If we can get this closed out soon then awesome, I'll keep pushing on it. If it's going t take more than, say, 30m of my time though then I'll need to prioritize elsewhere.

@mrocklin
Copy link
Member Author

failing linters
I'm not seeing this either, but I'll check CI when it starts up again

Oh, I see, the duration: str change that you suggested forced an issue with parse_timedelta. I reverted the typing change.

@mrocklin
Copy link
Member Author

See also dask/dask#8848

@crusaderky
Copy link
Collaborator

  • failing test test_scheduler_bokeh.py::test_simple
    I'm not seeing this

It's visible in the latest CI run:
https://github.com/dask/distributed/runs/5697032918?check_suite_focus=true

def check_thread_leak():
E               AssertionError: (<Thread(ThreadPoolExecutor-100_1, started daemon 140416036894464)>, ['  File "distributed/worker.py", line 4622, in benchmark_disk

failing linters
I found out what was going on here and fixed it. However mypy is unhappy and I'm not sure why.

diff --git a/distributed/dashboard/scheduler.py b/distributed/dashboard/scheduler.py
index 154aaafb2..a7dc0a053 100644
--- a/distributed/dashboard/scheduler.py
+++ b/distributed/dashboard/scheduler.py
@@ -108,7 +108,7 @@ applications = {
 }
 
 
-template_variables = {
+template_variables: dict = {
     "pages": [
         "status",
         "workers",
diff --git a/distributed/worker.py b/distributed/worker.py
index 5501d50ac..16d6d789a 100644
--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -54,11 +54,11 @@ from distributed.comm.utils import OFFLOAD_THRESHOLD
 from distributed.compatibility import randbytes
 from distributed.core import (
     CommClosedError,
+    ConnectionPool,
     Status,
     coerce_to_address,
     error_message,
     pingpong,
-    rpc,
     send_recv,
 )
 from distributed.diagnostics import nvml
@@ -4657,7 +4657,7 @@ def benchmark_memory(
 
 async def benchmark_network(
     address: str,
-    rpc: rpc,
+    rpc: ConnectionPool,
     sizes: Iterable[str] = ("1 kiB", "10 kiB", "100 kiB", "1 MiB", "10 MiB", "50 MiB"),
     duration="1 s",
 ) -> dict[str, float]:

I'm also happy to just close this PR

I think we're almost done.

Otherwise we let a lingering thread start, which makes some tests
unhappy.

There is some chance that the worker might have an executor which
doesn't work well with asyncio.  That seems rare enough and this feature
seems fringe enough that I'm totally willing to take that chance.
Copy link
Collaborator

@crusaderky crusaderky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@crusaderky crusaderky merged commit 06170d5 into dask:main Mar 28, 2022
@mrocklin
Copy link
Member Author

mrocklin commented Mar 28, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants