Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gapis timeout #773

Closed
dsrbecky opened this issue Jul 24, 2017 · 3 comments
Closed

Gapis timeout #773

dsrbecky opened this issue Jul 24, 2017 · 3 comments

Comments

@dsrbecky
Copy link
Contributor

My gapis keeping exiting on a big trace:
W 00:17:00.031 gapis Stopping GAPIS server as it has been idle for more than 3.333333333s (--idle-timeout)
I am not stopping any process, and running just at info log-level.

Also, should we it an error? It took me a while to notice it all the other log messages.

It also seems that gapis will explicitly kill gapir process on timeout. Can we just let gapir die on its own? I mistook the abrupt death for driver crash.

@ben-clayton
Copy link
Contributor

"3.333333333s" is a bug, it's actually 10 seconds.

I've seen this as well, but I'm not sure what the correct solution is.
It seems odd / wrong that GAPIC hasn't sent a ping to GAPIS within 10 seconds when it should be pinging at 1 second intervals.

I guess it's possible that with the workload of opening a large trace the heartbeat thread is just not scheduled. If so, the only short-term solution I can think of is to bump the timeouts. @pmuetschard proposed scanning for process IDs, but that involves per-platform code.

Also, should we it an error?

Sure.

It also seems that gapis will explicitly kill gapir process on timeout. Can we just let gapir die on its own?

When gapis times out, the main function returns, the main context is cancelled so that things clean up, and any outstanding processes are terminated so we don't orphan them. I'd argue that killing child processes is the correct thing to do when we exit - but maybe we should be using an interrupt instead of forcefully closing the child process. Unfortunately that isn't supported on windows.

@dsrbecky
Copy link
Contributor Author

Increased the timeout in #777
It still dies after a minute though

@AWoloszyn
Copy link
Contributor

FYI, I got the 1M timeout on Windows for a particularly large trace. Things had not hung yet though.

ben-clayton added a commit to ben-clayton/gapid that referenced this issue Aug 1, 2017
I believe RPC limits the number of concurrent RPCs, and once a bunch of long-running calls consume all the slots, the pings don't get through.

Count the number of concurrent RPCs, and consider this when deciding to timeout.

Fixes: google#773
ben-clayton added a commit that referenced this issue Aug 1, 2017
I believe RPC limits the number of concurrent RPCs, and once a bunch of long-running calls consume all the slots, the pings don't get through.

Count the number of concurrent RPCs, and consider this when deciding to timeout.

Fixes: #773
purvisa-at-google-com pushed a commit that referenced this issue Sep 29, 2022
Image Format conversion in AGI done by the converters that are
registered in init time of Go Packages. There are details about this
but simply init function in go packages runs before main when the
packages are loaded.

Recently #651 moved ETC conversion to its own packages similar to ASTC
but different from ASTC, image creation remained in the parent package.
As all the conversion functions called via register map that has registered
during the init, this causes no function from the etc package, which ended up
with it's not being loaded therefore no converter registered.

This is a workaround that just create translator functions to reflect the image
create functions in the parent package and uses those functions in Vulkan/resources
to ensure the package is initialized.

This is suboptimal and only a workaround. I am creating another PR with a refactor
of the image formats. This is only a workaround until the other one can be merged.
purvisa-at-google-com pushed a commit that referenced this issue Sep 29, 2022
This reverts commit 8402534.

This commit breaks compilation on Windows.
purvisa-at-google-com pushed a commit that referenced this issue Sep 29, 2022
* Refactors the ETC proto structs to simplify the code around ETC format image creation

This is a work continuation of #773

The way that ETC protos structured, end up with lots of constructors that are not really necessary and to avoid duplicate them, actual converters and the image format definitions were in different packages that caused b/184954321

While investigating that bug, I have realized the situation and created an initial fix. This requires more work later that I discuss in the design doc attached to refactor work's bug tracker.

For a summary, all ETC formats are under one struct and the information that buried into their name now lies as fields in proto, which ended up having much more simpler code to interact with it.

Refactor Work Bug Tracking: b/186110226
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants