Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable bodyTimeout and headersTimeout when fetching from backend #1853

Merged
merged 1 commit into from
Jun 18, 2023

Conversation

Splendide-Imaginarius
Copy link
Contributor

Fixes bug where CLI would crash if the chain took more than 5 minutes to process.

@RunDevelopment
Copy link
Member

@Splendide-Imaginarius Please explain the bug you're trying to fix a bit more. CLI and GUI use the same code for talking to the backend. The GUI obviously doesn't have this timeout problem, so why should the CLI have it?

I also want to point out that the native fetch method should not have a timeout built into it. So I don't understand how this can be an issue at all.

@Splendide-Imaginarius
Copy link
Contributor Author

Default fetch implementation in Node has a timeout of 5 minutes (300 seconds); fetch is implemented using a bundled version of undici, and the docs for that package say that the default timeout is 300 seconds. If I run a chain via CLI (with current main branch), and fetching the /run endpoint takes longer than 5 minutes (the timing matches down to a few milliseconds, so I'm quite confident this is the issue), an exception in undici.fetch is thrown, with the message TypeError: fetch failed. This symptom goes away with this PR applied.

Why does this only affect CLI? I'm honestly not sure. I spent several hours trying to trace why this behavior is CLI-specific and didn't turn up much. My working hypothesis is that the fetch of /run fails in GUI too, but that the exception is caught in a way that doesn't cause any user-visible symptoms. The error handling is a little different in CLI (see

process.on('uncaughtException', fatalErrorInMain);
, and note that there's no try/catch in
const response = await backend.run({
data,
options,
sendBroadcastData: false,
});
whereas there is a try/catch in the equivalent GUI code. Conceptually, I don't think it's correct behavior for a /run fetch to time out after 5 minutes, since a chain can easily take longer than that, so I believe the correct fix is to disable the timeout rather than mess with exception handling to induce the same masking behavior in the CLI that the GUI seems to have. I suppose maybe some other endpoints might have different optimal timeouts -- maybe it would be good to tweak the timeout per endpoint if needed.

@Splendide-Imaginarius
Copy link
Contributor Author

(If you happen to have a better understanding of why this behavior is CLI-only, maybe you can suggest a better fix than I came up with.)

@Splendide-Imaginarius
Copy link
Contributor Author

I see that the Linter is throwing a warning on CI; I'll fix that after we've figured out whether this patch is following the right approach.

@joeyballentine
Copy link
Member

Maybe this issue does affect the GUI? I don't think I've tried a > 5 minute request since we switched to fetch

@Splendide-Imaginarius
Copy link
Contributor Author

I guess the first question here is, can you guys reproduce the TypeError: fetch failed exception in CLI mode? If so, then yes, checking if there's some way to reproduce in the GUI would make sense. If not, then I'd need to investigate whether there's some strange issue in my setup that's causing different behavior.

@Splendide-Imaginarius
Copy link
Contributor Author

I'm pretty sure that when I tried to do a >5 minute chain in the GUI, I didn't see any user-facing errors, but I'm not sure what exactly would be expected to happen if that fetch fails in the GUI. From looking at the GUI code, it doesn't look like it would trigger a complete exit of the process like the CLI does.

@Splendide-Imaginarius
Copy link
Contributor Author

Splendide-Imaginarius commented Jun 10, 2023

On the very off chance that the node types have something to do with triggering the bug, I ran into the error in CLI while running a spritesheet iterator; the individual iterations were only a few seconds long, but the total time spent in the spritesheet iterator was on the order of 6-7 minutes, so just long enough to trigger the bug.

@RunDevelopment
Copy link
Member

Default fetch implementation in Node has a timeout of 5 minutes (300 seconds); fetch is implemented using a bundled version of undici, and the docs for that package say that the default timeout is 300 seconds.

Why does fetch even have an implementation-defined timeout? The standard doesn't mention this. MDN doesn't mention this.

Conceptually, I don't think it's correct behavior for a /run fetch to time out after 5 minutes, since a chain can easily take longer than that, so I believe the correct fix is to disable the timeout rather than mess with exception handling to induce the same masking behavior in the CLI that the GUI seems to have

I very much agree.

I also want to point out this: Since a timeout behavior is not specified in the standard, the timeout error fetch returns is not guaranteed to be consistent across implementations. So we can't even detect the timeout error in an platform/runtime/implementation-independent way.

My working hypothesis is that the fetch of /run fails in GUI too, but that the exception is caught in a way that doesn't cause any user-visible symptoms.

This might be the case. We ignore are certain kind of exception. If Chromium implements its 300s timeout as an internal abort controller, it would make sense that the UI doesn't show us an error.

} catch (err: unknown) {
if (!(err instanceof DOMException && err.name === 'AbortError')) {
sendAlert({
type: AlertType.ERROR,
message: `An unexpected error occurred: ${String(err)}`,
});
}
} finally {

So I tested this. I added time.sleep(10 * 60) (sleep for 10min) to one of our nodes, removed the !(err instanceof DOMException && err.name === 'AbortError') condition to always show errors. No timeout error. fetch waited 10min for the /run request to complete.

This is weird because nodejs/undici#1373 mentions that Chrome has a timeout of 300s. So either Electron is setting a different timeout or something weird is going on.

Note: The if before sending out error messages is to catch intentionally aborted requests. When stopping on ongoing execution, the backend may be slow to respond, so we kill it after a short delay. (#1521)

I also tried the same chain using CLI. I was able to reproduce the error:

12:16:45.034 > Read chain file C:\\Users\\micha\\Desktop\\cli timeout.chn
12:16:45.045 > Backend: [19336] [INFO] Running new executor...
12:21:45.061 > TypeError: fetch failed
    at Object.fetch (node:internal/deps/undici/undici:11118:11)
    at async Backend.fetchJson (C:\Users\micha\Git\chaiNNer\.webpack\main\index.js:23663:18)
    at async runChainInCli (C:\Users\micha\Git\chaiNNer\.webpack\main\index.js:29472:20)
12:21:45.062 > Cleaning up temp folders...

So how do we fix this?

I was able to verify that your PR correctly fixes the issue for CLI, but it causes issues in the GUI:
image

So I guess one way around that would be to detect whether we are currently inside node or a browser. This would fix CLI, but it would also leave the fate (correctness) of backend requests in the GUI up to the whims of Electron, chromium, and v8.

@Splendide-Imaginarius
Copy link
Contributor Author

Good analysis @RunDevelopment , thanks for that. Yeah, seems like a mess.

So I guess one way around that would be to detect whether we are currently inside node or a browser. This would fix CLI, but it would also leave the fate (correctness) of backend requests in the GUI up to the whims of Electron, chromium, and v8.

I mean, I don't exactly like this solution, but I definitely can't think of a better one. Might be the least bad option.

@Splendide-Imaginarius
Copy link
Contributor Author

What's the preferred way to check whether we're running in Node vs Electron? I assume there are lots of ways to do this that will usually work, but if there's already a pattern you use in chaiNNer, I guess I should probably copy that for this PR.

@joeyballentine
Copy link
Member

I don't think we have a way of checking that per se, we just have different entry points for CLI vs GUI

@RunDevelopment
Copy link
Member

Node vs Electron

We have isRenderer from src/common/env.ts.

@Splendide-Imaginarius
Copy link
Contributor Author

We have isRenderer from src/common/env.ts.

Good pointer, thanks @RunDevelopment. PR updated to only change behavior in CLI mode.

@RunDevelopment
Copy link
Member

Please run ESLint to fix the formatting: npm run lint-fix

@Splendide-Imaginarius
Copy link
Contributor Author

For some reason npm run lint-fix doesn't work on my machine (I think my ESLint version is too old, and I'm not motivated enough to get a newer release), but I was able to use the CI output to figure out the needed changes (I think).

@Splendide-Imaginarius
Copy link
Contributor Author

The TypeScript linter errors may take me a few days to fix.

@RunDevelopment
Copy link
Member

I think my ESLint version is too old, and I'm not motivated enough to get a newer release

What your version of ESLint? This project has ESLint in its dev deps, so we just use our version of ESLint. You did run npm ci, right?

The TypeScript linter errors may take me a few days to fix.

Here's what is happening: Native fetch and undici fetch have different interfaces for RequestInit. They are both called RequestInit, but they are different interfaces. This is why you get the well-worded and not-confusing-at-all error:

Argument of type 'RequestInit' is not assignable to parameter of type 'RequestInit & RequestInit'

In line 108 const options: RequestInit, we declare options as the native RequestInit and then try to use it for both fetches.

Here's how I would fix it:

@@ -1,4 +1,4 @@
-import { Agent as UndiciAgent, fetch as undiciFetch } from 'undici';
+import * as undici from 'undici';
 import {
     BackendJsonNode,
     Category,
@@ -105,12 +105,12 @@ export class Backend {
     }

     private async fetchJson<T>(path: string, method: 'POST' | 'GET', json?: unknown): Promise<T> {
-        const options: RequestInit = isRenderer
+        const options: RequestInit & undici.RequestInit = isRenderer
             ? { method, cache: 'no-cache' }
             : {
                   method,
                   cache: 'no-cache',
-                  dispatcher: new UndiciAgent({
+                  dispatcher: new undici.Agent({
                       bodyTimeout: 0,
                       headersTimeout: 0,
                   }),
@@ -123,7 +123,7 @@ export class Backend {
             };
             options.signal = signal;
         }
-        const resp = await (isRenderer ? fetch : undiciFetch)(
+        const resp = await (isRenderer ? fetch : undici.fetch)(
             `http://127.0.0.1:${this.port}${path}`,
             options
         );

Fixes bug where CLI would crash if the chain took more than 5 minutes to
process.
@Splendide-Imaginarius
Copy link
Contributor Author

What your version of ESLint? This project has ESLint in its dev deps, so we just use our version of ESLint. You did run npm ci, right?

Hadn't realized that; npm shipped its own ESLint dependency, which seems to be the wrong version for this, and I wasn't aware that chaiNNer has its own, newer ESLint dependency in package.json. Thanks for pointing that out. Linting passes locally here now; hopefully the CI will pass as well.

Copy link
Member

@RunDevelopment RunDevelopment left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Thank you @Splendide-Imaginarius!

@joeyballentine joeyballentine merged commit f1e394f into chaiNNer-org:main Jun 18, 2023
@Splendide-Imaginarius Splendide-Imaginarius deleted the cli-timeout branch June 30, 2023 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants