Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR: Exception in ASGI application: IndexError: index 24 is out of bounds for axis 0 with size 24, and meanwhile the search hangs. #2624

Closed
thuzhf opened this issue Mar 28, 2023 · 14 comments
Assignees
Labels
area / SDK-storage Issue area: SDK and storage related issues help wanted Extra attention is needed phase / shipped Issue phase: shipped type / bug Issue type: something isn't working
Milestone

Comments

@thuzhf
Copy link

thuzhf commented Mar 28, 2023

🐛 Bug

aim up server log:

image

Search hangs in aim web UI:
image

To reproduce

I am not sure how to reproduce exactly. Maybe if you run lots of experiments/runs and do search (no matter whether you search while the experiments are being run or after the experiments are being run) and it will occur.

Expected behavior

The search does not hang. And the aim up server log does not have exceptions.

Environment

  • Aim Version (e.g., 3.0.1): 3.16.2 & 3.17.1
  • Python version: 3.9.16
  • pip version: 23.0.1
  • OS (e.g., Linux): Linux
  • Any other relevant information
@thuzhf thuzhf added help wanted Extra attention is needed type / bug Issue type: something isn't working labels Mar 28, 2023
@SGevorg
Copy link
Member

SGevorg commented Mar 28, 2023

@thuzhf thanks for sharing this. this looks to be blocker on your end?

@thuzhf
Copy link
Author

thuzhf commented Mar 28, 2023

@SGevorg What is blocked on my end? The server shows this exception log, but is still running. And the aim web UI search will hang when it reaches 255 of 1798.

@SGevorg
Copy link
Member

SGevorg commented Mar 28, 2023

ah I see, Aim hangs across all explorers on the UI?

@thuzhf
Copy link
Author

thuzhf commented Mar 28, 2023

@SGevorg I just tested all explorers, and found only metrics explorer hangs. And runs explorer only shows the first 45 matched runs.

@SGevorg
Copy link
Member

SGevorg commented Mar 28, 2023

Thanks @thuzhf :-)
Strange. Have you tracked things other than metrics?
Runs explorer is virtualized and auto-loads the rest of the runs if you scroll down (if it doesn't then that's a bug, but hope not haha).

@thuzhf
Copy link
Author

thuzhf commented Mar 28, 2023

@SGevorg

  1. Yes, I only tracked metrics.
  2. Yes, runs explorer auto-loads the rest of the runs if I scroll down.

@gorarakelyan
Copy link
Contributor

@thuzhf the search hangs as the mentioned exception is raised. Seems the issue appears when trying to sort metrics values based on steps. Do you specify steps when tracking metrics?

@thuzhf
Copy link
Author

thuzhf commented Apr 2, 2023

@gorarakelyan Do you mean steps like follows? I didn't specify these steps.
image

@thuzhf
Copy link
Author

thuzhf commented Apr 6, 2023

@SGevorg Today I encountered searching in all explorers hung, and there are not any error logs from aim server.
And searching in every explorer hung here:
image

@SGevorg
Copy link
Member

SGevorg commented Apr 6, 2023

this is definitely terrible news.
@mihran113 @gorarakelyan @roubkar fyi

@roubkar
Copy link
Contributor

roubkar commented Apr 6, 2023

@thuzhf does the browser page freeze or can you still use the UI (navigate to other pages from the sidebar, hit the cancel button, etc..)?
would like to understand whether you're facing a UI performance issue or there's no response from the server

@thuzhf
Copy link
Author

thuzhf commented Apr 6, 2023

@roubkar It does not freeze. I can still use the UI like you gave as examples. It seems there's no response from the server.

@mihran113
Copy link
Contributor

Hey @thuzhf! I've artificially recreated this case on my end. As the tracking call for aim.Run is not atomic, there can be a case when the step data is already written, and the process terminated before the value data was written, and this will result in an error mentioned above. I've opened a pull request that will prevent such cases from happening in the future.

Regarding on how to fix your logs:
I've pushed some changes to a branch named: fix/temporary-fix-for-index-error, you can install from the source of that branch using
pip install git+https://github.com/aimhubio/aim.git@fix/temporary-fix-for-index-error
this will print the corrupted metric's run hash, metric name and context in the terminal of aim up command, then using that information you'll need to override the corrupted step via this simple script:

from aim import Run
run = Run(repo={path_to_your_repo}, run_hash={run_hash}, read_only=False)
run.track(0, name={metric_name}, context={context}, step=24)

Let me know if you encounter any issues during the process, will be glad to help.

@mihran113 mihran113 self-assigned this Apr 11, 2023
@mihran113 mihran113 added the area / SDK-storage Issue area: SDK and storage related issues label Apr 11, 2023
@mihran113 mihran113 added this to the v3.17.x milestone Apr 11, 2023
@mihran113 mihran113 added the phase / review-needed Issue phase: issues that are done and needs review label Apr 11, 2023
@mihran113 mihran113 moved this to Current in Aim roadmap Apr 11, 2023
@mihran113 mihran113 added phase / ready-to-go Issue phase: issues that are merged and will be included in the upcoming release and removed phase / review-needed Issue phase: issues that are done and needs review labels Apr 27, 2023
@mihran113 mihran113 added phase / shipped Issue phase: shipped and removed phase / ready-to-go Issue phase: issues that are merged and will be included in the upcoming release labels Nov 14, 2023
@mihran113
Copy link
Contributor

Closing the issue as the fix was shipped with aim v3.17.4.

@mihran113 mihran113 moved this from Patch-issues to Done in Aim roadmap Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area / SDK-storage Issue area: SDK and storage related issues help wanted Extra attention is needed phase / shipped Issue phase: shipped type / bug Issue type: something isn't working
Projects
Status: Done
Development

No branches or pull requests

5 participants