Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] append to existing lcurve for restart training #4039

Closed
anyangml opened this issue Aug 1, 2024 · 4 comments
Closed

[Feature Request] append to existing lcurve for restart training #4039

anyangml opened this issue Aug 1, 2024 · 4 comments

Comments

@anyangml
Copy link
Collaborator

anyangml commented Aug 1, 2024

Summary

Currently, when restart a training, the lcurve.out file will be overwritten. Maybe it's more natural to keep adding to the existing file, so that we can keep the entire training history in a single file.

Detailed Description

When a training crashes, and I would like to restart it from the latest ckpt, I have to move the ckpt and input file into a separate folder to prevent log files from being overwritten. This creates more complexity for downstream workflows such as training history visualization. If there is no technical issue, I believe adding to the existing log file is a more desirable behavior than overwriting it.

Further Information, Files, and Links

No response

@njzjz
Copy link
Member

njzjz commented Aug 1, 2024

Currently, when restart a training, the lcurve.out file will be overwritten.

I don't reproduce it using the water example. Do you have any specific configuration (e.g. multi-task)?

@anyangml
Copy link
Collaborator Author

anyangml commented Aug 2, 2024

Interesting, this happens to me for both single-task and multi-task. Maybe it's something associated with the cloud server, Ali-Pai, I will look into that.

here is the single task training input
image

command: dp --pt train input.json --restart model.ckpt-200200.pt --skip-neighbor-stat

image

@njzjz
Copy link
Member

njzjz commented Aug 2, 2024

Are you sure the latest code is used? At least after #3985

@anyangml
Copy link
Collaborator Author

anyangml commented Aug 4, 2024

I see, I didn't notice this is already available.

@anyangml anyangml closed this as completed Aug 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants