Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding Ithaca's Evaluation/Checkpointing System #5

Open
alocaputo opened this issue Jun 26, 2023 · 1 comment
Open

Understanding Ithaca's Evaluation/Checkpointing System #5

alocaputo opened this issue Jun 26, 2023 · 1 comment

Comments

@alocaputo
Copy link

Hello all,

I'm seeking clarification on how Ithaca's evaluation/checkpointing system works.

From my understanding, the evaluate function should calculate the evaluation metrics and store the checkpoint's pickle file on disk. However, I'm uncertain about when this function is called.

Currently, when I execute the code, I notice that it just generate a log file containing the training loss and the accuracy. However, it doesn't include information about the validation loss, nor a checkpoint is produced.

Also when I try to run:
python3 experiment.py --config=config.py --jaxline_mode=eval --logtostderr
it says:
Checkpoint None invalid or already evaluated, waiting.

Thank you for your time and assistance.

Best regards,
Alessandro

@Pragash-Mohanarajah
Copy link

Pragash-Mohanarajah commented Feb 17, 2024

Hello all,

I have also been working independently on Ithaca's Transformer model.
I have been experiencing the same issues when trying to recreate the model.

My understanding of Jaxline suggests that the intermediate training checkpoints are saved to the Tensorboard itself, and not as a pickle file. However, towards the end of training, it would be helpful to generate a pickle file to save the trained model.
I believe that training works relatively well, creating a TensorBoard log in the elected checkpoint directory.

However, as described above, the evaluation sequence continues to fail, regardless of how it is run.
There seems to be an issue with how the checkpoint directory is chosen; the model logs are not being saved as a result.
I have tried various adjustments to the original experiment.py file, but have had no success in building and saving a model.
In the parallel training/evaluation mode of jaxline, similar issues arose.

Please could you kindly update the experiment.py file and any other associated files, such that they function well together.
It would also be helpful to know the exact package environment configuration, which has worked during training and evaluation. New package versions have come since the release of Ithaca, often containing breaking changes.
A subsequent change to the requirements.txt could resolve these difficulties.
I have found that Ithaca functions seamlessly in its usual form, but struggles to function in --editable mode.

Thank you very very much for your time and consideration.
I look forward to hearing from you on this platform.

Kind Regards,
Pragash Mohanarajah

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants