from upstream #7

jcoffi · 2023-02-15T04:21:02Z

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

This PR adds a `Tuner.restore(param_space=...)` argument. This allows object refs to be updated if used in the original run. This is a follow-up to #31927 Signed-off-by: Justin Yu <[email protected]>

…#32518) * RLTrainer -> Learner * TrainerRunner -> LearnerGroup Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Previously, `get_preprocessor` would always serialize the Checkpoint into a dictionary first. This is incredibly wasteful and causes huge memory usage and runtime with large directory-based Checkpoints. This PR changes the logic to first see if a directory Checkpoint should be loaded into a dictionary or not in order to obtain the preprocessor. Context: I had ran into it when trying to do predictions with 25 GB Hugging Face model. `HuggingFacePredictor` calls `get_preprocessor` internally, and that takes ages to complete and almost caused an OOM for me - and all of that is unnecessary as the preprocessor has to be loaded from a file anyway. Signed-off-by: Antoni Baum <[email protected]>

This PR improves Train lazy checkpointing with NFS setups. Previously, the logic to determine whether lazy checkpointing should be used was dependent on whether the Train worker-actor was on the same node as the Trainable actor. The new logic instead has the Trainable actor drop a marker file in the Trial's directory. If a worker-actor can detect that file, it means it can access the same directory as the Trainable actor. This PR also fixes lazy checkpointing env var propagation. Signed-off-by: Antoni Baum <[email protected]>

Closes #30408

…_dir` w/ endpoint and params (#32479) Currently, URI handling with parameters is done in multiple places in different ways (using `urllib.parse` or splitting by `'?'` directly). In some places, it's not done at all, which **causes errors when performing cloud checkpointing.** In particular, `Trial.remote_checkpoint_dir` and `Trainable._storage_path` do not handle URI path appends correctly when URL params are present. Signed-off-by: Justin Yu <[email protected]>

Signed-off-by: SangBin Cho <[email protected]> Remove the tool sub-directory and put all the tools to the top level. Move ray dashboard to the top item from monitoring & debugging. Add a link to the overview page for all getting started guide. Add a observability section to the top-level getting started guide. Remove verbose dashboard overview and add a picture instead. Note: I will make an another PR to improve the overview page of the dashboard.

justinvyu and others added 7 commits February 14, 2023 14:35

[Tune] Allow re-specifying param space in Tuner.restore (#32317)

bc6b40a

This PR adds a `Tuner.restore(param_space=...)` argument. This allows object refs to be updated if used in the original run. This is a follow-up to #31927 Signed-off-by: Justin Yu <[email protected]>

[RLlib] Rename RLTrainer to Learner and TrainerRunner to LearnerGroup (…

98b267f

…#32518) * RLTrainer -> Learner * TrainerRunner -> LearnerGroup Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

[Datasets] Add glossary (#32400)

e4b9db0

Closes #30408

jcoffi merged commit 0894778 into jcoffi:master Feb 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

from upstream #7

from upstream #7

jcoffi commented Feb 15, 2023

from upstream #7

from upstream #7

Conversation

jcoffi commented Feb 15, 2023

Why are these changes needed?

Related issue number

Checks