-
-
Notifications
You must be signed in to change notification settings - Fork 173
Cloud Training
This feature allows training on the cloud on Runpod and any remote Linux server. Make sure to read the limitations and best usage before using it, it will save you time.

It was already possible to train on a remote server or Runpod with command line interface. With this new tab you can train remotely by using the OneTrainer UI on your own PC. It's the user friendly option and offers some additional advantages like saving models and anything you need locally, stop or delete the pod at the end so no need to worry about extra cost if you go out or for sleep.
You set your training configuration as usual. Define the remote server and options, when hitting "Start Training", it will upload the concepts and training configuration to the remote server. During the training it will download the remote workspace locally so you can view samples, call the Tensorborad, access to backups and saves as you would do like for a local training.
During the training, you will see its progression (epochs/steps). You can stop your local OneTrainer instance, swith off your computer, the training will continue in the Cloud. When starting again your local OneTrainer, it will synchronize the remote workspace so you don't loose any info.
After the training, it will either stop or delete the pod (when using Runpod).
Every field on this tab has a very descriptive tooltip. This page is made to make it's usage even simpler.
Enable the cloud training with the Enabled switch (default OFF).
Choose Type: RUNPOD.
First create an API key with Read&Write permissions in your Runpod user settings.
Create your SSH public and private keys (*) and upload the public key in the Runpod user settings.
(*) On Windows you can create these keys with the command ssh-Keygen
in a Command Prompt or Powershell. They will be placed in "C:\Users\username\ .ssh", open the public key id_ed25519.pub
(ed25519 is the default protocol, no matter if you use another) in a text editor to get its value, it's a long text chain starting with ssh-protocol ending with a kind of mail address.
Note: when deploying a new pod it will use a docker image with OneTrainer already installed to save the OneTrainer installation time.
Note as current state of development: Do not change the name nor set a password for the SSH keys (just hit enter when prompted). This may be changed in a future PR.
Now, you have the option to use an existing pod or to create a new one.
Enter your pod id under Cloud ID.
Leave hostname and port empty, keep user as default (root).
Directories: when deploying a pod, the workspace directy is defaulted to /workspace (Volume mount path), so change it only if you changed this value in your pod.
Install OneTrainer: it will install OneTrainer if the /workspace/OneTrainer directory is empty.
Create Cloud: unused.
Cloud Name / Type : unused with an existing pod.
GPU: unused.
Leave hostname, port and pod id empty, keep user as default (root).
Create Cloud: on.
Cloud Name (default OneTrainer): name of the pod.
Type: Community or Secure, see Runpod FAQ for more info.
Directories: keep it as default.
Install OneTrainer: on.
GPU: First click on the three dots next to it to update the list. Navigate to pods in Runpod and select deploy a pod to see the characteristics, cost and availability for each GPU. Choose a GPU that fits at least your training needs in terms of RAM and VRAM.
Same for Runpod but if you do such, we guess what connection parameters are required.
Install / Update OneTrainer (ON) and OneTrainer directory (/workspace/OneTrainer): keep everything as default.
File sync method: NATIVE_SCP or FABRIC_SFTP. In case of doubt choose NATIVE_SCP, much faster on Windows, FABRIC_SFTP is the classic method based on Paramiko/fabrik, slower on Windows.
Min Download (default 0): defines what should be the minimum download speed in Mbps when choosing a GPU, more relevant for community cloud. Change it from 600 up to 3000 to prevent download timeout exception. More relevant for community cloud, I never faced the issue with secure cloud whatevever the GPU (so default 0 were fine).
Jupyter password: password for Jupyter access. If you leave it empty, Jupyter doesn't start at all.
Actions on finish / error : either do nothing, stop or delete the pod. Self explicit.
Download Samples/Output Model/Saved Checkpoints... : synchronizing options. Download backups is not recommended but provided as an option.
Delete remote workspace: option to save space on your remote server.
Tensorboard access: either activate the Tensorboard TCP channel and you don't need to download the Tensorboards logs or desactivate the TCP tunnel and you have to download the Tensorboard logs. Note, when not using the TCP tunnel, it is recommended to not expose the samples to Tensorboard to avoid download time.
Detach Remote trainer (default off):
- If selected, the training will keep running on the cloud if you loose the connection or close the local One Trainer instance. At the end of the training it will behave like you set in the option "action on detach finish" . You can synchronize back with the button "reattach now", it will download the latest logs and the model if finished.
Note: if you can't start the pod because no GPU is available (message in the console), you can start manually the pod on Runpod without GPU, you will be able to connect to it and download the model.
- If not selected, it behaves like a local One Trainer instance, if closing the local One Trainer instance the training is stopped and the pod will keep running.
Use SDP attention, much faster on Linux and as pod run on Linux...
Actions on finish/error/detach or not: STOP (safest option), saving cost if you plan to continue trainings in the next days as everything is kept in the pod (OT installation, model, datasets and workspaces).
Important note: if you stop manually the training the pod will keep running. It is intentional: by setting up a new training you realize after a few steps that you forget to change some options for sampling, training, whatever. Stop the training, adapt your settings and start it again. Pod is waiting for you and no need to upload again datasets, download models if no GPU are available anymore.
Detach Remote Trainer: ON if you plan to shut down your PC or local One trainer, have an unstable connection, Windows automatic updates or whatever. OFF if you're confident.
Actually the solution can only perform actions on the remote server. E.g.: Stop, backup now, sample now are working. But editing samples and some training parameters on the fly don't work. It may be implemented in next releases.
So if you realize you need to change a setting (e.g. sample resolution, training setting), better to stop the training, edit the settings and start it again.
For expert users:
You can use paths as in the screenshot now to refer to files and directories that you don't have locally, but only on the cloud. Can be useful when you have trained a LoRA, and now you want to continue it. Or large datasets that you want to download from a third part cloud storage onto RunPod and use, but don't have it locally.