Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/sg 521 gpu tests #587

Merged
merged 74 commits into from
Dec 29, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
57c077d
workflow added
shaydeci Dec 19, 2022
077e43d
first tests added
shaydeci Dec 19, 2022
0ef726a
sanity tests moved
shaydeci Dec 19, 2022
7f5bf04
-m removed
shaydeci Dec 19, 2022
bd0dc11
env var added
shaydeci Dec 19, 2022
5e930d1
installation from branch added
shaydeci Dec 19, 2022
933ab79
more changes
shaydeci Dec 19, 2022
a195d97
command fix
shaydeci Dec 19, 2022
5829e10
formatt
shaydeci Dec 20, 2022
9b065d7
remove env adde to recipe+tests
shaydeci Dec 20, 2022
799c55c
Merge remote-tracking branch 'origin/master' into feature/SG-521_gpu_…
shaydeci Dec 20, 2022
6183ee3
command fix in config
shaydeci Dec 20, 2022
567dcdf
torchrun instead of python
shaydeci Dec 20, 2022
35a1b38
command update
shaydeci Dec 20, 2022
6d4ca53
hydra full error env var
shaydeci Dec 20, 2022
6e7a7f0
train from recipe cmd
shaydeci Dec 20, 2022
d084309
torch installation fix
shaydeci Dec 20, 2022
e06eabe
protobuf version try
shaydeci Dec 22, 2022
f5c6a11
lets get this running
shaydeci Dec 22, 2022
1ffd2bb
lets get this working
shaydeci Dec 22, 2022
422de1f
let make this work2
shaydeci Dec 22, 2022
27f866b
lets make this work 3.0
shaydeci Dec 22, 2022
6adf90f
let make this work 4.0
shaydeci Dec 25, 2022
f7f8beb
Merge remote-tracking branch 'origin/master' into feature/SG-521_gpu_…
shaydeci Dec 25, 2022
ea7c9a5
lets make this work 5.0
shaydeci Dec 25, 2022
f63bd60
coco try
shaydeci Dec 25, 2022
04d9528
coco try yolox
shaydeci Dec 25, 2022
38cd95f
coco try yolox fix num gpus
shaydeci Dec 25, 2022
92931e7
reordr installs
shaydeci Dec 25, 2022
4e2fca2
order installs + python3.8 removed
shaydeci Dec 25, 2022
551ce4d
Merge remote-tracking branch 'origin/master' into feature/SG-521_gpu_…
shaydeci Dec 25, 2022
ee3f7bd
order installs + python3.8
shaydeci Dec 25, 2022
9b79d9c
torch 1.12
shaydeci Dec 25, 2022
4db4ff2
linter
shaydeci Dec 25, 2022
6f7384c
cleanup and 11.6
shaydeci Dec 25, 2022
c39a9c1
11.6 with 2 epochs
shaydeci Dec 25, 2022
9cf0e09
dist launch used
shaydeci Dec 25, 2022
b6b0eab
dataset params lines removed
shaydeci Dec 25, 2022
d35cc2f
nccl debug
shaydeci Dec 25, 2022
56ccad1
assert with abs, cifar rolled back
shaydeci Dec 26, 2022
a8a8e2b
cifar recipe fix
shaydeci Dec 26, 2022
be0acec
formatting
shaydeci Dec 26, 2022
ea816dc
formatter
shaydeci Dec 26, 2022
179d6fd
teardown added to test + seg and det tests added
shaydeci Dec 26, 2022
d4f2da3
formatting
shaydeci Dec 26, 2022
6f1564f
large delta for det so it passes
shaydeci Dec 26, 2022
adf9245
larger shm 2nd try det
shaydeci Dec 26, 2022
e656aa8
40g shm
shaydeci Dec 26, 2022
66d9667
yolox goal map updated
shaydeci Dec 27, 2022
86db15d
coverage run added to config
shaydeci Dec 27, 2022
9d886c4
typo in config
shaydeci Dec 27, 2022
3784966
max epochs fix
shaydeci Dec 27, 2022
f5fe241
max epochs fix2
shaydeci Dec 27, 2022
e624d33
old test removed
shaydeci Dec 27, 2022
c49b7b8
format
shaydeci Dec 27, 2022
09ae870
exit 0 added to train from recipe
shaydeci Dec 27, 2022
53dd3fb
exit code moved to hydra main
shaydeci Dec 27, 2022
531fd3c
exit 0 addded
shaydeci Dec 27, 2022
7c4323b
exit code for ddp
shaydeci Dec 27, 2022
054410a
cifar num epochs fix 100
shaydeci Dec 27, 2022
603d8c0
added determinism for train from recipe and commands for yolox and re…
shaydeci Dec 27, 2022
5f7a3f5
torch deterministic mode fix
shaydeci Dec 27, 2022
af6a7e1
env var for reproducibality
shaydeci Dec 27, 2022
f1e2ebe
2nd option for env var
shaydeci Dec 27, 2022
7314b6e
cublas envvar
shaydeci Dec 27, 2022
76d2d7f
remove determins flags
shaydeci Dec 27, 2022
75bd128
yolox test arch set to n
shaydeci Dec 28, 2022
2a127ec
exp name fixes
shaydeci Dec 28, 2022
4055cfa
recipe tests added to release workflow
shaydeci Dec 28, 2022
f84daea
updated delta for cifar
shaydeci Dec 28, 2022
dab31f7
Merge branch 'master' into feature/SG-521_gpu_tests
shaydeci Dec 28, 2022
8fdc881
Merge remote-tracking branch 'origin/master' into feature/SG-521_gpu_…
shaydeci Dec 28, 2022
f946203
Merge remote-tracking branch 'origin/feature/SG-521_gpu_tests' into f…
shaydeci Dec 28, 2022
b41fa21
Merge branch 'master' into feature/SG-521_gpu_tests
ofrimasad Dec 29, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 39 additions & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,6 @@ jobs:
- store_artifacts:
path: ~/sg_logs


release_candidate:
parameters:
py_version:
Expand Down Expand Up @@ -180,6 +179,40 @@ jobs:
tag: $CIRCLE_TAG
notes: "This GitHub Release was done automatically by CircleCI"

recipe_tests:
machine: true
resource_class: deci-ai/sg-gpu-on-premise
parameters:
sg_existing_env_path:
type: string
default: "/env/persistent_env"
sg_new_env_name:
type: string
default: "${CIRCLE_BUILD_NUM}"
sg_new_env_python_version:
type: string
default: "python3.8"
steps:
- checkout
- run:
name: install requirements and run recipe tests
command: |
<< parameters.sg_new_env_python_version >> -m venv << parameters.sg_new_env_name >>
source << parameters.sg_new_env_name >>/bin/activate
python3.8 -m pip install --upgrade setuptools pip wheel
python3.8 -m pip install -r requirements.txt
python3.8 -m pip install git+https://github.com/Deci-AI/super-gradients.git@${CIRCLE_BRANCH}
python3.8 -m pip install torch==1.12.0+cu116 torchvision==0.13.0+cu116 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu116
python3.8 src/super_gradients/examples/train_from_recipe_example/train_from_recipe.py --config-name=cifar10_resnet experiment_name=shortened_cifar10_resnet_accuracy_test training_hyperparams.max_epochs=100 training_hyperparams.average_best_models=False +multi_gpu=DDP +num_gpus=4
python3.8 src/super_gradients/examples/train_from_recipe_example/train_from_recipe.py --config-name=coco2017_yolox experiment_name=shortened_coco2017_yolox_n_map_test architecture=yolox_n training_hyperparams.loss=yolox_fast_loss training_hyperparams.max_epochs=10 training_hyperparams.average_best_models=False multi_gpu=DDP num_gpus=4
python3.8 src/super_gradients/examples/train_from_recipe_example/train_from_recipe.py --config-name=cityscapes_regseg48 experiment_name=shortened_cityscapes_regseg48_iou_test training_hyperparams.max_epochs=10 training_hyperparams.average_best_models=False multi_gpu=DDP num_gpus=4
coverage run --source=super_gradients -m unittest tests/deci_core_recipe_test_suite_runner.py

- run:
name: Remove new environment when failed
command: "rm -r << parameters.sg_new_env_name >>"
when: on_fail



workflows:
Expand All @@ -199,10 +232,13 @@ workflows:
- deci-common/persist_version_info
- login_to_codeartifact_release
<<: *release_tag_filter
- recipe_tests:
<<: *release_tag_filter
- release_version:
py_version: "3.7"
requires:
- "build3.7"
- recipe_tests
<<: *release_tag_filter
- deci-common/pip_upload_package_from_codeartifact_to_global_pypi:
package_name: "super-gradients"
Expand All @@ -219,13 +255,15 @@ workflows:
- deci-common/persist_version_info
- deci-common/codeartifact_login:
repo_name: "deci-packages"

- build:
name: "build3.7"
py_version: "3.7"
package_name: "super-gradients"
requires:
- deci-common/persist_version_info
- deci-common/codeartifact_login

- release_candidate: # happens on merge
py_version: "3.7"
requires:
Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,4 @@ wheel>=0.38.0
# not directly required, pinned by Snyk to avoid a vulnerability
pygments>=2.7.4
stringcase>=1.2.0
numpy<=1.23
1 change: 0 additions & 1 deletion src/super_gradients/recipes/cifar10_resnet.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@ resume: False
training_hyperparams:
resume: ${resume}


ckpt_root_dir:

architecture: resnet18_cifar
Expand Down
8 changes: 7 additions & 1 deletion src/super_gradients/training/sg_trainer/sg_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -954,7 +954,13 @@ def forward(self, inputs, targets):
training_params = dict()
self.train_loader = train_loader or self.train_loader
self.valid_loader = valid_loader or self.valid_loader
if len(self.train_loader.dataset) % self.train_loader.batch_size != 0 and not self.train_loader.drop_last:

if hasattr(self.train_loader, "batch_sampler") and self.train_loader.batch_sampler is not None:
batch_size = self.train_loader.batch_sampler.batch_size
else:
batch_size = self.train_loader.batch_size

if len(self.train_loader.dataset) % batch_size != 0 and not self.train_loader.drop_last:
logger.warning("Train dataset size % batch_size != 0 and drop_last=False, this might result in smaller " "last batch.")
self._set_dataset_params()

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -242,7 +242,7 @@ def restart_script_with_ddp(num_gpus: int = None):
elastic_launch(config=config, entrypoint=sys.executable)(*sys.argv, *EXTRA_ARGS)

# The code below should actually never be reached as the process will be in a loop inside elastic_launch until any subprocess crashes.
sys.exit("Main process finished")
sys.exit(0)


def get_gpu_mem_utilization():
Expand Down
23 changes: 23 additions & 0 deletions tests/deci_core_recipe_test_suite_runner.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
import sys
import unittest

from tests.recipe_training_tests.shortened_recipes_accuracy_test import ShortenedRecipesAccuracyTests


class CoreUnitTestSuiteRunner:
def __init__(self):
self.test_loader = unittest.TestLoader()
self.recipe_tests_suite = unittest.TestSuite()
self._add_modules_to_unit_tests_suite()
self.test_runner = unittest.TextTestRunner(verbosity=3, stream=sys.stdout)

def _add_modules_to_unit_tests_suite(self):
"""
_add_modules_to_unit_tests_suite - Adds unit tests to the Unit Tests Test Suite
:return:
"""
self.recipe_tests_suite.addTest(self.test_loader.loadTestsFromModule(ShortenedRecipesAccuracyTests))


if __name__ == "__main__":
unittest.main()
Empty file.
44 changes: 44 additions & 0 deletions tests/recipe_training_tests/shortened_recipes_accuracy_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
import unittest
import shutil

from coverage.annotate import os
from super_gradients.common.environment import environment_config
import torch


class ShortenedRecipesAccuracyTests(unittest.TestCase):
@classmethod
def setUp(cls):
cls.experiment_names = ["shortened_cifar10_resnet_accuracy_test", "shortened_coco2017_yolox_n_map_test", "shortened_cityscapes_regseg48_iou_test"]

def test_shortened_cifar10_resnet_accuracy(self):
self.assertTrue(self._reached_goal_metric(experiment_name="shortened_cifar10_resnet_accuracy_test", metric_value=0.9167, delta=0.05))

def test_shortened_coco2017_yolox_n_map(self):
self.assertTrue(self._reached_goal_metric(experiment_name="shortened_coco2017_yolox_n_map_test", metric_value=0.044, delta=0.02))

def test_shortened_cityscapes_regseg48_iou(self):
self.assertTrue(self._reached_goal_metric(experiment_name="shortened_cityscapes_regseg48_iou_test", metric_value=0.263, delta=0.05))

@classmethod
def _reached_goal_metric(cls, experiment_name: str, metric_value: float, delta: float):
ckpt_dir = os.path.join(environment_config.PKG_CHECKPOINTS_DIR, experiment_name)
sd = torch.load(os.path.join(ckpt_dir, "ckpt_best.pth"))
metric_val_reached = sd["acc"].cpu().item()
diff = abs(metric_val_reached - metric_value)
print(
"Goal metric value: " + str(metric_value) + ", metric value reached: " + str(metric_val_reached) + ",diff: " + str(diff) + ", delta: " + str(delta)
)
return diff <= delta

@classmethod
def tearDownClass(cls) -> None:
# ERASE ALL THE FOLDERS THAT WERE CREATED DURING THIS TEST
for folder in cls.experiment_names:
ckpt_dir = os.path.join(environment_config.PKG_CHECKPOINTS_DIR, folder)
if os.path.isdir(ckpt_dir):
shutil.rmtree(ckpt_dir)


if __name__ == "__main__":
unittest.main()