Skip to content

Commit

Permalink
Lightning Dataset (including optimized dataloading of s3 buckets) (#1…
Browse files Browse the repository at this point in the history
…7743)

* Lightning DataLoader

* lightning dataloader

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* init

* example

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* env var

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update src/lightning/pytorch/utilities/data/__init__.py

Co-authored-by: Justus Schock <[email protected]>

* remove unused functions

* extra reqs

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update src/lightning/pytorch/utilities/data/fileio.py

Co-authored-by: Justus Schock <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* imports work now! yay

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* tests

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* imports

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* missing import

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* error handling

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update creds for local use case

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* codeowners

* recursive get index

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* index

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* clean up get index

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update imagenet example

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* docstrings

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* docstrings

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* docstrings

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* example cleanup

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* changelog

* reqs

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* codeowners

* requirements

* expose LightningDataset too

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* expost LightningDataset at top level

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove unused private methods from init

* remove private imports

* upper bound on extra requirements

* review comments

* loosen req

* deps

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* test updating fabric base req

* remove version pin on s3fs to test

* recover missing function

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* tests

* update

* random

* torchdata >= 0.3.0

* update torchdata version

* remove torchdata version to test

* try rem torch version pin

* req

* update bucket in test

* req

* skips

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* import

* update structure to lightning.data

* base.txt for data reqs

* fix imports

* rename to LightningS3Dataset

* new workflow

* dont need to test warnings

* reqs

* req

* revert data folder in pytorch

* test import

* tests

* req

* req

* req

* torch version

* req

* req

* open dep

* reformatted

* pin strict

* pin strict extra

* req

* modify workflow, no cache

* try

* patch

* import

* fix

* dataset test

* update getattr

* pin everything to test

* remove torch preinstall from workflow

* workflow

* req

* Update .github/workflows/ci-tests-data.yml

Co-authored-by: Jirka Borovec <[email protected]>

* workflow

* workflow

* req

* Update .github/workflows/ci-tests-data.yml

Co-authored-by: Jirka Borovec <[email protected]>

* workflow

* print

* skip test for now

* update path join

* revert app dep version bump

* Update .github/workflows/ci-tests-data.yml

Co-authored-by: Jirka Borovec <[email protected]>

* workflow updates

* app base req

* req

* window test failure

* add data req to assistant

* try

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add missing comma

* updates

* update

* typo

* requirements

* try widening req

* older torch version

* update

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

* update

* update

* update

* cleanup tests

* typo again

* update

* remove unnecessary line

* Update .github/CODEOWNERS

* Discard changes to requirements/pytorch/base.txt

* Discard changes to requirements/fabric/base.txt

* Discard changes to requirements/app/base.txt

* requirements

* requirements

* one line

* app workflow pick only app reqs

* rename package

* undo

* don't use cache

* examples CI

* pytorch and fabric CI

* try remove cache

* Apply suggestions from code review

* jirka playing

* jirka playing

* jirka playing

* blah

* flatten LightningDataset

* cleans up dataset class

* jirka playing

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* jirka playing

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* extra

* fix dataset test

* update checkgroups

* Luca's review comments

* val error fix

* unskip test

* min

* fix precommit warning

* cpu

* docstrings

* req

* 2.0.1

* add return type

* typing errors

* req

* return types with quotations

* import for type-checking

* no botocore in cloudagnostic code

* exit args

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* backends typing

* remove oldest from data tests

* typing

* typing

* typing

* types

* type

* typing

* typing

* typing

* import fix

* Changelog

---------

Co-authored-by: Noha Alon <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Justus Schock <[email protected]>
Co-authored-by: Jirka Borovec <[email protected]>
Co-authored-by: Jirka <[email protected]>
Co-authored-by: Justus Schock <[email protected]>
  • Loading branch information
7 people authored Jun 13, 2023
1 parent 377bfd2 commit ca30fd7
Show file tree
Hide file tree
Showing 28 changed files with 1,232 additions and 9 deletions.
12 changes: 11 additions & 1 deletion .actions/assistant.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import glob
import logging
import os
import pathlib
import re
Expand Down Expand Up @@ -43,6 +44,11 @@
"requirements/fabric/base.txt",
"requirements/fabric/strategies.txt",
),
"data": (
"requirements/data/data.txt",
"requirements/data/cloud.txt",
"requirements/data/examples.txt",
),
}
REQUIREMENT_FILES_ALL = list(chain(*REQUIREMENT_FILES.values()))

Expand Down Expand Up @@ -146,6 +152,9 @@ def load_requirements(path_dir: str, file_name: str = "base.txt", unfreeze: str
"""
assert unfreeze in {"none", "major", "all"}
path = Path(path_dir) / file_name
if not path.exists():
logging.warning(f"Folder {path_dir} does not have any base requirements.")
return []
assert path.exists(), (path_dir, file_name, path)
text = path.read_text()
return [req.adjust(unfreeze) for req in _parse_requirements(text)]
Expand Down Expand Up @@ -240,7 +249,7 @@ def _load_aggregate_requirements(req_dir: str = "requirements", freeze_requireme
requires = [
load_requirements(d, unfreeze="none" if freeze_requirements else "major")
for d in glob.glob(os.path.join(req_dir, "*"))
# skip empty folder as git artefacts, and resolving Will's special issue
# skip empty folder (git artifacts), and resolving Will's special issue
if os.path.isdir(d) and len(glob.glob(os.path.join(d, "*"))) > 0 and not os.path.basename(d).startswith("_")
]
if not requires:
Expand Down Expand Up @@ -404,6 +413,7 @@ def _replace_min(fname: str) -> None:
def replace_oldest_ver(requirement_fnames: Sequence[str] = REQUIREMENT_FILES_ALL) -> None:
"""Replace the min package version by fixed one."""
for fname in requirement_fnames:
print(fname)
AssistantCLI._replace_min(fname)

@staticmethod
Expand Down
5 changes: 5 additions & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,11 @@
/src/lightning/pytorch/core/hooks.py @williamfalcon @tchaton @awaelchli @carmocca
/src/lightning/pytorch/core/module.py @williamfalcon @tchaton @awaelchli @carmocca

# Data Utilities
/examples/data/ @nohalon @justusschock
/src/lightning/data/ @nohalon @justusschock
/tests/tests_data @nohalon @justusschock

# Lightning Fabric
/src/lightning/fabric @awaelchli @carmocca @justusschock
/src/lightning_fabric @awaelchli @carmocca @justusschock
Expand Down
20 changes: 20 additions & 0 deletions .github/checkgroup.yml
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,26 @@ subprojects:
- "build-pl (3.9, 1.13, 11.7.1)"
- "build-pl (3.10, 2.0, 11.7.1)"

# SECTIONS: lightning_data

- id: "lightning_data: CPU workflow"
paths:
- ".actions/**"
- "requirements/data/**"
- "src/lightning/data/**"
- "src/lightning_data/*"
- "tests/tests_data/**"
- "examples/data/**"
- "pyproject.toml" # includes pytest config
- ".github/workflows/ci-tests-data.yml"
- "!requirements/*/docs.txt"
- "!*.md"
- "!**/*.md"
checks:
- "data-cpu (macOS-11, lightning, 3.10, 2.0)"
- "data-cpu (ubuntu-20.04, lightning, 3.10, 2.0)"
- "data-cpu (windows-2022, lightning, 3.10, 2.0)"

# SECTION: lightning_fabric

- id: "lightning_fabric: CPU workflow"
Expand Down
118 changes: 118 additions & 0 deletions .github/workflows/ci-tests-data.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
name: Test Data

# see: https://help.github.com/en/actions/reference/events-that-trigger-workflows
on:
push:
branches: [master, "release/*"]
pull_request:
branches: [master, "release/*"]
types: [opened, reopened, ready_for_review, synchronize] # added `ready_for_review` since draft is skipped
paths:
- ".actions/**"
- "requirements/data/**"
- "src/lightning/data/**"
- "tests/tests_data/**"
- "pyproject.toml" # includes pytest config
- ".github/workflows/ci-tests-data.yml"
- "!requirements/*/docs.txt"
- "!*.md"
- "!**/*.md"

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.head_ref }}
cancel-in-progress: ${{ ! (github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/heads/release/')) }}

defaults:
run:
shell: bash

jobs:
data-cpu:
runs-on: ${{ matrix.os }}
if: github.event.pull_request.draft == false
strategy:
fail-fast: false
matrix:
include:
- {os: "macOS-11", pkg-name: "lightning", python-version: "3.10", pytorch-version: "2.0"}
- {os: "ubuntu-20.04", pkg-name: "lightning", python-version: "3.10", pytorch-version: "2.0"}
- {os: "windows-2022", pkg-name: "lightning", python-version: "3.10", pytorch-version: "2.0"}
# "oldest" versions tests, only on minimum Python
# - {os: "macOS-11", pkg-name: "lightning", python-version: "3.8", pytorch-version: "2.0", requires: "oldest"}
# - {os: "ubuntu-20.04", pkg-name: "lightning", python-version: "3.8", pytorch-version: "2.0", requires: "oldest"}
# - {os: "windows-2022", pkg-name: "lightning", python-version: "3.8", pytorch-version: "2.0", requires: "oldest"}
timeout-minutes: 25 # because of building grpcio on Mac
env:
PACKAGE_NAME: ${{ matrix.pkg-name }}
FREEZE_REQUIREMENTS: ${{ ! (github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/heads/release/')) }}
# PYPI_CACHE_DIR: "_pip-wheels"
TORCH_URL_STABLE: "https://download.pytorch.org/whl/cpu/torch_stable.html"
TORCH_URL_TEST: "https://download.pytorch.org/whl/test/cpu/torch_test.html"
steps:
- uses: actions/checkout@v3

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}

- name: basic setup
run: pip install -q -r .actions/requirements.txt

- name: Set min. dependencies
if: ${{ matrix.requires == 'oldest' }}
run: |
python .actions/assistant.py replace_oldest_ver
- name: Adjust PyTorch versions in requirements files
if: ${{ matrix.requires != 'oldest' && matrix.release != 'pre' }}
run: |
pip install -q wget packaging
python -m wget https://raw.githubusercontent.com/Lightning-AI/utilities/main/scripts/adjust-torch-versions.py
for fpath in `ls requirements/data/*.txt`; do \
python ./adjust-torch-versions.py $fpath ${{ matrix.pytorch-version }}; \
done
cat requirements/data/data.txt
cat requirements/data/cloud.txt
# - name: pip wheels cache
# uses: actions/cache/restore@v3
# with:
# path: ${{ env.PYPI_CACHE_DIR }}
# key: pypi_wheels
# - run: |
# mkdir -p $PYPI_CACHE_DIR
# ls -lh $PYPI_CACHE_DIR

# removing torch stable line:
# pip install -e ".[${extra}test]" "pytest-timeout" -U -f ${TORCH_URL} ${TORCH_PREINSTALL} -f ${PYPI_CACHE_DIR} --prefer-binary
- name: Install package & dependencies
run: |
python -m pip install -q pip -U
pip install -e ".[data-dev]" "pytest-timeout" -U -f ${TORCH_URL} --prefer-binary
pip list
- name: Testing Data
working-directory: tests/tests_data
# NOTE: do not include coverage report here, see: https://github.com/nedbat/coveragepy/issues/1003
run: |
python -m coverage run --source lightning \
-m pytest -v --timeout=30 --durations=50
- name: Statistics
if: success()
working-directory: tests/tests_data
run: |
coverage report
coverage xml
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
# see: https://github.com/actions/toolkit/issues/399
continue-on-error: true
with:
token: ${{ secrets.CODECOV_TOKEN }}
file: tests/tests_data/coverage.xml
flags: lightning,cpu,pytest,python${{ matrix.python-version }}
name: CPU-coverage
fail_ci_if_error: false
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,10 @@ our_model.tar
test.png
saved_models
data/
!src/lightning/data/
!examples/data/
!tests/tests_pytorch/utilities/data/
!requirements/data/
.shared
.lightning
node_modules/
Expand Down
Loading

0 comments on commit ca30fd7

Please sign in to comment.