-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add multi-storage-client backend for file open #1455
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lhotse/serialization.py
Outdated
|
||
class MSCIOBackend(IOBackend): | ||
""" | ||
Uses multi-storage client to download data from object store |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a link to MSC here? It'd be good to add 1-2 sentences about how MSC is different and what are it's unique features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few lines to describe MSC and the documents' links. Let me know if that looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, great work!
.github/workflows/unit_tests.yml
Outdated
@@ -57,7 +57,7 @@ jobs: | |||
# the torchaudio env var does nothing when torchaudio is installed, but doesn't require it's presence when it's not | |||
pip install lilcom '.[tests]' | |||
# Enable some optional tests | |||
pip install h5py dill smart_open[http] kaldi_native_io webdataset==0.2.5 s3prl scipy nara_wpe pyloudnorm ${{ matrix.extra_deps }} | |||
pip install h5py dill smart_open[http] kaldi_native_io webdataset==0.2.5 s3prl scipy nara_wpe pyloudnorm ${{ matrix.extra_deps }} multi-storage-client==0.16.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll have to move this requirement to extra_deps
field in python version matrix, because Python 3.8 tests are failing (looks like MSC doesn't support that version anymore). Please add this to all python version tests starting from 3.9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, please fix the formatting tests and the unit test failures, and LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, great work!
This PR adds support for the Multi-Storage Client (MSC) backend to handle object storage access in Lhotse. The changes include:
Features
MSCIOBackend
for handling MSC protocol URLsLHOTSE_MSC_OVERRIDE_PROTOCOLS
env for supported protocols, e.g.s3://
->msc://
LHOTSE_MSC_PROFILE
env for profile/bucket name overrides, e.g.msc://my-bucket
->msc://my-profile
Implementation Details
Configuration
MSC behavior can be configured through environment variables:
LHOTSE_MSC_OVERRIDE_PROTOCOLS
: Comma-separated list of protocols to override (e.g., "s3,gs")LHOTSE_MSC_PROFILE
: Profile name to use for bucket overrideDependencies
multistorageclient
package to be installed