Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add multi-storage-client backend for file open #1455

Merged
merged 7 commits into from
Mar 19, 2025

Conversation

jayya2
Copy link
Contributor

@jayya2 jayya2 commented Feb 15, 2025

This PR adds support for the Multi-Storage Client (MSC) backend to handle object storage access in Lhotse. The changes include:

Features

  • New MSCIOBackend for handling MSC protocol URLs
  • URL transformation from existing protocols (e.g., s3://) to MSC format via
    • LHOTSE_MSC_OVERRIDE_PROTOCOLS env for supported protocols, e.g. s3:// -> msc://
    • LHOTSE_MSC_PROFILE env for profile/bucket name overrides, e.g. msc://my-bucket -> msc://my-profile

Implementation Details

  • Added URL transformation utilities for bucket/profile name handling
  • Integrated MSC backend into the default IO backend chain
  • Added unit tests for MSC functionality

Configuration

MSC behavior can be configured through environment variables:

  • LHOTSE_MSC_OVERRIDE_PROTOCOLS: Comma-separated list of protocols to override (e.g., "s3,gs")
  • LHOTSE_MSC_PROFILE: Profile name to use for bucket override

Dependencies

  • Requires multistorageclient package to be installed

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it looks good! I left a few comments.

In addition to those, can you add MSC to the list of optional dependencies in README.md?

You might also need to add msc to the list of optional dependencies installed in the CI for its tests to succeed (see here).


class MSCIOBackend(IOBackend):
"""
Uses multi-storage client to download data from object store
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a link to MSC here? It'd be good to add 1-2 sentences about how MSC is different and what are it's unique features.

Copy link
Contributor Author

@jayya2 jayya2 Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few lines to describe MSC and the documents' links. Let me know if that looks good

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, great work!

@@ -57,7 +57,7 @@ jobs:
# the torchaudio env var does nothing when torchaudio is installed, but doesn't require it's presence when it's not
pip install lilcom '.[tests]'
# Enable some optional tests
pip install h5py dill smart_open[http] kaldi_native_io webdataset==0.2.5 s3prl scipy nara_wpe pyloudnorm ${{ matrix.extra_deps }}
pip install h5py dill smart_open[http] kaldi_native_io webdataset==0.2.5 s3prl scipy nara_wpe pyloudnorm ${{ matrix.extra_deps }} multi-storage-client==0.16.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have to move this requirement to extra_deps field in python version matrix, because Python 3.8 tests are failing (looks like MSC doesn't support that version anymore). Please add this to all python version tests starting from 3.9

pzelasko
pzelasko previously approved these changes Mar 17, 2025
Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, please fix the formatting tests and the unit test failures, and LGTM!

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, great work!

@pzelasko pzelasko merged commit 0ca0394 into lhotse-speech:master Mar 19, 2025
9 checks passed
@pzelasko pzelasko added this to the v1.30 milestone Mar 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants