Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define library extras and refactor S3 readers #71

Open
javihern98 opened this issue Feb 12, 2025 · 3 comments
Open

Define library extras and refactor S3 readers #71

javihern98 opened this issue Feb 12, 2025 · 3 comments
Assignees
Labels
dependencies Dependencies management

Comments

@javihern98
Copy link
Contributor

javihern98 commented Feb 12, 2025

Overview

Currently the s3 support delivers quite a lot of libraries (15) that are always installed regardless of the use case. Almost all functionalities in the library should be based on the minimal amount of libraries possible.

When attempting to install vtlengine as a dependency, the poetry dependency resolver takes a lot of time.

Task to perform

  • Explore which libraries may be optional if they do not break core functionalities such as the semantic analysis and run over vtl code.
  • Generate extras based on its relevance and number of libraries imported.
  • Ensure a user friendly message is used whenever attempting to use a functionality that requires an extra (see https://github.com/bis-med-it/pysdmx/blob/develop/src/pysdmx/__extras_check.py)
  • Ensure we can reduce the code base on S3 functionality by using only pandas (with extras)
  • Edit the pyproject.toml to adjust to Poetry 2.0: https://python-poetry.org/docs/pyproject/
  • Attempt to use only libraries that have as well APT packages (usually as python3-LIBRARY_NAME). If not, try to find a way to compile them using source code.
@javihern98 javihern98 added the dependencies Dependencies management label Feb 12, 2025
@javihern98 javihern98 self-assigned this Feb 12, 2025
@javihern98
Copy link
Contributor Author

javihern98 commented Feb 12, 2025

Handling of files on the cloud using Pandas storage options:
https://pandas.pydata.org/pandas-docs/version/2.1/user_guide/io.html#reading-writing-remote-files

@javihern98 javihern98 changed the title Define library extras Define library extras and refactor S3 readers Feb 12, 2025
@javihern98
Copy link
Contributor Author

javihern98 commented Feb 12, 2025

Maybe instead of adding an extra on s3, we could write some installation docs that suggest the S3 compatibility and only define somehow the storage options for the engine, with the use of:

  • A global config parameter
  • An environment variable (preferred)
  • Passing somehow some parameters for input and output storage options?

In any case, if the users wants to have the S3 compatibilty, they may always install the required extras. It takes an awful lot of time to resolve the dependencies for pandas extras aws and gcp, but once they are defined in the lock it is quite fast again. It also downloads a lot of extra libraries that are only used for s3 and are not part of the rest of the library.

@ghomem
Copy link

ghomem commented Feb 12, 2025

Some notes:

I am attaching the full list of Ubuntu 24.04 (latest LTS Ubuntu) python libraries so that an effort to make the VTL Engine runnable on plain Ubuntu can be done.

libs-ubuntu2404.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Dependencies management
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants