Skip to content

Forward args to _get_remote_config() and honour core/no_scm if present #10719

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

rgoya
Copy link

@rgoya rgoya commented Apr 10, 2025

This is a proposed fix for #10608, the code here makes steps 9 and 10 described in the issue work.

Summary:

This change allows a user to access the dvc information in an environment that is disconnected from the original Git backend (e.g. in a deployed container, see #10608), by using something like:

dvc.api.get_url(path,
                repo,
                config={"core": {"no_scm": True}}
                )

Description:

Mainly, a call to dvc/repo/open_repo.py:open_repo(url, *args, **kwargs) may contain a parameter config in **kwargs. With this config a user might indicate they do not want to access the repo with Git support, by using config={"core": {"no_scm": True}}.

During the execution of dvc/repo/open_repo.py:open_repo(), there is a call to a function dvc/repo/open_repo.py:_get_remote_config() that returns the remote configuration({"core": {"remote"}}. This is then merged to the user provided config parameter before calling Repo(url, *args, **kwargs).

dvc/repo/open_repo.py:_get_remote_config(), in turn, does a quick Repo() call to get the remote configuration. However, it does not use any of the parameters requested via dvc/repo/open_repo.py:open_repo() and thus relies entirely on the contents of .dvc/config. This means that even if the user requested no SCM support, it will try to look for a Git repo if .dvc/config says so, and fail if it does not find it.

This PR modifies dvc/repo/open_repo.py:_get_remote_config() to receive *args, **kwargs and honour the request to use or ignore Git support when accessing the dvc repo.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

Copy link

codecov bot commented Apr 10, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.07%. Comparing base (2431ec6) to head (652560a).
Report is 40 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10719      +/-   ##
==========================================
+ Coverage   90.68%   91.07%   +0.39%     
==========================================
  Files         504      504              
  Lines       39795    39965     +170     
  Branches     3141     3158      +17     
==========================================
+ Hits        36087    36400     +313     
+ Misses       3042     2938     -104     
+ Partials      666      627      -39     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines 104 to 106
# It seems some tests might be passing a 'config' key that is not a dict
if not isinstance(user_config, dict):
user_config = {}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into this, some tests send kwargs = {'config': None, ...; this safeguard protects against this.


if no_scm_flag is not None:
# Honour specific SCM treatment if requested in the call
repo = Repo(url, config={"core": {"no_scm": no_scm_flag}})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIR, Repo(config=...) should just work.

Suggested change
repo = Repo(url, config={"core": {"no_scm": no_scm_flag}})
repo = Repo(url, config=kwargs.get("config"))

I don't want to specialize core.no_scm in any way or handle it.

Copy link
Author

@rgoya rgoya Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it doesn't feel ideal to handle core.no_scm itself. Your solution makes sense and it works for my specific use case, but it triggers other errors in the dvc test suite; which makes me think there are other non-core.no_scm configuration options that are being used that _get_remote_config() doesn't like.

(I went with the core.no_scm specific approach to highlight the need.)

These are the errors I get when using repo = Repo(url, config=kwargs.get("config")):

FAILED tests/func/test_import.py::test_import_no_hash[files1-expected_info_calls1] - dvc_data.index.index.DataIndexDirError: failed to load directory ('dir',)
FAILED tests/func/test_repo_index.py::test_data_index - dvc_data.index.index.DataIndexDirError: failed to load directory ('edir',)
FAILED tests/func/repro/test_repro_pull.py::test_repro_pulls_missing_import - dvc.exceptions.ReproductionError: failed to reproduce 'foo.dvc'
FAILED tests/func/test_data_cloud.py::test_pull_external_dvc_imports - FileNotFoundError: [Errno 2] No storage files available: 'foo'
FAILED tests/func/test_data_cloud.py::test_pull_external_dvc_imports_mixed - FileNotFoundError: [Errno 2] No storage files available: 'foo'
FAILED tests/func/repro/test_repro.py::test_repro_pulls_missing_import - dvc.exceptions.ReproductionError: failed to reproduce 'foo.dvc'
FAILED tests/func/test_import.py::test_import_dir - dvc_data.index.index.DataIndexDirError: failed to load directory ('dir',)
FAILED tests/func/test_import.py::test_import_file_from_dir - dvc_data.index.index.DataIndexDirError: failed to load directory ('dir',)
FAILED tests/func/test_import.py::test_import_file_from_dir_to_dir - dvc_data.index.index.DataIndexDirError: failed to load directory ('dir',)
FAILED tests/func/test_import.py::test_import_rev - FileNotFoundError: [Errno 2] No storage files available: 'foo'
FAILED tests/func/test_import.py::test_pull_imported_stage - dvc.exceptions.CheckoutError: Checkout failed for following targets:
FAILED tests/func/test_import.py::test_pull_import_no_download - FileNotFoundError: [Errno 2] No such file or directory: '/private/var/folders/_3/h8jc4f6d5gg6dwk7t6464_z80000gp/T/pytest-of-rodrigo.goya/pytest-17/popen-gw13/test_pull_import_no_download0/.dvc/cache/fs/local/e3501e821bcee8f40107794afbe767d1/.F0VDC8H_fGFnvnEcrt...
FAILED tests/func/test_import.py::test_pull_import_no_download_rev_lock - dvc.exceptions.DownloadError: 1 files failed to download
FAILED tests/func/test_import.py::test_pull_imported_directory_stage[dir] - dvc_data.index.index.DataIndexDirError: failed to load directory ('dir',)
FAILED tests/func/test_import.py::test_pull_imported_directory_stage[dir/] - dvc_data.index.index.DataIndexDirError: failed to load directory ('dir',)
FAILED tests/func/test_import.py::test_pull_wildcard_imported_directory_stage - dvc_data.index.index.DataIndexDirError: failed to load directory ('dir123',)
FAILED tests/func/test_update.py::test_update_import[True] - FileNotFoundError: [Errno 2] No storage files available: 'version'
FAILED tests/func/test_import.py::test_pull_non_workspace - FileNotFoundError: [Errno 2] No storage files available: 'foo'
FAILED tests/func/test_update.py::test_update_import_after_remote_updates_to_dvc - FileNotFoundError: [Errno 2] No storage files available: 'version'
FAILED tests/func/test_import.py::test_import_with_jobs - dvc_data.index.index.DataIndexDirError: failed to load directory ('dir1',)

@rgoya rgoya requested a review from skshetry April 10, 2025 07:51
@rgoya rgoya force-pushed the respect_no_scm_in_open_repo_call branch 2 times, most recently from 91cf291 to f598ecd Compare April 26, 2025 06:38
@rgoya rgoya force-pushed the respect_no_scm_in_open_repo_call branch from f598ecd to 652560a Compare April 26, 2025 07:03
@rgoya
Copy link
Author

rgoya commented Apr 26, 2025

AFAIR, Repo(config=...) should just work.
- repo = Repo(url, config={"core": {"no_scm": no_scm_flag}})
+ repo = Repo(url, config=kwargs.get("config"))
I don't want to specialize core.no_scm in any way or handle it.

@skshetry, Ok, I finally got some time to track down the errors.

I took your suggestion of using something closer to Repo(url, config=kwargs.get("config")).

I tracked the errors that were triggered by that call to the way that dvc manages imports. Mainly, it creates a DVCFileSystem for a remote repo using a local cache (see issue #9385 / fix #9423 ). This means that when _get_remote_config() is eventually called, it receives the url of the remote repo, but a kwargs["config"]["cache"] that points to the local cache.

Since the function of _get_remote_config() is to get the actual remote's cache location, in order for dvc import to work, _get_remote_config() needs to ignore anything sent in kwargs["config"]["cache"].

I modified the code to to the above, and added some comments to the code for future reference.

I also added a related test to the test suite.

Please let me know what you think.


Note: FYI, I've been seeing some flakiness in the test suite from what seems to be a race condition in the logs, where a message of DEBUG s3fs:utils.py:80 RC: discarding all clients sometimes sneaks in and fails the assert, e.g.:

>       assert first(caplog.messages) == expected_message
E       assert 'RC: discarding all clients' == "Output 'path... 'stage.dvc'."
E         
E         + RC: discarding all clients
E         - Output 'path'(stage: 'stage.dvc') is missing version info. Cache for it will not be collected. Use `dvc repro` to get your pipeline up to date.
E         - You can also use `dvc commit stage.dvc` to associate existing 'path' with stage: 'stage.dvc'.

tests/unit/output/test_output.py:106: AssertionError
----------------------------- Captured stderr call -----------------------------
WARNING: Output 'path'(stage: 'stage.dvc') is missing version info. Cache for it will not be collected. Use `dvc repro` to get your pipeline up to date.
You can also use `dvc commit stage.dvc` to associate existing 'path' with stage: 'stage.dvc'.
------------------------------ Captured log call -------------------------------
DEBUG    s3fs:utils.py:80 RC: discarding all clients
WARNING  dvc.output:output.py:1160 Output 'path'(stage: 'stage.dvc') is missing version info. Cache for it will not be collected. Use `dvc repro` to get your pipeline up to date.
You can also use `dvc commit stage.dvc` to associate existing 'path' with stage: 'stage.dvc'.

@skshetry
Copy link
Member

Great research, @rgoya. I have taken a quick look at the PR and it looks good. Please give me some time (maybe this week) to look at this more closely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants