Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow cache TTL override for --find-links urls #8109

Open
wisechengyi opened this issue Apr 21, 2020 · 15 comments · Fixed by pex-tool/pip#7
Open

Allow cache TTL override for --find-links urls #8109

wisechengyi opened this issue Apr 21, 2020 · 15 comments · Fixed by pex-tool/pip#7
Labels
S: needs triage Issues/PRs that need to be triaged

Comments

@wisechengyi
Copy link

wisechengyi commented Apr 21, 2020

What's the problem this feature will solve?

Our wheels (thousands of them) are hosted in a flat directory, e.g. https://myhost.com/wheels/ has

...
django_auth_ldap-1.6.1-py2.py3-none-any.whl
django_bootstrap_toolkit-2.15.0-py2-none-any.whl
django_bootstrap_toolkit-2.15.0-py3-none-any.whl
django_braces-1.14.0-py2.py3-none-any.whl
django_common_helpers-0.9.2-py2-none-any.whl
django_common_helpers-0.9.2-py3-none-any.whl
django_configurations-2.2-py2.py3-none-any.whl
django_crispy_forms-1.8.1-py2.py3-none-any.whl
django_cron-0.5.1-py2-none-any.whl
django_cron-0.5.1-py3-none-any.whl
...

With --no-index --find-links https://myhost.com/wheels/, pip will query it for every artifact it needs transitively. However, when large number of builds are running concurrently with pip, the wheel server can be overwhelmed.

Describe the solution you'd like

We'd like some mechanism to force the cache TTL for the index page. Something to the effect of:

diff --git a/src/pip/_vendor/cachecontrol/controller.py b/src/pip/_vendor/cachecontrol/controller.py
index dafe55c..f0066a7 100644
--- a/src/pip/_vendor/cachecontrol/controller.py
+++ b/src/pip/_vendor/cachecontrol/controller.py
@@ -84,7 +84,9 @@ class CacheController(object):
 
         retval = {}
 
+        cc_headers = 'max-age=8'
         for cc_directive in cc_headers.split(","):
             if not cc_directive.strip():
                 continue

In this case I was hardcoding the cache TTL to be 8 seconds.

It can be plumbed via an option, e.g. adding --force-index-cache-ttl=<some seconds> to below:

Package Index Options:
  -i, --index-url <url>       Base URL of the Python Package Index (default https://pypi.org/simple). This should point to a repository compliant with PEP 503 (the
                              simple repository API) or a local directory laid out in the same format.
  --extra-index-url <url>     Extra URLs of package indexes to use in addition to --index-url. Should follow the same rules as --index-url.
  --no-index                  Ignore package index (only looking at --find-links URLs instead).
  -f, --find-links <url>      If a URL or path to an html file, then parse for links to archives such as sdist (.tar.gz) or wheel (.whl) files. If a local path or
                              file:// URL that's a directory,  then look for archives in the directory listing. Links to VCS project URLs are not supported.

Please kindly let me know if the concept is acceptable or if there's a better approach to this.

Thanks!

@triage-new-issues triage-new-issues bot added the S: needs triage Issues/PRs that need to be triaged label Apr 21, 2020
@wisechengyi
Copy link
Author

Hi @cjerdonek, would you be able to take a look at this? Saw you were the last person who made related changes.

@NoahGorny
Copy link
Contributor

Take a look at #8042, seems like it can solve this problem as well

@wisechengyi
Copy link
Author

wisechengyi commented Apr 24, 2020

Hi @NoahGorny, thanks for the pointer! This use case is a bit different from #8042 which sets the header in the outgoing requests to server, whereas this is more about how to interpreter/change the header in the response received from server.

Specifically the existing code is overwriting max-age to be 0 regardless what the server response is

The goal is here to manipulate to 'max-age' to force pip to cache the response, regardless what 'max-age' is in the actual server response.

Edit: clarify where the change should be.

@wisechengyi
Copy link
Author

ok looks like pip is using the request header to determine whether the response should be cached. #8009

index dafe55c..f0066a7 100644
--- a/src/pip/_vendor/cachecontrol/controller.py
+++ b/src/pip/_vendor/cachecontrol/controller.py
@@ -84,7 +84,9 @@ class CacheController(object):
 
         retval = {}
 
+        cc_headers = 'max-age=8'
         for cc_directive in cc_headers.split(","):
             if not cc_directive.strip():
                 continue

Happened to work because it will change the parsing for both request and response.

@NoahGorny
Copy link
Contributor

@wisechengyi I think that allowing custom headers for the request will solve it, but only if the max-age override is somehow skipped if specified by the user. The blocker is the client side request and not the server side response.

@wisechengyi
Copy link
Author

wisechengyi commented Apr 24, 2020

Ah yes you are totally right! @NoahGorny. I was able to test it by passing max-age on CLI and comment out:

"Cache-Control": "max-age=0",

Spoke too soon.
With

diff --git a/src/pip/_internal/index/collector.py b/src/pip/_internal/index/collector.py
index e2c800c..3394b1e 100644
--- a/src/pip/_internal/index/collector.py
+++ b/src/pip/_internal/index/collector.py
@@ -163,7 +163,7 @@ def _get_html_response(url, session):
             # trip for the conditional GET now instead of only
             # once per 10 minutes.
             # For more information, please see pypa/pip#5670.
-            "Cache-Control": "max-age=0",
+            # "Cache-Control": "max-age=0",
         },
     )
     resp.raise_for_status()
diff --git a/src/pip/_vendor/cachecontrol/controller.py b/src/pip/_vendor/cachecontrol/controller.py
index dafe55c..26bca79 100644
--- a/src/pip/_vendor/cachecontrol/controller.py
+++ b/src/pip/_vendor/cachecontrol/controller.py
@@ -126,6 +126,8 @@ class CacheController(object):
         logger.debug('Looking up "%s" in the cache', cache_url)
         cc = self.parse_cache_control(request.headers)
 
+        print('request header:', cc)
+
         # Bail out if the request insists on fresh data
         if "no-cache" in cc:
             logger.debug('Request header has "no-cache", cache bypassed')

On top of #8078

$ python3 src/pip -vvv download --dest /tmp/yic_cache --no-index --find-links https://myhost.com/home/third_party/source/python/wheels/ -H 'Cache-Control: max-age=10' Flask==1.1.1 
...
Getting page https://myhost.com/home/third_party/source/python/wheels/
Looking up "https://myhost.com/home/third_party/source/python/wheels/" in the cache
request header: {'max-age': 10}
No cache entry available // this appears multiple times.

I wonder if server response also requires max-age.

@uranusjr
Copy link
Member

I wonder if server response also requires max-age.

Yes, and they mean different things. The request’s Cache-Control tells the server what to send. If you want the server to avoid receiving requests at all, you should set Cache-Control in the response to tell the client to not send the same request for the duration specified by max-age. (And of course pip’s client needs to implement support to respect that; I don’t know if it already does or not.)

@uranusjr
Copy link
Member

IIUC the response header combined with #7729 (which is available in 20.1b1) should reduce the request to one per pip invocation.

@wisechengyi
Copy link
Author

Thanks, @uranusjr! That's more clear, and luckily pip already is able to support the caching mechanism if configured correctly.

Having experimented with the code, to summarize it in a more understandable way:

  1. non-zero max-age in request header allows/turns on caching for a particular url
  2. max-age in the response determines how long pip is going to keep it valid

Hi @amancevice, do you think it would be reasonable to incorporate the below change into #8078? I think it should still comply with the philosophy of the change, i.e. default header should be overwritten by the one specified on CLI.

diff --git a/src/pip/_internal/index/collector.py b/src/pip/_internal/index/collector.py
index e2c800c..e653d4e 100644
--- a/src/pip/_internal/index/collector.py
+++ b/src/pip/_internal/index/collector.py
@@ -163,7 +163,10 @@ def _get_html_response(url, session):
             # trip for the conditional GET now instead of only
             # once per 10 minutes.
             # For more information, please see pypa/pip#5670.
-            "Cache-Control": "max-age=0",
+            # However if we want to override Cache-Control, e.g. via CLI,
+            # we can still do so.
+            "Cache-Control": session.headers['Cache-Control']
+            if 'Cache-Control' in session.headers else "max-age=0"
         },
     )
     resp.raise_for_status()

@amancevice
Copy link

amancevice commented Apr 26, 2020

Sounds good to me, @wisechengyi! But would you be opposed to a slightly different logic using the get() method of the dict? It's a little less verbose and should give the same behavior (get the value if its in the dict, otherwise return the 2nd arg)

diff --git a/src/pip/_internal/index/collector.py b/src/pip/_internal/index/collector.py
index e2c800c2..2c3ba917 100644
--- a/src/pip/_internal/index/collector.py
+++ b/src/pip/_internal/index/collector.py
@@ -163,7 +163,9 @@ def _get_html_response(url, session):
             # trip for the conditional GET now instead of only
             # once per 10 minutes.
             # For more information, please see pypa/pip#5670.
-            "Cache-Control": "max-age=0",
+            # However if we want to override Cache-Control, e.g. via CLI,
+            # we can still do so.
+            "Cache-Control": session.headers.get('Cache-Control', 'max-age=0'),
         },
     )
     resp.raise_for_status()

@wisechengyi
Copy link
Author

That looks much prettier. Thanks, @amancevice !

@NoahGorny
Copy link
Contributor

@wisechengyi you should make sure that the above change will not change default behaviour (without specifying max-age)
Test that the max-age is 0 in that case

@amancevice
Copy link

amancevice commented Apr 26, 2020

I wrote a test for this last night that @wisechengyi can grab.

(sorry for the very large diff, not sure if this is the best way to share it)

diff --git a/src/pip/_internal/index/collector.py b/src/pip/_internal/index/collector.py
index e2c800c2..2c3ba917 100644
--- a/src/pip/_internal/index/collector.py
+++ b/src/pip/_internal/index/collector.py
@@ -163,7 +163,9 @@ def _get_html_response(url, session):
             # trip for the conditional GET now instead of only
             # once per 10 minutes.
             # For more information, please see pypa/pip#5670.
-            "Cache-Control": "max-age=0",
+            # However if we want to override Cache-Control, e.g. via CLI,
+            # we can still do so.
+            "Cache-Control": session.headers.get('Cache-Control', 'max-age=0'),
         },
     )
     resp.raise_for_status()
diff --git a/tests/unit/test_collector.py b/tests/unit/test_collector.py
index cfc2af1c..a8ae51aa 100644
--- a/tests/unit/test_collector.py
+++ b/tests/unit/test_collector.py
@@ -60,6 +60,7 @@ def test_get_html_response_archive_to_http_scheme(url, content_type):
     if the scheme supports it, and raise `_NotHTML` if the response isn't HTML.
     """
     session = mock.Mock(PipSession)
+    session.headers = {}
     session.head.return_value = mock.Mock(**{
         "request.method": "HEAD",
         "headers": {"Content-Type": content_type},
@@ -87,6 +88,7 @@ def test_get_html_response_archive_to_http_scheme_is_html(url):
     request is responded with text/html.
     """
     session = mock.Mock(PipSession)
+    session.headers = {}
     session.head.return_value = mock.Mock(**{
         "request.method": "HEAD",
         "headers": {"Content-Type": "text/html"},
@@ -120,6 +122,7 @@ def test_get_html_response_no_head(url):
     look like an archive, only the GET request that retrieves data.
     """
     session = mock.Mock(PipSession)
+    session.headers = {}
 
     # Mock the headers dict to ensure it is accessed.
     session.get.return_value = mock.Mock(headers=mock.Mock(**{
@@ -145,6 +148,7 @@ def test_get_html_response_dont_log_clear_text_password(caplog):
     in its DEBUG log message.
     """
     session = mock.Mock(PipSession)
+    session.headers = {}
 
     # Mock the headers dict to ensure it is accessed.
     session.get.return_value = mock.Mock(headers=mock.Mock(**{
@@ -167,6 +171,57 @@ def test_get_html_response_dont_log_clear_text_password(caplog):
     ]
 
 
+@pytest.mark.parametrize(
+    ("url", "headers", "cache_control"),
+    [
+        (
+            "http://python.org/python-3.7.1.zip",
+            {},
+            "max-age=0",
+        ),
+        (
+            "https://pypi.org/pip-18.0.tar.gz",
+            {},
+            "max-age=0",
+        ),
+        (
+            "http://python.org/python-3.7.1.zip",
+            {"Cache-Control": "max-age=1"},
+            "max-age=1",
+        ),
+        (
+            "https://pypi.org/pip-18.0.tar.gz",
+            {"Cache-Control": "max-age=1"},
+            "max-age=1",
+        ),
+    ],
+)
+def test_get_html_response_override_cache_control(url, headers, cache_control):
+    """
+    `_get_html_response()` should use the session's default value for the
+    Cache-Control header if provided.
+    """
+    session = mock.Mock(PipSession)
+    session.headers = headers
+    session.head.return_value = mock.Mock(**{
+        "request.method": "HEAD",
+        "headers": {"Content-Type": "text/html"},
+    })
+    session.get.return_value = mock.Mock(headers={"Content-Type": "text/html"})
+
+    resp = _get_html_response(url, session=session)
+
+    assert resp is not None
+    assert session.mock_calls == [
+        mock.call.head(url, allow_redirects=True),
+        mock.call.head().raise_for_status(),
+        mock.call.get(url, headers={
+            "Accept": "text/html", "Cache-Control": cache_control,
+        }),
+        mock.call.get().raise_for_status(),
+    ]
+
+
 @pytest.mark.parametrize(
     ("html", "url", "expected"),
     [
@@ -416,6 +471,7 @@ def test_request_http_error(caplog):
     caplog.set_level(logging.DEBUG)
     link = Link('http://localhost')
     session = Mock(PipSession)
+    session.headers = {}
     session.get.return_value = resp = Mock()
     resp.raise_for_status.side_effect = requests.HTTPError('Http error')
     assert _get_html_page(link, session=session) is None
@@ -429,6 +485,7 @@ def test_request_retries(caplog):
     caplog.set_level(logging.DEBUG)
     link = Link('http://localhost')
     session = Mock(PipSession)
+    session.headers = {}
     session.get.side_effect = requests.exceptions.RetryError('Retry error')
     assert _get_html_page(link, session=session) is None
     assert (
@@ -501,6 +558,7 @@ def test_get_html_page_directory_append_index(tmpdir):
     expected_url = "{}/index.html".format(dir_url.rstrip("/"))
 
     session = mock.Mock(PipSession)
+    session.headers = {}
     fake_response = make_fake_html_response(expected_url)
     mock_func = mock.patch("pip._internal.index.collector._get_html_response")
     with mock_func as mock_func:

@wisechengyi
Copy link
Author

@amancevice looks reasonable to me. thanks! it might be a good idea to put it on #8078, so if folks have further feedback, they can use the PR for review.

@wisechengyi
Copy link
Author

wisechengyi commented Apr 27, 2020

Hi @NoahGorny the tests Alexandar added should do there. If you would be able to review #8078, that'll be appreciated.

amancevice pushed a commit to amancevice/pip that referenced this issue May 6, 2020
jsirois pushed a commit to pex-tool/pip that referenced this issue May 11, 2020
amancevice pushed a commit to amancevice/pip that referenced this issue May 14, 2020
amancevice pushed a commit to amancevice/pip that referenced this issue Jul 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S: needs triage Issues/PRs that need to be triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants