Allow cache TTL override for --find-links urls #8109

wisechengyi · 2020-04-21T23:04:13Z

What's the problem this feature will solve?

Our wheels (thousands of them) are hosted in a flat directory, e.g. https://myhost.com/wheels/ has

...
django_auth_ldap-1.6.1-py2.py3-none-any.whl
django_bootstrap_toolkit-2.15.0-py2-none-any.whl
django_bootstrap_toolkit-2.15.0-py3-none-any.whl
django_braces-1.14.0-py2.py3-none-any.whl
django_common_helpers-0.9.2-py2-none-any.whl
django_common_helpers-0.9.2-py3-none-any.whl
django_configurations-2.2-py2.py3-none-any.whl
django_crispy_forms-1.8.1-py2.py3-none-any.whl
django_cron-0.5.1-py2-none-any.whl
django_cron-0.5.1-py3-none-any.whl
...

With --no-index --find-links https://myhost.com/wheels/, pip will query it for every artifact it needs transitively. However, when large number of builds are running concurrently with pip, the wheel server can be overwhelmed.

Describe the solution you'd like

We'd like some mechanism to force the cache TTL for the index page. Something to the effect of:

diff --git a/src/pip/_vendor/cachecontrol/controller.py b/src/pip/_vendor/cachecontrol/controller.py
index dafe55c..f0066a7 100644
--- a/src/pip/_vendor/cachecontrol/controller.py
+++ b/src/pip/_vendor/cachecontrol/controller.py
@@ -84,7 +84,9 @@ class CacheController(object):
 
         retval = {}
 
+        cc_headers = 'max-age=8'
         for cc_directive in cc_headers.split(","):
             if not cc_directive.strip():
                 continue

In this case I was hardcoding the cache TTL to be 8 seconds.

It can be plumbed via an option, e.g. adding --force-index-cache-ttl=<some seconds> to below:

Package Index Options:
  -i, --index-url <url>       Base URL of the Python Package Index (default https://pypi.org/simple). This should point to a repository compliant with PEP 503 (the
                              simple repository API) or a local directory laid out in the same format.
  --extra-index-url <url>     Extra URLs of package indexes to use in addition to --index-url. Should follow the same rules as --index-url.
  --no-index                  Ignore package index (only looking at --find-links URLs instead).
  -f, --find-links <url>      If a URL or path to an html file, then parse for links to archives such as sdist (.tar.gz) or wheel (.whl) files. If a local path or
                              file:// URL that's a directory,  then look for archives in the directory listing. Links to VCS project URLs are not supported.

Please kindly let me know if the concept is acceptable or if there's a better approach to this.

Thanks!

The text was updated successfully, but these errors were encountered:

wisechengyi · 2020-04-23T22:35:52Z

Hi @cjerdonek, would you be able to take a look at this? Saw you were the last person who made related changes.

NoahGorny · 2020-04-24T18:47:41Z

Take a look at #8042, seems like it can solve this problem as well

wisechengyi · 2020-04-24T22:26:09Z

Hi @NoahGorny, thanks for the pointer! This use case is a bit different from #8042 which sets the header in the outgoing requests to server, whereas this is more about how to interpreter/change the header in the response received from server.

~~Specifically the existing code is overwriting max-age to be 0 regardless what the server response is~~

The goal is here to manipulate to 'max-age' to force pip to cache the response, regardless what 'max-age' is in the actual server response.

Edit: clarify where the change should be.

wisechengyi · 2020-04-24T23:12:36Z

ok looks like pip is using the request header to determine whether the response should be cached. #8009

index dafe55c..f0066a7 100644
--- a/src/pip/_vendor/cachecontrol/controller.py
+++ b/src/pip/_vendor/cachecontrol/controller.py
@@ -84,7 +84,9 @@ class CacheController(object):
 
         retval = {}
 
+        cc_headers = 'max-age=8'
         for cc_directive in cc_headers.split(","):
             if not cc_directive.strip():
                 continue

Happened to work because it will change the parsing for both request and response.

NoahGorny · 2020-04-24T23:30:20Z

@wisechengyi I think that allowing custom headers for the request will solve it, but only if the max-age override is somehow skipped if specified by the user. The blocker is the client side request and not the server side response.

wisechengyi · 2020-04-24T23:45:48Z

~~Ah yes you are totally right! @NoahGorny. I was able to test it by passing max-age on CLI and comment out:~~

pip/src/pip/_internal/index/collector.py

Line 166 in 327a315

"Cache-Control": "max-age=0",

Spoke too soon.
With

diff --git a/src/pip/_internal/index/collector.py b/src/pip/_internal/index/collector.py
index e2c800c..3394b1e 100644
--- a/src/pip/_internal/index/collector.py
+++ b/src/pip/_internal/index/collector.py
@@ -163,7 +163,7 @@ def _get_html_response(url, session):
             # trip for the conditional GET now instead of only
             # once per 10 minutes.
             # For more information, please see pypa/pip#5670.
-            "Cache-Control": "max-age=0",
+            # "Cache-Control": "max-age=0",
         },
     )
     resp.raise_for_status()
diff --git a/src/pip/_vendor/cachecontrol/controller.py b/src/pip/_vendor/cachecontrol/controller.py
index dafe55c..26bca79 100644
--- a/src/pip/_vendor/cachecontrol/controller.py
+++ b/src/pip/_vendor/cachecontrol/controller.py
@@ -126,6 +126,8 @@ class CacheController(object):
         logger.debug('Looking up "%s" in the cache', cache_url)
         cc = self.parse_cache_control(request.headers)
 
+        print('request header:', cc)
+
         # Bail out if the request insists on fresh data
         if "no-cache" in cc:
             logger.debug('Request header has "no-cache", cache bypassed')

On top of #8078

$ python3 src/pip -vvv download --dest /tmp/yic_cache --no-index --find-links https://myhost.com/home/third_party/source/python/wheels/ -H 'Cache-Control: max-age=10' Flask==1.1.1 
...
Getting page https://myhost.com/home/third_party/source/python/wheels/
Looking up "https://myhost.com/home/third_party/source/python/wheels/" in the cache
request header: {'max-age': 10}
No cache entry available // this appears multiple times.

I wonder if server response also requires max-age.

uranusjr · 2020-04-25T08:11:45Z

I wonder if server response also requires max-age.

Yes, and they mean different things. The request’s Cache-Control tells the server what to send. If you want the server to avoid receiving requests at all, you should set Cache-Control in the response to tell the client to not send the same request for the duration specified by max-age. (And of course pip’s client needs to implement support to respect that; I don’t know if it already does or not.)

uranusjr · 2020-04-25T08:16:20Z

IIUC the response header combined with #7729 (which is available in 20.1b1) should reduce the request to one per pip invocation.

wisechengyi · 2020-04-25T23:21:55Z

Thanks, @uranusjr! That's more clear, and luckily pip already is able to support the caching mechanism if configured correctly.

Having experimented with the code, to summarize it in a more understandable way:

non-zero max-age in request header allows/turns on caching for a particular url
max-age in the response determines how long pip is going to keep it valid

Hi @amancevice, do you think it would be reasonable to incorporate the below change into #8078? I think it should still comply with the philosophy of the change, i.e. default header should be overwritten by the one specified on CLI.

diff --git a/src/pip/_internal/index/collector.py b/src/pip/_internal/index/collector.py
index e2c800c..e653d4e 100644
--- a/src/pip/_internal/index/collector.py
+++ b/src/pip/_internal/index/collector.py
@@ -163,7 +163,10 @@ def _get_html_response(url, session):
             # trip for the conditional GET now instead of only
             # once per 10 minutes.
             # For more information, please see pypa/pip#5670.
-            "Cache-Control": "max-age=0",
+            # However if we want to override Cache-Control, e.g. via CLI,
+            # we can still do so.
+            "Cache-Control": session.headers['Cache-Control']
+            if 'Cache-Control' in session.headers else "max-age=0"
         },
     )
     resp.raise_for_status()

amancevice · 2020-04-26T01:07:22Z

Sounds good to me, @wisechengyi! But would you be opposed to a slightly different logic using the get() method of the dict? It's a little less verbose and should give the same behavior (get the value if its in the dict, otherwise return the 2nd arg)

diff --git a/src/pip/_internal/index/collector.py b/src/pip/_internal/index/collector.py
index e2c800c2..2c3ba917 100644
--- a/src/pip/_internal/index/collector.py
+++ b/src/pip/_internal/index/collector.py
@@ -163,7 +163,9 @@ def _get_html_response(url, session):
             # trip for the conditional GET now instead of only
             # once per 10 minutes.
             # For more information, please see pypa/pip#5670.
-            "Cache-Control": "max-age=0",
+            # However if we want to override Cache-Control, e.g. via CLI,
+            # we can still do so.
+            "Cache-Control": session.headers.get('Cache-Control', 'max-age=0'),
         },
     )
     resp.raise_for_status()

wisechengyi · 2020-04-26T04:07:03Z

That looks much prettier. Thanks, @amancevice !

NoahGorny · 2020-04-26T09:14:45Z

@wisechengyi you should make sure that the above change will not change default behaviour (without specifying max-age)
Test that the max-age is 0 in that case

amancevice · 2020-04-26T15:30:41Z

I wrote a test for this last night that @wisechengyi can grab.

(sorry for the very large diff, not sure if this is the best way to share it)

diff --git a/src/pip/_internal/index/collector.py b/src/pip/_internal/index/collector.py
index e2c800c2..2c3ba917 100644
--- a/src/pip/_internal/index/collector.py
+++ b/src/pip/_internal/index/collector.py
@@ -163,7 +163,9 @@ def _get_html_response(url, session):
             # trip for the conditional GET now instead of only
             # once per 10 minutes.
             # For more information, please see pypa/pip#5670.
-            "Cache-Control": "max-age=0",
+            # However if we want to override Cache-Control, e.g. via CLI,
+            # we can still do so.
+            "Cache-Control": session.headers.get('Cache-Control', 'max-age=0'),
         },
     )
     resp.raise_for_status()
diff --git a/tests/unit/test_collector.py b/tests/unit/test_collector.py
index cfc2af1c..a8ae51aa 100644
--- a/tests/unit/test_collector.py
+++ b/tests/unit/test_collector.py
@@ -60,6 +60,7 @@ def test_get_html_response_archive_to_http_scheme(url, content_type):
     if the scheme supports it, and raise `_NotHTML` if the response isn't HTML.
     """
     session = mock.Mock(PipSession)
+    session.headers = {}
     session.head.return_value = mock.Mock(**{
         "request.method": "HEAD",
         "headers": {"Content-Type": content_type},
@@ -87,6 +88,7 @@ def test_get_html_response_archive_to_http_scheme_is_html(url):
     request is responded with text/html.
     """
     session = mock.Mock(PipSession)
+    session.headers = {}
     session.head.return_value = mock.Mock(**{
         "request.method": "HEAD",
         "headers": {"Content-Type": "text/html"},
@@ -120,6 +122,7 @@ def test_get_html_response_no_head(url):
     look like an archive, only the GET request that retrieves data.
     """
     session = mock.Mock(PipSession)
+    session.headers = {}
 
     # Mock the headers dict to ensure it is accessed.
     session.get.return_value = mock.Mock(headers=mock.Mock(**{
@@ -145,6 +148,7 @@ def test_get_html_response_dont_log_clear_text_password(caplog):
     in its DEBUG log message.
     """
     session = mock.Mock(PipSession)
+    session.headers = {}
 
     # Mock the headers dict to ensure it is accessed.
     session.get.return_value = mock.Mock(headers=mock.Mock(**{
@@ -167,6 +171,57 @@ def test_get_html_response_dont_log_clear_text_password(caplog):
     ]
 
 
+@pytest.mark.parametrize(
+    ("url", "headers", "cache_control"),
+    [
+        (
+            "http://python.org/python-3.7.1.zip",
+            {},
+            "max-age=0",
+        ),
+        (
+            "https://pypi.org/pip-18.0.tar.gz",
+            {},
+            "max-age=0",
+        ),
+        (
+            "http://python.org/python-3.7.1.zip",
+            {"Cache-Control": "max-age=1"},
+            "max-age=1",
+        ),
+        (
+            "https://pypi.org/pip-18.0.tar.gz",
+            {"Cache-Control": "max-age=1"},
+            "max-age=1",
+        ),
+    ],
+)
+def test_get_html_response_override_cache_control(url, headers, cache_control):
+    """
+    `_get_html_response()` should use the session's default value for the
+    Cache-Control header if provided.
+    """
+    session = mock.Mock(PipSession)
+    session.headers = headers
+    session.head.return_value = mock.Mock(**{
+        "request.method": "HEAD",
+        "headers": {"Content-Type": "text/html"},
+    })
+    session.get.return_value = mock.Mock(headers={"Content-Type": "text/html"})
+
+    resp = _get_html_response(url, session=session)
+
+    assert resp is not None
+    assert session.mock_calls == [
+        mock.call.head(url, allow_redirects=True),
+        mock.call.head().raise_for_status(),
+        mock.call.get(url, headers={
+            "Accept": "text/html", "Cache-Control": cache_control,
+        }),
+        mock.call.get().raise_for_status(),
+    ]
+
+
 @pytest.mark.parametrize(
     ("html", "url", "expected"),
     [
@@ -416,6 +471,7 @@ def test_request_http_error(caplog):
     caplog.set_level(logging.DEBUG)
     link = Link('http://localhost')
     session = Mock(PipSession)
+    session.headers = {}
     session.get.return_value = resp = Mock()
     resp.raise_for_status.side_effect = requests.HTTPError('Http error')
     assert _get_html_page(link, session=session) is None
@@ -429,6 +485,7 @@ def test_request_retries(caplog):
     caplog.set_level(logging.DEBUG)
     link = Link('http://localhost')
     session = Mock(PipSession)
+    session.headers = {}
     session.get.side_effect = requests.exceptions.RetryError('Retry error')
     assert _get_html_page(link, session=session) is None
     assert (
@@ -501,6 +558,7 @@ def test_get_html_page_directory_append_index(tmpdir):
     expected_url = "{}/index.html".format(dir_url.rstrip("/"))
 
     session = mock.Mock(PipSession)
+    session.headers = {}
     fake_response = make_fake_html_response(expected_url)
     mock_func = mock.patch("pip._internal.index.collector._get_html_response")
     with mock_func as mock_func:

wisechengyi · 2020-04-26T22:05:33Z

@amancevice looks reasonable to me. thanks! it might be a good idea to put it on #8078, so if folks have further feedback, they can use the PR for review.

See pypa#8109

wisechengyi · 2020-04-27T20:29:10Z

Hi @NoahGorny the tests Alexandar added should do there. If you would be able to review #8078, that'll be appreciated.

See pypa#8109

Cherry pick pypa#8078 Fixes: pypa#8109

See pypa#8109

triage-new-issues bot added the S: needs triage Issues/PRs that need to be triaged label Apr 21, 2020

amancevice pushed a commit to amancevice/pip that referenced this issue Apr 26, 2020

Allow override of Cache-Control header

6385f02

See pypa#8109

wisechengyi mentioned this issue Apr 27, 2020

Add CLI options for enhancing requests with HTTP headers pex-tool/pip#7

Merged

amancevice pushed a commit to amancevice/pip that referenced this issue May 6, 2020

Allow override of Cache-Control header

3c38b23

See pypa#8109

jsirois pushed a commit to pex-tool/pip that referenced this issue May 11, 2020

Add CLI options for enhancing requests with HTTP headers (#7)

f9dde7c

Cherry pick pypa#8078 Fixes: pypa#8109

amancevice pushed a commit to amancevice/pip that referenced this issue May 14, 2020

Allow override of Cache-Control header

e62870b

See pypa#8109

amancevice pushed a commit to amancevice/pip that referenced this issue Jul 16, 2020

Allow override of Cache-Control header

813db6b

See pypa#8109

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow cache TTL override for --find-links urls #8109

Allow cache TTL override for --find-links urls #8109

wisechengyi commented Apr 21, 2020 •

edited

Loading

wisechengyi commented Apr 23, 2020

NoahGorny commented Apr 24, 2020

wisechengyi commented Apr 24, 2020 •

edited

Loading

wisechengyi commented Apr 24, 2020

NoahGorny commented Apr 24, 2020

wisechengyi commented Apr 24, 2020 •

edited

Loading

uranusjr commented Apr 25, 2020

uranusjr commented Apr 25, 2020

wisechengyi commented Apr 25, 2020

amancevice commented Apr 26, 2020 •

edited

Loading

wisechengyi commented Apr 26, 2020

NoahGorny commented Apr 26, 2020

amancevice commented Apr 26, 2020 •

edited

Loading

wisechengyi commented Apr 26, 2020

wisechengyi commented Apr 27, 2020 •

edited

Loading

Allow cache TTL override for --find-links urls #8109

Allow cache TTL override for --find-links urls #8109

Comments

wisechengyi commented Apr 21, 2020 • edited Loading

wisechengyi commented Apr 23, 2020

NoahGorny commented Apr 24, 2020

wisechengyi commented Apr 24, 2020 • edited Loading

wisechengyi commented Apr 24, 2020

NoahGorny commented Apr 24, 2020

wisechengyi commented Apr 24, 2020 • edited Loading

uranusjr commented Apr 25, 2020

uranusjr commented Apr 25, 2020

wisechengyi commented Apr 25, 2020

amancevice commented Apr 26, 2020 • edited Loading

wisechengyi commented Apr 26, 2020

NoahGorny commented Apr 26, 2020

amancevice commented Apr 26, 2020 • edited Loading

wisechengyi commented Apr 26, 2020

wisechengyi commented Apr 27, 2020 • edited Loading

wisechengyi commented Apr 21, 2020 •

edited

Loading

wisechengyi commented Apr 24, 2020 •

edited

Loading

wisechengyi commented Apr 24, 2020 •

edited

Loading

amancevice commented Apr 26, 2020 •

edited

Loading

amancevice commented Apr 26, 2020 •

edited

Loading

wisechengyi commented Apr 27, 2020 •

edited

Loading