Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent / support deep pagination #5838

Closed
2 tasks
elynema opened this issue May 21, 2024 · 10 comments
Closed
2 tasks

Prevent / support deep pagination #5838

elynema opened this issue May 21, 2024 · 10 comments
Assignees

Comments

@elynema
Copy link
Contributor

elynema commented May 21, 2024

Description

It is a known issue with Solr / Blacklight that deep pagination into either facet sets or results will cause significant performance issues. We fairly regularly see occurrences in the MCO logs where requests are coming in past the 100 page mark. Although these don't normally exhibit paging (ex: requesting page 651, 651, 652, etc.) and are one-off requests, more targeted paging does sometimes occur and may be contributing to sudden slowdowns in Solr and CPU spikes on the Solr server.

How can we better handle these situations?

Done Looks Like

  • Investigation into remediation of deep pagination performance issues
  • would sitemaps help, or is that a separate question?
@joncameron
Copy link
Contributor

Could be suggestions from the community on Blacklight and pagination performance to add to this issue.

@masaball masaball self-assigned this Sep 3, 2024
@cjcolvar
Copy link
Member

cjcolvar commented Sep 4, 2024

Some quick links to blacklight discussion about deep pagination for me to look at later:
projectblacklight/blacklight#3094
projectblacklight/blacklight#1665
https://code4lib.slack.com/archives/C54TB5WDQ/p1564656616035000
https://code4lib.slack.com/archives/C0B3ELJQM/p1631023853055000
The first one is interesting since it provides a possible solution of just removing the last pages from the pagination bar at the bottom of search results so it becomes 1 2 3 4 5 ... instead of 1 2 3 4 5 ... 10,825 10,826
Bots and users could still dig that deep but would make it less likely.

@masaball
Copy link
Contributor

masaball commented Sep 4, 2024

Thanks for the links! I had found some of the code linked from the slack discussions and issue 1665, but there was some good additional info in the other threads.

PR 3094 is an interesting one and does not look like it is in any v7 releases, so would be another impetus for pushing to Blacklight 8 once it supports Rails 7.2. Stanford homebrewed a similar approach 6-7 years ago, so neat to see comparable behavior incorporated upstream.

From reading through all of the links from @cjcolvar , the most common approach by other blacklight institutions has been to limit deep pagination.

  • UPenn has established pagination thresholds that return "You have paginated too deep into facets. Please contact us if you need to view facets past page #{FACET_PAGINATION_THRESHOLD}" as a message if a user or bot tries to navigate past that point. UPenn limits both search returns and facet returns.
  • Stanford, in addition to removing the deep page links at the bottom of searches, redirected users to the homepage with a similar message to UPenn if the user tried to navigate past a certain page threshold.
  • Brown limited requesting any pages past 1000, returning a 400 error to non-html requests (most likely bots) and an error message for html requests recommending the user narrow their search with facets or contacting support.

Other site search behavior:

  • Google in my testing seems to limit you to the first 200-300 items returned (none of my searches would go past page 30).
  • Github limits you to the first 1000 results (100 pages, 10 results per page). Trying to manually navigate to a later page gets a 404 error page.
  • Bing seems to allow infinite manual paging but trying to jump to an arbitrary page by adjusting the URL frequently returns a single page with no option to navigate forward or backward from there. However I think they are using a system similar to Solr's Cursors, where the next page request is fed the ending of the last one, so it can jump directly to the new start point instead of cycling through everything.

Blacklight does not support cursorMarks because Solr is set up to only work moving forward, not bi-directionally which could break paging through results. Also they would not help with bots/users jumping to arbitrary pages. If we wanted to investigate it anyways, we would have to roll our own implementation and with the limitations on the solr side I am not confident how much real benefit there would be.

There is some discussion of sitemaps and schemas in the code4lib conversations and it seemed like people were saying that it can potentially result in reduced bot traffic but that deep pagination requests are heavy enough that sometimes a single request can cripple a large enough dataset. I do not think we are quite that large a dataset, but sitemaps seem like something that would be beneficial in general, but would not necessarily have a direct effect for this issue.

So at this time, it seems like the main way forward would be to limit how deep users can paginate, and maybe upgrade to blacklight 8 to get the configurable pagination bar.

@elynema
Copy link
Contributor Author

elynema commented Sep 5, 2024

Presumably a future release of Blacklight 8.* will support Rails 7.2. Current Blacklight 8 isn't there yet.

@elynema
Copy link
Contributor Author

elynema commented Sep 5, 2024

Propose discussing first at Backlog Refinement, then we can schedule more time for discussion if needed.

@joncameron
Copy link
Contributor

Looking at log data could be helpful as well to see what the requests are like in practice.

@elynema
Copy link
Contributor Author

elynema commented Sep 11, 2024

I asked Digital Collections and IUCAT folks if they are doing anything about this. Digital Collections said no.

David Elyea said about IUCAT:

I think you could still use Rack Attack to limit paging if what you're seeing is every couple seconds the same IP Number (or user agent possibly) is requesting a new page of results. Here's a link to show how one app attempted this: https://github.com/mastodon/mastodon/blob/a021dee64214fcc662c0c36ad4e44dc1deaba65f/config/initializers/rack_attack.rb#L93
12:44
I've done A LOT with Rack Attack for IUCAT if you have any questions or need help. It's helped us a lot with bot issues. I think you might even be able to put a custom response page up in case any actual user accidentally gets throttled for your "deep pagination" rule.

@joncameron
Copy link
Contributor

joncameron commented Sep 11, 2024

Putting Blacklight 8.x on the roadmap would be a good next step in that part of the investigation.

Next step: Look at the logs and retrieve service statistics: examining how much of an issue it is for us can be part of this; we don't need to fully block things off if it's not a large performance issue in our case. If the logs don't point to humans, it's best practice to disable this unless we could say it's not a problem.

Ideal for us to not disable this. In practice real users are unlikely to be regularly doing deep pages of search results.

Others report using https://github.com/rack/rack-attack successfully. See https://github.com/mastodon/mastodon/blob/a021dee64214fcc662c0c36ad4e44dc1deaba65f/config/initializers/rack_attack.rb#L93 for throttling setting in this library.

Also: what is the current level of throttling at the proxy level? We can check in about the current production status of how this is handled in our server architecture.

@joncameron
Copy link
Contributor

@joncameron to write a new investigation issue to carry on with the work here regarding paging and what we could do to investigate the real world load and performance issue mitigation.

@joncameron
Copy link
Contributor

Follow-on issue: #6038

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants