Search: stop relying on the DB when indexing #10696

stsewd · 2023-08-31T23:47:00Z

Removed the "wipe" actions from the admin instead of porting them, since I'm not sure that we need an action in the admin just to delete the search index of a project. Re-index seems useful.
fileify was replaced by index_build, and it only requires the build id to be passed, any other information can be retrieved from the build/version object.
fileify isn't removed in this PR to avoid downtimes during deploy, it's safe to keep it around till next deploy.
New code is avoiding any deep connection to the django-elasticsearch-dsl package, since it doesn't make sense anymore to have it, and I'm planning on removing it.
We are no longer tracking all files in the DB, only the ones of interest.
Re-indexing a version will also re-evaluate the files from the DB, useful for old projects that are out of sync.
The reindex command now generates taks per-version rather than per-collection of files, since we no longer track all files in the DB.
Closes Search: stop relying on the DB when indexing #10623
Closes Build: track imported files for external versions #10690

We don't need to do anything special during deploy, zero downtime out of the box. We can trigger a re-index for all versions if we want to delete the HTML files that we don't need from the DB, but that operation will also re-index their contents in ES, so probably better do that after we are all settled with any changes to ES.

- Closes #10623 - Closes #10690

stsewd · 2023-09-13T01:41:40Z

readthedocs/builds/models.py

@@ -320,7 +320,7 @@ def config(self):
        :rtype: dict
        """
        last_build = (
-            self.builds(manager=INTERNAL).filter(
+            self.builds.filter(


We don't need to filter builds by internal/external here, if we are accessing the version, we already know if it's external o internal.

humitos · 2023-09-13T08:30:39Z

Re-indexing a version will also re-evaluate the files from the DB, useful for old projects that are out of sync.

Can we split this into two different tasks so we can trigger them in different situations? Like the one you mentioned:

We can trigger a re-index for all versions if we want to delete the HTML files that we don't need from the DB, but that operation will also re-index their contents in ES

Then, for normal workflow, if we need them to be ran sequentially we can chain the two tasks like celery.chain([sync_html_files, index_build]) (pseudocode since I don't remember the function's name).

humitos

This looks good to me! I'd like to understand a little more the "build ID hack" I commented about and also if it's possible to split "create HTML objects and reindex" into two different tasks so we can call them separately.

humitos · 2023-09-13T08:32:32Z

readthedocs/builds/querysets.py

+                builds__success=True,
+            )
+            .exclude(project__delisted=True)
+            .exclude(project__is_spam=True)


We should probably use a score here. Otherwise, only projects manually marked as spam will be excluded here.

This is a copy of

readthedocs.org/readthedocs/search/documents.py

Lines 120 to 130 in 53e21bb

def get_queryset(self):

"""Don't include ignored files and delisted projects."""

queryset = super().get_queryset()

queryset = (

queryset

.exclude(ignore=True)

.exclude(project__delisted=True)

.exclude(project__is_spam=True)

.select_related('version', 'project')

)

return queryset

probably better discuss this at #9899

readthedocs/projects/tasks/search.py

humitos · 2023-09-13T08:46:16Z

readthedocs/projects/tasks/search.py

+                # Last pattern to match takes precedence
+                # XXX: see if we can implement another type of precedence,
+                # like the longest pattern.


Suggested change

# Last pattern to match takes precedence

# XXX: see if we can implement another type of precedence,

# like the longest pattern.

# Last pattern to match takes precedence

# TODO: see if we can implement another type of precedence,

# like the longest pattern.

I think we can just remove this note, don't think we will be changing how our search priority works, that will be a breaking change. I like how the current precedence work.

readthedocs/projects/tasks/search.py

readthedocs/search/utils.py

stsewd · 2023-09-13T17:48:34Z

Can we split this into two different tasks so we can trigger them in different situations? Like the one you mentioned:

We can, but then we will need to walk the storage twice. I don't think we currently have the need to do these operations separately, if we do, then we can probably just have 3 tasks, one that does both operations together in a single walk and two others that do it independently.

humitos

🚀

readthedocs/projects/tasks/search.py

This task was replaced in #10696.

stsewd added 11 commits August 31, 2023 18:45

Search: stop relying on the DB when indexing

871bfe2

- Closes #10623 - Closes #10690

Second draft

c6df611

Merge branch 'main' into dont-rely-on-db-to-index

912e457

Fix tests

bc8e4ff

Merge branch 'main' into dont-rely-on-db-to-index

f3853ea

Merge branch 'main' into dont-rely-on-db-to-index

4ef4105

Remove more code

3e2fd31

Simplify more things

c5ec080

Fix test

d57c481

Small updates

022e7e9

Update test

19d7c5c

stsewd commented Sep 13, 2023

View reviewed changes

stsewd marked this pull request as ready for review September 13, 2023 01:55

stsewd requested review from a team as code owners September 13, 2023 01:55

stsewd requested review from ericholscher and agjohnson September 13, 2023 01:55

auto-assign bot assigned stsewd Sep 13, 2023

stsewd removed request for a team and agjohnson September 13, 2023 01:55

This was referenced Sep 13, 2023

Unify HTMLFile and ImportedFile models #10729

Open

Search: remove django-elasticsearch-dsl dependency #10730

Open

ImportedFile: remove unused fields #10731

Open

humitos reviewed Sep 13, 2023

View reviewed changes

stsewd added 2 commits September 13, 2023 13:06

Updates from review and fixes

79e6141

Format

2ccf0e8

Rename build -> sync_id

8fdbc06

stsewd mentioned this pull request Sep 13, 2023

Search/ImportedFile: rename build to sync_id #10734

Open

humitos approved these changes Sep 14, 2023

View reviewed changes

readthedocs/projects/tasks/search.py Show resolved Hide resolved

Updates from review

dd9e226

stsewd merged commit fa54900 into main Sep 14, 2023

stsewd deleted the dont-rely-on-db-to-index branch September 14, 2023 15:40

humitos mentioned this pull request Sep 15, 2023

Don't rely on the HTMLFile model to index and re-index files #7875

Closed

stsewd added a commit that referenced this pull request Sep 18, 2023

Tasks: remove old fileify task

953dd72

This task was replaced in #10696.

stsewd mentioned this pull request Sep 18, 2023

Tasks: remove old fileify task #10747

Merged

stsewd added a commit that referenced this pull request Sep 19, 2023

Tasks: remove old fileify task (#10747)

283fd10

This task was replaced in #10696.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search: stop relying on the DB when indexing #10696

Search: stop relying on the DB when indexing #10696

stsewd commented Aug 31, 2023 •

edited

Loading

stsewd Sep 13, 2023

humitos commented Sep 13, 2023

humitos left a comment

humitos Sep 13, 2023

stsewd Sep 13, 2023

humitos Sep 13, 2023

stsewd Sep 13, 2023

stsewd commented Sep 13, 2023

humitos left a comment

	def get_queryset(self):
	"""Don't include ignored files and delisted projects."""
	queryset = super().get_queryset()
	queryset = (
	queryset
	.exclude(ignore=True)
	.exclude(project__delisted=True)
	.exclude(project__is_spam=True)
	.select_related('version', 'project')
	)
	return queryset

Search: stop relying on the DB when indexing #10696

Search: stop relying on the DB when indexing #10696

Conversation

stsewd commented Aug 31, 2023 • edited Loading

stsewd Sep 13, 2023

Choose a reason for hiding this comment

humitos commented Sep 13, 2023

humitos left a comment

Choose a reason for hiding this comment

humitos Sep 13, 2023

Choose a reason for hiding this comment

stsewd Sep 13, 2023

Choose a reason for hiding this comment

humitos Sep 13, 2023

Choose a reason for hiding this comment

stsewd Sep 13, 2023

Choose a reason for hiding this comment

stsewd commented Sep 13, 2023

humitos left a comment

Choose a reason for hiding this comment

stsewd commented Aug 31, 2023 •

edited

Loading