Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search: stop relying on the DB when indexing #10696

Merged
merged 15 commits into from
Sep 14, 2023
Merged

Conversation

stsewd
Copy link
Member

@stsewd stsewd commented Aug 31, 2023

  • Removed the "wipe" actions from the admin instead of porting them, since I'm not sure that we need an action in the admin just to delete the search index of a project. Re-index seems useful.

  • fileify was replaced by index_build, and it only requires the build id to be passed, any other information can be retrieved from the build/version object.

  • fileify isn't removed in this PR to avoid downtimes during deploy, it's safe to keep it around till next deploy.

  • New code is avoiding any deep connection to the django-elasticsearch-dsl package, since it doesn't make sense anymore to have it, and I'm planning on removing it.

  • We are no longer tracking all files in the DB, only the ones of interest.

  • Re-indexing a version will also re-evaluate the files from the DB, useful for old projects that are out of sync.

  • The reindex command now generates taks per-version rather than per-collection of files, since we no longer track all files in the DB.

  • Closes Search: stop relying on the DB when indexing #10623

  • Closes Build: track imported files for external versions #10690

We don't need to do anything special during deploy, zero downtime out of the box. We can trigger a re-index for all versions if we want to delete the HTML files that we don't need from the DB, but that operation will also re-index their contents in ES, so probably better do that after we are all settled with any changes to ES.

@@ -320,7 +320,7 @@ def config(self):
:rtype: dict
"""
last_build = (
self.builds(manager=INTERNAL).filter(
self.builds.filter(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to filter builds by internal/external here, if we are accessing the version, we already know if it's external o internal.

@humitos
Copy link
Member

humitos commented Sep 13, 2023

Re-indexing a version will also re-evaluate the files from the DB, useful for old projects that are out of sync.

Can we split this into two different tasks so we can trigger them in different situations? Like the one you mentioned:

We can trigger a re-index for all versions if we want to delete the HTML files that we don't need from the DB, but that operation will also re-index their contents in ES

Then, for normal workflow, if we need them to be ran sequentially we can chain the two tasks like celery.chain([sync_html_files, index_build]) (pseudocode since I don't remember the function's name).

Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! I'd like to understand a little more the "build ID hack" I commented about and also if it's possible to split "create HTML objects and reindex" into two different tasks so we can call them separately.

builds__success=True,
)
.exclude(project__delisted=True)
.exclude(project__is_spam=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably use a score here. Otherwise, only projects manually marked as spam will be excluded here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a copy of

def get_queryset(self):
"""Don't include ignored files and delisted projects."""
queryset = super().get_queryset()
queryset = (
queryset
.exclude(ignore=True)
.exclude(project__delisted=True)
.exclude(project__is_spam=True)
.select_related('version', 'project')
)
return queryset

probably better discuss this at #9899

Comment on lines 200 to 202
# Last pattern to match takes precedence
# XXX: see if we can implement another type of precedence,
# like the longest pattern.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Last pattern to match takes precedence
# XXX: see if we can implement another type of precedence,
# like the longest pattern.
# Last pattern to match takes precedence
# TODO: see if we can implement another type of precedence,
# like the longest pattern.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just remove this note, don't think we will be changing how our search priority works, that will be a breaking change. I like how the current precedence work.

@stsewd
Copy link
Member Author

stsewd commented Sep 13, 2023

Can we split this into two different tasks so we can trigger them in different situations? Like the one you mentioned:

We can, but then we will need to walk the storage twice. I don't think we currently have the need to do these operations separately, if we do, then we can probably just have 3 tasks, one that does both operations together in a single walk and two others that do it independently.

Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@stsewd stsewd merged commit fa54900 into main Sep 14, 2023
@stsewd stsewd deleted the dont-rely-on-db-to-index branch September 14, 2023 15:40
stsewd added a commit that referenced this pull request Sep 18, 2023
This task was replaced in #10696.
stsewd added a commit that referenced this pull request Sep 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Build: track imported files for external versions Search: stop relying on the DB when indexing
2 participants