-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search: stop relying on the DB when indexing #10696
Conversation
@@ -320,7 +320,7 @@ def config(self): | |||
:rtype: dict | |||
""" | |||
last_build = ( | |||
self.builds(manager=INTERNAL).filter( | |||
self.builds.filter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to filter builds by internal/external here, if we are accessing the version, we already know if it's external o internal.
Can we split this into two different tasks so we can trigger them in different situations? Like the one you mentioned:
Then, for normal workflow, if we need them to be ran sequentially we can chain the two tasks like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! I'd like to understand a little more the "build ID hack" I commented about and also if it's possible to split "create HTML objects and reindex" into two different tasks so we can call them separately.
builds__success=True, | ||
) | ||
.exclude(project__delisted=True) | ||
.exclude(project__is_spam=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably use a score here. Otherwise, only projects manually marked as spam will be excluded here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a copy of
readthedocs.org/readthedocs/search/documents.py
Lines 120 to 130 in 53e21bb
def get_queryset(self): | |
"""Don't include ignored files and delisted projects.""" | |
queryset = super().get_queryset() | |
queryset = ( | |
queryset | |
.exclude(ignore=True) | |
.exclude(project__delisted=True) | |
.exclude(project__is_spam=True) | |
.select_related('version', 'project') | |
) | |
return queryset |
probably better discuss this at #9899
readthedocs/projects/tasks/search.py
Outdated
# Last pattern to match takes precedence | ||
# XXX: see if we can implement another type of precedence, | ||
# like the longest pattern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Last pattern to match takes precedence | |
# XXX: see if we can implement another type of precedence, | |
# like the longest pattern. | |
# Last pattern to match takes precedence | |
# TODO: see if we can implement another type of precedence, | |
# like the longest pattern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just remove this note, don't think we will be changing how our search priority works, that will be a breaking change. I like how the current precedence work.
We can, but then we will need to walk the storage twice. I don't think we currently have the need to do these operations separately, if we do, then we can probably just have 3 tasks, one that does both operations together in a single walk and two others that do it independently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
This task was replaced in #10696.
This task was replaced in #10696.
Removed the "wipe" actions from the admin instead of porting them, since I'm not sure that we need an action in the admin just to delete the search index of a project. Re-index seems useful.
fileify
was replaced byindex_build
, and it only requires the build id to be passed, any other information can be retrieved from the build/version object.fileify
isn't removed in this PR to avoid downtimes during deploy, it's safe to keep it around till next deploy.New code is avoiding any deep connection to the django-elasticsearch-dsl package, since it doesn't make sense anymore to have it, and I'm planning on removing it.
We are no longer tracking all files in the DB, only the ones of interest.
Re-indexing a version will also re-evaluate the files from the DB, useful for old projects that are out of sync.
The reindex command now generates taks per-version rather than per-collection of files, since we no longer track all files in the DB.
Closes Search: stop relying on the DB when indexing #10623
Closes Build: track imported files for external versions #10690
We don't need to do anything special during deploy, zero downtime out of the box. We can trigger a re-index for all versions if we want to delete the HTML files that we don't need from the DB, but that operation will also re-index their contents in ES, so probably better do that after we are all settled with any changes to ES.