Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Allowed search results for Django code terms which contain stop words. #1942

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 14 additions & 4 deletions docs/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
TrigramSimilarity,
)
from django.core.cache import cache
from django.db import models, transaction
from django.db import connection, models, transaction
from django.db.models import Prefetch, Q
from django.db.models.fields.json import KeyTextTransform
from django.utils.functional import cached_property
Expand Down Expand Up @@ -174,6 +174,18 @@ def sync_to_db(self, decoded_documents):
if line.startswith(f"Disallow: /{self.lang}/{self.release_id}/")
]

language_mapping = TSEARCH_CONFIG_LANGUAGES
english = "custom_english"
with connection.cursor() as cursor:
cursor.execute(
"SELECT EXISTS(SELECT 1 FROM pg_ts_config WHERE cfgname = %s)",
[english],
)
has_custom_english_config = cursor.fetchone()[0]

if has_custom_english_config:
language_mapping["en"] = english

for document in decoded_documents:
if (
"body" not in document
Expand All @@ -192,9 +204,7 @@ def sync_to_db(self, decoded_documents):
path=document_path,
title=html.unescape(strip_tags(document["title"])),
metadata=document,
config=TSEARCH_CONFIG_LANGUAGES.get(
self.lang[:2], DEFAULT_TEXT_SEARCH_CONFIG
),
config=language_mapping.get(self.lang[:2], DEFAULT_TEXT_SEARCH_CONFIG),
)
for document in self.documents.all():
document.metadata["breadcrumbs"] = list(
Expand Down
38 changes: 38 additions & 0 deletions docs/stopwords/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Instructions to create a new search dictionary
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have never done this before so these instructions may not be very good
I would love it if we can create this custom search dictionary in our docker setup as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job. Still few time to review properly, but I had two ideas:

  1. create a migration to create a custom English dictionary
  2. write the list of words you removed from the original list of stop words and a command/pathc/other to remove those words from the final stop words list you uploaded


In this folder, there is `custom_english.stop`.

This copies the [snowball english stop words](https://github.com/postgres/postgres/blob/master/src/backend/snowball/stopwords/english.stop)
but removes some stop words such as "through" and "when". This is because these
terms are also used in Django code.

The file format is a list of words, one per line. Blank lines and trailing
spaces are ignored, and upper case is folded to lower case, but no other
processing is done on the file contents.

This file needs to be created in `$SHAREDIR/tsearch_data/custom_english.stop`,
where `$SHAREDIR` means the PostgreSQL installation's shared-data directory,
available via `pg_config --sharedir`.

See https://www.postgresql.org/docs/current/textsearch-dictionaries.html

Once the custom stop words file has been created, we can run the following SQL:

```sql
CREATE TEXT SEARCH DICTIONARY english_custom (
TEMPLATE = snowball,
Language = english,
StopWords = english_custom
);

CREATE TEXT SEARCH CONFIGURATION public.english_custom (
COPY = pg_catalog.english
);

ALTER TEXT SEARCH CONFIGURATION public.english_custom
ALTER MAPPING
FOR asciiword, asciihword, hword_asciipart, hword, hword_part, word
WITH english_custom;
```

This should then mean the `english_custom` search dictionary is available.
119 changes: 119 additions & 0 deletions docs/stopwords/custom_english.stop
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
i
me
my
myself
we
our
ours
ourselves
you
your
yours
yourself
yourselves
he
him
his
himself
she
her
hers
herself
it
its
itself
they
them
their
theirs
themselves
what
which
who
whom
this
that
these
those
am
is
are
was
were
be
been
being
have
has
had
having
do
does
did
doing
a
an
the
and
but
or
because
as
until
while
of
at
by
about
against
between
into
during
before
after
above
below
to
from
up
down
in
out
on
off
over
under
again
further
then
once
here
there
where
why
how
any
both
each
few
more
most
other
some
such
no
nor
not
own
same
so
than
too
very
s
t
can
will
just
don
should