Skip to content

Commit

Permalink
Query feature/pla 364 update vespa schema to include concept counts (#…
Browse files Browse the repository at this point in the history
…175)

* Update vespa test files

* Adding concept counts to families.

* Bumping the version.

* Updating the schema.

* Adding the initial concept count filter object and query in the comments.

* Adding to yql builder.

* Developing query method.

* Refactoring.

* Extending tests.

* Updating docstring.

* Precommit fix.

* Adding in sort functionality as well as concept counts as a hit.

* Adding sort functionality to the tests.

* Adding in the ability to search for false matches.

* Resolving merge conflict.

* Removing comment.

* Precommit fix.

* Updating some comments.

* Resolving some PR comments.

* Moving a test and adding one to combine filter permutations.

* Adding a unit test for the yql builder clause.

* Adding documentation.

* Precommit fix.

---------

Co-authored-by: THOR300 <[email protected]>
Co-authored-by: Mark <[email protected]>
  • Loading branch information
3 people authored Jan 16, 2025
1 parent e35b9cb commit 47dd75d
Show file tree
Hide file tree
Showing 8 changed files with 340 additions and 8 deletions.
20 changes: 19 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,24 @@ For clean up:
make vespa_dev_down
```

### Filtering for concept counts

The cpr_sdk incorporates via `SearchParameters` and a build clause in the `YqlBuilder` class the ability to perform complex queries on the agregated concept counts that are held in the family index.

These counts refer to the total number of matches for a concept in a family document. For example concept Q123 may have 100 matches because the concept for example forestry is mentioned in text 100 times.

So what queries can we perform?
- An extensive set of tests have been written for the concept count filters, these display the full capabilities of the filtering functionality:
`tests/test_search_adaptors.py:test_vespa_search_adaptor__concept_counts`

This shows that we can:
- Filter for documents with a match for a concept.
- Filter for documents that don't have a match for a concept.
- Filter for documents with a match for a concept, with a specific count (e.g. > 10 matches)
- Filter for documents with a count of any concept (e.g. > 10 matches)
- Stack filters via an AND operator, e.g. 100 matches for Q123 AND 10 matches for Q456.
- Order results in ascending or descending order such that documents with the most/least matches appear first in search.

## Release Flow:

- Make updates to the package.
Expand All @@ -247,4 +265,4 @@ make vespa_dev_down
- Merge.
- Tag a release manually in github with a version that matches the latest on main that you just merged.
- In CI/CD we will check that the latest release matches the versions defined in code.
- Check in `pypi`.
- Check in `pypi`.
49 changes: 48 additions & 1 deletion src/cpr_sdk/models/search.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import re
from datetime import datetime
from typing import List, Literal, Optional, Sequence, Any
from enum import Enum
from typing import Any, List, Literal, Optional, Sequence

from pydantic import (
AliasChoices,
Expand All @@ -24,6 +25,7 @@
"date": "family_publication_ts",
"title": "family_name",
"name": "family_name",
"concept_counts": "concept_counts.value",
}

filter_fields = {
Expand All @@ -38,6 +40,16 @@
ID_PATTERN = re.compile(rf"{_ID_ELEMENT}\.{_ID_ELEMENT}\.{_ID_ELEMENT}\.{_ID_ELEMENT}")


class OperandTypeEnum(Enum):
"""Enumeration of possible operands for yql queries"""

GREATER_THAN = ">"
GREATER_THAN_OR_EQUAL = ">="
LESS_THAN = "<"
LESS_THAN_OR_EQUAL = "<="
EQUALS = "="


class MetadataFilter(BaseModel):
"""A filter for metadata fields"""

Expand Down Expand Up @@ -151,6 +163,34 @@ def sanitise_filter_inputs(cls, field):
return clean_values


class ConceptCountFilter(BaseModel):
"""
A filter for a concept count.
Can combine filters for concept ID and concept count to achieve logic like:
- Documents with greater than 10 matches of concept Q123.
- Documents with greater than 1000 matches of any concept.
These ConceptCountFilters can be combined with an 'and' operator to create more
complex queries like:
- Documents with more than 10 matches for concept Q123 and more than 5 matches for
concept Q456.
param concept_id: If provided this is the ID of the concept to filter on. If it
left blank then all concepts that match the query will be counted.
param count: The number of matches to filter on.
param operand: The operand to use for the filter.
E.g. we want to filter for documents with more than 10 matches of concept Q123.
param negate: Whether to negate the filter.
E.g. we want to filter for documents that do NOT have a match for a concept.
"""

concept_id: Optional[str] = None
count: int
operand: OperandTypeEnum
negate: bool = False


class SearchParameters(BaseModel):
"""Parameters for a search request"""

Expand Down Expand Up @@ -257,6 +297,11 @@ class SearchParameters(BaseModel):
so can also be used to override YQL or ranking profiles.
"""

concept_count_filters: Optional[Sequence[ConceptCountFilter]] = None
"""
A list of concept count filters to apply to the search.
"""

replace_acronyms: bool = False
"""
Whether to perform acronym replacement based on the 'acronyms' ruleset.
Expand Down Expand Up @@ -395,6 +440,7 @@ class Hit(BaseModel):
concepts: Optional[Sequence[Concept]] = None
relevance: Optional[float] = None
rank_features: Optional[dict[str, float]] = None
concept_counts: Optional[dict[str, int]] = None

@classmethod
def from_vespa_response(cls, response_hit: dict) -> "Hit":
Expand Down Expand Up @@ -476,6 +522,7 @@ def from_vespa_response(cls, response_hit: dict) -> "Document":
concepts=fields.get("concepts"),
relevance=response_hit.get("relevance"),
rank_features=fields.get("summaryfeatures"),
concept_counts=fields.get("concept_counts"),
)


Expand Down
2 changes: 1 addition & 1 deletion src/cpr_sdk/version.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
_MAJOR = "1"
_MINOR = "13"
_MINOR = "14"
_PATCH = "0"
_SUFFIX = ""

Expand Down
24 changes: 24 additions & 0 deletions src/cpr_sdk/yql_builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,29 @@ def build_year_end_filter(self) -> Optional[str]:
return f"(family_publication_year <= {end})"
return None

def build_concept_count_filter(self) -> Optional[str]:
"""Create the part of the query that filters on concept counts"""
concept_count_filters_subqueries = []
if self.params.concept_count_filters:
for concept_count_filter in self.params.concept_count_filters:
concept_count_filters_subqueries.append(
f"""
{"!" if concept_count_filter.negate else ""}
(
concept_counts contains sameElement(
{(
f'key contains "{concept_count_filter.concept_id}", '
if concept_count_filter.concept_id is not None else ""
)}
value {concept_count_filter.operand.value} {concept_count_filter.count}
)
)
"""
)

return f"({' and '.join(concept_count_filters_subqueries)})"
return None

def build_where_clause(self) -> str:
"""Create the part of the query that adds filters"""
filters = []
Expand All @@ -173,6 +196,7 @@ def build_where_clause(self) -> str:
filters.append(self._inclusive_filters(f, "family_source"))
filters.append(self.build_year_start_filter())
filters.append(self.build_year_end_filter())
filters.append(self.build_concept_count_filter())
return " and ".join([f for f in filters if f]) # Remove empty

def build_continuation(self) -> str:
Expand Down
7 changes: 7 additions & 0 deletions tests/local_vespa/test_app/schemas/family_document.sd
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,12 @@ schema family_document {
attribute: fast-search
}
}

field concept_counts type map<string, int> {
indexing: summary
struct-field key { indexing: attribute }
struct-field value { indexing: attribute }
}
}

import field search_weights_ref.name_weight as name_weight {}
Expand Down Expand Up @@ -245,6 +251,7 @@ schema family_document {
summary document_cdn_object {}
summary document_source_url {}
summary metadata {}
summary concept_counts {}
summary corpus_import_id {}
summary corpus_type_name {}
summary collection_title {}
Expand Down
15 changes: 12 additions & 3 deletions tests/local_vespa/test_documents/family_document.json
Original file line number Diff line number Diff line change
Expand Up @@ -295,7 +295,10 @@
"name": "family.sector",
"value": "Government"
}
]
],
"concept_counts": {
"concept_0_0": 2
}
}
},
{
Expand Down Expand Up @@ -604,7 +607,12 @@
"name": "family.instrument",
"value": "Capacity building"
}
]
],
"concept_counts": {
"concept_1_1": 101,
"concept_2_2": 15,
"concept_3_3": 1543
}
}
},
{
Expand Down Expand Up @@ -925,7 +933,8 @@
"name": "family.instrument",
"value": "Capacity building"
}
]
],
"concept_counts": {}
}
}
]
Loading

0 comments on commit 47dd75d

Please sign in to comment.