Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom mapping write doc fix #297

Merged
merged 7 commits into from
Aug 10, 2020

Conversation

karimjp
Copy link
Contributor

@karimjp karimjp commented Aug 7, 2020

Fix to address issue #293

@tanaysoni
Copy link
Contributor

Hi @karimjp, thank you for working on this.

I had a closer look at the ElasticsearchDocumentStore. In the current version, the custom fields(eg, self.text or self.embedding) are not used at the Elasticsearch level when writing documents to an index.

def write_documents(self, documents: Union[List[dict], List[Document]], index: Optional[str] = None):
"""
Indexes documents for later queries in Elasticsearch.
When using explicit document IDs, any existing document with the same ID gets updated.
:param documents: a list of Python dictionaries or a list of Haystack Document objects.
For documents as dictionaries, the format is {"text": "<the-actual-text>"}.
Optionally: Include meta data via {"text": "<the-actual-text>",
"meta":{"name": "<some-document-name>, "author": "somebody", ...}}
It can be used for filtering and is accessible in the responses of the Finder.
Advanced: If you are using your own Elasticsearch mapping, the key names in the dictionary
should be changed to what you have set for self.text_field and self.name_field.
:param index: Elasticsearch index where the documents should be indexed. If not supplied, self.index will be used.
:return: None
"""
if index and not self.client.indices.exists(index=index):
self._create_document_index(index)
if index is None:
index = self.index
# Make sure we comply to Document class format
documents_objects = [Document.from_dict(d) if isinstance(d, dict) else d for d in documents]
documents_to_index = []
for doc in documents_objects:
_doc = {
"_op_type": "index" if self.update_existing_documents else "create",
"_index": index,
**doc.to_dict()
} # type: Dict[str, Any]
# rename id for elastic
_doc["_id"] = str(_doc.pop("id"))
# don't index query score and empty fields
_ = _doc.pop("query_score", None)
_doc = {k:v for k,v in _doc.items() if v is not None}
# In order to have a flat structure in elastic + similar behaviour to the other DocumentStores,
# we "unnest" all value within "meta"
if "meta" in _doc.keys():
for k, v in _doc["meta"].items():
_doc[k] = v
_doc.pop("meta")
documents_to_index.append(_doc)
bulk(self.client, documents_to_index, request_timeout=300, refresh="wait_for")

At line 170, the dictionaries get converted to Document objects. Within a Document object, the text field is always named as text and that's what gets indexed to the Elasticsearch.

We might want to have a fix for it first before we can continue with this PR. What do you think?

@karimjp
Copy link
Contributor Author

karimjp commented Aug 7, 2020

Hi @tanaysoni you were absolutely right about the custom keys being indexed with the default key names, thanks for catching this. I added the changes in the latest commits to the branch custom_mapping_write_doc_fix. In addition to the fixes we talked, I noticed rest_api/config.py was missing a configuration option to map a custom name field and so I added it and also included it in rest_api/controller/search.py.

@tanaysoni
Copy link
Contributor

Thank you for the fixes @karimjp. I added a small test case to the PR and it works well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants