Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch custom mapping fails to insert with write_documents() #293

Closed
karimjp opened this issue Aug 6, 2020 · 4 comments
Closed

Comments

@karimjp
Copy link
Contributor

karimjp commented Aug 6, 2020

Hi all,

I was attempting to insert the following data dictionary structure as shown in the example below using write_documents():

[
{
"context": "CHAPTER 4 LIVE LOADS",
"page": 68,
"section": "CHAPTER_4",
"chapter": "CHAPTER_4",
"title": "ASCE_7_16-CHAPTER_4"
},
{...},
{...}
]

I have a custom mapping, and field mapping definitions for the text field, and name_field as shown below:

haystack_es = ElasticsearchDocumentStore(host=host, port=9200, username="elastic",
                                         password="some_password", index="some_index",
                                         search_fields = ["title", "context"],
                                         name_field="title", text_field="context",
                                         custom_mapping=pdf_custom_mapping)

This was working at some point but it is breaking now with the following error:

Traceback (most recent call last): File "/Users/karim/PycharmProjects/haystack/playground/solution.py", line 27, in <module> haystack_es.write_documents(haystack_dicts) File "/Users/karim/PycharmProjects/haystack/haystack/database/elasticsearch.py", line 164, in write_documents documents_objects = [Document.from_dict(d) if isinstance(d, dict) else d for d in documents] File "/Users/karim/PycharmProjects/haystack/haystack/database/elasticsearch.py", line 164, in <listcomp> documents_objects = [Document.from_dict(d) if isinstance(d, dict) else d for d in documents] File "/Users/karim/PycharmProjects/haystack/haystack/database/base.py", line 57, in from_dict return cls(**_doc) TypeError: __init__() missing 1 required positional argument: 'text'

Investigating into the issue which appears to generate from this code,

@classmethod
def from_dict(cls, dict):
_doc = dict.copy()
init_args = ["text", "id", "query_score", "question", "meta", "embedding"]
if "meta" not in _doc.keys():
_doc["meta"] = {}
# copy additional fields into "meta"
for k, v in _doc.items():
if k not in init_args:
_doc["meta"][k] = v
# remove additional fields from top level
_doc = {k: v for k, v in _doc.items() if k in init_args}
return cls(**_doc)

I made the following finding:

  • This function is enforcing a schema and does not take into account custom field mappings for the text_field and name_field defined in the ElasticsearchDocument object initialization.

I coded the following code changes to base.py in order to test a solution for my finding, but it only covers the case for mapping my custom text_field to the expected text field. The name field case doesn't seem to be taken into account by the expected document schema internal to haystack:

@classmethod
def from_dict(cls, dict, field_map = {"context": "text"}):
    _doc = dict.copy()
    init_args = ["text", "id", "query_score", "question", "meta", "embedding"]
    if "meta" not in _doc.keys():
        _doc["meta"] = {}
    # copy additional fields into "meta"
    for k, v in _doc.items():
        if k not in init_args and k not in field_map:
            _doc["meta"][k] = v
    # remove additional fields from top level
    _new_doc = {}
    for k,v in _doc.items():
        if k in init_args:
            _new_doc[k] = v
        elif k in field_map:
            k = field_map[k]
            _new_doc[k] = v

    return cls(**_new_doc)
@tanaysoni
Copy link
Contributor

Hi @karimjp, thank you for the detailed explanation!

Yes, you're right, the custom_mapping currently only works when querying documents but not for indexing.

The solution you proposed looks good. Similar to the text field, we could add other fields in the field map. Would you like to create a PR?

@karimjp
Copy link
Contributor Author

karimjp commented Aug 6, 2020

Thank you for reviewing @tanaysoni . Sounds good, I'll work on this tonight.

karimjp added a commit to karimjp/haystack that referenced this issue Aug 7, 2020
@tanaysoni
Copy link
Contributor

Resolved by #297.

@iravkr
Copy link

iravkr commented Nov 9, 2021

for index, row in tqdm(dff.iterrows()):
print(index)
print(row)
dicts = {}
dicts['text'] = clean_text(row['full_text'])
doc_len.append(len(dicts['text']))
corpora.append(dicts['text'])
dicts['meta'] = {}
dicts['meta']['name'] = clean_text(row['index'])
docs.append(dicts)

document_store.write_documents(docs)

TypeError: init() missing 1 required positional argument: 'content'

Can anyone check this error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants