Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect bunsetu_span detection boundary condition in bunsetu_recognizer pipeline #195

Closed
borh opened this issue Oct 15, 2021 · 1 comment

Comments

@borh
Copy link

borh commented Oct 15, 2021

When processing a multi-sentence line document from the command line (reproducible by running the ginza command against the text file here), analyze_conllu in command_line.py can trigger an IndexError at line 287 below:

ginza/ginza/command_line.py

Lines 286 to 287 in 31a22bc

for idx in range(phrase.start - sent.start, phrase.end - sent.start):
np_labels[idx] = "NP_B" if idx == phrase.start else "NP_I"

This can trigger because the bunsetu_span function in bunsetu_recognizer.py uses 0 as the ending boundary condition in the for loop (L81) and else branch (L86), which can apparently traverse into a previous sentence, return a phrase from there and thus result in a negative index above.

def bunsetu_span(token: Token) -> Span:
bunsetu_bi_list = bunsetu_bi_labels(token.doc)
start = token.i
end = start + 1
for idx in range(start, 0, -1):
if bunsetu_bi_list[idx] == "B":
start = idx
break
else:
start = 0
doc_len = len(token.doc)
for idx in range(end, doc_len):
if bunsetu_bi_list[idx] == "B":
end = idx
break
else:
end = doc_len
doc = token.doc
return Span(doc, start=start, end=end, label=POS_PHRASE_MAP.get(doc[start:end].root.pos_, ""))

I am actually not sure this isn't a bug with incorrect BI labels, but the logic change that fixes this error for me is to change the boundary conditions (L81 and L86) from 0 (=token.doc.start) to token.sent.start. If a PR would help, I can make one. Note that I could only get this to trigger with the ja-ginza-electra model and not with the ja-ginza version. This obviously does not trigger with the sentencizer disabled. Used versions:

ginza==5.0.2
ginza-transformers==0.3.1
ja-ginza-electra==5.0.0
@hiroshi-matsuda-rit
Copy link
Contributor

@borh Thanks for reporting! You observation is True. The bunsetu span should not cross the sentence boundary.
I fixed it and released v5.0.3. Please check it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants