You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When processing a multi-sentence line document from the command line (reproducible by running the ginza command against the text file here), analyze_conllu in command_line.py can trigger an IndexError at line 287 below:
This can trigger because the bunsetu_span function in bunsetu_recognizer.py uses 0 as the ending boundary condition in the for loop (L81) and else branch (L86), which can apparently traverse into a previous sentence, return a phrase from there and thus result in a negative index above.
I am actually not sure this isn't a bug with incorrect BI labels, but the logic change that fixes this error for me is to change the boundary conditions (L81 and L86) from 0 (=token.doc.start) to token.sent.start. If a PR would help, I can make one. Note that I could only get this to trigger with the ja-ginza-electra model and not with the ja-ginza version. This obviously does not trigger with the sentencizer disabled. Used versions:
@borh Thanks for reporting! You observation is True. The bunsetu span should not cross the sentence boundary.
I fixed it and released v5.0.3. Please check it out.
When processing a multi-sentence line document from the command line (reproducible by running the ginza command against the text file here), analyze_conllu in command_line.py can trigger an
IndexError
at line 287 below:ginza/ginza/command_line.py
Lines 286 to 287 in 31a22bc
This can trigger because the bunsetu_span function in bunsetu_recognizer.py uses 0 as the ending boundary condition in the for loop (L81) and else branch (L86), which can apparently traverse into a previous sentence, return a phrase from there and thus result in a negative index above.
ginza/ginza/bunsetu_recognizer.py
Lines 77 to 96 in 31a22bc
I am actually not sure this isn't a bug with incorrect BI labels, but the logic change that fixes this error for me is to change the boundary conditions (L81 and L86) from 0 (=token.doc.start) to
token.sent.start
. If a PR would help, I can make one. Note that I could only get this to trigger with the ja-ginza-electra model and not with the ja-ginza version. This obviously does not trigger with the sentencizer disabled. Used versions:The text was updated successfully, but these errors were encountered: