Incorrect bunsetu_span detection boundary condition in bunsetu_recognizer pipeline #195

borh · 2021-10-15T08:20:06Z

When processing a multi-sentence line document from the command line (reproducible by running the ginza command against the text file here), analyze_conllu in command_line.py can trigger an IndexError at line 287 below:

ginza/ginza/command_line.py

Lines 286 to 287 in 31a22bc

    
           for idx in range(phrase.start - sent.start, phrase.end - sent.start): 
        
               np_labels[idx] = "NP_B" if idx == phrase.start else "NP_I"

This can trigger because the bunsetu_span function in bunsetu_recognizer.py uses 0 as the ending boundary condition in the for loop (L81) and else branch (L86), which can apparently traverse into a previous sentence, return a phrase from there and thus result in a negative index above.

ginza/ginza/bunsetu_recognizer.py

Lines 77 to 96 in 31a22bc

    
           def bunsetu_span(token: Token) -> Span: 
        
               bunsetu_bi_list = bunsetu_bi_labels(token.doc) 
        
               start = token.i 
        
               end = start + 1 
        
               for idx in range(start, 0, -1): 
        
                   if bunsetu_bi_list[idx] == "B": 
        
                       start = idx 
        
                       break 
        
               else: 
        
                   start = 0 
        
               doc_len = len(token.doc) 
        
               for idx in range(end, doc_len): 
        
                   if bunsetu_bi_list[idx] == "B": 
        
                       end = idx 
        
                       break 
        
               else: 
        
                   end = doc_len 
        
               doc = token.doc 
        
               return Span(doc, start=start, end=end, label=POS_PHRASE_MAP.get(doc[start:end].root.pos_, ""))

I am actually not sure this isn't a bug with incorrect BI labels, but the logic change that fixes this error for me is to change the boundary conditions (L81 and L86) from 0 (=token.doc.start) to token.sent.start. If a PR would help, I can make one. Note that I could only get this to trigger with the ja-ginza-electra model and not with the ja-ginza version. This obviously does not trigger with the sentencizer disabled. Used versions:

ginza==5.0.2
ginza-transformers==0.3.1
ja-ginza-electra==5.0.0

The text was updated successfully, but these errors were encountered:

hiroshi-matsuda-rit · 2021-10-15T09:08:41Z

@borh Thanks for reporting! You observation is True. The bunsetu span should not cross the sentence boundary.
I fixed it and released v5.0.3. Please check it out.

hiroshi-matsuda-rit closed this as completed in 185c4ae Oct 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect bunsetu_span detection boundary condition in bunsetu_recognizer pipeline #195

Incorrect bunsetu_span detection boundary condition in bunsetu_recognizer pipeline #195

borh commented Oct 15, 2021

hiroshi-matsuda-rit commented Oct 15, 2021

Incorrect bunsetu_span detection boundary condition in bunsetu_recognizer pipeline #195

Incorrect bunsetu_span detection boundary condition in bunsetu_recognizer pipeline #195

Comments

borh commented Oct 15, 2021

hiroshi-matsuda-rit commented Oct 15, 2021