BUG: duplicated tokens should not be allowed in pdf_structure tokens list #186

CarloNicolini · 2022-08-26T21:18:52Z

In the pdfplumber preprocess pipeline I've found that duplicated tokens may exist. Specifically in the obtain_word_tokens of the pdfplumber.py file, one should put a .drop_duplicates before converting the dataframe to list.

word_tokens = df.apply(self.convert_to_pagetoken, axis=1).drop_duplicates(keep="first").tolist()

It can happen in some cases that the tokens from a PAWLS pdf structure appear duplicated and this messes up things a bit when indexing from the annotation file.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: duplicated tokens should not be allowed in pdf_structure tokens list #186

BUG: duplicated tokens should not be allowed in pdf_structure tokens list #186

CarloNicolini commented Aug 26, 2022

BUG: duplicated tokens should not be allowed in pdf_structure tokens list #186

BUG: duplicated tokens should not be allowed in pdf_structure tokens list #186

Comments

CarloNicolini commented Aug 26, 2022