New versions of pdfminer.six cannot extract chinese characters from pdf #391

yadavsandip32 · 2020-03-16T08:45:46Z

Describe the bug:

I have observed that latest version of pdfminer.six i.e. 20200124 is not able to read Chinese character.

The version appropriately working right now is 20181108. After that all the later versions are having that problem.

To Reproduce:

Simply run pdf2txt.py input.pdf

Current output:

Ysj Lvf
Tel:
Fax:
Email:
...
...
Abc Lvf
28H abcdefg Road#06-90 Tower 9 The Abcdef@AbcdefGhijklmnop 8336661

Expected Output:

Ysj退Lvf送
Tel:
Fax:
Email:
...
...
Abc退Lvf送
28H abcdefg Road#06-90 Tower 9 The Abcdef@AbcdefGhijklmnop 8336661

Reference PDF:
Sample pdf for chinese character.pdf

P.S. The acrobat may show error while opening this pdf, you can ignore that error. It is prompting this error because some part of the pdf had customer data, so I have manually changed it.

The text was updated successfully, but these errors were encountered:

pietermarsman · 2020-03-24T21:05:41Z

I can replicate this issue.

I did a git bisect and the commit that introduced the error is fa40043.

I tested with

python tools/pdf2txt.py uncompressedunbroken.pdf | head -n 1

yadavsandip32 · 2020-03-31T13:37:43Z

hi @pietermarsman , you or some else is working on this issue ?

pietermarsman · 2020-03-31T20:48:13Z

Not that I am aware of.

fakabbir · 2020-05-21T05:36:33Z

@yadavsandip32 Looking into this

dwalton76 · 2020-06-06T13:37:56Z

Post fa40043 CMapDB.get_cmap(cmap_name) will only be called if cmap_name in IDENTITY_ENCODER. For the Chinese characters in the PDF in this issue the cmap_name is UniGB-UCS2-H. If get_cmap is called for UniGB-UCS2-H it has the info needed to create the cmap.

        if cmap_name in IDENTITY_ENCODER:
            return CMapDB.get_cmap(IDENTITY_ENCODER[cmap_name])
        else:
            try:
                return CMapDB.get_cmap(cmap_name)
            except Exception as e:
                log.info(f"get_cmap for {cmap_name} failed due to '{e}'")
                return CMap()

I have not tested on other PDFs but for the PDF in this issue the code block above (at the end of get_cmap_from_spec restores the previous behavior of Ysj退Lvf送

pietermarsman · 2020-06-06T14:35:41Z

Thanks @dwalton76. I've used your analysis as a starting point.

Someone wants to review my PR: #438?

yadavsandip32 mentioned this issue Mar 23, 2020

Can not extrat text from some chinese pdf document #400

Closed

pietermarsman added type: bug component:characters Anything with encodings, character mappings or CJK languages labels Mar 24, 2020

pietermarsman mentioned this issue May 21, 2020

Version 20200517 cannot extract correct text from pdf file (version: pdf-1.2) but Version 20181108 can #430

Closed

pietermarsman mentioned this issue Jun 6, 2020

Always try to get CMap, even if name is not recognized #438

Merged

6 tasks

pietermarsman self-assigned this Jul 11, 2020

pietermarsman closed this as completed in #438 Jul 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New versions of pdfminer.six cannot extract chinese characters from pdf #391

New versions of pdfminer.six cannot extract chinese characters from pdf #391

yadavsandip32 commented Mar 16, 2020 •

edited

Loading

pietermarsman commented Mar 24, 2020 •

edited

Loading

yadavsandip32 commented Mar 31, 2020

pietermarsman commented Mar 31, 2020

fakabbir commented May 21, 2020

dwalton76 commented Jun 6, 2020

pietermarsman commented Jun 6, 2020

New versions of pdfminer.six cannot extract chinese characters from pdf #391

New versions of pdfminer.six cannot extract chinese characters from pdf #391

Comments

yadavsandip32 commented Mar 16, 2020 • edited Loading

pietermarsman commented Mar 24, 2020 • edited Loading

yadavsandip32 commented Mar 31, 2020

pietermarsman commented Mar 31, 2020

fakabbir commented May 21, 2020

dwalton76 commented Jun 6, 2020

pietermarsman commented Jun 6, 2020

yadavsandip32 commented Mar 16, 2020 •

edited

Loading

pietermarsman commented Mar 24, 2020 •

edited

Loading