Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New versions of pdfminer.six cannot extract chinese characters from pdf #391

Closed
yadavsandip32 opened this issue Mar 16, 2020 · 6 comments · Fixed by #438
Closed

New versions of pdfminer.six cannot extract chinese characters from pdf #391

yadavsandip32 opened this issue Mar 16, 2020 · 6 comments · Fixed by #438
Assignees
Labels
component:characters Anything with encodings, character mappings or CJK languages type: bug

Comments

@yadavsandip32
Copy link

yadavsandip32 commented Mar 16, 2020

Describe the bug:

I have observed that latest version of pdfminer.six i.e. 20200124 is not able to read Chinese character.

The version appropriately working right now is 20181108. After that all the later versions are having that problem.

To Reproduce:

Simply run pdf2txt.py input.pdf

Current output:

Ysj Lvf
Tel:
Fax:
Email:
...
...
Abc Lvf
28H abcdefg Road#06-90 Tower 9 The Abcdef@AbcdefGhijklmnop 8336661

Expected Output:

Ysj退Lvf送
Tel:
Fax:
Email:
...
...
Abc退Lvf送
28H abcdefg Road#06-90 Tower 9 The Abcdef@AbcdefGhijklmnop 8336661

Reference PDF:
Sample pdf for chinese character.pdf

P.S. The acrobat may show error while opening this pdf, you can ignore that error. It is prompting this error because some part of the pdf had customer data, so I have manually changed it.

@pietermarsman
Copy link
Member

pietermarsman commented Mar 24, 2020

I can replicate this issue.

I did a git bisect and the commit that introduced the error is fa40043.

I tested with

python tools/pdf2txt.py uncompressedunbroken.pdf | head -n 1

@pietermarsman pietermarsman added type: bug component:characters Anything with encodings, character mappings or CJK languages labels Mar 24, 2020
@yadavsandip32
Copy link
Author

hi @pietermarsman , you or some else is working on this issue ?

@pietermarsman
Copy link
Member

Not that I am aware of.

@fakabbir
Copy link
Contributor

@yadavsandip32 Looking into this

@dwalton76
Copy link

Post fa40043 CMapDB.get_cmap(cmap_name) will only be called if cmap_name in IDENTITY_ENCODER. For the Chinese characters in the PDF in this issue the cmap_name is UniGB-UCS2-H. If get_cmap is called for UniGB-UCS2-H it has the info needed to create the cmap.

        if cmap_name in IDENTITY_ENCODER:
            return CMapDB.get_cmap(IDENTITY_ENCODER[cmap_name])
        else:
            try:
                return CMapDB.get_cmap(cmap_name)
            except Exception as e:
                log.info(f"get_cmap for {cmap_name} failed due to '{e}'")
                return CMap()

I have not tested on other PDFs but for the PDF in this issue the code block above (at the end of get_cmap_from_spec restores the previous behavior of Ysj退Lvf送

@pietermarsman
Copy link
Member

Thanks @dwalton76. I've used your analysis as a starting point.

Someone wants to review my PR: #438?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:characters Anything with encodings, character mappings or CJK languages type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants