-
Notifications
You must be signed in to change notification settings - Fork 952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New versions of pdfminer.six cannot extract chinese characters from pdf #391
Comments
I can replicate this issue. I did a I tested with
|
hi @pietermarsman , you or some else is working on this issue ? |
Not that I am aware of. |
@yadavsandip32 Looking into this |
Post fa40043
I have not tested on other PDFs but for the PDF in this issue the code block above (at the end of |
Thanks @dwalton76. I've used your analysis as a starting point. Someone wants to review my PR: #438? |
Describe the bug:
I have observed that latest version of pdfminer.six i.e. 20200124 is not able to read Chinese character.
The version appropriately working right now is 20181108. After that all the later versions are having that problem.
To Reproduce:
Simply run
pdf2txt.py input.pdf
Current output:
Expected Output:
Reference PDF:
Sample pdf for chinese character.pdf
P.S. The acrobat may show error while opening this pdf, you can ignore that error. It is prompting this error because some part of the pdf had customer data, so I have manually changed it.
The text was updated successfully, but these errors were encountered: