Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow selecting decode errors bahaviour #87

Closed
stefanor opened this issue Jul 4, 2015 · 11 comments
Closed

Allow selecting decode errors bahaviour #87

stefanor opened this issue Jul 4, 2015 · 11 comments

Comments

@stefanor
Copy link
Contributor

stefanor commented Jul 4, 2015

Forwarded from https://bugs.launchpad.net/ubuntu/+source/python-html2text/+bug/1318227

Currently it stops conversion on any decode error:

$ html2markdown broken_text
Traceback (most recent call last):
  File "/usr/bin/html2markdown", line 9, in <module>
    load_entry_point('html2text==3.200.3', 'console_scripts', 'html2text')()
  File "/usr/lib/python3/dist-packages/html2text.py", line 781, in main
    data = data.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x91 in position 4: invalid start byte

But for the files I'm working on it would be perfectly fine just to add

data = data.decode(encoding, errors='ignore')

It can be exposed as an option.

@Alir3z4
Copy link
Owner

Alir3z4 commented Jul 4, 2015

load_entry_point('html2text==3.200.3', 'console_scripts', 'html2text')()

This is possible duplicate of #83 and some others I guess, but the main issue is that this one is not using the latest version.

@stefanor Could you please confirm the latest version throws the same error as well?

@theSage21
Copy link
Collaborator

@stefanor That would be a good idea. Am I right in saying that this should be a non boolean value? something like decode_errors = ignore?

@int-ua
Copy link

int-ua commented Jul 5, 2015

@Alir3z4 I'm afraid it's not fixed, just check the code here:
https://github.com/Alir3z4/html2text/blob/master/html2text/cli.py#L204

No option to ignore errors. Have you seen one anywhere?

@theSage21
Copy link
Collaborator

@stefanor @int-ua I am working on this. Should the default be 'ignore' or 'strict'?

@int-ua
Copy link

int-ua commented Jul 6, 2015

Looks like backward compatibility needs 'strict'. Also, losing some characters should be acknowledged IMO.

@theSage21
Copy link
Collaborator

@int-ua @stefanor Does this fix it?

@int-ua
Copy link

int-ua commented Jul 6, 2015

Yes, thanks. Also, looks like it would be nice catching this UnicodeDecodeError and recommending to use the new option.

@theSage21
Copy link
Collaborator

@int-ua Something like Warning: this is set for deprication. use --decode-errors=ignore?

@theSage21
Copy link
Collaborator

@int-ua It is done. @Alir3z4 Consider the issue closed?

@theSage21
Copy link
Collaborator

@Alir3z4 you merged #88 . This issue is closed.

@Alir3z4
Copy link
Owner

Alir3z4 commented Aug 10, 2015

@theSage21 Thanks for the heads up.
The issue is closed now.

@Alir3z4 Alir3z4 closed this as completed Aug 10, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants