Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LEFT-TO-RIGHT MARK sometimes causing a crash #208

Closed
JulienPalard opened this issue Aug 3, 2018 · 5 comments
Closed

LEFT-TO-RIGHT MARK sometimes causing a crash #208

JulienPalard opened this issue Aug 3, 2018 · 5 comments

Comments

@JulienPalard
Copy link

Hi, using html2text 2018.1.9 and Python 3.7.0, I'm having an issue with the U+200E LEFT-TO-RIGHT MARK caracter:

$ html2text <<< '<html> <body> <b>b</b>&#8206; </body> </html>'
Traceback (most recent call last):
  File "/home/mdk/.venvs/googlesearchd/bin/html2text", line 11, in <module>
    sys.exit(main())
  File "/home/mdk/.venvs/googlesearchd/lib/python3.7/site-packages/html2text/cli.py", line 324, in main
    wrapwrite(h.handle(data))
  File "/home/mdk/.venvs/googlesearchd/lib/python3.7/site-packages/html2text/__init__.py", line 149, in handle
    self.feed(data)
  File "/home/mdk/.venvs/googlesearchd/lib/python3.7/site-packages/html2text/__init__.py", line 146, in feed
    HTMLParser.HTMLParser.feed(self, data)
  File "/home/mdk/.pyenv/versions/3.7.0-debug/lib/python3.7/html/parser.py", line 111, in feed
    self.goahead(0)
  File "/home/mdk/.pyenv/versions/3.7.0-debug/lib/python3.7/html/parser.py", line 204, in goahead
    self.handle_charref(name)
  File "/home/mdk/.venvs/googlesearchd/lib/python3.7/site-packages/html2text/__init__.py", line 186, in handle_charref
    self.handle_data(self.charref(c), True)
  File "/home/mdk/.venvs/googlesearchd/lib/python3.7/site-packages/html2text/__init__.py", line 802, in handle_data
    and re.match(r'[^\s.!?]', data[0])
IndexError: string index out of range
@JulienPalard
Copy link
Author

JulienPalard commented Aug 3, 2018

I think the big problem is charref can easily return the empty string, (return '' obviously, and return unifiable_n[c] both can return the empty string too).

But handle_data receiving the result of charref hits data[0] as soon as its line 6.

Comes from b2765e2

@Alir3z4
Copy link
Owner

Alir3z4 commented Jan 7, 2019

Did #206 fixed this ?

@JulienPalard
Copy link
Author

No:

$ git rev-parse --short HEAD
a9a6133
$ pip install -e .
Obtaining file:///home/mdk/clones/html2text
Installing collected packages: html2text
  Running setup.py develop for html2text
Successfully installed html2text
$ which html2text 
/home/mdk/.venvs/3.7.1/bin/html2text
$ html2text <<< '<html> <body> <b>b</b>&#8206; </body> </html>'
Traceback (most recent call last):
  File "/home/mdk/.venvs/3.7.1/bin/html2text", line 11, in <module>
    load_entry_point('html2text', 'console_scripts', 'html2text')()
  File "/home/mdk/clones/html2text/html2text/cli.py", line 337, in main
    wrapwrite(h.handle(data))
  File "/home/mdk/clones/html2text/html2text/__init__.py", line 151, in handle
    self.feed(data)
  File "/home/mdk/clones/html2text/html2text/__init__.py", line 148, in feed
    HTMLParser.HTMLParser.feed(self, data)
  File "/home/mdk/.pyenv/versions/3.7.1/lib/python3.7/html/parser.py", line 111, in feed
    self.goahead(0)
  File "/home/mdk/.pyenv/versions/3.7.1/lib/python3.7/html/parser.py", line 204, in goahead
    self.handle_charref(name)
  File "/home/mdk/clones/html2text/html2text/__init__.py", line 188, in handle_charref
    self.handle_data(self.charref(c), True)
  File "/home/mdk/clones/html2text/html2text/__init__.py", line 813, in handle_data
    and re.match(r'[^\s.!?]', data[0])
IndexError: string index out of range

@Alir3z4
Copy link
Owner

Alir3z4 commented Jan 8, 2019

Would you be able to submit patch for this problem ?

@jdufresne
Copy link
Collaborator

@JulienPalard I've proposed a fix for this in #255. If you have the time, could you test it against your real life use case? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants