URL & Unicode characters #9905
Replies: 3 comments 5 replies
-
Note: these links are perfectly fine in HTML5, so the issue concerns only HTML4 (and EPUB2). |
Beta Was this translation helpful? Give feedback.
-
I made the change. |
Beta Was this translation helpful? Give feedback.
-
One more question related to non-ASCII characters: What does pandoc consider a valid id? I faced a funny case:
You see, Now let's try to generate a span with such id:
Oops, the construct is not recognized as a span at all. There is a workaround, though:
HTML allows any characters in id, but pandoc can't do the same, since space is used to separate id from other attributes, closing brace terminates the list of attributes, etc. Currently pandoc documentation does not specify what characters are allowed and what are not:
|
Beta Was this translation helpful? Give feedback.
-
Hi,
I faced a problem related to Unicode (non-ASCII) characters and URLs. Look:
HTML standard applies very small restrictions on ids: an id must be non-empty and unique, that's all. So, an id of non-ASCII characters is perfectly valid:
I used a Greek letter as id. Ok, let's how use this as a link destination:
And see how pandoc process it:
Let's ignore the first 3 tidy warnings: it is because I generated a HTML fragment, not a standalone file. The fourth warning is my point. pandoc passed the id of the destination to the link's href attribute as-is:
<a href="#Ψ">
, but href is expected to be URL, and URL syntax does not allow non-ASCII characters. Non-ASCII characters in URLs (let's forget about domain part for a moment) must be percent-encoded, so the correct HTML code will be:<a href="#%CE%A8">
.I can modify the markdown source to get fully correct HTML:
But it looks ugly, contradicts to the markdown philosophy that markdown must be human-readable, and unmaintainable.
The same problem arises with Unicode characters in URL paths, e. g.:
In such a case not only
à
, but also parenthesis must be percent-encoded. The special encoding rules is used for characters in domain part of an URL.The problem is not very serious, though, since modern browsers accept non-ASCII characters in URLs: I tested Firefox and Chromium and didn't see any problem with such links.
What do you think? Should pandoc properly encode URLs to generate valid HTML, or pass them as is and hope browsers will do the work?
P. S. I think the problem has been discussed, but I can't find it in issues or in discussions.
Beta Was this translation helpful? Give feedback.
All reactions