Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML link attributes are erased when parsed as HTML but not as Markdown #6970

Closed
gwern opened this issue Dec 20, 2020 · 1 comment
Closed

Comments

@gwern
Copy link
Contributor

gwern commented Dec 20, 2020

Pandoc correctly generates a HTML link with ID & attributes:

$ echo '[foo](https://www.example.com){#foo key1=value1 key2=value2}' | pandoc -f markdown -w html
<p><a href="https://www.example.com" id="foo" data-key1="value1" data-key2="value2">foo</a></p>

On reading its own HTML as HTML and generating either HTML or Markdown, the key-value attributes are silently erased:

$ echo '[foo](https://www.example.com){#foo key1=value1 key2=value2}' | pandoc -f markdown -w html | pandoc -f html -w html
<p><a href="https://www.example.com" id="foo">foo</a></p>
$ echo '[foo](https://www.example.com){#foo key1=value1 key2=value2}' | pandoc -f markdown -w html | pandoc -f html -w markdown
[foo](https://www.example.com){#foo}

But on reading its own HTML as Markdown, the data is preserved correctly:

$ echo '[foo](https://www.example.com){#foo key1=value1 key2=value2}' | pandoc -f markdown -w html | pandoc -f markdown -w markdown
```{=html}
<p>
```
`<a href="https://www.example.com" id="foo" data-key1="value1" data-key2="value2">`{=html}foo`</a>`{=html}
```{=html}
</p>
```
$ echo '[foo](https://www.example.com){#foo key1=value1 key2=value2}' | pandoc -w html | pandoc -f markdown -w html
<p>
<a href="https://www.example.com" id="foo" data-key1="value1" data-key2="value2">foo</a>
</p>

This turned out to be a serious problem for my link annotation code because I write it as HTML, and so naturally my processing code also used readHTML; unfortunately, that erases most (but not all) of the data (which fooled me for a while because I could see the classes/IDs were all still there when I checked the final generated HTML, but didn't notice the data-* attributes were all gone). Debugging in ghci & CLI were even more confusing until I happened to check every possible pair of HTML/Markdown input/output formats and discovered that readMarkdown is better at reading HTML than readHtml is (!). This solved the immediate problem of silently stripping annotations but introduced further downstream problems like needing to strip <p></p> surrounding fragments like titles/authors... So it would be good for this to be fixed.

@jgm
Copy link
Owner

jgm commented Dec 20, 2020

Yes, I think we could do that; most of the HTML reader predates link attributes in the AST and wasn't written with that in mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants