HTML link attributes are erased when parsed as HTML but not as Markdown #6970

gwern · 2020-12-20T19:17:02Z

Pandoc correctly generates a HTML link with ID & attributes:

$ echo '[foo](https://www.example.com){#foo key1=value1 key2=value2}' | pandoc -f markdown -w html
<p><a href="https://www.example.com" id="foo" data-key1="value1" data-key2="value2">foo</a></p>

On reading its own HTML as HTML and generating either HTML or Markdown, the key-value attributes are silently erased:

$ echo '[foo](https://www.example.com){#foo key1=value1 key2=value2}' | pandoc -f markdown -w html | pandoc -f html -w html
<p><a href="https://www.example.com" id="foo">foo</a></p>
$ echo '[foo](https://www.example.com){#foo key1=value1 key2=value2}' | pandoc -f markdown -w html | pandoc -f html -w markdown
[foo](https://www.example.com){#foo}

But on reading its own HTML as Markdown, the data is preserved correctly:

$ echo '[foo](https://www.example.com){#foo key1=value1 key2=value2}' | pandoc -f markdown -w html | pandoc -f markdown -w markdown
```{=html}
<p>
```
`<a href="https://www.example.com" id="foo" data-key1="value1" data-key2="value2">`{=html}foo`</a>`{=html}
```{=html}
</p>
```
$ echo '[foo](https://www.example.com){#foo key1=value1 key2=value2}' | pandoc -w html | pandoc -f markdown -w html
<p>
<a href="https://www.example.com" id="foo" data-key1="value1" data-key2="value2">foo</a>
</p>

This turned out to be a serious problem for my link annotation code because I write it as HTML, and so naturally my processing code also used readHTML; unfortunately, that erases most (but not all) of the data (which fooled me for a while because I could see the classes/IDs were all still there when I checked the final generated HTML, but didn't notice the data-* attributes were all gone). Debugging in ghci & CLI were even more confusing until I happened to check every possible pair of HTML/Markdown input/output formats and discovered that readMarkdown is better at reading HTML than readHtml is (!). This solved the immediate problem of silently stripping annotations but introduced further downstream problems like needing to strip <p></p> surrounding fragments like titles/authors... So it would be good for this to be fixed.

The text was updated successfully, but these errors were encountered:

jgm · 2020-12-20T22:24:02Z

Yes, I think we could do that; most of the HTML reader predates link attributes in the AST and wasn't written with that in mind.

jgm added format:HTML reader labels Dec 20, 2020

cderv mentioned this issue Nov 24, 2021

Some valid attributes are dropped from HTML to Markdown conversion #7714

Closed

jgm closed this as completed in 6072bdc Nov 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML link attributes are erased when parsed as HTML but not as Markdown #6970

HTML link attributes are erased when parsed as HTML but not as Markdown #6970

gwern commented Dec 20, 2020 •

edited

Loading

jgm commented Dec 20, 2020

HTML link attributes are erased when parsed as HTML but not as Markdown #6970

HTML link attributes are erased when parsed as HTML but not as Markdown #6970

Comments

gwern commented Dec 20, 2020 • edited Loading

jgm commented Dec 20, 2020

gwern commented Dec 20, 2020 •

edited

Loading