Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

un-escaped characters for asciidoc output #2337

Closed
tsagkase opened this issue Jul 30, 2015 · 24 comments
Closed

un-escaped characters for asciidoc output #2337

tsagkase opened this issue Jul 30, 2015 · 24 comments

Comments

@tsagkase
Copy link

Pandoc version 1.15.0.6 doesn't correctly escape asciidoc output

$ echo '<a href="http://example.com">][</a>' | pandoc -f html -t asciidoc
http://example.com[][]

which asciidoc would render back as ...

$ echo '<a href="http://example.com">][</a>' | pandoc -f html -t asciidoc|asciidoc - |grep example\.com
<div class="paragraph"><p><a href="http://example.com">http://example.com</a>[]</p></div>

Unfortunately, the rules for escaping asciidoc special chars are complex and I cannot point to a single place in the asciidoc documentation. The general rule is that the '' character is used to escape. So with correct quoting/escaping ...

$ echo '<a href="http://example.com">\][</a>' | pandoc -f html -t asciidoc|asciidoc - |grep example\.com
<div class="paragraph"><p><a href="http://example.com">][</a></p></div>

References

@jgm
Copy link
Owner

jgm commented May 6, 2017

It would be easy to escape all these special characters, but the output would likely be ugly.
Not sure it's worth it if these cases are rare...

@jgm jgm removed this from the pandoc 2.0 milestone May 6, 2017
@mako4
Copy link

mako4 commented Apr 10, 2018

Escaping with backslashes is not that easy in asciidoc, because it is very picky about only accepting a backslash escape in exactly that cases were it would recognize a command (with exceptions), otherwise it will render a backslash literal. (I'm using asciidoctor as the reference here, I haven't tried the orginal implementation)

E.g. escaping <<...>> to make asciidoc not render them as in-document references

\<<not a proper reference>

\<<proper reference>>

will render a backslash for the first line:

\<<not a proper reference>
<<proper reference>>

Edit: Apparently, there is a much more reliable way to do this with passthroughs: ++<<++proper reference>> will work just fine. This is the unconstrained version of the +...+ passthrough markers. Here is the relevant section of the documentation: Escaping unconstrained quotes

@lisa
Copy link

lisa commented Jan 9, 2019

I believe I have another two instances of this but with this mediawiki, input:

# pandoc-mediawiki-asciidoc-bug.mediawiki file
Syntax defect begin <code>[a-zA-Z_][a-zA-Z0-9_]*</code> (syntax defect middle <code>__</code>) syntax defect near-end <code>[a-zA-Z_:][a-zA-Z0-9_:]*</code>. syntax defect end.

I have used variations of the phrase "syntax defect" as a way to sanitize and minimize the real-life source, and to illustrate the defect. Converting the file with pandoc -s -f mediawiki pandoc-mediawiki-asciidoc-bug.mediawiki -t asciidoc provides this output:

Syntax defect begin `[a-zA-Z_][a-zA-Z0-9_]*` (syntax defect middle `__`)
syntax defect near-end `[a-zA-Z_:][a-zA-Z0-9_:]*`. syntax defect end.

There are two escape issues with the output identified below with ^ characters:

Syntax defect begin `[a-zA-Z_][a-zA-Z0-9_]*` (syntax defect middle `__`)
                                          ^                         ^
syntax defect near-end `[a-zA-Z_:][a-zA-Z0-9_:]*`. syntax defect end.

Before the characters indicated by the ^ should be a literal \ to escape them, as in:

Syntax defect begin `[a-zA-Z_][a-zA-Z0-9_]\*` (syntax defect middle `\__`)
syntax defect near-end `[a-zA-Z_:][a-zA-Z0-9_:]*`. syntax defect end.

To summarize: I believe there are two separate defects broadly related to unescaped characters:

  1. The two regular expressions appear to interact with one another, with the * character in the first regex appearing to act as a bold start and the * in the second regex acting as the bold end.
  2. The __ appears to act as a single _ when it should be treated as a literal __ because it is between <code></code> mediawiki tags.

Version information:

pandoc 2.5
Compiled with pandoc-types 1.17.5.4, texmath 0.11.1.2, skylighting 0.7.4

MacOS 10.13.6, pandoc installed via homebrew.

@jgm
Copy link
Owner

jgm commented Jan 9, 2019

Asciidoc is crazy!!
With this input

`[0-9]*`
`[0-9]*`

asciidoctor gives you

<code><strong class="0-9"></code>
<code>[0-9]</strong></code>

which isn't even well-formed HTML.
But with

`0-9*`
`0-9*`

you get

<code>0-9*</code>
<code>0-9*</code>

I have to believe this is a bug in asciidoctor and not the intended behavior. I'm not going to try to work around all these quirks.

@jgm
Copy link
Owner

jgm commented Jan 9, 2019

EVen worse, if you try to escape the *s in the first example above

`[0-9]\*`
`[0-9]\*`

you get

<code>[0-9]*</code>
<code>[0-9]\*</code>

The first backslash acts as an escape and the second one doesn't!
If this is intentional, it's an insane design decision. How are users supposed to keep track of what a backslash does in these contexts??

@lisa
Copy link

lisa commented Jan 9, 2019

@jgm Would it help if we opened an issue about this with the upstream project, or supported you (as the owner of this repo) in that endeavour?

@jgm
Copy link
Owner

jgm commented Jan 9, 2019

@lisa If you'd like to inquire upstream about whether this is intended behavior, and ask them to clarify the escaping rules, that would be great.

@mako4
Copy link

mako4 commented Feb 4, 2019

Passthrough quotes fix this as well:

`++[0-9]*++`
`++[0-9]*++`

will produce the intended output.

Still also a bug in asciidoctor, as the output isn't proper html.

@henribru
Copy link

henribru commented Jun 16, 2019

Escaping is also missing for these relatively simple cases:

$ pandoc --version
pandoc.exe 2.7.3
Compiled with pandoc-types 1.17.5.4, texmath 0.11.2.2, skylighting 0.8.1
[...]
$ echo "*Foo*" | pandoc -f html -t asciidoctor
*Foo*
$ echo "_Foo_" | pandoc -f html -t asciidoctor
_Foo_

Unfortunately I'm not sure what the "correct" output is in cases like this. According to https://asciidoctor.org/docs/asciidoc-syntax-quick-reference/#escaping-text I guess it would be \*Foo* and \_Foo_, but the handling of backslash escaping seems quite complex and this might break down in more complicated cases. There's also plus escaping, the pass macro, character replacement ({asterisk} for *, doesn't seem to be one for _) and possibly more options ...

@jgm
Copy link
Owner

jgm commented Jun 21, 2019

Unfortunately escaping in asciidoc is not well designed.

@grv87
Copy link

grv87 commented Jul 2, 2019

+1 for escaping, at least in URLs

@jgm
Copy link
Owner

jgm commented Sep 2, 2019

Passthrough quotes fix this as well:

`++[0-9]*++`
`++[0-9]*++`

will produce the intended output.

OK...but what if you want to have ++[0-9*++, quoting the plus signs too? I tried

`\+\+[0-9]*\+\+`
`\+\+[0-9]*\+\+`

which yields

<code>+\+<strong class="0-9">\+\+</code>
<code>+\+[0-9]</strong>\+\+</code>

in which the first backslash acts as an escape but the others don't. Argh! Asciidoc needs some clear, consistent escaping rules.

@mako4
Copy link

mako4 commented Nov 4, 2019

I think this works

`pass:[++[0-9\]*++]`

You can use backslashes here to escape the ] in this context.

@jasom
Copy link

jasom commented Apr 29, 2021

I have a local patch similar to what @mako4 suggests, but with one modification. it specifies that special character substitutions still apply; otherwise asciidoctor will pass-through special html characters into the final document:

pass:specialcharacters[++[0-9\]*++]

asciidoctor allows "c" as an abbreviation for "specialcharacters". I haven't implemented that yet, but makes things very slightly less ugly. If this is applied in escapeString to only texts with the special characters in it the output isn't too ugly for prose.

The alternative is to define attributes for each special character:

:plus: +
:rbracket: ]
:lbracket: [
:star: *

{plus}{plus}{lbracket}0-9{rbracket}{star}{plus}{plus}

When there are relatively few special characters the latter looks better, when there are many the former looks better.

@jgm
Copy link
Owner

jgm commented May 17, 2022

So to summarize: for asciidoctor, at least, we can do

`pass:c[CODE]`

where CODE is the raw code with all ] characters backslash-escaped.
(Question: what about backslashes in the code, should they all be backslash-escaped too?)

@jasom
Copy link

jasom commented May 17, 2022

@jgm no, you can't escape backslashes, which means the CODE part of pass:c[CODE] may not end with a backslash sigh.

@jgm
Copy link
Owner

jgm commented May 17, 2022

Hm, it also means that if code contains \] already, the backslash will disappear.

@jgm
Copy link
Owner

jgm commented May 17, 2022

Just did an experiment: it looks like you can use numeric entities to escape special characters inside ...
Example

`&#x5b;&#x30;&#x2d;&#x39;&#x5d;&#x2a;`

output from asciidoc (original):

<p><code>&amp;#x5b;&amp;#x30;&amp;#x2d;&amp;#x39;&amp;#x5d;&amp;#x2a;</code></p>

output from asciidoctor:

<p><code>&#x5b;&#x30;&#x2d;&#x39;&#x5d;&#x2a;</code></p>

So that's an interesting behavior change!

@jasom
Copy link

jasom commented May 17, 2022

Also, I somehow missed this (maybe it's a new addition?), but this works in asciidoctor only:

{blank}{empty}{sp}{nbsp}{zwsp}{wj}{apos}{quot}{lsquo}{rsquo}{ldquo}{rdquo}{deg}{plus}{brvbar}{vbar}{amp}{lt}{gt}{startsb}{endsb}{caret}{asterisk}{tilde}{backslash}{backtick}{two-colons}{two-semicolons}{cpp}{pp}

Output from asciidoctor:

<p> &#160;&#8203;&#8288;&#39;&#34;&#8216;&#8217;&#8220;&#8221;&#176;&#43;&#166;|&<>[]^*~\`::;;C&#43;&#43;&#43;&#43;</p>

@kbroch-rivosinc
Copy link
Contributor

Still an issue in latest pandoc:

(venv)kbroch@penguin:~ $ pandoc --version
pandoc 3.1.4
Features: +server +lua
Scripting engine: Lua 5.4
User data directory: /home/kbroch/.local/share/pandoc
Copyright (C) 2006-2023 John MacFarlane. Web: https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.
(venv)kbroch@penguin:~ $ pandoc -o debug.adoc debug-escaping-asterisk.docx 
(venv)kbroch@penguin:~ $ cat debug.adoc 
*There should be asterisks on either side of this*
(venv)kbroch@penguin:~ $ 

But should see: \*There should be asterisks on either side of this\*

debug-escaping-asterisk.docx

@jgm
Copy link
Owner

jgm commented Jul 5, 2023

@kbroch-rivosinc Please see the comments above. The suggested output you give isn't correct. Given this input, asciidoctor yields the following HTML:

<p>*There should be asterisks on either side of this\*</p>

If it were just a matter of backslash-escaping all * signs, we could easily do that. But that won't work. The interpretation of backslashes is highly non-regular, and that's why this issue is still open...

@kbroch-rivosinc
Copy link
Contributor

@jgm : here's what I see from asciidoctor (sorry I should have put this in original comment):

(venv)kbroch@penguin:~ $ asciidoctor --version
Asciidoctor 2.0.20 [https://asciidoctor.org]
Runtime Environment (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]) (lc:UTF-8 fs:UTF-8 in:UTF-8 ex:UTF-8)
(venv)kbroch@penguin:~ $ asciidoctor debug.adoc 
(venv)kbroch@penguin:~ $ grep "should be asterisks" debug.html 
<p><strong>There should be asterisks on either side of this</strong></p>

@jgm
Copy link
Owner

jgm commented Jul 6, 2023

Well yes, asciidoc(tor) will turn

*hello*

into strong emphasis. But we're concerned here with how to represent literal asterisk characters. And asciidoctor turns

\*hello\*

into

<p>*hello\*</p>

So we can't simply backslash-escape all the literal asterisks as you suggested.

@kbroch-rivosinc
Copy link
Contributor

Thanks for explanation. I see above: #2337 (comment) where all this was explained. Sorry I didn't catch it the first time. I appreciate you taking the time to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants