Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

an escaped line break spoils the parsing of a following CodeBlock #3730

Closed
rose00 opened this issue Jun 10, 2017 · 8 comments
Closed

an escaped line break spoils the parsing of a following CodeBlock #3730

rose00 opened this issue Jun 10, 2017 · 8 comments

Comments

@rose00
Copy link

rose00 commented Jun 10, 2017

A trailing backslash interferes with the parsing of a fenced code block on the next line.

Workaround: Pre-filter the input to insert a blank line after "\\n".

$ (echo 'nice line @'; echo '``` {style=".impeccable"}'; echo "  preformatted       stuff";
    echo '```') | pandoc -t native -f markdown+escaped_line_breaks
[Para [Str "nice",Space,Str "line",Space,Str "@"]
,CodeBlock ("",[],[("style",".impeccable")]) "  preformatted       stuff"]

Note: In the next two commands, the backslash appears as a single slash in the markdown input that comes out of "echo".

$ (echo 'bad line break \'; echo '``` {style=".impeccable"}'; echo "  preformatted       stuff";
    echo '```') | pandoc -t native -f markdown+escaped_line_breaks
[Para [Str "bad",Space,Str "line",Space,Str "break",LineBreak,Code ("",[],[]) "{style=\".impeccable\"}   preformatted       stuff"]]

Using Code instead of CodeBlock for the preformatted stuff will cause it to lose its format when it passes through HTML. That is how the bug was noticed.

$ (echo 'bad line break \'; echo '``` {style=".impeccable"}'; echo "  preformatted       stuff";
    echo '```') | pandoc -t native -f markdown-escaped_line_breaks
pandoc: 
Error at "source" (line 2, column 1):unknown parse error
``` {style=".impeccable"}
^
CallStack (from HasCallStack):
  error, called at src/Text/Pandoc/Error.hs:66:13 in pandoc-1.19.2.1-J1nmFBg9ln971v0RrPbKLJ:Text.Pandoc.Error

This error suggests that the parsing after a backslash is a little too unforgiving.

Suggested fix: When parsing an escaped newline, do not consume the newline character itself.

@jgm
Copy link
Owner

jgm commented Jun 10, 2017

It's not obvious that this is a bug. (Of course, the lack of a spec makes it hard to resolve this definitively; pandoc does not yet follow the CommonMark spec.)

Generally when pandoc sees an escaped newline, it assumes that this is a line break inside a block, so it assumes that the next line is not meant to start a new block.

Another example:

hi\
---

which doesn't turn into a setext header.

On the other hand, CommonMark has a philosophy of discerning block structure independently of inline parsing, so CommonMark would do things the way you suggest.

Still, I'm inclined to agree that this should be changed in the way you suggest.

The following patch implements your suggestion "do not consume the newline character itself":

diff --git a/src/Text/Pandoc/Readers/Markdown.hs b/src/Text/Pandoc/Readers/Markdown.hs
index 5694c43..8220bac 100644
--- a/src/Text/Pandoc/Readers/Markdown.hs
+++ b/src/Text/Pandoc/Readers/Markdown.hs
@@ -1471,12 +1471,18 @@ escapedChar' = try $ do
 
 escapedChar :: PandocMonad m => MarkdownParser m (F Inlines)
 escapedChar = do
-  result <- escapedChar'
+  result <- lookAhead escapedChar'
   case result of
-       ' '   -> return $ return $ B.str "\160" -- "\ " is a nonbreaking space
-       '\n'  -> guardEnabled Ext_escaped_line_breaks >>
-                return (return B.linebreak)  -- "\[newline]" is a linebreak
-       _     -> return $ return $ B.str [result]
+       ' '   -> do
+         void $ count 2 anyChar
+         return $ return $ B.str "\160" -- "\ " is a nonbreaking space
+       '\n'  -> do
+         guardEnabled Ext_escaped_line_breaks
+         void $ anyChar -- eat the backslash, leaving the newline (see #3730)
+         return (return B.linebreak)  -- "\[newline]" is a linebreak
+       _     -> do
+         void $ count 2 anyChar
+         return $ return $ B.str [result]

The test suite fails because of a test involving this Markdown:

# Title\
foo

which current pandoc converts to

<h1 id="title-foo">Title<br />
foo</h1>

and the changed code converts to

<h1 id="title">Title<br />
</h1>
<p>foo</p>

So, one effect of making this change is that one can no longer use this trick to get newlines in headers. I have a feeling that some people may be relying on the current behavior, so this change would require some further discussion on pandoc-discuss (though I'm still inclined to make it, to bring pandoc's parsing closer to CommonMark).

@jgm
Copy link
Owner

jgm commented Jun 10, 2017

Here's a different patch that might make more sense. It improves over current pandoc behavior in disallowing escaped newlines in some contexts where newlines aren't allowed. But it has the same effect as the above patch of disallowing hard breaks in headers.

diff --git a/src/Text/Pandoc/Readers/Markdown.hs b/src/Text/Pandoc/Readers/Markdown.hs
index 5694c43..807b178 100644
--- a/src/Text/Pandoc/Readers/Markdown.hs
+++ b/src/Text/Pandoc/Readers/Markdown.hs
@@ -1450,6 +1450,7 @@ inline = choice [ whitespace
                 , autoLink
                 , spanHtml
                 , rawHtmlInline
+                , escapedNewline
                 , escapedChar
                 , rawLaTeXInline'
                 , exampleRef
@@ -1466,16 +1467,20 @@ escapedChar' = try $ do
   (guardEnabled Ext_all_symbols_escapable >> satisfy (not . isAlphaNum))
      <|> (guardEnabled Ext_angle_brackets_escapable >>
             oneOf "\\`*_{}[]()>#+-.!~\"<>")
-     <|> (guardEnabled Ext_escaped_line_breaks >> char '\n')
      <|> oneOf "\\`*_{}[]()>#+-.!~\""
 
+escapedNewline :: PandocMonad m => MarkdownParser m (F Inlines)
+escapedNewline = try $ do
+  guardEnabled Ext_escaped_line_breaks
+  char '\\'
+  lookAhead (char '\n') -- don't consume the newline (see #3730)
+  return $ return B.linebreak
+
 escapedChar :: PandocMonad m => MarkdownParser m (F Inlines)
 escapedChar = do
   result <- escapedChar'
   case result of
        ' '   -> return $ return $ B.str "\160" -- "\ " is a nonbreaking space
-       '\n'  -> guardEnabled Ext_escaped_line_breaks >>
-                return (return B.linebreak)  -- "\[newline]" is a linebreak
        _     -> return $ return $ B.str [result]
 
 ltSign :: PandocMonad m => MarkdownParser m (F Inlines)

@jgm
Copy link
Owner

jgm commented Jun 10, 2017

Further note: pandoc is a bit inconsistent, because if you leave two spaces at the end of the line, you don't get a hard break in a header:

# Title followed by two spaces  
content

Pandoc also disallows hard breaks in setext headers:

Title\
second line
----

doesn't produce a header.

This argues for removing the inconsistency by disallowing backslash-newline hard breaks in atx headers. The only reason not to do this is that it may break behavior that people are relying on (and this isn't something we should do lightly).

@jgm jgm added this to the pandoc 2.0 milestone Jun 10, 2017
@rose00
Copy link
Author

rose00 commented Jun 11, 2017

The main problem with this bug, for me, is that a new block construct (fenced code) is not recognized after an escaped newline. Perhaps the best behavior is to allow a new block to start after the escaped newline but not terminate a header-line (is that the same kind of block?) if there is not a block-introducing construct after the escaped newline.

And here's more: The escaped newline, immediately followed by a fenced code block, was originally produced by pandoc working from an HTML source. It was surprisingly tricky to produce a minimized test case, but here is one (the <div> is required or the bug will not manifest):

$ echo '<div>nice line<br><pre class="impeccable">nice code</pre></div>' | pandoc -f html -t markdown
<div>

nice line\
``` {.impeccable}
nice code
```

</div>

This bug could be made moot for me if the markdown unparser put an extra newline between backslash and backtick, in this case. In fact, I work around the bug with a sed script to post-process the possibly-broken markdown:

$ sed '/[\\]$/{N;s/\([\\]\)\(\n\)\(```\)/\1\2\2\3/;}'

@jgm jgm closed this as completed in b466152 Jun 11, 2017
@adunning
Copy link
Contributor

This behaviour was fairly widely documented for use with headings; I see the problem with it, but is there any other way of achieving the same thing?

@agusmba
Copy link
Contributor

agusmba commented Oct 31, 2017

@adunning did you try with <br/> and writing the header in one line as proposed in the discussion topic you linked?

root@18cbcfcaf28c:/source# pandoc -t html -f markdown
# Line<br/>break

^D

<h1 id="line-break">Line<br/>break</h1>

If you need latex output, you could use raw_attributes

root@18cbcfcaf28c:/source# pandoc -t latex -f markdown
# Line`\\`{=latex} Break

^D

\hypertarget{line-break}{%
\section{\texorpdfstring{Line\\ Break}{Line Break}}\label{line-break}}

This doesn't look very optimal, since you'd need specific line-breaks for each desired output format, but it may serve as a workaround.

@jgm
Copy link
Owner

jgm commented Oct 31, 2017 via email

@adunning
Copy link
Contributor

Many thanks, both, for the ideas!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants