Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exponential runtime for [link](url) #9710

Closed
xrat opened this issue May 1, 2024 · 8 comments
Closed

Exponential runtime for [link](url) #9710

xrat opened this issue May 1, 2024 · 8 comments

Comments

@xrat
Copy link

xrat commented May 1, 2024

(Apparently, I cannot add labels. Would have added: reader:markdown + performance.)

I ran into a case of exponential runtime for the Markdown reader that baffles me. Sadly, I was still not able to reproduce the bug with generic data. I cannot make my pathological file publicly available but I can share it privately upon request. The type of input is

[dir1/foo_bar_1100.md](https://www.example.org/dir1/foo/bar1)
[dir2/foo_bar_2100.md](https://www.example.org/dir2/foo/bar2)
...

The exponential runtime is obvious:

$ N=0; while ((N++<10)); do n=$((10*N)); head -n $n pathological.md > tmp.md;
echo -n "$n lines: "; start=${EPOCHREALTIME//[.,]};
pandoc --to native tmp.md >/dev/null;
echo $(((${EPOCHREALTIME//[.,]}-start)/1000)) ms ; done
10 lines: 24 ms
20 lines: 41 ms
30 lines: 45 ms
40 lines: 34 ms
50 lines: 34 ms
60 lines: 50 ms
70 lines: 286 ms
80 lines: 1838 ms
90 lines: 3549 ms
100 lines: 5158 ms

At this stage (following above code) tmp.md has 100 lines of pathological.md. I tried to run --trace and I notice that the output seems broken:

$ time pandoc --trace --to native tmp.md | wc -l
[trace] Parsed [Para [Link ("",[],[]) [Str "2factor_authentication.md"] ("h at line 103
703

real    0m5.182s
user    0m5.122s
sys     0m0.061s

The Commonmark reader does not have this problem:

$ time pandoc --from commonmark_x --to native tmp.md | wc -l
703

real    0m0.088s
user    0m0.082s
sys     0m0.009s

Baffling, too, is the effect of s/^/* /:

$ sed -i 's/^/* /' tmp.md; wc -l tmp.md
100 tmp.md
$ time pandoc --to native tmp.md | wc -l
1005

real    0m0.088s
user    0m0.063s
sys     0m0.024s

Pandoc 3.1.13 on amd64 Debian GNU/Linux 11.9

@xrat xrat added the bug label May 1, 2024
@jgm
Copy link
Owner

jgm commented May 1, 2024

I can't do much without the file. You can email it to me; my email address is in pandoc.cabal in the top level of this repository.

@xrat
Copy link
Author

xrat commented May 1, 2024

Thanks, sent you the file.

In the meantime, I got a little closer to what might be the trigger, but I still fail to reproduce it with generic data.

  • Wrapping the link text in `` also makes Pandoc parse the data just fine.
  • pandoc --from markdown-citations also runs as fast as usually.
  • All other extensions make no difference, except intraword_underscores which seemingly makes Pandoc run forever. I ^C'ed it after 15s. But that might be due to another issue.

@jgm
Copy link
Owner

jgm commented May 2, 2024

I think the problem is the URLs with double __ in them. If you take those out, it goes quite quickly. Now that I've seen that I should be able to narrow in on a fix.

Note: adding <..> around your URLs may also work.

@jgm
Copy link
Owner

jgm commented May 2, 2024

Here's a minimal case that takes 9s to parse on my machine:

[x_x_x__x_x_x.x](x://x.x.x.x/~x/x/x/x/x_x_x__x_x_x.x)  
[x_x_x__x_x_x.x](x://x.x.x.x/~x/x/x/x/x_x_x__x_x_x.x)  
[x_x_x__x_x_x.x](x://x.x.x.x/~x/x/x/x/x_x_x__x_x_x.x)  
[x_x_x__x_x_x.x3x.x](x://x.x.x.x/~x/x/x/x/x_x_x__x_x_x.x3x.x)  
[x_x_x_x__x_x.x](x://x.x.x.x/~x/x/x/x/x_x_x_x__x_x.x)  
[x_x_x__x_x_x.x](x://x.x.x.x/~x/x/x/x/x_x_x__x_x_x.x)  
[x_x_x__x_x_x.x](x://x.x.x.x/~x/x/x/x/x_x_x__x_x_x.x)  
[x_x_x__x_x_x.x](x://x.x.x.x/~x/x/x/x/x_x_x__x_x_x.x)  
[x_x_x__x_x_x.x3x.x](x://x.x.x.x/~x/x/x/x/x_x_x__x_x_x.x3x.x)  
[x_x_x_x__x_x.x](x://x.x.x.x/~x/x/x/x/x_x_x_x__x_x.x)  

@xrat
Copy link
Author

xrat commented May 2, 2024

For reference, your minimal case takes 12s on my machine.

Thanks to your analysis I now found a minimal case that is way simpler: [foo__bar](url)

$ N=0; while ((N++<9)); do n=$((2*N)); for ((i=1;i<=n;i++)); do echo "[foo__bar](url)" ; done > tmp.md; start=${EPOCHREALTIME//[.,]}; echo -n "$n lines: "; pandoc tmp.md >/dev/null; echo $(((${EPOCHREALTIME//[.,]}-start)/1000)) ms ; done
2 lines: 24 ms
4 lines: 11 ms
6 lines: 34 ms
8 lines: 64 ms
10 lines: 154 ms
12 lines: 484 ms
14 lines: 1868 ms
16 lines: 7114 ms
18 lines: 28634 ms

So, 18 lines of [foo__bar](url) take ~28s on my machine.

@jgm
Copy link
Owner

jgm commented May 2, 2024

Substituting * for _ gives the same result. This must have something to do with emphasis parsing, but I'm still puzzled.

@xrat
Copy link
Author

xrat commented May 2, 2024

More data points in the hope they are of use for the debugging:

  • [foo_bar_foobar](url): linear time
  • [foo__bar__foobar](url): exponential time
  • [foo___bar___foobar](url): linear time
  • * [foo__bar__foobar](url): linear time
  • [__foo__](url): linear time
  • [foo__bar__](url): exponential time
  • [__foo__bar](url): exponential time

@jgm
Copy link
Owner

jgm commented May 2, 2024

Commenting out cite in
https://github.com/jgm/pandoc/blob/main/src/Text/Pandoc/Readers/Markdown.hs#L1527
fixes the issue. That's where the problem lies.

The link parser grabs a string between [..] and parses it, but the cite parser just ploughs ahead parsing inlines after [, not stopping when it sees ] but allowing that as a constituent of an emphasized inline. Culprit is the parser normalCite.

Not sure about the exact fix yet but at least I now see the problem.

@jgm jgm closed this as completed in 2da454d May 2, 2024
@jgm jgm added the performance label May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants