Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected (?) behavior change in parsing newlines #1041

Closed
mattiaverga opened this issue Oct 14, 2020 · 3 comments
Closed

Unexpected (?) behavior change in parsing newlines #1041

mattiaverga opened this issue Oct 14, 2020 · 3 comments

Comments

@mattiaverga
Copy link

The following code:

import markdown
text = ('# this is a header\n''this is some **text**\n''<script>alert("pants")</script>')
extensions = ['markdown.extensions.fenced_code', ]
markdown.markdown(text, extensions=extensions)

produces different output between markdown 3.2.2 and 3.3.1
This is 3.2.2:

<h1>this is a header</h1>\n<p>this is some <strong>text</strong>\n<script>alert("pants")</script></p>

And this is 3.3.1:

<h1>this is a header</h1>\n<p>this is some <strong>text</strong></p>\n<script>alert("pants")</script>

Note how the paragraph is closed before the <script> tag in 3.3.1. In 3.2.2 that required a double \n.
Is this change expected?

@facelessuser
Copy link
Collaborator

Besides being different. Does this present issues?

The HTML parsing was rewritten in 3.0. The old code was very difficult to maintain. With that said the new parser may have a couple of corner cases that got missed.

This is obviously a change, but I also think this behavior is more sane with block elements.

If I recall, I think this particular case is expected. But @waylan can probably speak better to it than I. If this was happening with an inline element, it would definitely be a problem.

@waylan
Copy link
Member

waylan commented Oct 14, 2020

See Babelmark. It appears that the various implementations are split down the middle with half including the script within the <p> tag and half not. Of note is the reference implementation (markdown.pl), which does not. In fact, I made a conscious choice to follow the reference implementation in this case.

As @facelessuser noted, the HTML parser in 3.3 is a complete rewrite, We started over completely from scratch. It would have been more work to get the old behavior in this case. However, as the reference implementation behaves this way, I left it as-is.

I should note the following section of the rules

The only restrictions are that block-level HTML elements — e.g. <div>, <table>, <pre>, <p>, etc. — must be separated from surrounding content by blank lines, and the start and end tags of the block should not be indented with tabs or spaces. Markdown is smart enough not to add extra (unwanted) <p> tags around HTML block-level tags.

To be clear, <script> is a block-level element and that rule suggests that a blank line is required. However, as previously noted, the reference implementation ignores the blank line requirement and only enforces the indentation restriction. In fact, over the years, we have received bug reports from people who expect the same behavior and they objected when they were quoted that rule and instructed to include a blank line.

Perhaps if the tag was a <div> rather than <script> tag, then the current behavior would make more sense. After all, it is invalid to have a <div> nested within a <p>. We are not going to special case <script> and have it behave differently just because it is not invalid to include it within a <p>. So let's look more closely at the <div> case. If this input:

A paragraph.
<div>foo</div>

was rendered as

<p>A paragraph.
<div>foo</div></p>

that would be invalid HTML and a browser would interpret it as

<p>A paragraph.
</p>
<div>foo</p>
<p></p>

In other words, it would close the <p> at the start of the <div> and then create an empty <p> at the closing </p> tag. That empty <p> at the end would likely cause some extra undesired whitespace to appear in the rendered document. That is clearly not what the document author intended. Therefore, the correct way to render that is as

<p>A paragraph.</p>
<div>foo</div>

We get the same behavior for all block-level tags, therefore, the current behavior for <script> tags.

Finally, I will note that almost all implementations agree with the rendering in the <div> case as per Babelmark. It seems that some do special case <script>, or at least those implementations don't consider <script> to be be a block-level element. Python-Markdown has always followed the reference implementation by including the tag in the list of block-level elements. We have no intention of changing that now.

@mattiaverga
Copy link
Author

Thanks @waylan for the detailed explanation! I agree that following the reference is always better. I was searching a confirmation that the change was intentional, so I'll change the failing tests in accordance with the new behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants