-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unclosed tag in code span #1066
Comments
There seem to be two problems here.
I'm not sure what the issue is here as I don't have time to dig further right now. It looks like we may be attempting to handle I would definitely argue this is a bug. |
@waylan while I believe that preprocessing instructions need some tweaks, but looking at this inline code case, I am slightly curious about your thoughts on this case:
Part of the idea in the rewrite was to fix generally buggy handling, but it also added the new functionality where if a tag will be treated as HTML if it is at the start of the line with no empty line before it. The old processor didn't do this, and I think I'm starting to understand why it didn't. Looking at different parsers, handling seems to be all over the place: In the end, there may not be a "right" way, just curious if this case had been considered. |
I don't necessarily think the above question demonstrates an additional problem, just curious if this is the intention or not. |
Actually, that is unrelated to the issue. It would have actually been more work to replicate the old behavior. The fact that no blank line is required is a red herring. The examples provided here behave the same whether there is a blank line or not. The issue is that the html parser doesn't care about {insignificant in HTML) whitespace. In Markdown, a raw HTML block can only start at the beginning of a line, whereas in HTML the block can begin anywhere. The trick is getting the parser to ignore everything that Markdown does not see as a valid opening block tag. That means we need to anticipate every scenario in which the html parser should ignore an opening block tag and force the parser to ignore it. Apparently this is an edge case we missed. Although I was sure I had a test for this specific case. I can't find it now though, so perhaps not.
I would have expected the only reasonable rendering to be: <p>some text <code>some html code we want to wrap <body>
<div>text</div></body></code>
</p> However, I see that markdown.pl 1.0.2b8 outputs the same as we do: <p>some text `some html code we want to wrap
<body>
</p>
<div>text</div>
<p>
</body>`</p> So was this a situation where I intentionally changed the behavior or was this accidental? I don't recall. I also see that Commonmark almost matches this behavior (they leave off the wrapping In the end, I think that your example is a completely different issue that the original report. |
Yes, I agree it was only related by the fact that I starting thinking about this when I saw this code related issue. Mainly that HTML blocks are handled before inline code, but yes a separate issue (if even an issue at all). I suspect that this current issue (unrelated to what I brought up) is a problem with differences in how preprocessors are handled vs block tags, but I haven't looked that deep into this, only confirmed that this appears to be a bug. |
The problem here is that from test.test_htmlparser import EventCollector
src="""
A text
Another text `<?php`.
<div>
hello
</div>
"""
ec = EventCollector()
ec.feed(src)
ec.close()
print(ec.get_events()) The output is: [
('data', '\nA text\n\nAnother text `'),
('pi', 'php`.\n\n<div'),
('data', '\nhello\n'),
('endtag', 'div'),
('data', '\n')
] Notice the second token, which is a processing instruction ( We have a similar issue with this input:
We get: [
('data', '\ntext `'),
('starttag', 'div`.', [('<div', None)]),
('data', '\nhello\n'),
('endtag', 'div'),
('data', '\n')
] The starttag is all of This is completely an issue with how the parser works and not the fault of our work. Unless we can find a way to change the way the underlying parser works, we would need to completely abandon using the HTML parser. If we had caught this before releasing 3.3, this would have been a blocker to the entire thing. |
And this is why |
So they're is no way to abandon an incomplete tag? Are you implying this isn't solvable? |
This would not be solvable with the public API of HtmlParser. It would require overriding the methods in the parent class, which I have been trying to avoid. I haven't actually dug into the code to see how extensive of a change it would be. A few similar issues were resolved by monkey-patching a regex. But this is dependent upon whether we are in a raw block or not, which means it depends on the status of |
Is there any reason to support/allow a backtick to be in a tag at all (tag name or attributes)? Perhaps if we simply disallowed backticks within tags (between |
You can't really be sure what will be in an attribute. |
Additionally, a preprocessing statement could have anything in it. |
Right. Both good points. Scratch that idea. |
I'm adding a second use case, where parser fails, but silently (no error). If code inside a Fails <div markdown="1">
A `<p>`
</div> With an HTML inline element, no problem. Also fails with Fails in 3.3.3 and works in 3.2.1 Is it related, or should I open a new issue? |
I would expect these cases to be treated just like any span element. It looks like it is still trying to treat these as block elements, but if they don't have a newline, I think it shouldn't get block logic. While not entirely the same as the opening post, I do feel there is a bit of a relation. I don't think preprocessing statements should be treated as such if they are not at the start of the line either and should be evaluated similarly to spans. My concern is that it is easy to parse the HTML tags when they are on their own line, but when they are in the middle of a paragraph, we have no context to know whether they are a tag or raw content until we are in the inline parser. This is due to the way Python Markdown is designed. Other parsers are designed a bit differently and may have an easier time with this scenario than we do. Can we treat block elements differently if they are not at the start of a line? |
The second case reported by @iamvdo is a separate issue which was reported in #1068. Please, let's not mix issues here. To be clear, this issue specifically deals with code spans in which a tag does not contain a closing bracket, for example |
Note that the fix to #1068 (in #1069) addresses the issue with
now outputs: <p>text <code><div</code>.</p>
<p><div>
hello
</div></p> Note that the actual div is wrapped in a <p>A text</p>
<p>Another text <code><?php</code>.</p>
<p><div>
hello
</div></p> While that is clearly wrong, it is a lot less alarming that the previous behavior. It is pretty clear to me that the reason is as outlined in this comment. That said, the issue in #1068 was certainly adding additional complexity to the problem. So now that that is out of the way, we can focus on a fix for this issue. To break down the issue, let's use the As an aside, the behavior is exactly the same with or without the |
In addition to In my initial attempt to address the issue, I found a solution for the + def parse_pi(self, i):
+ if self.at_line_start() or self.intail:
+ return super().parse_pi(i)
+ # This is not the beginning of a raw block so treat as plain data
+ # and avoid consuming any tags which may follow (see #1066).
+ self.handle_data('<?')
+ return i + 2 And it works well in our simple test cases. However, this means that we can never have a processing instruction which doesn't start a line, even within a raw block. For example, consider this:
Yes, that works with the default behavior. The same concerns exist for any similar fix for the other cases ( As an aside, I discovered another bug, which is detailed in #1070. Until that is resolved, the concerns raised in this comment are moot for PIs, although they are still valid for the other cases. It occurs to me that it would be surprising to me for someone to expect to have valid PHP embedded within the HTML generated by Markdown. However, I have seen people try crazier things. And there should be no reason why we would intentionally prevent it from working. Using PIs was simply the easiest way to demonstrate the problem with the proposed fix. In the end I suspect a better solution would be to only run the HTML parser on raw HTML blocks from within the block parser. I tried that in the early commits in #803. However, I reverted back to a preprocessor in 7a8a6b5 as I was encountering too many complications. I wonder if that would be less of an issue now that the parser covers more edge cases than it did at the time. |
Working on #1070 I was reminded that we have modified processing instructions to require them to end with However, that doesn't address the other cases ( |
* fix unclosed pi in code span * fix unclosed dec in code span * fix unclosed tag in code span Closes #1066.
Hello,
I'm facing a problem, not sure where the problem really is (I'm not a Python developer), but I managed to create a little use case showing it.
Using fresh install mkdocs (1.1.2) and markdown 3.3.3 (problem is not here with markdown 3.2.2)
This
.md
file worksAdding a new line, and it no longer works with error
AttributeError: 'NoneType' object has no attribute 'end'
It also fails with other combinations of
code
tags containing HTML and/or PHP. But this one is the smaller use case I've found.Thanks!
The text was updated successfully, but these errors were encountered: