Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup line_offset property #1392

Merged
merged 5 commits into from
Oct 26, 2023
Merged

Speedup line_offset property #1392

merged 5 commits into from
Oct 26, 2023

Conversation

eanorige
Copy link
Contributor

@eanorige eanorige commented Oct 20, 2023

When using mkdocs, I noticed huge runtimes and tracked these down to HTMLExtractor.line_offset taking all the time. Before this patch, a run of mkdocs could take 500+ seconds, with this patch, it takes ~10 seconds.

Flamegraph before optimization:
mkdocs2

Flamegraph after optimization:
mkdocs2a

Note: It may be the case that the full cache of start-positions of all lines isn't needed, and just a single entry is sufficient; I didn't explore this optimization.

* Replace dynamic regex with string find operation
* Add cache of where each line starts so we don't have quadratic behavior identifying line numbers when importing large chunks of html
Forgot to remove old implementation
@waylan waylan added the needs-review Needs to be reviewed and/or approved. label Oct 22, 2023
@eanorige
Copy link
Contributor Author

Thanks for triggering CI. Based on its feedback, I have

  1. added the auto-requested changelog entry
  2. removed the extra blank line accidentally introduced
  3. reworded a comment.

In rewording this comment, I realize that there is a slight change in behavior of this function when self.lineno is past the last line; the old implementation would return the position of the last '\n' in the input, while this implementation returns the length of the input (i.e. EOF). If the exact backwards compatible behavior is needed, lf_pos = len(self.rawdata) should be changed to = self.rawdata.rfind('\n'). Let me know what you think.

@facelessuser
Copy link
Collaborator

@waylan I am not as familiar with the HTML parsing that you implemented, can a buffer in this situation change on you invalidating the cached offset?

If the exact backwards compatible behavior is needed, lf_pos = len(self.rawdata) should be changed to = self.rawdata.rfind('\n'). Let me know what you think.

Generally, I would say that to have a better chance of getting merged quickly, backwards compatible would be better. When something backwards incompatible is introduced, we would need to be even more cautious to ensure this doesn't inadvertently break some other behavior that is going to cause us more time to debug later.

While the speedup is certainly nice, I would definitely want to ensure this hasn't introduced some unexpected parsing behavior that is going to break 3rd party plugins that aren't immediately obvious.

@facelessuser
Copy link
Collaborator

I guess if the change could be somehow identified as inadvertent behavior, then it would make sense to not retain backwards compatibility.

@waylan
Copy link
Member

waylan commented Oct 23, 2023

My memory is very hazy about this. That said, the code that this PR replaces is discussed in #1066 and/or #1068 and was committed in #1069. It appears that we added tests to cover the various edge cases.

the old implementation would return the position of the last '\n' in the input, while this implementation returns the length of the input (i.e. EOF). If the exact backwards compatible behavior is needed, lf_pos = len(self.rawdata) should be changed to = self.rawdata.rfind('\n'). Let me know what you think.

Yes, we want the exact same behavior. It is possible that a file may not end with a newline and that would have been taken into consideration with the existing implementation. Therefore, we would need the method to always return the position of a \n exactly. Returning any other position would be an error. Perhaps we need to add a test for that.

@waylan
Copy link
Member

waylan commented Oct 23, 2023

can a buffer in this situation change on you invalidating the cached offset?

In short yes. See this comment (and the few that follow) for an explanation. As a reminder, we are using the standard library HTML parser to extract HTML from non-HTML and need to hack it to make it work with any non-HTML which contains angle brackets.

Therefore, we would need the method to always return the position of a \n exactly.

Actually, what I should have said is that the method always must return the beginning of a line. EOF is only the beginning of a (empty) line if the buffer ends with a newline. If it does not end with a newline, the EOF is not the beginning of a line and it would be an error to return that position. In fact, looking at the example given in the above linked comment, it is possible that returning EOF could fail to address the purpose of this hack altogether.

@facelessuser
Copy link
Collaborator

In short yes. See #1066 (comment) (and the few that follow) for an explanation. As a reminder, we are using the standard library HTML parser to extract HTML from non-HTML and need to hack it to make it work with any non-HTML which contains angle brackets.

This poses a problem then. If the cache can be invalidated without us knowing, then caching is likely a bad idea.

@waylan
Copy link
Member

waylan commented Oct 23, 2023

Sorry, I think I misread your question. No, the buffer doesn't change. However, the parser jumps too far ahead and we need to back it up. What is being cached is the last position. The new position we back it up to should always be after the last cached position. And now I'm wondering if I am wrong about what to do if no more newlines are found.

@eanorige
Copy link
Contributor Author

eanorige commented Oct 23, 2023

Actually, what I should have said is that the method always must return the beginning of a line.

The old behavior is actually a bit more complex than that. Here's a sketch of the current (before my patch) implementation:

When there's no newlines in the input file or lineno is <= 1, return 0
Else if lineno is a valid line number, return the position of the first character *after* the newline
Else return the position of the last newline.  

I suspect that there's a bug in the existing code and that we should return rfind('\n')+1 in the last case to be consistent with the first two.

ed: remove caching invalidation discussion

@waylan
Copy link
Member

waylan commented Oct 24, 2023

I suspect that there's a bug in the existing code and that we should return rfind('\n')+1 in the last case to be consistent with the first two.

What is curious is why no tests are failing. Unless the character at the +1 position is insignificant whitespace or something. I wondered if maybe the tests weren't triggering that code path, but the coverage report says they are. I guess I need to take a closer look at the tests. One or two might need to be tweaked to be sure it fails if we are off by one character.

@waylan
Copy link
Member

waylan commented Oct 26, 2023

I just took another look at this. Specifically, the behavior when no newlines are found. I confirmed that we have no tests for this edge case. In fact, the existing code has a # pragma: no cover comment on that code block. If I remove the comment, that line shows up in the coverage report as missing.

I don't recall if I added the comment because I couldn't come up with a scenario which triggered that code path, or because I was focusing on other things and expected to come back to it later. In any event, we should probably try to come up with a test for this. If we do, then that should dictate how it should behave. If not, then I'm not going to be too concerned about it.

@waylan
Copy link
Member

waylan commented Oct 26, 2023

So, the concern (how the HTML parser could get into the state where the line number is greater than the total number of lines) is described in this comment. However, among various similar tests, the following test exists:

def test_raw_processing_instruction_code_span(self):
self.assertMarkdownRenders(
self.dedent(
"""
`<?php`
<div>
foo
</div>
"""
),
self.dedent(
"""
<p><code>&lt;?php</code></p>
<div>
foo
</div>
"""
)
)

I just tried many other variations and I cannot get that condition. I expect that it would take a very unusual edge case (with very poorly formed HTML) to ever get that condition. Therefore, I'm not going to sweet it.

@waylan waylan merged commit 524e4da into Python-Markdown:master Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-review Needs to be reviewed and/or approved.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants