-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite POD parser and Markdown converter #142
base: main
Are you sure you want to change the base?
Conversation
Oh, one thing I appear to have missed is including Will see it I can support this gracefully somehow. I guess it's best to just include all content starting from the matched Apart from that, let me know what you think! I'm happy to answer any questions, of course. |
Looks great, thanks! Working through it now as it's a large commit. Some comments: I think being more error tolerant would be helpful, especially as POD doesn't always follow the POD standard. Generally, as long as it looks like pod, we should probably show it. Some examples that now break: File::Basename and File::Compare are core modules that break entirely because they don't end with a File::Fetch errors with File::Temp throws an uncaught error on this line: I found the above examples by flipping through autocompletion results on And as you mentioned, the lookup from Also, for translating Overall though, this is great. Rearchitecting this is definitely helpful and I appreciate you diving in here. POD parsing is very painful. |
Thank you for the feedback! Regarding fault-tolerance: Interesting, I didn't know POD was interpreted so "liberally" across the ecosystem (but alas, I should've figured, it's Perl after all). I'll definitely address your points and make it more forgiving; as forgiving as possible. Perhaps I'll just make it ignore errors completely, if possible. Regarding headers: I did not know that VSCode would actually display them so prominently; I agree that it looks way too large! I'll use a level of I'll also address the uncaught error, of course. Besides that, I recently also found two small issues:
I'll also correct those. So in total, the following things need to be addressed:
If there's anything else you find, let me know! I'll add it to the list above and address it. |
d31f8ac
to
d033f4d
Compare
Alright, I managed to churn through all this today, which was quicker than I expected. All new commits address the points I listed above; I tried to be as granular as possible, so reviewing is a little easier. I did run a quick format over the file and fixed up some minor code formatting things here and there; this did affect a couple existing commits, but the contents of those commits aren't changed otherwise. All new or altered behaviour is added through new commits. Sorry for the noise there! (I blame For good measure I also added a little formatting feature -- see the last commit. If you notice any other issues, please let me know! I'm always happy to help. |
Hmm, on second thought, I might revisit the symbol lookup behaviour regarding =over
=item *
$object->foo()
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
=item *
$object->bar()
Vestibulum ac placerat arcu. Nulla eget risus sed eros pretium tempus in sed purus.
=back
=cut This is genuinely annoying to handle though, because you can't really parse it all at once; e.g. it might seem logical to just include the next paragraph in the item if there's a plain I'll see if I can revisit the data model in that regard to just make handling So overall, I'm not 100% satisfied with the |
As of this commit, the language server protocol doesn't support HTML as markup content type and it seems that there are currently no plans to support it. [1] It's not really defined whether LSP servers should escape HTML or return HTML escape sequences for things like hover documentation; at least I wasn't able to find any clear specification on that. However, there is a note on the `MarkupContent` interface [2] that says that *clients* might sanitize the returned markup, in order to e.g. remove any HTML that might be in there. This goes for both `plaintext` and `markdown` markup kinds. It's therefore best to let clients decide whether they should escape any markdown content returned by the server or not. Because there are several servers that do this, the Neovim LSP client specifically handles common HTML escape sequences and converts them back to their un-escaped counterparts. [3] I think it's therefore safe to say that HTML shouldn't be escaped server-side and let clients decide whether they want to sanitize their strings or not. [1]: microsoft/language-server-protocol#781 [2]: https://github.com/microsoft/vscode-languageserver-node/blob/14ddabfc22187b698e83ecde072247aa40727308/types/src/main.ts#L2048-L2049 [3]: https://github.com/neovim/neovim/blob/3a61f05dd2d6cb2ac9bca4795467f459595e58dd/runtime/lua/vim/lsp/util.lua#L1265-L1276 Signed-off-by: Max R. Carrara <[email protected]>
This commit makes the regexes that handle the conversion of POD inline format codes to markdown formatting more robust and more conformant to the POD specification. Specifically, previously only codes like "B<< foobar >>" were supported, whereas now codes like the following are also supported: B<< foo bar >> B<<<< foo bar >>>> Both of these formatting codes result in the following markdown: **foo bar** However, due to the nature of doing this kind of conversion using regexes, mismatching numbers of brackets are still possible ("allowed"). Moreover, because this conversion is performed line-by-line, formatting codes like the following two examples will not be recognized, even though the POD spec considers them valid: B<<<<< foo bar >>>>> B<< foo bar >> Both the mismatching brackets and the multiline format code parsing could be addressed by writing a separate parser & converter specifically for formatting codes, but this was deemed as having little to no benefit for the amount of work that it requires. Signed-off-by: Max R. Carrara <[email protected]>
Signed-off-by: Max R. Carrara <[email protected]>
This commit completely rewrites the existing POD parsing and markdown conversion logic, separating it into several components and data types. The three main components, the "parser", "processor" and "converter", are all represented as classes in order to keep their corresponding helpers namespaced (and private). While a purely functional approach is also possible here, I found this to be the most straightforward way to keep everything contained / grouped without introducing any additional files and such. These components are as follows: 1. `RawPodParser` The `RawPodParser` class takes the contents of a file, parses any POD document it encounters and returns it as a `RawPodDocument`. This kind of POD document is called "raw" because no additional validity checks are performed during the parse (e.g. checking if an `=over` command is eventually matched by a `=back` command). 2. `PodProcessor` This class takes a `RawPodDocument` and transforms it into a `PodDocument`, performing various validity checks and data conversions. In essence, this makes sure that the given document conforms as much to the POD specification as feasibly possible (without overcomplicating all of this). Additionally, certain kinds of paragraphs (e.g. verbatim paragraphs) are merged, in order to make them easier to use when converting the POD doc to markdown. 3. `PodToMarkdownConverter` As the name implies, this class converts the given `PodDocument` into a string containing Markdown. This conversion is infallible, as every `PodDocument` returned by the POD processor has already been verified to be correct. All of the components above make it possible to return an entire section (denoted by `=head1`, `=head2`, etc.) belonging to the symbol name being looked up when a client requests the symbol's hover documentation. This means that all sub-sections and other POD commands (e.g. lists in `=over ... =back` regions) are now also included in the hover documentation. Additionally, several other POD paragraphs / structures are now also supported while also fixing some minor things. To summarize: - Consecutive verbatim paragraphs are merged to one large paragraph and rendered as markdown code block without syntax highlighting - Nested `=over ... =back` regions with specified indent levels * This also allows nested lists to be rendered neatly - `=begin html ... =end html` regions are converted to markdown code blocks with HTML syntax highlighting - `=begin code ... =end code` regions are converted to markdown code blocks with Perl syntax highlighting - Nested `=begin ... =end` regions with (custom) format names and parameters (parameters are currently ignored) - `=for` paragraphs are treated the same as `=begin ... =end` regions, as per spec - `=head1`, `=head2`, etc. are now correctly converted to their respective Markdown counterparts (e.g. `=head1 foo` --> `# foo`) - Searching for documentation of methods is now more precise Signed-off-by: Max R. Carrara <[email protected]>
These unit tests ensure that changes to the POD parser, processor and markdown converter don't cause any regressions in the resulting output. Furthermore, these tests also test for certain quirks and peculiarities that this specific POD parser implementation allows, which aren't allowed per the official spec. The tests can be run from the `server/` directory by running `npm ci` followed by `npm test`. The latter command is added by this commit. Signed-off-by: Max R. Carrara <[email protected]>
.. and update the corresponding test case accordingly. Signed-off-by: Max R. Carrara <[email protected]>
.. and also add a corresponding unit test. At the same time, if the `=for` paragraph doesn't contain any content (no lines), don't create an empty ordinary or data paragraph during the processing stage. In other words, a `=for` paragraph without any content now always results in an empty `=begin ... =end` block. Signed-off-by: Max R. Carrara <[email protected]>
.. and update unit tests accordingly. Note that this obviously prevents these kinds of blocks to span across multiple `=pod ... =cut` regions, but this doesn't seem to show up anywhere anyways AFAIK. Signed-off-by: Max R. Carrara <[email protected]>
In other words, this restores the original behaviour of e.g. `=head1` in POD becoming `###` in Markdown. The header level is incremented by `2` up until a maximum level of `6` in order to not exceed the maximum header level of Markdown. Unit tests are updated correspondingly. Signed-off-by: Max R. Carrara <[email protected]>
Instead of always omitting the last indentation level of nested lists, the initial level of the first nesting is now unindented. This means that nested `=over ... =back` blocks with the levels `5, 3, 2` are converted to e.g.: - foo (0) - bar (3) - baz (2) .. instead of: - foo (0) - bar (5) - baz (3) Unit tests are updated correspondingly. Signed-off-by: Max R. Carrara <[email protected]>
.. and use type predicates instead of plain booleans as return types for the `isItem` and `isOverBlockWithItem` functions. Signed-off-by: Max R. Carrara <[email protected]>
Although presumably pretty rare, it doesn't hurt to support them. Signed-off-by: Max R. Carrara <[email protected]>
This commit adds support for the following: - symbol lookup for `=item` paragraphs, restoring the original behaviour - consecutive `=item` or `=head\d` paragraphs (of the same level) for the same subroutine, method, etc., e.g. in order to show an alternative way to call the subroutine, are now displayed as expected - support symbol lookup via index entries (`X<entry>`), which may be used to e.g. look up an alias of a subroutine To elaborate on the last point, here's a quick example: `File::Copy::move` has an alias called `File::Copy::mv`. The documentation for move contains an index entry for `mv`, so that the docs count for both subroutines: =item move X<move> X<mv> X<rename> The symbol lookup regex will now also match index entries, so if a signature doesn't match, the regex will try to match for an index entry, if there is any. This makes it possible to look up the docstring for `File::Copy::mv`. Also, add documentation for the `lookupSymbolInPod` function. Signed-off-by: Max R. Carrara <[email protected]>
For all general cases, the `formatPodDoc` function is now used to perform various actions on a processed pod document. At the moment, this only removes `=head1 NAME` paragraphs from the document, as those are kind of unnecessary in hover docs. After a successful symbol lookup, the returned pod document is formatted in the following ways: - `=item` paragraphs of the same indentation level are converted to header paragraphs, which makes the matched symbol appear more nicely. - `=head\d` paragraphs are now "normalized" -- this means that if e.g. a subroutine signature starts at `=head4`, it will be converted to a `=head1` internally, which is then adapted as needed by the markdown converter. The above results in the docs of all looked up symbols being uniformly rendered as `h3` (`###`), with sub-headings being converted to the respective relative level. Signed-off-by: Max R. Carrara <[email protected]>
Some Perl modules in the wild document their subroutines like this: =over =item * $obj->foo() Lorem ipsum odor amet, consectetuer adipiscing elit. Quis a habitasse hendrerit efficitur phasellus lacus. [...] =back While this would render to Markdown as expected, the =item paragraph would still contain no text in terms of how the data is represented. This commit makes it so that if an =item without text is followed by an ordinary paragraph, the latter is melded into the former during the initial parse of the POD document. Thus, symbol lookup for this case in particular works again as expected. Test cases are adapted correspondingly. Signed-off-by: Max R. Carrara <[email protected]>
d033f4d
to
fe7c1e7
Compare
Hello! Happy new year! It's been a little while, but I've now fixed the This means that this PR is ready to be reviewed again. I appreciate any feedback! FWIW I can also meld the commits logically together instead of adding |
Hi @Aequitosh, Happy New Year and thanks for continuing to push this forward! There's a lot of code here, so it may take me a while to review, but I wanted to provide some initial feedback on functionality. It looks great overall and is working for me on the WWW::Mechanize example as well as a few other key POD files. It might makes sense to remove all of the error checking logic, and attempt to smoothly handle any non-conforming POD. Currently, when an error is encountered, it fails to retrieve any documentation at all. Removing error handling would also simplify much of the code. Some examples that currently fail:
Other modules may fail to parse as well, especially people's local modules which are less likely to be properly formatted. A minor suggestion is to change the 8 spaces in tabsToSpaces to either 2 or 4 spaces (I'm currently using 2 spaces since the syntax highlighting is already a visual indicator that the section is different). When displaying docs in the autocompletion callback, horizontal space is extremely limited in the default size. Here's an example with only one level of tab. Two or three levels may be unreadable. Also, I see that you're using Neovim. Would you mind adding Otherwise, everything seems to be working and looks like a solid improvement. Thanks again! |
This is a solid approach. There's going to be a lot of broken Pod out there. 😄 |
Hi @bscan, thanks for the detailed response! To keep things short, I agree with all your points here and see if I can address them soon, especially the error handling part. I did keep in mind that some POD out there is more lax in its adherence to the spec, but not that lax. Thanks a bunch! I'll address these things in additional commits. |
As suggested in PR bscan#142 [1], get rid of the error handling in the RawPodParser class as part of making the overall POD-to-Markdown conversion process more lax. This only affects =begin, =end and =for command paragraphs, which are now parsed even if there is no format name, despite the spec [2] requiring one. [1]: bscan#142 (comment) [2]: https://perldoc.perl.org/perlpodspec#%22=begin-formatname%22 Signed-off-by: Max R. Carrara <[email protected]>
Similar to the previous commit, get rid of the error handling in the PodProcessor class in order to make the overall POD-to-Markdown conversion process more lax (as suggested in PR bscan#142 [1]). This has a few consequences: - =head paragraphs are now allowed within =over ... =back blocks (even though it is suggested not to put them there [2]). - If there are any disallowed command paragraphs inside =begin ... =end, ignore them instead of returning an error. - The formatname parameter of =begin and =end paragraphs does not have to match anymore. - Should there be any =back or =end paragraphs without a matching =over or =begin, ignore them instead of returning an error. - Even though data paragraphs are constructed by the processor itself and should never occur naturally inside the raw pod blocks being processed, handle them anyways -- this means that they get ignored everywhere, except inside =begin ... =end blocks, where they're just added to the processed block. This is a minor detail, but still worth pointing out. [1]: bscan#142 (comment) [2]: https://perldoc.perl.org/perlpod#=over-indentlevel Signed-off-by: Max R. Carrara <[email protected]>
.. as suggested in PR bscan#142 [1]. [1]: bscan#142 (comment) Signed-off-by: Max R. Carrara <[email protected]>
Signed-off-by: Max R. Carrara <[email protected]>
Due to a tiny logic error, no docstrings would be rendered for modules like File::Basename or File::Find::Rule. This would happen if no symbol name set (which means that just the entire module's POD should be returned). Signed-off-by: Max R. Carrara <[email protected]>
This PR is made mainly in order to address issue #134, but goes beyond that and also addresses a bunch of other things in one go.
In essence, this PR rewrites the entire POD parsing and POD-to-Markdown conversion logic, trying to stay as compliant to the POD specification1 as possible. Compliance is however broken in certain places in order to not over-complicate things. More on that below.
Furthermore, the conversion of POD inline formatting codes (e.g.
B<foobar>
) to their respective Markdown equivalents is also improved. The server-side escaping of HTML is removed, as it's generally the client's decision whether to sanitize HTML within markdown or not.The last commit of this PR also adds a lot of unit tests that have aided me in writing all of this; they're added as a separate commit in case they're undesired. Though, I highly suggest keeping them, as they also test for certain quirks that this implementation allows.
The unit tests are not automatically executed however; they must be run manually in the
server/
directory vianpm test
(requiresnpm ci
to be called first).For all of the above, it's best to read the commit messages for further details.
Highlights
Sub-sections, lists, etc. are now included when looking up the hover documentation for subroutines and methods.
foo
, all POD content from=head3 foo
until the next=head3
(or=head2
or=head1
) will be included and converted to Markdown.=head3 foo
were included.Nested lists (
=over ... =back
) and data regions (=begin ... =end
) are now supported and rendered according to their contents.Headers now correspond to their markdown equivalents.
Subroutine and method lookup is now more precise.
Previews
These previews have been made in Neovim and might look different depending on your editor.
WWW::Mechanize
- BeforeWWW::Mechanize
- AfterWWW::Mechanize::new
- BeforeWWW::Mechanize::new
- AfterRegarding Spec Conformity
While the new parser is much more spec-compliant than the old one, there are still a couple of cases where conforming to the spec was deemed too complicated or having too little ROI.
The spec1 for example does not explicitly specify whether
=item
commands may also exist outside of an=over ... =back
region, so the new parser just supports it (as it doesn't really make much of a difference in terms of the implementation's logic).Furthermore,
=over ... =back
regions have a couple of different "flavours"2, each having different requirements. For simplicity's sake, none of these flavours are accounted for; instead, all content inside=over ... =back
regions is treated in the same manner (except=head1
etc. commands, these are not allowed and will raise an error).This means that mixed list types like the following are valid POD:
The above is equivalent to:
.. and is converted to the following Markdown:
Closing Thoughts
Any feedback is highly appreciated! I've tested this implementation via
WWW::Mechanize
(as you can see above) and also via the provided unit tests; there could always be edge cases I missed though, however, because POD isn't a necessarily... straightforward format.Footnotes
https://perldoc.perl.org/perlpodspec ↩ ↩2
https://perldoc.perl.org/perlpodspec#About-=over...=back-Regions ↩