Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite POD parser and Markdown converter #142

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

Aequitosh
Copy link

This PR is made mainly in order to address issue #134, but goes beyond that and also addresses a bunch of other things in one go.

In essence, this PR rewrites the entire POD parsing and POD-to-Markdown conversion logic, trying to stay as compliant to the POD specification1 as possible. Compliance is however broken in certain places in order to not over-complicate things. More on that below.

Furthermore, the conversion of POD inline formatting codes (e.g. B<foobar>) to their respective Markdown equivalents is also improved. The server-side escaping of HTML is removed, as it's generally the client's decision whether to sanitize HTML within markdown or not.

The last commit of this PR also adds a lot of unit tests that have aided me in writing all of this; they're added as a separate commit in case they're undesired. Though, I highly suggest keeping them, as they also test for certain quirks that this implementation allows.

The unit tests are not automatically executed however; they must be run manually in the server/ directory via npm test (requires npm ci to be called first).

For all of the above, it's best to read the commit messages for further details.

Highlights

  • Sub-sections, lists, etc. are now included when looking up the hover documentation for subroutines and methods.

    • This means that for a subroutine named foo, all POD content from =head3 foo until the next =head3 (or =head2 or =head1) will be included and converted to Markdown.
    • Previously, only the ordinary and verbatim paragraphs directly after =head3 foo were included.
  • Nested lists (=over ... =back) and data regions (=begin ... =end) are now supported and rendered according to their contents.

  • Headers now correspond to their markdown equivalents.

  • Subroutine and method lookup is now more precise.

Previews

These previews have been made in Neovim and might look different depending on your editor.

WWW::Mechanize - Before

Screenshot_20240902_162953

WWW::Mechanize - After

Screenshot_20240902_155827

WWW::Mechanize::new - Before

Screenshot_20240902_163219

WWW::Mechanize::new - After

Screenshot_20240902_163255

Regarding Spec Conformity

While the new parser is much more spec-compliant than the old one, there are still a couple of cases where conforming to the spec was deemed too complicated or having too little ROI.

The spec1 for example does not explicitly specify whether =item commands may also exist outside of an =over ... =back region, so the new parser just supports it (as it doesn't really make much of a difference in terms of the implementation's logic).

Furthermore, =over ... =back regions have a couple of different "flavours"2, each having different requirements. For simplicity's sake, none of these flavours are accounted for; instead, all content inside =over ... =back regions is treated in the same manner (except =head1 etc. commands, these are not allowed and will raise an error).

This means that mixed list types like the following are valid POD:

=pod

=over

=item * Foo

=item 42. Bar

=back

=cut

The above is equivalent to:

=pod

=over 4

=item *

Foo

=item 42.

Bar

=back

=cut

.. and is converted to the following Markdown:

- Foo
42. Bar

Closing Thoughts

Any feedback is highly appreciated! I've tested this implementation via WWW::Mechanize (as you can see above) and also via the provided unit tests; there could always be edge cases I missed though, however, because POD isn't a necessarily... straightforward format.

Footnotes

  1. https://perldoc.perl.org/perlpodspec 2

  2. https://perldoc.perl.org/perlpodspec#About-=over...=back-Regions

@Aequitosh
Copy link
Author

Oh, one thing I appear to have missed is including =items when looking up a symbol's name -- it seems that some docs out there in the wild group and enumerate their subroutines in =over ... =back lists.

Will see it I can support this gracefully somehow. I guess it's best to just include all content starting from the matched =item until another =item (on the same nesting level) is encountered. Shouldn't be too hard.

Apart from that, let me know what you think! I'm happy to answer any questions, of course.

@bscan
Copy link
Owner

bscan commented Sep 10, 2024

Looks great, thanks! Working through it now as it's a large commit. Some comments:

I think being more error tolerant would be helpful, especially as POD doesn't always follow the POD standard. Generally, as long as it looks like pod, we should probably show it. Some examples that now break:

File::Basename and File::Compare are core modules that break entirely because they don't end with a =cut. It throws the error: message: '"=pod ... =cut" region beginning at line 400 was never closed (missing "=cut")'

File::Fetch errors with message: 'unexpected end of paragraphs while processing "=over ... =back" block'. This is because it doesn't have a =back line, only a =cut to end the list.

File::Temp throws an uncaught error on this line:
let contents = matchResult.groups?.contents.trim() || "";
Although this fixes it:
let contents = (matchResult.groups?.contents || "").trim();

I found the above examples by flipping through autocompletion results on use File::, which is also a nice way to quickly spot check a large number of modules.

And as you mentioned, the lookup from item needs to be added back: use File::Copy qw(move), the move doesn't work anymore.

Also, for translating =head\d into markdown, I see that it now does "#".repeat(headerPara.level). The prior version started at 3 levels down instead which I found was a better use of space. The enormous display of =head1 would waste substantial amounts of vertical space. See the comparison is below. I also previously had it remove the NAME: title as it felt unneccessary,.

Before:
image

After:
image

Overall though, this is great. Rearchitecting this is definitely helpful and I appreciate you diving in here. POD parsing is very painful.

@Aequitosh
Copy link
Author

Aequitosh commented Sep 10, 2024

Thank you for the feedback!

Regarding fault-tolerance: Interesting, I didn't know POD was interpreted so "liberally" across the ecosystem (but alas, I should've figured, it's Perl after all). I'll definitely address your points and make it more forgiving; as forgiving as possible. Perhaps I'll just make it ignore errors completely, if possible.

Regarding headers: I did not know that VSCode would actually display them so prominently; I agree that it looks way too large! I'll use a level of 3 as the starting level again.

I'll also address the uncaught error, of course.

Besides that, I recently also found two small issues:

  1. Nested lists aren't correctly indented; the last indentation level is always skipped in order to unindent the list in Markdown. Instead, the first indentation level should be skipped.
  2. Multiple headings belonging to the same subroutine aren't rendered correctly, as the symbol lookup function will just consider everything from the first to the succeeding heading, without checking if the heading also belongs to the subroutine.
    • In other words, if there's =head3 foo(bar => \%baz) followed by =head3 foo(bar => \@baz) (because the author wants to show that keys can have either array refs or hash refs as values), only the content starting from the former up until and excluding the latter header is included. This means that the actual docs are lost when rendered as Markdown.

I'll also correct those.

So in total, the following things need to be addressed:

  • Use a level of 3 as starting level for headers (So, a =head1 becomes ### in MD)
  • Make the parser more tolerant regarding errors, as the exact spec isn't upheld in the wild
    • If possible, just ignore errors entirely (kind of done in some places already)
  • Fix indentation of nested lists
  • Fix consecutive headers belonging to the same subroutine being ignored, leading to the rendered docs missing content
  • Fix =item commands not being taken into account when looking up a symbol's name
    • Note that I'll probably not stay on spec here and only consider =item foo, =item * foo and =item N. foo, because being spec-compliant in this case is a huge hassle -- e.g. =item *\n\nfoo\n\n is painful to handle, because foo is now an ordinary paragraph
  • Fix uncaught error (shows up with File::Temp)
  • Remove =head1 NAME from top of document if it exists, as it's unnecessary

If there's anything else you find, let me know! I'll add it to the list above and address it.

@Aequitosh Aequitosh force-pushed the fix-134-pod-hover-doc branch from d31f8ac to d033f4d Compare September 10, 2024 15:36
@Aequitosh
Copy link
Author

Alright, I managed to churn through all this today, which was quicker than I expected.

All new commits address the points I listed above; I tried to be as granular as possible, so reviewing is a little easier.

I did run a quick format over the file and fixed up some minor code formatting things here and there; this did affect a couple existing commits, but the contents of those commits aren't changed otherwise. All new or altered behaviour is added through new commits. Sorry for the noise there! (I blame git absorb).

For good measure I also added a little formatting feature -- see the last commit.

If you notice any other issues, please let me know! I'm always happy to help.

@Aequitosh
Copy link
Author

Aequitosh commented Sep 13, 2024

Hmm, on second thought, I might revisit the symbol lookup behaviour regarding =item paragraphs... There do seem to be some packages that document their subroutines like this:

=over

=item *

$object->foo()

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

=item *

$object->bar()

Vestibulum ac placerat arcu. Nulla eget risus sed eros pretium tempus in sed purus.

=back

=cut

This is genuinely annoying to handle though, because you can't really parse it all at once; e.g. it might seem logical to just include the next paragraph in the item if there's a plain =item or =item * without any additional text, but that opens a can of worms of additional cases that ought to be handled.

I'll see if I can revisit the data model in that regard to just make handling =items much easier overall, or perhaps I'll cook up some sort of "peekable" or "rewindable" iterator, or something like that.

So overall, I'm not 100% satisfied with the =item handling -- so if you consider this fine to merge, please don't do so yet. I really want to make this proper and implement something that's maintainable.

Max R. Carrara added 15 commits January 2, 2025 14:42
As of this commit, the language server protocol doesn't support HTML
as markup content type and it seems that there are currently no plans
to support it. [1]

It's not really defined whether LSP servers should escape HTML or
return HTML escape sequences for things like hover documentation; at
least I wasn't able to find any clear specification on that.

However, there is a note on the `MarkupContent` interface [2] that
says that *clients* might sanitize the returned markup, in order to
e.g. remove any HTML that might be in there. This goes for both
`plaintext` and `markdown` markup kinds.

It's therefore best to let clients decide whether they should escape
any markdown content returned by the server or not.

Because there are several servers that do this, the Neovim LSP client
specifically handles common HTML escape sequences and converts them
back to their un-escaped counterparts. [3]

I think it's therefore safe to say that HTML shouldn't be escaped
server-side and let clients decide whether they want to sanitize their
strings or not.

[1]: microsoft/language-server-protocol#781
[2]: https://github.com/microsoft/vscode-languageserver-node/blob/14ddabfc22187b698e83ecde072247aa40727308/types/src/main.ts#L2048-L2049
[3]: https://github.com/neovim/neovim/blob/3a61f05dd2d6cb2ac9bca4795467f459595e58dd/runtime/lua/vim/lsp/util.lua#L1265-L1276

Signed-off-by: Max R. Carrara <[email protected]>
This commit makes the regexes that handle the conversion of POD inline
format codes to markdown formatting more robust and more conformant to
the POD specification.

Specifically, previously only codes like "B<< foobar >>" were
supported, whereas now codes like the following are also supported:

  B<<    foo bar  >>
  B<<<<     foo bar        >>>>

Both of these formatting codes result in the following markdown:

  **foo bar**

However, due to the nature of doing this kind of conversion using
regexes, mismatching numbers of brackets are still possible
("allowed").

Moreover, because this conversion is performed line-by-line,
formatting codes like the following two examples will not be
recognized, even though the POD spec considers them valid:

  B<<<<<
  foo bar
  >>>>>

  B<<
       foo bar
                     >>

Both the mismatching brackets and the multiline format code parsing
could be addressed by writing a separate parser & converter
specifically for formatting codes, but this was deemed as having
little to no benefit for the amount of work that it requires.

Signed-off-by: Max R. Carrara <[email protected]>
This commit completely rewrites the existing POD parsing and markdown
conversion logic, separating it into several components and data
types.

The three main components, the "parser", "processor" and "converter",
are all represented as classes in order to keep their corresponding
helpers namespaced (and private).

While a purely functional approach is also possible here, I found this
to be the most straightforward way to keep everything contained /
grouped without introducing any additional files and such.

These components are as follows:

1. `RawPodParser`

The `RawPodParser` class takes the contents of a file, parses any POD
document it encounters and returns it as a `RawPodDocument`. This kind
of POD document is called "raw" because no additional validity checks
are performed during the parse (e.g. checking if an `=over` command is
eventually matched by a `=back` command).

2. `PodProcessor`

This class takes a `RawPodDocument` and transforms it into a
`PodDocument`, performing various validity checks and data
conversions.

In essence, this makes sure that the given document conforms as much
to the POD specification as feasibly possible (without
overcomplicating all of this).

Additionally, certain kinds of paragraphs (e.g. verbatim paragraphs)
are merged, in order to make them easier to use when converting the
POD doc to markdown.

3. `PodToMarkdownConverter`

As the name implies, this class converts the given `PodDocument` into
a string containing Markdown. This conversion is infallible, as every
`PodDocument` returned by the POD processor has already been verified
to be correct.

All of the components above make it possible to return an entire
section (denoted by `=head1`, `=head2`, etc.) belonging to the symbol
name being looked up when a client requests the symbol's hover
documentation.

This means that all sub-sections and other POD commands (e.g. lists in
`=over ... =back` regions) are now also included in the hover
documentation.

Additionally, several other POD paragraphs / structures are now also
supported while also fixing some minor things. To summarize:

- Consecutive verbatim paragraphs are merged to one large paragraph
  and rendered as markdown code block without syntax highlighting

- Nested `=over ... =back` regions with specified indent levels
  * This also allows nested lists to be rendered neatly

- `=begin html ... =end html` regions are converted to markdown code
  blocks with HTML syntax highlighting

- `=begin code ... =end code` regions are converted to markdown code
  blocks with Perl syntax highlighting

- Nested `=begin ... =end` regions with (custom) format names and
  parameters (parameters are currently ignored)

- `=for` paragraphs are treated the same as `=begin ... =end` regions,
  as per spec

- `=head1`, `=head2`, etc. are now correctly converted to their
  respective Markdown counterparts (e.g. `=head1 foo` --> `# foo`)

- Searching for documentation of methods is now more precise

Signed-off-by: Max R. Carrara <[email protected]>
These unit tests ensure that changes to the POD parser, processor and
markdown converter don't cause any regressions in the resulting
output.

Furthermore, these tests also test for certain quirks and
peculiarities that this specific POD parser implementation allows,
which aren't allowed per the official spec.

The tests can be run from the `server/` directory by running `npm ci`
followed by `npm test`. The latter command is added by this commit.

Signed-off-by: Max R. Carrara <[email protected]>
.. and update the corresponding test case accordingly.

Signed-off-by: Max R. Carrara <[email protected]>
.. and also add a corresponding unit test.

At the same time, if the `=for` paragraph doesn't contain any content
(no lines), don't create an empty ordinary or data paragraph during
the processing stage.

In other words, a `=for` paragraph without any content now always
results in an empty `=begin ... =end` block.

Signed-off-by: Max R. Carrara <[email protected]>
.. and update unit tests accordingly.

Note that this obviously prevents these kinds of blocks to span across
multiple `=pod ... =cut` regions, but this doesn't seem to show up
anywhere anyways AFAIK.

Signed-off-by: Max R. Carrara <[email protected]>
In other words, this restores the original behaviour of e.g. `=head1`
in POD becoming `###` in Markdown.

The header level is incremented by `2` up until a maximum level of
`6` in order to not exceed the maximum header level of Markdown.

Unit tests are updated correspondingly.

Signed-off-by: Max R. Carrara <[email protected]>
Instead of always omitting the last indentation level of nested lists,
the initial level of the first nesting is now unindented.

This means that nested `=over ... =back` blocks with the levels
`5, 3, 2` are converted to e.g.:

- foo (0)
   - bar (3)
     - baz (2)

.. instead of:

- foo (0)
     - bar (5)
        - baz (3)

Unit tests are updated correspondingly.

Signed-off-by: Max R. Carrara <[email protected]>
.. and use type predicates instead of plain booleans as return types
for the `isItem` and `isOverBlockWithItem` functions.

Signed-off-by: Max R. Carrara <[email protected]>
Although presumably pretty rare, it doesn't hurt to support them.

Signed-off-by: Max R. Carrara <[email protected]>
This commit adds support for the following:
- symbol lookup for `=item` paragraphs, restoring the original
  behaviour
- consecutive `=item` or  `=head\d` paragraphs (of the same level) for
  the same subroutine, method, etc., e.g. in order to show an
  alternative way to call the subroutine, are now displayed as
  expected
- support symbol lookup via index entries (`X<entry>`), which may be
  used to e.g. look up an alias of a subroutine

To elaborate on the last point, here's a quick example:

`File::Copy::move` has an alias called `File::Copy::mv`. The
documentation for move contains an index entry for `mv`, so that the
docs count for both subroutines:

  =item move
  X<move> X<mv> X<rename>

The symbol lookup regex will now also match index entries, so if a
signature doesn't match, the regex will try to match for an index
entry, if there is any.

This makes it possible to look up the docstring for `File::Copy::mv`.

Also, add documentation for the `lookupSymbolInPod` function.

Signed-off-by: Max R. Carrara <[email protected]>
For all general cases, the `formatPodDoc` function is now used to
perform various actions on a processed pod document. At the moment,
this only removes `=head1 NAME` paragraphs from the document, as those
are kind of unnecessary in hover docs.

After a successful symbol lookup, the returned pod document is
formatted in the following ways:
- `=item` paragraphs of the same indentation level are converted to
  header paragraphs, which makes the matched symbol appear more
  nicely.
- `=head\d` paragraphs are now "normalized" -- this means that if e.g.
  a subroutine signature starts at `=head4`, it will be converted to a
  `=head1` internally, which is then adapted as needed by the
  markdown converter.

The above results in the docs of all looked up symbols being uniformly
rendered as `h3` (`###`), with sub-headings being converted to the
respective relative level.

Signed-off-by: Max R. Carrara <[email protected]>
Some Perl modules in the wild document their subroutines like this:

  =over

  =item *

  $obj->foo()

  Lorem ipsum odor amet, consectetuer adipiscing elit. Quis a habitasse
  hendrerit efficitur phasellus lacus.

  [...]

  =back

While this would render to Markdown as expected, the =item paragraph
would still contain no text in terms of how the data is represented.

This commit makes it so that if an =item without text is followed by
an ordinary paragraph, the latter is melded into the former during the
initial parse of the POD document. Thus, symbol lookup for this case
in particular works again as expected.

Test cases are adapted correspondingly.

Signed-off-by: Max R. Carrara <[email protected]>
@Aequitosh Aequitosh force-pushed the fix-134-pod-hover-doc branch from d033f4d to fe7c1e7 Compare January 2, 2025 13:57
@Aequitosh
Copy link
Author

Hello! Happy new year!

It's been a little while, but I've now fixed the =item handling in a separate commit. I've also updated some very tiny spelling issues and comments here and there in older commits and commit messages, but those shouldn't matter much.

This means that this PR is ready to be reviewed again. I appreciate any feedback!

FWIW I can also meld the commits logically together instead of adding fix ... commits here and there, but I hope it's a bit easier to review now as I'm not adding everything as a huge commit at once. If you've got any preferences in that regard anyhow, please let me know.

@bscan
Copy link
Owner

bscan commented Jan 2, 2025

Hi @Aequitosh, Happy New Year and thanks for continuing to push this forward! There's a lot of code here, so it may take me a while to review, but I wanted to provide some initial feedback on functionality. It looks great overall and is working for me on the WWW::Mechanize example as well as a few other key POD files.

It might makes sense to remove all of the error checking logic, and attempt to smoothly handle any non-conforming POD. Currently, when an error is encountered, it fails to retrieve any documentation at all. Removing error handling would also simplify much of the code. Some examples that currently fail:

'=back' does not have matching '=over' looks occurs in File::Basename, File::Fetch, File::Find::Rule, B::Utils, Data::Float

unexpected command paragraph "over" in "=begin future ... =end future" block happens in the core module Encode::Encoding, and a similar error occurs in Reply::Plugin::TypeTiny

Other modules may fail to parse as well, especially people's local modules which are less likely to be properly formatted.

A minor suggestion is to change the 8 spaces in tabsToSpaces to either 2 or 4 spaces (I'm currently using 2 spaces since the syntax highlighting is already a visual indicator that the section is different). When displaying docs in the autocompletion callback, horizontal space is extremely limited in the default size. Here's an example with only one level of tab. Two or three levels may be unreadable.

image

Also, I see that you're using Neovim. Would you mind adding perl as the language identifier in the verbatim blocks in convertVerbatimPara and seeing if it works? Vscode already natively interprets all verbatim blocks as perl, and would be a nice Neovim improvement. The idea was suggested here: #152 . I don't mean to add scope-creep, but I think it would be an easy fix for you.

Otherwise, everything seems to be working and looks like a solid improvement. Thanks again!

@oalders
Copy link
Contributor

oalders commented Jan 3, 2025

It might makes sense to remove all of the error checking logic, and attempt to smoothly handle any non-conforming POD. Currently, when an error is encountered, it fails to retrieve any documentation at all. Removing error handling would also simplify much of the code.

This is a solid approach. There's going to be a lot of broken Pod out there. 😄

@Aequitosh
Copy link
Author

Hi @bscan, thanks for the detailed response! To keep things short, I agree with all your points here and see if I can address them soon, especially the error handling part. I did keep in mind that some POD out there is more lax in its adherence to the spec, but not that lax.

Thanks a bunch! I'll address these things in additional commits.

Max R. Carrara added 2 commits January 7, 2025 14:21
As suggested in PR bscan#142 [1], get rid of the error handling in the
RawPodParser class as part of making the overall POD-to-Markdown
conversion process more lax.

This only affects =begin, =end and =for command paragraphs, which are
now parsed even if there is no format name, despite the spec [2]
requiring one.

[1]: bscan#142 (comment)
[2]: https://perldoc.perl.org/perlpodspec#%22=begin-formatname%22

Signed-off-by: Max R. Carrara <[email protected]>
Similar to the previous commit, get rid of the error handling in the
PodProcessor class in order to make the overall POD-to-Markdown
conversion process more lax (as suggested in PR bscan#142 [1]).

This has a few consequences:

- =head paragraphs are now allowed within =over ... =back blocks (even
  though it is suggested not to put them there [2]).

- If there are any disallowed command paragraphs inside =begin ...
  =end, ignore them instead of returning an error.

- The formatname parameter of =begin and =end paragraphs does not have
  to match anymore.

- Should there be any =back or =end paragraphs without a matching
  =over or =begin, ignore them instead of returning an error.

- Even though data paragraphs are constructed by the processor itself
  and should never occur naturally inside the raw pod blocks being
  processed, handle them anyways -- this means that they get ignored
  everywhere, except inside =begin ... =end blocks, where they're just
  added to the processed block. This is a minor detail, but still
  worth pointing out.

[1]: bscan#142 (comment)
[2]: https://perldoc.perl.org/perlpod#=over-indentlevel

Signed-off-by: Max R. Carrara <[email protected]>
Max R. Carrara added 3 commits January 7, 2025 16:38
.. as suggested in PR bscan#142 [1].

[1]: bscan#142 (comment)

Signed-off-by: Max R. Carrara <[email protected]>
Due to a tiny logic error, no docstrings would be rendered for modules
like File::Basename or File::Find::Rule. This would happen if no
symbol name set (which means that just the entire module's POD should
be returned).

Signed-off-by: Max R. Carrara <[email protected]>
@Aequitosh
Copy link
Author

Hello! Was quicker than I thought. The commit messages should speak for themselves, but in summary, I addressed all of your points.

Regarding docstrings not popping up for e.g. File::Basename -- that was actually a small logic error on my part. Perhaps I could clean that part up more overall, but I'll refrain from doing that for the time being, I don't want to dump more commits onto you to look at.

Also, the docstrings look much better now with syntax highlighting in verbatim paragraphs, like you suggested:
image

(The two empty lines (instead of one) are a Neovim thing by the way, it hides the backticks for readability unless your cursor goes over the line containing them.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants