Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outline enhancement request #41

Closed
danjgill opened this issue Jan 5, 2025 · 2 comments
Closed

Outline enhancement request #41

danjgill opened this issue Jan 5, 2025 · 2 comments

Comments

@danjgill
Copy link

danjgill commented Jan 5, 2025

Hi,

I am trying to use outlines with epub files, and it is not working. Chapter and PageNumber always return as -1. Perhaps I am missing a step for setting them up, or perhaps they simply aren't meant to work with reflowable files?

As an alternative, I tried to use link files, but ran into a different problem. Here Chapter and PageNumber are correct, even when I change the font size and re-Layout the document. So good job there! However, the MuPDFLink object lacks a title, or some other method of finding something to put into a combobox. I might be able to figure out the text, but every approach I take seems quite hacky.

So is it possible to either A) get the correct Chapter and PageNumbers into the Outline (including updating them when layout changes) or B) Adding a "Title" property to the MuPDFLink object that captures the text of the link.

(A) would be preferable, as sometimes the links are very short (such as in an index).

Or maybe I am missing something obvious and there is some step I am missing. If so, any direction would be appreciated.

Thanks

@arklumpus
Copy link
Owner

arklumpus commented Jan 6, 2025

Hi, thanks for pointing this out! It turns out that the outline items were correctly resolved to an URI like EPUB/xhtml/section0002.xhtml, but the next step (i.e., matching that to a page number) was missing.

I just pushed version 2.0.1, which should fix this issue; please have a look (the link destinations should also update automatically if you reflow the document with a different page/font size). This should address your part A.

For part B, the problem is that a link is not strictly associated to a text element on the page, so there is no such thing as the "text" of the link. However, a link has an ActiveArea, which represents the rectangle on which the user can click to activate it; if you really wanted to get the text associated with a link, you could use this in combination with a MuPDFStructuredTextPage (see below).

Code example
using MuPDFContext ctx = new MuPDFContext();
using MuPDFDocument doc = new MuPDFDocument(ctx, @"C:\Users\Giorgio\Downloads\LinkExamples.pdf");

using MuPDFStructuredTextPage structTextPage = doc.GetStructuredTextPage(0);

foreach (MuPDFLink link in doc.Pages[0].Links)
{
    // Middle-left point of the link area.
    PointF linkStart = new PointF(link.ActiveArea.X0, (link.ActiveArea.Y0 + link.ActiveArea.Y1) * 0.5f);

    // Middle-right point of the link area.
    PointF linkEnd = new PointF(link.ActiveArea.X1, (link.ActiveArea.Y0 + link.ActiveArea.Y1) * 0.5f);

    // Get the closest text character to the start and end of the link area.
    MuPDFStructuredTextAddress? start = structTextPage.GetHitAddress(linkStart, false);
    MuPDFStructuredTextAddress? end = structTextPage.GetHitAddress(linkEnd, false);

    string linkText;

    // If both of them exist...
    if (start != null && end != null)
    {
        // ... then we can get the link text.
        linkText = structTextPage.GetText(new MuPDFStructuredTextAddressSpan(start.Value, end));
    }
    else
    {
        // ... otherwise, there is no useful link text we can retrieve.
        linkText = "<no text>";
    }

    string destination = link.Destination.Type switch
    {
        MuPDFLinkDestination.DestinationType.External => (link.Destination as MuPDFExternalLinkDestination)!.Uri,
        MuPDFLinkDestination.DestinationType.Internal => $"Page {(link.Destination as MuPDFInternalLinkDestination)!.PageNumber}",
        _ => "<Unknown destination>"
    };

    Console.WriteLine("\"{0}\" links to \"{1}\"", linkText, destination);
}

Here is the example PDF: LinkExamples.pdf

However, note that this may not always work as expected, depending on how the link is implemented in the document (see examples below).

Link examples
  • In example 1, the link is a single rectangle encompassing the whole link text, so everything is fine:

    image

    This produces:

    "Link text" links to "https://github.com/arklumpus/MuPDFCore"
    
  • At first sight, example 2 looks exactly the same, but in this case the text is drawn as a series of paths, rather than as actual text:

    image

    As a result, the link text cannot be extracted (unless you use the OCR feature):

    "<no text>" links to "https://github.com/arklumpus/MuPDFCore"
    
  • In example 3, each letter has its own link area:

    image

    In this case, you get a bunch of different links (the last one comes from the underline):

    "L" links to "https://github.com/arklumpus/MuPDFCore"
    "i" links to "https://github.com/arklumpus/MuPDFCore"
    "in" links to "https://github.com/arklumpus/MuPDFCore"
    "nk " links to "https://github.com/arklumpus/MuPDFCore"
    "k" links to "https://github.com/arklumpus/MuPDFCore"
    " t" links to "https://github.com/arklumpus/MuPDFCore"
    "te" links to "https://github.com/arklumpus/MuPDFCore"
    "x" links to "https://github.com/arklumpus/MuPDFCore"
    "xt" links to "https://github.com/arklumpus/MuPDFCore"
    "Link text" links to "https://github.com/arklumpus/MuPDFCore"
    
  • In example 4, the link is broken by a line end:

    image

    In this case, you also get two different links:

    "Link" links to "https://github.com/arklumpus/MuPDFCore"
    "text" links to "https://github.com/arklumpus/MuPDFCore"
    

In any case, I hope this helps! Please let me know if you have any more problems or questions!

@danjgill
Copy link
Author

danjgill commented Jan 7, 2025

Hi there,

I just tried out the Outline fix and can confirm that it works. Thanks for making a great resource even better!

I think I agree on the discussion of the Links. While it is feasible to retrieve the text, it may have too many corner cases and exceptions to make it worthwhile. I do appreciate the code above however, as it may come in handy in the future for better understanding how both StructuredTextPages and Links work.

Thanks again for the speedy resolution.

@danjgill danjgill closed this as completed Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants