Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn on duplicate or conflicting YAML metadata key-values within a metadata block #10312

Closed
gwern opened this issue Oct 20, 2024 · 11 comments
Closed

Comments

@gwern
Copy link
Contributor

gwern commented Oct 20, 2024

Pandoc currently silently allows key-value metadata fields which are redundant or conflicting within a YAML metadata block. This allows errors or unintended metadata risk to build up over time.

I suggest that Pandoc ought to warn about there being more than 1 instance of a key within a YAML metadata block (but not across multiple YAML metadata blocks).


The current Pandoc metadata block documentation doesn't say much about how fields are required to be unique or non-overlapping, saying just

A document may contain multiple metadata blocks. If two metadata blocks attempt to set the same field, the value from the second block will be taken.

Currently, Pandoc appears to read values in like a Data.Map.fromList operation: the last pair wins, overriding all previous key-values.

Example of a conflicting pair of key-values:

---
title: title
key: value1
key: value2
...

Hello world!

yields value2, and no warnings:

$ xclip -o | pandoc -s -w native
Pandoc
  Meta
    { unMeta =
        fromList
          [ ( "key" , MetaInlines [ Str "value2" ] )
          , ( "title" , MetaInlines [ Str "title" ] )
          ]
    }
  [ Para [ Str "Hello" , Space , Str "world!" ] ]

(And if it's just a redundant key, like key: value2 twice, then it yields value2, also no warnings.)

I was trying out a new site feature involving the metadata, and discovered I had ~10 instances, built up over the past decades, where a Pandoc YAML metadata field was either: 1. required but missing; 2. duplicated twice exactly; or 3. had two keys with different conflicting values (only one of which was right). Pandoc obviously hadn't warned or hinted about them at all, or else I would've fixed them long ago.

Now, enforcing required metadata is out of scope for Pandoc and should be handled by another tool like the build script, but the other 2 should be handled by Pandoc.

They have to be handled by Pandoc because it is difficult to handle them as a user. They do not appear in the Pandoc API: the unMeta map has already erased the duplicates. So there is no way to add a check or lint easily at the Pandoc user level. It needs to be done while reading the original YAML.

One possibility would be to change the fromList to append, but this would probably be confusing and no one would use it. So it would be more useful to just warn about the cases of redundant keys.

When might we want redundant or conflicting keys in a metadata block?

I have a hard time thinking of any legitimate usecase for having the exact same key-value pair twice in the same metadata block. It's easy to imagine that it would be useful for having redundant pairs across multiple metadata blocks, for templating or defaults, that sort of thing, but not within the same metadata block. So I think warning on a redundant pair is a very safe warning to add for what is almost certainly an oversight or error.

Conflicting keys are a little trickier. Again, you obviously might need it across blocks, but within blocks? It's possible that there are systems which try to munge metadata blocks and inject user metadata settings after a default, or vice-versa. But it seems like anything you would do like that where the redundancy/override is the goal, you could do more safely and more cleanly by using multiple metadata blocks?

For example, instead of having multiple title or key values in a single metadata block, just write multiple ones in a clean, compact, easier to generate & read way, like:

---
title: Template Title
key: template-value
...

---
key: my-value
...

Hello world!

which yields the expected AST:

$ xclip -o | pandoc -s -w native
Pandoc
  Meta
    { unMeta =
        fromList
          [ ( "key" , MetaInlines [ Str "my-value" ] )
          , ( "title"
            , MetaInlines [ Str "Template" , Space , Str "Title" ]
            )
          ]
    }
  [ Para [ Str "Hello" , Space , Str "world!" ] ]

So a warning there is useful to encourage systems to move to a safer way of handling metadata in general, by using metadata defined in other blocks or in other files (like the --metadata-file option) .

@jgm
Copy link
Owner

jgm commented Oct 21, 2024

I would certainly like to be able to issue a warning in such cases, but we're using the Hackage yaml package, and it would be the one that would have to generate this warning. It doesn't.

@jgm
Copy link
Owner

jgm commented Oct 21, 2024

Interestingly, yaml has (as of 0.10) added
https://hackage.haskell.org/package/yaml-0.11.11.2/docs/Data-Yaml.html#v:decodeFileWithWarnings
which gives us warnings -- but this is only available in a version that takes a file path, not in a version that takes a string or bytestring of the YAML, so it wouldn't help us.

@jgm
Copy link
Owner

jgm commented Oct 21, 2024

We could maybe use
https://hackage.haskell.org/package/yaml-0.11.11.2/docs/Data-Yaml.html#v:decodeHelper
with conduit primitives.

@jgm
Copy link
Owner

jgm commented Oct 21, 2024

Using decodeHelper_ it should be fairly straightforward; something ilke

Data.Yaml.Internal.decodeHelper_ Data.Libyaml.decode

@gwern
Copy link
Contributor Author

gwern commented Oct 21, 2024

I am not entirely sure the underlying warning there is right:

https://hackage.haskell.org/package/yaml-0.11.11.2/docs/src/Data.Yaml.Internal.html#local-6989586621679083465

There's no docs here explaining what this warning is supposed to do, and the codebase is unfamiliar, but if I'm reading this right, this would detect conflicting keys (by finding a key-value pair in the original Set not in the final Set?) but it would not detect redundant keys (because then the redundancy would have disappeared when the original Set was created)?

@jgm
Copy link
Owner

jgm commented Oct 21, 2024

This works:

import Data.ByteString (ByteString)
import Data.Yaml.Internal (decodeHelper_, Warning, ParseException)
import Data.Aeson (FromJSON)
import qualified Text.Libyaml as Y

decodeWarn :: FromJSON a
           => ByteString -> IO (Either ParseException ([Warning], a))
decodeWarn = decodeHelper_ . Y.decode

IT does give you duplicate key warnings, e.g.

[DuplicateKey [Key "foo"]]

@jgm
Copy link
Owner

jgm commented Oct 21, 2024

Oh, I see, you're more worried about the case where you have multiple YAML metadata sections in the same document? I read too hastily. Yes, that part is handled by pandoc.

@gwern
Copy link
Contributor Author

gwern commented Oct 21, 2024

Ah, so DuplicateKey here is simply ignoring the value part, so regardless whether a key-value pair is redundant or it is conflicting (based on the value part), the duplicate key is detected? Then yeah, I think that decode approach should work.

(As long as the different YAML blocks are handled separately (otherwise we risk spurious warnings from reasonable metadata combination uses as I outlined), but Pandoc presumably does the obvious thing of reading them separately & then merging them later on for use and that would avoid any issues here.)

@jgm
Copy link
Owner

jgm commented Oct 21, 2024

That's right. OK, I will plan to implement this warning when I get a chance.

@jgm
Copy link
Owner

jgm commented Oct 21, 2024

Wait, never mind, this won't work because it's in IO. We have to be able to do YAML parsing outside IO. It may be possible to rewrite the function so it doesn't need IO, but it would take some understanding of conduit and the libyaml and yaml libraries.

@jgm
Copy link
Owner

jgm commented Oct 21, 2024

It turns out that yaml's non-IO function decodeEither' simply calls unsafePerformIO on the IO function, so we can do the same I guess!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants