Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adopting the pandoc-citeproc markdown citation format #2

Closed
dhimmel opened this issue Jun 30, 2017 · 15 comments
Closed

Adopting the pandoc-citeproc markdown citation format #2

dhimmel opened this issue Jun 30, 2017 · 15 comments

Comments

@dhimmel
Copy link
Member

dhimmel commented Jun 30, 2017

Currently we cite multiple documents like:

Several groups [@doi:10.1371/journal.pone.0032235 @doi:10.1109/TCBB.2014.2343960 @doi:10.1038/srep11476] initiated

Prior to pandoc, this gets converted to:

Several groups [@1AlhRKQbe; @ZzaRyGuJ; @UpFrhdJf] initiated

Then post pandoc conversion, it will look like:

Several groups [30,192,193] initiated

Note how we have to add semicolons to separate each reference. We figured this out at lierdakil/pandoc-crossref#110. It would be nice to align our format with the pandoc-citeproc format. This presumably would also allow us to make non-bracketed citations like:

@doi:10.1371/journal.pone.0032235 was the first group 

This would presumably render to

Qi et al 2012 was the first group

However, I haven't found the actual docs for the markdown citation formatting supported by pandoc-citeproc (docs). Tagging @lierdakil and @slochower in case they have any insights.

@dhimmel
Copy link
Member Author

dhimmel commented Jun 30, 2017

@jgm (creator of pandoc-citeproc): could you point us to where we can find more info on the syntax for markdown citations?

@dhimmel
Copy link
Member Author

dhimmel commented Jun 30, 2017

Just found the following in the pandoc docs. Converted to markdown using this cool tool.


Citations go inside square brackets and are separated by semicolons. Each citation must have a key, composed of '@' + the citation identifier from the database, and may optionally have a prefix, a locator, and a suffix. The citation key must begin with a letter, digit, or _, and may contain alphanumerics, _, and internal punctuation characters (:.#$%&-+?<>~/). Here are some examples:

Blah blah [see @doe99, pp. 33-35; also @smith04, chap. 1].

Blah blah [@doe99, pp. 33-35, 38-39 and *passim*].

Blah blah [@smith04; @doe99].

pandoc-citeproc detects locator terms in the CSL locale files. Either abbreviated or unabbreviated forms are accepted. In the en-US locale, locator terms can be written in either singular or plural forms, as bookbk./bks.chapterchap./chaps.columncol./cols.figurefig./figs.foliofol./fols.numberno./nos.linel./ll.noten./nn.opusop./opp.pagep./pp.paragraphpara./paras.partpt./pts.sectionsec./secs.sub verbos.v./s.vv.versev./vv.volumevol./vols./¶¶§/§§. If no locator term is used, "page" is assumed.

A minus sign (-) before the @ will suppress mention of the author in the citation. This can be useful when the author is already mentioned in the text:

Smith says blah [-@smith04].

You can also write an in-text citation, as follows:

@smith04 says blah.

@smith04 [p. 33] says blah.

So I think this answers some of our questions about what is valid syntax. Next step will be to see if we can find the regexes to extract (and then replace in our case) the citations.

@dhimmel
Copy link
Member Author

dhimmel commented Jun 30, 2017

One difficulty that we may face is that DOIs can include some prohibited characters, meaning that we can't use identical regex to pandoc-citeproc.

@slochower
Copy link
Collaborator

To be clear: is your goal to allow authors to use pandoc-citeproc syntax but still use things like @doi... to cite things on the fly?

@dhimmel
Copy link
Member Author

dhimmel commented Jun 30, 2017

To be clear: is your goal to allow authors to use pandoc-citeproc syntax but still use things like @doi... to cite things on the fly?

Exactly, for two reasons:

  1. ideally all valid pandoc-citeproc syntax will still work. In other words, we could do:

    Blah blah [see @doi:10.1371/journal.pone.0032235, pp. 33-35; also @doi:10.1109/TCBB.2014.2343960, chap. 1].
    
  2. it's counter-productive to maintain a separate citation syntax. It's most harmonious if we don't deviate from pandoc-citeproc.

However, given the weird characters that can be in DOIs and URLs, I'm not sure this will be possible? We could always require using tags for certain problematic IDs.

@slochower
Copy link
Collaborator

Right. So is the first step using the DOI regex to replace all @doi... strings with tags but leaving prefixes and suffixes still associated with those tags, and then using pandoc-citeproc to render the tags with prefixes and suffixes into citations? Maybe I should ask, what happens if you currently write:

Blah blah [see @doi:10.1371/journal.pone.0032235, pp. 33-35]

Is it broken?

@lierdakil
Copy link

As far as I can tell, the main problem with this whole thing is DOI can include semicolons, while Pandoc citations can't, for obvious reasons.
So I think your best bet is to use some other separator. Spaces in presence of arbitrary prefixes/suffixes can be problematic (impossible to parse: consider [see @doi:... also @doi:...] -- would also be a suffix of first citation, or a prefix of second one?), but some other character or string can work just fine. My first thought is using several semicolons, but it obviously needs more thought.

@slochower
Copy link
Collaborator

@dhimmel thoughts on next steps or division of labor?

@dhimmel
Copy link
Member Author

dhimmel commented Jul 7, 2017

@dhimmel thoughts on next steps or division of labor?

I think the immediate priority is #1. You should go ahead and submit a PR that adds support for pandoc-fignos and pandoc-tablenos. Also add examples in the manuscript to show how it works.

As far as this issue, I'm not sure what the best solution is. Here's what I'm thinking. If we're going to change our citation syntax, sooner is better. If you want to pursue this, the next step will be, precisely defining valid citations as a subset of valid pandoc-citeproc citations. We would need to build a regex to match:

The citation key must begin with a letter, digit, or _, and may contain alphanumerics, _, and internal punctuation characters (:.#$%&-+?<>~/).

In the case of DOIs or URLs that would be invalid pandoc-citeproc citations, these would have to use a tag. It would be helpful to scan all standard citations from the deep review and see what percent are invalid at the moment. URLs ending in / seem like they would commonly break.

We should make sure we think this revised format will be superior. It'll nice to align with pandoc-citeproc, but the semicolons will be mildly annoying. The more versatile reference, e.g. outside of brackets, would be a plus.

@slochower
Copy link
Collaborator

I think the immediate priority is #1. You should go ahead and submit a PR that adds support for pandoc-fignos and pandoc-tablenos. Also add examples in the manuscript to show how it works.

Working on it. Logistical question: if I setup a separate Travis build on my fork to test the changes before creating a PR, then I also need to change things in ci/ like the deploy key so Travis will be happy. But if I do that and commit the changes, does it get tricky to merge my commits into a PR with this repository because I don't want to change the deploy key here? I haven't encountered this issue before.

@dhimmel
Copy link
Member Author

dhimmel commented Jul 7, 2017

@slochower I wouldn't change any Travis settings for your fork. Your pull request will get built... but not deployed so you won't actually see the PDF / HTML output.

Instead you can run sh build/build.sh locally (assuming you've got the environment activated). Check this way to see that your manuscript builds work as expected. Let me know if you're not on linux.

@dhimmel
Copy link
Member Author

dhimmel commented Jul 22, 2017

@tarleb this issue is one of the most important outstanding issues. In short, currently our citation syntax differs from the pandoc citation syntax. We're hoping to more closely adhere to the pandoc syntax, so we can rely more heavily on pandoc's citation features

The only additional feature we add on top of pandoc is that raw citations (for example to a DOI like @doi:10.7287/peerj.preprints.3100) trigger metadata retrieval and citeproc CSL JSON creation for that reference.

@tarleb, since you seem to know the pandoc codebase, perhaps you could help us with the following question:

  1. What is the regex (or series of processing steps) that pandoc uses to extract references?

The goal here is to extract the citation keys for further processing. We want to mirror the same process pandoc uses to extract these keys.

@tarleb
Copy link

tarleb commented Jul 22, 2017

My suggestion would be to go the other way: match on DOI URIs, backslash-escapepercent-encode any ; in those, and use a pandoc filter (e.g., with panflute) to do the rest of the processing. You'll have to call pandoc-citeproc explicitly and need to make sure that it runs after your other filter.
I can elaborate on this when I find the time, but maybe I misunderstood some crucial details again ;)

dhimmel added a commit to dhimmel/manubot that referenced this issue Jul 30, 2017
dhimmel added a commit to dhimmel/manubot that referenced this issue Aug 7, 2017
dhimmel added a commit to manubot/manubot that referenced this issue Aug 7, 2017
@agitter
Copy link
Member

agitter commented Aug 11, 2017

@dhimmel Did #48 close this or are there still more changes needed?

@dhimmel
Copy link
Member Author

dhimmel commented Aug 11, 2017

Did #48 close this

Yes.

One more thing I'd like to do is enable warnings if something like [@tag:a @tag:b] is detected, since the authors likely meant [@tag:a; @tag:b].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants