Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SimpleMRS codec fails when predicates contain colons #302

Closed
goodmami opened this issue Jul 17, 2020 · 6 comments
Closed

SimpleMRS codec fails when predicates contain colons #302

goodmami opened this issue Jul 17, 2020 · 6 comments
Labels
Milestone

Comments

@goodmami
Copy link
Member

goodmami commented Jul 17, 2020

The form of predicates is very permissive. From the PredicateRfc wiki page:

The lemma field of a surface pred may be just about anything that does not contain underscores or spaces.

The "just about" leaves room for interpretation. What I settled on before was a sequence of anything that wasn't a space, newline ], or a < followed by something that looks like a lnk value.

(r'(?:[^ \n\]<]+'
r'|<(?![-0-9:#@ ]*>\s))+', 'SYMBOL:a symbol'),

This is tripping up on some predicates (that aren't great, but it's hard to say they are invalid, either) that have ]:

_+-]\?[/NN_u_unknown_rel"<12:18>

Also, if a colon appears in the predicate, it can look like a feature (e.g., LBL:), which breaks on things like urls, etc.:

_xml:tm/NN_u_unknown<57:63>

While it would be great to have a more restrictive syntax for these things, for now PyDelphin will need to deal with them somehow, even if it's just throwing some error about an invalid predicate form.

@goodmami goodmami added the bug label Jul 17, 2020
@goodmami goodmami added this to the v1.4.0 milestone Jul 20, 2020
@oepen
Copy link

oepen commented Jul 21, 2020

hi mike,

in general, i agree: those examples are not great, but at least some may be valid syntax. looking at the PredicateRfc page, we arguably still need to make those rules more specific. in general, i believe we have two notations for MRS predicates, (a) as C-like quoted strings and (b) as 'plain' (unquoted) identifiers (traditionally called 'symbols' in lisp).

in variant (a), i would think anything goes (within any additional constraints we might impose on the fields of surface predicates) and standard escape conventions apply: \" for embedded double quotes and \\ for embedded backslashes. i am pretty sure the MRSs from the pre-2020 ERG that you are reading are generated by @sweaglesw (if that were his M$ GitHub user), and i am wondering whether we could make the overall pipeline more robust by suggesting that FFTB serialize predicates as quoted strings and, in doing so, make sure that embedded quote marks and backslashes are escaped.

for variant (b), the rules seem a bit murkier. reading the PredicateRfc page literally, the following should all be valid surface predicates _:_c_1, "_v_2, <_p_3, or _[_x_4. because the page likens the sense field to the lemma field, one could also argue for the following: <_p_<, "_p_", or even <_p_angle<. for abstract predicates, our current PredicateRfc does not seem to impose any constraints, strictly speaking?

seeing as ERG predicates for unknown words are generated on the fly, and further taking into account that we want the lemma field of a surface predicate to correspond to the actual surface string, i would be sceptical of imposing too many constraints on valid MRS predicates. but, nevertheless, we need a syntax that is well defined and parseable with adequate effort. i believe banning a wider range of characters in type-(b) predicates and generally pushing toward broader use of type-(a) quoting and escaping might allow us to attain these goals?

@goodmami
Copy link
Member Author

Thanks, Stephan,

I mostly agree with what you said. In general I think we need to consider 2 things: the allowed shape of predicates, and issues with serialization in any particular format. As this GitHub issue is specifically about SimpleMRS, I'll take the more general discussion to the message on the developers list. If you think my solution to this particular issue is not correct, we can open a new issue (as this one is already merged into a release) to fix it further.

@oepen
Copy link

oepen commented Aug 12, 2020

just fyi: i have in the meantime updated the LKB reader and writer for simple MRS serializations to accept all characters except ", [, and < in unquoted predicates and arbitrary strings when quoted using standard C conventions. upon serialization, the writer will output a quoted string iff one of the above characters is present in the predicate value.

@goodmami
Copy link
Member Author

goodmami commented Aug 12, 2020

Thanks, Stephan, that sounds reasonable. Two questions:

  • Do you also unquote quoted strings that do not contain special characters? Or is this a one-way conversion?
  • I understand " and <, but why is [ disallowed? It does not start argument roles in SimpleMRS, only in EDS (in which case you'd also need to check for { for optional properties)

@oepen
Copy link

oepen commented Aug 12, 2020

internally in the LKB (when predicate normalization is enabled, which has been the case for the ERG since 1214 and is now becoming the default in the stock LKB), all predicates are represented as strings. so serialization defaults to the unquoted format, because that seems more (human) reader-friendly to me. thus, reading and writing an MRS in the LKB will normalize the extend to which quoting is used, i.e. in practise can remove some unnecessary quotes.

regarding the range of allowed characters, you are right: my inclusion of [ betrays my intention to use the same quoting conventions for (native) EDS serialization and possibly also other (non-XML, non-JSON) MRS serializations, e.g. the indexed format. i am inclined to immediately add { to my list of characters unwelcome in unquoted predicates, and maybe ( should go on that list too?

alternatively, one could take the position that simple MRS has the most generous rules for unquoted predicates (banning only ", <, and whitespace), and other serializations add their own constraints (e.g. [ and { for EDS). i realize i ended up putting what should be a follow-up to the email thread you started into M$ GitHub ...

@goodmami
Copy link
Member Author

goodmami commented Aug 12, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants