-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SimpleMRS codec fails when predicates contain colons #302
Comments
hi mike, in general, i agree: those examples are not great, but at least some may be valid syntax. looking at the PredicateRfc page, we arguably still need to make those rules more specific. in general, i believe we have two notations for MRS predicates, (a) as C-like quoted strings and (b) as 'plain' (unquoted) identifiers (traditionally called 'symbols' in lisp). in variant (a), i would think anything goes (within any additional constraints we might impose on the fields of surface predicates) and standard escape conventions apply: for variant (b), the rules seem a bit murkier. reading the PredicateRfc page literally, the following should all be valid surface predicates seeing as ERG predicates for unknown words are generated on the fly, and further taking into account that we want the lemma field of a surface predicate to correspond to the actual surface string, i would be sceptical of imposing too many constraints on valid MRS predicates. but, nevertheless, we need a syntax that is well defined and parseable with adequate effort. i believe banning a wider range of characters in type-(b) predicates and generally pushing toward broader use of type-(a) quoting and escaping might allow us to attain these goals? |
Thanks, Stephan, I mostly agree with what you said. In general I think we need to consider 2 things: the allowed shape of predicates, and issues with serialization in any particular format. As this GitHub issue is specifically about SimpleMRS, I'll take the more general discussion to the message on the developers list. If you think my solution to this particular issue is not correct, we can open a new issue (as this one is already merged into a release) to fix it further. |
just fyi: i have in the meantime updated the LKB reader and writer for simple MRS serializations to accept all characters except |
Thanks, Stephan, that sounds reasonable. Two questions:
|
internally in the LKB (when predicate normalization is enabled, which has been the case for the ERG since 1214 and is now becoming the default in the stock LKB), all predicates are represented as strings. so serialization defaults to the unquoted format, because that seems more (human) reader-friendly to me. thus, reading and writing an MRS in the LKB will normalize the extend to which quoting is used, i.e. in practise can remove some unnecessary quotes. regarding the range of allowed characters, you are right: my inclusion of alternatively, one could take the position that simple MRS has the most generous rules for unquoted predicates (banning only |
Re: normalization, that sounds good to me
Re: allowed characters, I’m afraid I find these dual-purpose expedients harmful in the long run, leading to confusion and inconsistency. Instead, as I advocated in the thread, I think we should have a universal set of escaping conventions for predicates, namely that a literal `_` is escaped to `\s` (or similar) and `\` to `\\`. Each format (simplemrs, eds, prolog-mrs, indexed mrs, etc.) may then impose additional constraints as fits their own requirements. In this view, the quotes surrounding some predicates in simplemrs should never persist in an internal representation of the predicate, they are merely artifacts of simplemrs encoding.
But yes, maybe we should return to the email thread :)
…On Aug 12, 2020, 6:29 PM +0800, Stephan Oepen , wrote:
internally in the LKB (when predicate normalization is enabled, which has been the case for the ERG since 1214 and is now becoming the default in the stock LKB), all predicates are represented as strings. so serialization defaults to the unquoted format, because that seems more (human) reader-friendly to me. thus, reading and writing an MRS in the LKB will normalize the extend to which quoting is used, i.e. in practise can remove some unnecessary quotes.
regarding the range of allowed characters, you are right: my inclusion of [ betrays my intention to use the same quoting conventions for (native) EDS serialization and possibly also other (non-XML, non-JSON) MRS serializations, e.g. the indexed format. i am inclined to immediately add { to my list of characters unwelcome in unquoted predicates, and maybe ( should go on that list too?
alternatively, one could take the position that simple MRS has the most generous rules for unquoted predicates (banning only ", <, and whitespace), and other serializations add their own constraints (e.g. [ and { for EDS). i realize i ended up putting what should be a follow-up to the email thread you started into M$ GitHub ...
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
The form of predicates is very permissive. From the PredicateRfc wiki page:
The "just about" leaves room for interpretation. What I settled on before was a sequence of anything that wasn't a space, newline
]
, or a<
followed by something that looks like a lnk value.pydelphin/delphin/codecs/simplemrs.py
Lines 148 to 149 in b940073
This is tripping up on some predicates (that aren't great, but it's hard to say they are invalid, either) that have
]
:Also, if a colon appears in the predicate, it can look like a feature (e.g.,
LBL:
), which breaks on things like urls, etc.:While it would be great to have a more restrictive syntax for these things, for now PyDelphin will need to deal with them somehow, even if it's just throwing some error about an invalid predicate form.
The text was updated successfully, but these errors were encountered: