-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix malformed clobs #130
Fix malformed clobs #130
Conversation
ion/tokenizer.go
Outdated
|
||
func isClobWithInvalidChar(clob bool, c int) bool { | ||
// The string in clob may only contain 7-bit ASCII characters. | ||
return clob && c > 0x7f |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The clob section of the primary spec page says:
The string may only contain legal 7-bit ASCII characters, using the same escaping syntax as string and symbol values.
However, there's an important clarification buried on the Strings and Clobs page:
Non-printable ASCII and non-ASCII Unicode code points are not allowed un-escaped in the string bodies.
This means that ASCII characters below 32
/0x20
/space
must also be escaped because they are non-printable. The same page provides a listing of alternative escapes (BEL
/ 0x07
/ \a
, for example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have already done that in an earlier pull request; it is isProhibitedControlChar()
function.
Both of these checks are done on both readString()
and readLongString()
.
I did not combine them into one function, as my understanding was string
can hold characters beyond \x7f
- Unless I am wrong. I do not see any mention of 7-bit ASCII limitation for string in the doc and also looked at some of the string examples in good directory where they contain 8-bit characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have already done that in an earlier pull request; it is isProhibitedControlChar() function.
Thanks for pointing that out. PR reviews always involve looking at the codebase through a keyhole -- only the items in the diff are brought to my attention.
I'd still change the name of isClobWithInvalidChar
. It's really checking to see if the current character is ASCII or not, which is a subset of checking for clob character validity. Maybe isClobWithNonAsciiChar
?
I did not combine them into one function, as my understanding was string can hold characters beyond \x7f - Unless I am wrong.
No, you're right. The string
type holds unicode code points and so can store non-ascii character data. I don't think you need to combine the two functions, I just think the isClobWithInvalidChar
function should have a name which describes its purpose more crisply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update the name to isClobWithNonAsciiChar
I think we can close #130?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant this issue: amazon-ion/ion-docs#130
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry. I just noticed the issue is opened in ion-doc repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! It's the same issue number in a different repo. What a fun coincidence.
Oh, sorry. I just noticed the issue is opened in ion-doc repo
Yeah, the ion-docs
issue is reorganize that section of the spec so the "non-displayable" restriction on clob ASCII characters is more easily discoverable.
ion/tokenizer.go
Outdated
@@ -1379,3 +1383,15 @@ func isNewLineChar(c int) bool { | |||
return c == 0x0A || //new line | |||
c == 0x0D //carriage return | |||
} | |||
|
|||
func isClobWithInvalidChar(clob bool, c int) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throughout: I suggest using is____
for bool
variable names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you rather just cutting is
prefix from all the following functions:
isProhibitedControlChar()
isStringWhitespace()
isNewLineChar()
isClobWithNonAsciiChar()
(All these methods are added either in this PR or #58)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm suggesting changing the boolean variable clob
to isClob
for readability. Removing the is
from those functions seems to be going in the opposite direction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought you were talking about keeping is___
only for variable names and not for function names.
I will updated the clob
variable to isClob
- I was following the existing clob
variable.
ion/skipper.go
Outdated
// ensureNoCommentsHandler is a commentHandler that returns an | ||
// error if any comments are found, else no error is returned. | ||
func (t *tokenizer) ensureNoCommentsHandler() (bool, error) { | ||
return false, &UnexpectedTokenError{"No Comments are allowed", t.Pos() - 1} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error message might be confusing without some additional context:
return false, &UnexpectedTokenError{"No Comments are allowed", t.Pos() - 1} | |
return false, &UnexpectedTokenError{"Comments are not allowed within long-form strings.", t.Pos() - 1} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it the case that we can have comments in long-form string?
( '''hello ''' // Sexp with one element
'''world!''' )
And the restriction is only on clob, I assume.
How about "Comments are not allowed within text format clob." ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Extreme nitpick: per https://github.com/golang/go/wiki/CodeReviewComments#error-strings error strings in go should not start with a capitalized word or end in punctuation.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it the case that we can have comments in long-form string?
( '''hello ''' // Sexp with one element '''world!''' )
And the restriction is only on clob, I assume.
How about "Comments are not allowed within text format clob." ?
Yes, that's correct. Since the binary format doesn't have comments, I think you can simplify it further to comments are not allowed within a clob
.
ion/tokenizer.go
Outdated
@@ -356,9 +356,9 @@ func (t *tokenizer) ReadValue(tok token) (string, error) { | |||
case tokenSymbolOperator, tokenDot: | |||
str, err = t.readOperator() | |||
case tokenString: | |||
str, err = t.readString() | |||
str, err = t.readString(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open question: This PR adds an unlabeled boolean to several methods; is there a way we can make the meaning of function call parameters like this more obvious?
Is it worth making readClobString
, readLongClobString
, etc? I suspect the answer is no, but you have more context here than I do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way we can make the meaning function call parameters like this more obvious?
How about:
const isClob = true
const isNotClob = false
And using these two where I need to distinguish clobs from string: str, err = t.readString(isNotClob)
Is it worth making readClobString, readLongClobString, etc?
I think the majority of what is being done in these functions are similar between clob and string, except some small changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const isClob = true const isNotClob = false
This is a good approach. My only quibble with it is that isClob
and isNotClob
sound like open-ended boolean values rather than constants with well-known values. For example, if I read this:
str, err = t.readString(isClob)
it seems like I'll need to track down the definition of isClob
to see whether it's true
or false
. How about:
const clobText = true;
const nonClobText = false;
The downside is you have booleans that don't "look like" booleans (they don't start with is
), but the upside is that the function calls end up being more self-documenting:
str, err = t.readString(clobText);
// or
str, err = t.readString(nonClobText);
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think clobText
and nonClobText
are better options, specially given we will name our input variable isClob
.
Will update the PR.
ion/skipper.go
Outdated
// ensureNoCommentsHandler is a commentHandler that returns an | ||
// error if any comments are found, else no error is returned. | ||
func (t *tokenizer) ensureNoCommentsHandler() (bool, error) { | ||
return false, &UnexpectedTokenError{"No Comments are allowed", t.Pos() - 1} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Extreme nitpick: per https://github.com/golang/go/wiki/CodeReviewComments#error-strings error strings in go should not start with a capitalized word or end in punctuation.)
ion/tokenizer.go
Outdated
@@ -583,7 +583,7 @@ func (t *tokenizer) readString() (string, error) { | |||
return "", err | |||
} | |||
// -1 denotes EOF, and new lines are not allowed in short string | |||
if c == -1 || c == '\n' || isProhibitedControlChar(c) { | |||
if c == -1 || c == '\n' || isProhibitedControlChar(c) || isClobWithInvalidChar(clob, c) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... || (clob && !isASCII(c))
would read more naturally to me here 🤷.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... || (clob && !isASCII(c))
would read more naturally to me here 🤷.
This would work for me too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! 👍
This is to fix various cases where a
clob
should be considered invalid:\u
and\U
in this commit.