Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix malformed clobs #130

Merged
merged 6 commits into from
Aug 13, 2020
Merged

Fix malformed clobs #130

merged 6 commits into from
Aug 13, 2020

Conversation

R-maan
Copy link
Contributor

@R-maan R-maan commented Aug 12, 2020

This is to fix various cases where a clob should be considered invalid:

ion/tokenizer.go Outdated

func isClobWithInvalidChar(clob bool, c int) bool {
// The string in clob may only contain 7-bit ASCII characters.
return clob && c > 0x7f
Copy link
Contributor

@zslayton zslayton Aug 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clob section of the primary spec page says:

The string may only contain legal 7-bit ASCII characters, using the same escaping syntax as string and symbol values.

However, there's an important clarification buried on the Strings and Clobs page:

Non-printable ASCII and non-ASCII Unicode code points are not allowed un-escaped in the string bodies.

This means that ASCII characters below 32/0x20/space must also be escaped because they are non-printable. The same page provides a listing of alternative escapes (BEL / 0x07 / \a, for example).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have already done that in an earlier pull request; it is isProhibitedControlChar() function.

Both of these checks are done on both readString() and readLongString().

I did not combine them into one function, as my understanding was string can hold characters beyond \x7f - Unless I am wrong. I do not see any mention of 7-bit ASCII limitation for string in the doc and also looked at some of the string examples in good directory where they contain 8-bit characters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have already done that in an earlier pull request; it is isProhibitedControlChar() function.

Thanks for pointing that out. PR reviews always involve looking at the codebase through a keyhole -- only the items in the diff are brought to my attention.

I'd still change the name of isClobWithInvalidChar. It's really checking to see if the current character is ASCII or not, which is a subset of checking for clob character validity. Maybe isClobWithNonAsciiChar?

I did not combine them into one function, as my understanding was string can hold characters beyond \x7f - Unless I am wrong.

No, you're right. The string type holds unicode code points and so can store non-ascii character data. I don't think you need to combine the two functions, I just think the isClobWithInvalidChar function should have a name which describes its purpose more crisply.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update the name to isClobWithNonAsciiChar

I think we can close #130?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can close #130?

#130 is this PR. Did you mean to post a different number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant this issue: amazon-ion/ion-docs#130

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry. I just noticed the issue is opened in ion-doc repo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! It's the same issue number in a different repo. What a fun coincidence.

Oh, sorry. I just noticed the issue is opened in ion-doc repo

Yeah, the ion-docs issue is reorganize that section of the spec so the "non-displayable" restriction on clob ASCII characters is more easily discoverable.

ion/tokenizer.go Outdated
@@ -1379,3 +1383,15 @@ func isNewLineChar(c int) bool {
return c == 0x0A || //new line
c == 0x0D //carriage return
}

func isClobWithInvalidChar(clob bool, c int) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throughout: I suggest using is____ for bool variable names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you rather just cutting is prefix from all the following functions:

isProhibitedControlChar()
isStringWhitespace() 
isNewLineChar()  
isClobWithNonAsciiChar()

(All these methods are added either in this PR or #58)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suggesting changing the boolean variable clob to isClob for readability. Removing the is from those functions seems to be going in the opposite direction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you were talking about keeping is___ only for variable names and not for function names.

I will updated the clob variable to isClob - I was following the existing clob variable.

ion/skipper.go Outdated
// ensureNoCommentsHandler is a commentHandler that returns an
// error if any comments are found, else no error is returned.
func (t *tokenizer) ensureNoCommentsHandler() (bool, error) {
return false, &UnexpectedTokenError{"No Comments are allowed", t.Pos() - 1}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message might be confusing without some additional context:

Suggested change
return false, &UnexpectedTokenError{"No Comments are allowed", t.Pos() - 1}
return false, &UnexpectedTokenError{"Comments are not allowed within long-form strings.", t.Pos() - 1}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it the case that we can have comments in long-form string?

( '''hello '''     // Sexp with one element
  '''world!'''  )

And the restriction is only on clob, I assume.
How about "Comments are not allowed within text format clob." ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Extreme nitpick: per https://github.com/golang/go/wiki/CodeReviewComments#error-strings error strings in go should not start with a capitalized word or end in punctuation.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it the case that we can have comments in long-form string?

( '''hello '''     // Sexp with one element
  '''world!'''  )

And the restriction is only on clob, I assume.
How about "Comments are not allowed within text format clob." ?

Yes, that's correct. Since the binary format doesn't have comments, I think you can simplify it further to comments are not allowed within a clob.

ion/tokenizer.go Outdated
@@ -356,9 +356,9 @@ func (t *tokenizer) ReadValue(tok token) (string, error) {
case tokenSymbolOperator, tokenDot:
str, err = t.readOperator()
case tokenString:
str, err = t.readString()
str, err = t.readString(false)
Copy link
Contributor

@zslayton zslayton Aug 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open question: This PR adds an unlabeled boolean to several methods; is there a way we can make the meaning of function call parameters like this more obvious?

Is it worth making readClobString, readLongClobString, etc? I suspect the answer is no, but you have more context here than I do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way we can make the meaning function call parameters like this more obvious?

How about:

const isClob = true
const isNotClob = false

And using these two where I need to distinguish clobs from string: str, err = t.readString(isNotClob)

Is it worth making readClobString, readLongClobString, etc?

I think the majority of what is being done in these functions are similar between clob and string, except some small changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const isClob = true
const isNotClob = false

This is a good approach. My only quibble with it is that isClob and isNotClob sound like open-ended boolean values rather than constants with well-known values. For example, if I read this:

str, err = t.readString(isClob)

it seems like I'll need to track down the definition of isClob to see whether it's true or false. How about:

const clobText = true;
const nonClobText = false;

The downside is you have booleans that don't "look like" booleans (they don't start with is), but the upside is that the function calls end up being more self-documenting:

str, err = t.readString(clobText);
// or
str, err = t.readString(nonClobText);

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think clobText and nonClobText are better options, specially given we will name our input variable isClob.

Will update the PR.

ion/skipper.go Outdated
// ensureNoCommentsHandler is a commentHandler that returns an
// error if any comments are found, else no error is returned.
func (t *tokenizer) ensureNoCommentsHandler() (bool, error) {
return false, &UnexpectedTokenError{"No Comments are allowed", t.Pos() - 1}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Extreme nitpick: per https://github.com/golang/go/wiki/CodeReviewComments#error-strings error strings in go should not start with a capitalized word or end in punctuation.)

ion/tokenizer.go Outdated
@@ -583,7 +583,7 @@ func (t *tokenizer) readString() (string, error) {
return "", err
}
// -1 denotes EOF, and new lines are not allowed in short string
if c == -1 || c == '\n' || isProhibitedControlChar(c) {
if c == -1 || c == '\n' || isProhibitedControlChar(c) || isClobWithInvalidChar(clob, c) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... || (clob && !isASCII(c)) would read more naturally to me here 🤷.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... || (clob && !isASCII(c)) would read more naturally to me here 🤷.

This would work for me too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change.

@R-maan R-maan requested a review from zslayton August 13, 2020 19:05
Copy link
Contributor

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants