Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for string values #58

Merged
merged 11 commits into from
Jun 11, 2020
6 changes: 0 additions & 6 deletions ion/integration_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,6 @@ var textRoundTripSkipList = []string{
"lists.ion",
"localSymbolTableImportZeroMaxId.ion",
"nonNulls.ion",
"nonNulls.ion",
"notVersionMarkers.ion",
"structWhitespace.ion",
"subfieldInt.ion",
Expand Down Expand Up @@ -204,7 +203,6 @@ var malformedIonsSkipList = []string{
"localSymbolTableWithMultipleSymbolsAndImportsFields.ion",
"localSymbolTableWithMultipleSymbolsFields.10n",
"localSymbolTableWithMultipleSymbolsFields.ion",
"longStringRawControlCharacter.ion",
"minLongWithLenTooLarge.10n",
"minLongWithLenTooSmall.10n",
"negativeIntZero.10n",
Expand All @@ -213,8 +211,6 @@ var malformedIonsSkipList = []string{
"nopPadWithAnnotations.10n",
"nullDotCommentInt.ion",
"sexpOperatorAnnotation.ion",
"stringLenTooLarge.10n",
"stringRawControlCharacter.ion",
"stringWithLatinEncoding.10n",
"structOrderedEmpty.10n",
"surrogate_1.ion",
Expand All @@ -235,7 +231,6 @@ var malformedIonsSkipList = []string{

var equivsSkipList = []string{
"annotatedIvms.ion",
"clobs.ion",
"localSymbolTableAppend.ion",
"localSymbolTableNullSlots.ion",
"localSymbolTableWithAnnotations.ion",
Expand All @@ -244,7 +239,6 @@ var equivsSkipList = []string{
"nonIVMNoOps.ion",
"sexps.ion",
"stringUtf8.ion",
"strings.ion",
"structsFieldsDiffOrder.ion",
"structsFieldsRepeatedNames.ion",
"systemSymbols.ion",
Expand Down
33 changes: 32 additions & 1 deletion ion/tokenizer.go
Original file line number Diff line number Diff line change
Expand Up @@ -542,6 +542,9 @@ func (t *tokenizer) readString() (string, error) {
if err != nil {
return "", err
}
if isInvalidChar(c) {
return "", &SyntaxError{"Invalid character", t.pos}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t.invalidChar(c)?

}

switch c {
case -1, '\n':
Expand Down Expand Up @@ -582,20 +585,27 @@ func (t *tokenizer) readLongString() (string, error) {
if err != nil {
return "", err
}
if isInvalidChar(c) {
return "", &SyntaxError{"Invalid character", t.pos}
}

switch c {
case -1:
return "", t.invalidChar(c)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're reporting an invalid character here when we encounter an EOF, which may be confusing. Looks like we do the same thing in readString above. In both cases, I suggest handling the EOF case immediately after read() out so we can provide a more informative error to the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t.invalidChar, as previously suggested, does the right thing here (and should probably have a better name :)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


case '\'':
startPosition := t.pos
ok, err := t.skipEndOfLongString(t.skipCommentsHandler)
if err != nil {
return "", err
}
if ok {
return ret.String(), nil
}

if startPosition == t.pos {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like skipEndOfLongString reports whether it consumed the string ending, so I think you can replace this position check with an else {...} attached to the if ok {...}.

wasConsumed, err := t.skipEndOfLongString(t.skipCommentsHandler)
if err != nil {
	return "", err
}
if wasConsumed {
	return ret.String(), nil
} else {
        // The ' was not part of a long string ending.
        ret.writeByte(byte(c))
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll also return false if it skipped over a ''' /* pause in the string */ ''' sequence, in which case you don't want to keep the '. :( This seems correct to me in the short term, skipEndOfLongString should probably should be refactored to return a tri-state in the longer term.

Copy link
Contributor Author

@R-maan R-maan Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I almost refactored skipEndOfLongString but then I decided to KISS.

#61 to change skipEndOfLongString

// No character has been consumed. It is single '.
ret.WriteByte(byte(c))
}
case '\\':
c, err = t.peek()
if err != nil {
Expand Down Expand Up @@ -1263,3 +1273,24 @@ func (t *tokenizer) unread(c int) {
t.pos--
t.buffer = append(t.buffer, c)
}

func isInvalidChar(c int) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name's a bit ambiguous at the top level. Maybe isInvalidStringChar? It'd fit nicely alongside the other isXxx functions in textutils.go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that's embarrassing! I read this comment and thought you're suggesting to change the name of the function, totally misread it.
I originally had named it isProhibitedControlChar. Changing it back

if c < 0x00 || c > 0x1F {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took a bit of investigation to understand why a negative int was considered a valid character. Taking a look at read(), I found that it's returning -1 instead of an EOF err like Buffer.ReadByte does (docs). I'd be in favor of refactoring this behavior to align with the standard library in the future, but in the meantime I think that isInvalidCharacter should treat EOF (-1) as an invalid character, simplifying this check to:

// Values lower than this are non-displayable ASCII characters
if c > 0x1F {
  return false;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a couple things easier treating EOF as a character instead of an error, and it's an internal-only API so it felt like a good trade-off. It's very possible that's just my old C habits speaking though. :) Feel free to refactor it if you think it improves readability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for having this treat -1 as an invalid char. In both places we're calling it we're checking for -1 immediately after, which you could then remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified and created #60 to refactor read()

return false
}
if isWhiteSpaceChar(c) || isNewLineChar(c) {
return false
}
return true
}

func isWhiteSpaceChar(c int) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isWhitespace is similarly-named but works a bit differently. Maybe worth combining them? Otherwise maybe rename this one to isStringWhitespace or something like that.

return c == 0x09 || //horizontal tab
c == 0x0B || //vertical tab
c == 0x0C // form feed
}

func isNewLineChar(c int) bool {
return c == 0x0A || //new line
c == 0x0D //carriage return
}