-
-
Notifications
You must be signed in to change notification settings - Fork 656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ledger: handling unicode #195
Comments
Hmm, lemme try working the exercise... |
It looks like a little better would be to count just runes that pass Edit: And a little better would be to use Ha! I have my program passing with descriptions of "Freude schöner Götterfunken" (those are combining characters) and "Ode To Joy 기쁨 에 송시 Hyuna". I'll hold off on posting an iteration so I don't spoil your fun. |
Huh. I was trying to use
The "pattern" mentioned in the comment would be a CLDR Number Pattern, which would do the right thing with a space between the currency symbol and the amount -- rather than hard coding a space. |
Ahhh, so we should think about wide characters as well, and this is super interesting for your example "Ode To Joy 기쁨 에 송시 Hyuna" because the cut should happen in the middle of a wide character (in the middle of that 시) . What would you say the result should be when that happens? Replace the wide character with a
or
(Oy, that doesn't even look right on my browser because the wide characters aren't exactly the width of two narrows... but it does look fine in a monospace font in my editor) |
Eh, submitted an iteration that deals with wide characters and combining characters, though I'm not confident that I'm really doing the combining characters the right way. I'm somewhat wondering whether we should have people deal with wide characters at all. The argument for it is that if we're dealing with one aspect of unicode (that some codepoints take more than one byte) then we should deal with other aspects too (that characters have varying widths). On the other hand, it could be a lot to deal with in an exercise that is mostly about refactoring some existing code. Then again, maybe it's not that bad if we write some existing code in the initial ledger.go I am almost tempted to say "this variable width business even deserves its own exercise", but I don't know. |
Cool. I uploaded an iteration at http://exercism.io/submissions/fbab39b6626b498d96e9f0466d4a5c6d. You can see I handle combining characters with Unicode is a rabbit hole. To keep the exercise focused on refactoring, we could just take out the multibyte test case and make it all ASCII. If we leave it in, we should at least handle the existing test case properly. Beyond that, I don't know. The exercise gets more complex, but I think in the real world most anyone who handles Unicode has to handle combining characters. I hear Go is popular in China. It's good stuff to be able do. |
Ah I see. I guess my solution just kinda works by chance. I also do a Given all this, maybe the best thing to do is to go with ASCII only in the ledger tests, and then possibly think of some separate exercise that will deal with Unicode. Will probably be about 24 hours before I can do the first thing so someone could feel free to beat me to it. The Unicode exercise will of course need a bit of thought. |
I think you're both right that a) unicode handling is important, and |
resolved with #199 |
While doing the ledger exercise, I noticed that both the example solution and initial implementation have a defect in how they count lines that are too long. The "Description" column of the ledger has width 25, so any string that has more than 25 characters should display the first 22 characters followed by "..."
And yet, the example solution and initial implementation both check for length greater than 27 characters, and cut at index 24 instead of 22.
https://github.com/exercism/xgo/blob/master/ledger/example.go#L44
https://github.com/exercism/xgo/blob/master/ledger/ledger.go#L98
You may ask, is it something to do with the padding in the string
" | "
that separates each column? No, it's not even that because there is in fact a test that tests for proper cutoffs: https://github.com/exercism/xgo/blob/master/ledger/ledger_test.go#L128In fact it's because that is the only test that tests too-long descriptions and it happens to have two unicode codepoints that take up two bytes each, the two ö (C3 B6 in utf-8). Therefore, for that particular string, cutting at the 24th byte is cutting at the 22nd rune, and everything is correct.
I'm not sure if I'm comfortable leaving that code in example.go, because it's very much not generic - only works on strings with two more bytes than runes. I'm very tempted to add tests that more thoroughly test cutoffs, such as master...petertseng:ledger-unicode and then fix the code. Both the example and initial implementations currently fail:
(They cut the string at the wrong place and misalign the columns)
Thing is, if I add the tests, that means the initial implementation has to pass the tests too, and that means I have to also write bad code that happens to pass the tests. I'll consider whether that's a thing I want to do.
So while I think about that, I wanted to open an issue and ask what y'all think about it as well. Worth adding the tests? And ideas on how to write an intentionally-bad implementation that nevertheless passes tests?
As for a good implementation that works correctly, despite there being https://golang.org/pkg/unicode/utf8/#RuneCountInString , I'm not aware of something in Go that lets you substring at a certain rune count instead of a byte count. So that means I may have to iterate over the first 25 runes of the string (
j := 0; for i, _ := range description { if (j == 25) { ... } ... j++ }
or similar)The text was updated successfully, but these errors were encountered: