ledger: handling unicode #195

petertseng · 2015-10-31T02:43:43Z

While doing the ledger exercise, I noticed that both the example solution and initial implementation have a defect in how they count lines that are too long. The "Description" column of the ledger has width 25, so any string that has more than 25 characters should display the first 22 characters followed by "..."

And yet, the example solution and initial implementation both check for length greater than 27 characters, and cut at index 24 instead of 22.
https://github.com/exercism/xgo/blob/master/ledger/example.go#L44
https://github.com/exercism/xgo/blob/master/ledger/ledger.go#L98

You may ask, is it something to do with the padding in the string " | " that separates each column? No, it's not even that because there is in fact a test that tests for proper cutoffs: https://github.com/exercism/xgo/blob/master/ledger/ledger_test.go#L128

In fact it's because that is the only test that tests too-long descriptions and it happens to have two unicode codepoints that take up two bytes each, the two ö (C3 B6 in utf-8). Therefore, for that particular string, cutting at the 24th byte is cutting at the 22nd rune, and everything is correct.

I'm not sure if I'm comfortable leaving that code in example.go, because it's very much not generic - only works on strings with two more bytes than runes. I'm very tempted to add tests that more thoroughly test cutoffs, such as master...petertseng:ledger-unicode and then fix the code. Both the example and initial implementations currently fail:

--- FAIL: TestFormatLedgerSuccess (0.00s)
    ledger_test.go:347: FormatLedger for input named "overlong ascii descriptions" was expected to return...
        Date_______|_Description_______________|_Change
        01/01/2015_|_This_description_is_wa..._|________$1.00_

        ...but returned...
        Date_______|_Description_______________|_Change
        01/01/2015_|_This_description_is_way_..._|________$1.00_

(They cut the string at the wrong place and misalign the columns)

Thing is, if I add the tests, that means the initial implementation has to pass the tests too, and that means I have to also write bad code that happens to pass the tests. I'll consider whether that's a thing I want to do.

So while I think about that, I wanted to open an issue and ask what y'all think about it as well. Worth adding the tests? And ideas on how to write an intentionally-bad implementation that nevertheless passes tests?

As for a good implementation that works correctly, despite there being https://golang.org/pkg/unicode/utf8/#RuneCountInString , I'm not aware of something in Go that lets you substring at a certain rune count instead of a byte count. So that means I may have to iterate over the first 25 runes of the string (j := 0; for i, _ := range description { if (j == 25) { ... } ... j++ } or similar)

The text was updated successfully, but these errors were encountered:

soniakeys · 2015-10-31T11:51:58Z

Hmm, lemme try working the exercise...

soniakeys · 2015-10-31T13:52:10Z

It looks like a little better would be to count just runes that pass ~~unicode.IsGraphic~~? (didn't quite work.)

Edit: And a little better would be to use godoc.org/golang.org/x/text/width...

Ha! I have my program passing with descriptions of "Freude schöner Götterfunken" (those are combining characters) and "Ode To Joy 기쁨 에 송시 Hyuna". I'll hold off on posting an iteration so I don't spoil your fun.

soniakeys · 2015-10-31T18:27:56Z

Huh. I was trying to use godoc.org/golang.org/x/text/currency but sadly ran into this:

    // TODO: use pattern.
    io.WriteString(s, opt.symbol(lang, cur))
    if v.amount != nil {
        s.Write(space)

The "pattern" mentioned in the comment would be a CLDR Number Pattern, which would do the right thing with a space between the currency symbol and the amount -- rather than hard coding a space.

petertseng · 2015-11-02T02:12:52Z

Ahhh, so we should think about wide characters as well, and this is super interesting for your example "Ode To Joy 기쁨 에 송시 Hyuna" because the cut should happen in the middle of a wide character (in the middle of that 시) . What would you say the result should be when that happens? Replace the wide character with a " ...", or should it just be "..." without the space?

Date       | Description               | Change
01/01/2015 | Ode To Joy 기쁨 에 송...  |        $1.00

or

Date       | Description               | Change
01/01/2015 | Ode To Joy 기쁨 에 송 ... |        $1.00

(Oy, that doesn't even look right on my browser because the wide characters aren't exactly the width of two narrows... but it does look fine in a monospace font in my editor)

petertseng · 2015-11-02T04:05:16Z

Eh, submitted an iteration that deals with wide characters and combining characters, though I'm not confident that I'm really doing the combining characters the right way. I'm somewhat wondering whether we should have people deal with wide characters at all. The argument for it is that if we're dealing with one aspect of unicode (that some codepoints take more than one byte) then we should deal with other aspects too (that characters have varying widths). On the other hand, it could be a lot to deal with in an exercise that is mostly about refactoring some existing code. Then again, maybe it's not that bad if we write some existing code in the initial ledger.go

I am almost tempted to say "this variable width business even deserves its own exercise", but I don't know.

soniakeys · 2015-11-02T06:21:45Z

Cool. I uploaded an iteration at http://exercism.io/submissions/fbab39b6626b498d96e9f0466d4a5c6d. You can see I handle combining characters with if !unicode.Is(unicode.Mn, r) {.

Unicode is a rabbit hole. To keep the exercise focused on refactoring, we could just take out the multibyte test case and make it all ASCII. If we leave it in, we should at least handle the existing test case properly. Beyond that, I don't know. The exercise gets more complex, but I think in the real world most anyone who handles Unicode has to handle combining characters. I hear Go is popular in China. It's good stuff to be able do.

petertseng · 2015-11-02T09:22:42Z

Ah I see. I guess my solution just kinda works by chance. I also do a width.LookupRune(r).Kind() but it so happened that the comining characters were EastAsianAmbiguous and so I just treated them as zero... but I'm sure there have to be other EastAsianAmbiguous characters that have nonzero widths, so my solution is not quite good enough.

Given all this, maybe the best thing to do is to go with ASCII only in the ledger tests, and then possibly think of some separate exercise that will deal with Unicode. Will probably be about 24 hours before I can do the first thing so someone could feel free to beat me to it. The Unicode exercise will of course need a bit of thought.

kytrinyx · 2015-11-02T14:36:44Z

I think you're both right that

a) unicode handling is important, and
b) it's a rabbit hole, and
c) we should probably put it in a separate exercise.

soniakeys · 2015-11-04T05:13:05Z

resolved with #199

petertseng mentioned this issue Nov 4, 2015

ledger: remove multibyte characters #199

Merged

soniakeys mentioned this issue Nov 4, 2015

Consider creating new Unicode exercise #200

Closed

soniakeys closed this as completed Nov 4, 2015

petertseng mentioned this issue Nov 4, 2016

Discussion: Non-Ascii test cases. exercism/problem-specifications#428

Closed

petertseng mentioned this issue Jan 7, 2017

Suggest specific exercises that deal with multi-language text handling. exercism/problem-specifications#455

Open

dexterlemmer mentioned this issue Mar 13, 2017

Incomplete test coverage for scrabble-score exercism/rust#264

Closed

robphoenix mentioned this issue Apr 27, 2017

bob: ensure generator is up-to-date #624

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ledger: handling unicode #195

ledger: handling unicode #195

petertseng commented Oct 31, 2015

soniakeys commented Oct 31, 2015

soniakeys commented Oct 31, 2015

soniakeys commented Oct 31, 2015

petertseng commented Nov 2, 2015

petertseng commented Nov 2, 2015

soniakeys commented Nov 2, 2015

petertseng commented Nov 2, 2015

kytrinyx commented Nov 2, 2015

soniakeys commented Nov 4, 2015

ledger: handling unicode #195

ledger: handling unicode #195

Comments

petertseng commented Oct 31, 2015

soniakeys commented Oct 31, 2015

soniakeys commented Oct 31, 2015

soniakeys commented Oct 31, 2015

petertseng commented Nov 2, 2015

petertseng commented Nov 2, 2015

soniakeys commented Nov 2, 2015

petertseng commented Nov 2, 2015

kytrinyx commented Nov 2, 2015

soniakeys commented Nov 4, 2015