Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Unicode 15.0 line breaking #4389

Merged
merged 31 commits into from
Dec 1, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
88c80b4
traces
eggrobin Oct 24, 2023
cb3307f
The first monkey passes
eggrobin Nov 22, 2023
cf91a15
More traces
eggrobin Nov 22, 2023
48eb682
trace the right thing
eggrobin Nov 22, 2023
ec17694
Maybe we don’t need that LB25_HY state?
eggrobin Nov 22, 2023
6592f33
What a complete mess
eggrobin Nov 23, 2023
be70044
This is going to be tedious, isn’t it
eggrobin Nov 24, 2023
6b6e173
LB8a
eggrobin Nov 24, 2023
7bfd662
Keep hammering at the ZWJ CM case
eggrobin Nov 24, 2023
6f3cd7c
Surprisingly this moves me slightly further into the pile of tests
eggrobin Nov 24, 2023
93f7186
Onto the next test.
eggrobin Nov 24, 2023
c1a12d8
Back to completely untailored LB25, the recommended tailoring needs l…
eggrobin Nov 29, 2023
147bf00
Any and then some
eggrobin Nov 29, 2023
1982d90
handle ZWJ for OP in extended context
eggrobin Nov 29, 2023
a9b2e85
RI_RI_ZWJ
eggrobin Nov 29, 2023
3fb0288
HL_ZWJ in extended context in LB21
eggrobin Nov 29, 2023
ca04538
HL HY CM, more left Any for failure on test 387
eggrobin Nov 29, 2023
6bcb745
ID_CN_ZWJ for test case 2077 😭
eggrobin Nov 29, 2023
ff737aa
Handle ZWJ ZWJ, pushing the failure to test case 3556
eggrobin Nov 29, 2023
2056364
Push the failure to 5441
eggrobin Nov 29, 2023
055f1eb
Now fails on 6437
eggrobin Nov 30, 2023
a7e012c
9227...
eggrobin Nov 30, 2023
23df846
twenty kilotests passing.
eggrobin Nov 30, 2023
dd7471c
Remove traces
eggrobin Nov 30, 2023
6616daf
Check in a few tests
eggrobin Nov 30, 2023
31e23d4
Untailor LineBreakTest.txt
eggrobin Nov 30, 2023
d985de2
An attempt at reducing spurious changes in that untailored LineBreakTest
eggrobin Nov 30, 2023
184540a
Try to remove even more spurious diffs
eggrobin Nov 30, 2023
442e00c
cargo make testdata
eggrobin Nov 30, 2023
ea5ddc7
Document how the sausage was made
eggrobin Dec 1, 2023
8a1cb70
doc test for #4146
eggrobin Dec 1, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions components/segmenter/src/line.rs
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,21 @@ pub type LineBreakIteratorUtf16<'l, 's> = LineBreakIterator<'l, 's, LineBreakTyp
/// let breakpoints: Vec<usize> = segmenter.segment_str(text).collect();
/// // 9 and 22 are mandatory breaks, 14 is a line break opportunity.
/// assert_eq!(&breakpoints, &[0, 9, 14, 22]);
///
/// // There is a break opportunity between emoji, but not within the ZWJ sequence 🏳️‍🌈.
/// let flag_equation = "🏳️➕🌈🟰🏳️\u{200D}🌈";
/// let possible_first_lines: Vec<&str> =
/// segmenter.segment_str(flag_equation).skip(1).map(|i| &flag_equation[..i]).collect();
/// assert_eq!(
/// &possible_first_lines,
/// &[
/// "🏳️",
/// "🏳️➕",
/// "🏳️➕🌈",
/// "🏳️➕🌈🟰",
/// "🏳️➕🌈🟰🏳️‍🌈"
/// ]
/// );
/// ```
///
/// # Examples
Expand Down
46 changes: 41 additions & 5 deletions components/segmenter/tests/spec_test.rs
Original file line number Diff line number Diff line change
Expand Up @@ -106,18 +106,52 @@ impl Iterator for TestContentIterator {
fn line_break_test(filename: &str) {
let test_iter = TestContentIterator::new(filename);
let segmenter = LineSegmenter::new_dictionary();
for mut test in test_iter {
for (i, mut test) in test_iter.enumerate() {
let s: String = test.utf8_vec.into_iter().collect();
let iter = segmenter.segment_str(&s);
let result: Vec<usize> = iter.collect();
// NOTE: For consistency with ICU4C and other Segmenters, we return a breakpoint at
// index 0, despite UAX #14 suggesting otherwise. See issue #3283.
test.break_result_utf8.insert(0, 0);
assert_eq!(result, test.break_result_utf8, "{}", test.original_line);
if test.break_result_utf8.first() != Some(&0) {
test.break_result_utf8.insert(0, 0);
}
if result != test.break_result_utf8 {
let lb = icu::properties::maps::line_break();
let lb_name = icu::properties::LineBreak::enum_to_long_name_mapper();
let mut iter = segmenter.segment_str(&s);
// TODO(egg): It would be really nice to have Name here.
println!(" | A | E | Code pt. | Line_Break | Literal");
for (i, c) in s.char_indices() {
let expected_break = test.break_result_utf8.contains(&i);
let actual_break = result.contains(&i);
if actual_break {
iter.next();
}
println!(
"{}| {} | {} | {:>8} | {:>18} | {}",
if actual_break != expected_break {
"😭"
} else {
" "
},
if actual_break { "÷" } else { "×" },
if expected_break { "÷" } else { "×" },
format!("{:04X}", c as u32),
lb_name
.get(lb.get(c))
.unwrap_or(&format!("{:?}", lb.get(c))),
c
)
}
println!("Test case #{}", i);
panic!()
}

let iter = segmenter.segment_utf16(&test.utf16_vec);
let result: Vec<usize> = iter.collect();
test.break_result_utf16.insert(0, 0);
if test.break_result_utf16.first() != Some(&0) {
test.break_result_utf16.insert(0, 0);
}
assert_eq!(
result, test.break_result_utf16,
"UTF16: {}",
Expand All @@ -127,7 +161,9 @@ fn line_break_test(filename: &str) {
// Test data is Latin-1 character only, it can run for Latin-1 segmenter test.
if let Some(mut break_result_latin1) = test.break_result_latin1 {
let iter = segmenter.segment_latin1(&test.latin1_vec);
break_result_latin1.insert(0, 0);
if break_result_latin1.first() != Some(&0) {
break_result_latin1.insert(0, 0);
}
let result: Vec<usize> = iter.collect();
assert_eq!(
result, break_result_latin1,
Expand Down
208 changes: 208 additions & 0 deletions components/segmenter/tests/testdata/LineBreakExtraTest.txt

Large diffs are not rendered by default.

209 changes: 105 additions & 104 deletions components/segmenter/tests/testdata/LineBreakTest.txt

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Loading