-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathUPDATES.txt
151 lines (117 loc) · 5.15 KB
/
UPDATES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
December 26,
- pinyin-generation improvements for compounds rooted in the number class
- updated version of the database from Popup Chinese
November 20,
- Added /search directory. It contains a basic java program that will index Chinese texts
using Chinese, English and Pinyin using the Lucene engine, along with brief instructions
on getting up and running.
November 20,
- SQLite support is back
- SQLite database creation scripts in the scripts directory
November 11,
- --editable-five-fields UTF8 single quote output edited to create more standards-compliant
javascript that will not produce errors under certain conditions handling UTF8 encoded text.
November 10,
- --refined UTF8 mysql-based parsing algorithm. It now checks against the traditional database
when it does not find content in the simplified space and vice-versa.This enables the software
to understand "mixed" simplified and traditionalUTF8 texts for the first time.
November 3,
- --no-phrases command line option added. This ignores database content marked as "PHRASE" when
parsing content. This options works for both the mysql and internally compiled datasets.
October 21,
- database updates, improvements to the number class to better handle decimals and some
new command line options for segmentation (--skip-basic, --skip-advanced), and see --help
for the rest.
September 11,
- improved traditional support in backend database - purging of empty single-character entries so that
UNKNOWN entries in automatically-generated words are less frequent and/or non-existent.
September 2
- improvements to pinyin generation for the number class. It will no longer drop certain characters which have
been unified with numbers.
July 29, 2008
- mass simplification of the number class. There are still some numbers which will run off the upper limits
of our handling capabilities, but treatment of large numbers with 亿 should be much more reliable and
accurate, even with decimal figures.
July 4, 2008
- changes to --editable-five-fields to improve treatment of ASCII for more reliable popups
June 27, 2008
- minor changes for --editable-five-fields ignores ASCII-only text for Popup Chinese
June 22, 2008
- clode cleanup
- addition of "--editable-five-fields" flag
- traditional variants now generated automatically if non-existent
April 29, 2008
- major dictionary update (+6000 entries)
April 24, 2008
- major dictionary update (+2000 entries)
- fix to number class that caused segmentation fault in certain cases involving suffixed numbers
March 28, 2008
- fixed issue with pinyin generation causing segmentation fault
- --split-on-ascii-punctuation added (NLP feature)
March 10, 2008
- fixed compile issue for Mac OS
March 07, 2008
- reduplicated pinyin error fixed
- improvements to Name recognition
Feb 24, 2008
- improvements to editing features (better traditional support)
- --tonalize mode added for conversion from numeric to tonal pinyin
- vocabulary output mode for traditional output enabled
Feb 18, 2008
- improvements to editing features (better traditional support)
- more dictionary entries added/reviewed
- proper escaping of apostrophes in --textbook output javascript
- some additional improvements to grammar/translation.grammar file
to support better translation of phraes from contemporary news
texts.
Feb 16, 2008
- automatic detection of traditional characters re-enabled
Feb 11, 2008
- routine database update (1000 or so new words/phrases).
- improvements to noun recognition, number handling
- correction of minor errors in textbook output mode
- further improvements in date processing
Feb 7, 2008
- major correction to parsing class
- improvements to corporate/proper noun recognition
- improvements to --textbook output for cleaner javascript
in pinyin/english with apostrophes
- month+time+minute handling improved
Jan 25, 2008
- input method --textbook-markup accepts --textbook output
as input. Useful for altering, editing existing segmentation
for pre-analysed texts.
Jan 21, 2008
- improvement to nonchinese.cpp for the correct recognition of
nonchinese/number/nonchinese compounds as nonchinese compounds.
examples include URLs that include both ASCII and numbers.
Jan 19, 2008
- --segcn (Chinese segmentation), --vocab (vocab list) features added
- more modifications to pronoun class to improve identification of possessive
pronouns
Jan 15, 2008
- addition of double field to dictionary for logorithmic frequency data
- FREQ2 added to expanded_unified table, as well as each individual table
Jan 14, 2008
- improved spacing for pinyin generation of Proper Nouns.
Jan 13, 2008
- word/frequency identification feature added
--frequency
--frequency-reset
Jan 12, 2008
- modifications to pronoun class
- minor bugfixes
Dec 27, 2007
- automation of stand-alone .dat files (from MySQL)
- scripts included in the scripts directory.
Dec 19, 2007
- question mark now handled properly (textbook mode)
Dec 13, 2007
- fix to handling of bei (passive) that would segfault
Nov 13, 2007
- asciify Chinese punctuation feature added (--ucp)
- minor bug-fixes
Oct 03, 2007
- basic support for traditional parsing added
- improved resegmentation handling
- improved support for address recognition