forked from JohnSnowLabs/spark-nlp
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCHANGELOG
411 lines (368 loc) · 19.1 KB
/
CHANGELOG
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
========
1.5.0
========
---------------
Overview
---------------
We are proud to announce if not the biggest release in terms of content in Spark-NLP!
This release makes the library miles easier to use for new comers, allowing easier to import
annotators and the extended use of model downloader throughout pretrained models and pipelines.
This also includes two new annotators that use deep learning algorithms with graphs from TensorFlow, which
is the first time we do so.
Apart from this, we include new Light Pipelines that are 10x times faster when working with data smaller than about
50,000 rows length.
Finally, we included several bugfixes across the library, from algorithm wise to developer API.
We'll gladly welcome any feedback! The website has been extensively updated.
---------------
New features
---------------
* Light Pipelines are Annotator Pipelines created from SparkML pipelines that run more than 10x faster in small datasets
* Deep Learning NER based on Bi-LSTM and Convolutional Neural Networks from word embeddings datasets
* Deep Learning Assertion Status model based on LSTM to compute status identification from word embeddings
* Easier to use Spark-NLP:
1. Imports have been made easy in scala API (com.johnsnowlabs.annotator._) to bring all annotators
2. BasicPipeline and AdvancedPipeline downloadable pipelines created for quick annotation of text
3. Light Pipelines are easy to use and accept simple strings to annotate a Spark ML Pipeline without spark datasets
* New Downloadable models: CRF NER, Lemmatizer, POS and Spell checker
* New Downloadable pipelines: Vivekn Sentiment analysis, BasicPipeline and AdvancedPipeline
---------------
Enhancements
---------------
* Model downloader significantly improved in terms of usability
---------------
Documentation
---------------
* Website widely improved
* Added invite to our first slack chat channel
---------------
Bugfixes
---------------
* Fixed positional index wrong value when creating Annotations from constructor
* Fixed hamming distance calculation in spell checker
* Fixed Downloadable NER model failing sporadically due to missing temporary files
* Fixed SearchTrie algorithm used in TextMatcher (fmy. EntiyExtractor) thanks @avenka11 for reporting and proposing solution
* Fixed some model deserialization issues happening on Windows
---------------
Other
---------------
* Thanks to @showy we have TravisCI automatic integration testing
* Finisher now outputs to array by default
* Training example resources removed in advantage of using the model downloader more
========
1.4.2
========
---------------
Bugfixes
---------------
* Filesystem protocols now properly read across the library, fixed use case for S3:// protocol (thanks @avenka11)
* Library now works properly in Windows environments
* PySpark annotator param getters now work properly when retrieving default values
* Fixed stemmer serialization due to misspelled param name
* Fixed Tokenizer infixPattern param name to infixPatterns, leading to broken pyspark serialization of such param
* Added missing addInfixPattern() function to PySpark, to allow adding patterns to current value
* Model Downloader clearCache now properly removes both .zip files and extracted content
* Model Downloader is now capable of reading all types of models properly
* Added missing clearCache function into PySpark
---------------
Developer API
---------------
* Function names in model downloader code has been refactored consistenyl
---------------
Other
---------------
* RocksDB rolled back to previous version to support Windows
* NerCRF unittest modified to reduce time to test
* Removed training scripts from repository
* Updated build spark and scala version
========
1.4.1
========
---------------
New features
---------------
* Model and Pipeline Downloader
We are glad to announce our first experimental model downloader, working both in Python and Scala.
This allows to download pre-trained models from our public storage. This does not include any pre-trained models yet
but just the logic to be able to do it.
---------------
Enhancements
---------------
* Improved ExternalResource API (introduced in 1.4.0) to make it easier to provide external corpus and resource information
on annotators such as readAs (which allows setting how would you like SparkNLP to read your source), delimiters and parse settings among
other options that might be passed to Spark Reader directly. Annotators using external sources now all share this functionality.
WordEmbeddings are not yet supported on this format.
* All python annotators now properly have getter functions to retrieve param values
--------------
Bugfixes
--------------
* Fixed some annotators in python not de-serializable on their own outside a Pipeline
* Fixed CRF NER not working when not using word embeddings (thanks @crisliu for reporting)
* Fixed Tokenizer not properly recognizing some stop words (thanks @easimadi)
* Fixed Tokenizer not properly recognizing composite tokens when changing target pattern param (thanks @easimadi)
* ReadAs parameter now properly read from string in all ExternalResource setters
---------------
Developer API
---------------
* PySpark API further improvements within AnnotatorApproach, AnnotatorModel and now private internal _AnnotatorModel for fit() result representation
* Automated getter have been written in order not to have to write getter functions in all annotators manually
-----------
Other
-----------
* RocksDB dependency rolled back to 5.2.1 for better universal compatibility particularly to support databricks platform
---------------
Documentation
---------------
* Updated website components page to match 1.4.x
* Replaced notebooks site to a placeholder linking to current python notebooks for lower maintenance
========
1.4.0
========
---------------
New features
---------------
* All annotator external sources have been unified through an ExternalResource component.
This is used to represents external data information deals with content in HDFS or local just as spark deals with data.
It also improves performance globally and allows customization
into how these sources are read (e.g. as RDD or Line by Line sequences)
* NorvigSweeting SpellChecker, ViveknSentiment and POS Perceptron can now train from the dataset passed to fit().
For Spell Checker, this will be applied if the user did not supply a corpus, forcing fit() to learn from words in the data column.
For ViveknSentiment and POS Perceptron, this strategy will be applied if sentimentCol and posCol params have been set respectively.
---------------
Enhancements
---------------
* ResourceHelper now has an improved SourceStream class which allows for more consistent HDFS/Filesystem reading by using
more of the Hadoop APIs.
* application.conf is now a global setting and can be overridden in run-time through ConfigLoader.setConfigPath(). It may also be accessed from PySpark
* PySpark API improved by creating AnnotatorApproach and AnnotatorModel classes
* EntityMatcher now uses recursive Pipelines
* Part-of-Speech tagging performance has been improved throughout the prediction algorithm
* EntityMatcher may now use RecursivePipeline in order to tokenize external data with the same pipeline provided by the user
---------------
Developer API
---------------
*PySpark API has been severly improved to make it easier to extend JVM classes
*PySpark API improved for extending annotator approaches and models appropriately
----------------
Bugfixes
----------------
* Reverted a bug introduced causing NER not to read datasets properly from HDFS
* Fixed EntityMatcher wrongly normalizing external content (thanks @sofianeh)
----------------
Documentation
----------------
* Fixed EntityMatcher documentation obsolete params (Thanks @sofianeh)
* Fixed NER CRF documentation in website
========
1.3.0
========
IMPORTANT: Pipelines from 1.2.6 or older cannot be loaded from 1.3.0
---------------
New features
---------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/94
Tokenizer annotator has been revamped. It now follows standard NLP Rules, matching above 90% of StanfordNLP Tokens
This annotator has now more complex rules allowing setting custom composite words as exceptions (e.g. to not break New York)
and custom Prefix, Infix, Suffix and Breaking rules. It uses regular expression groups in order to match various tokens per target word
Defaults have been updated to also be language agnostic and support foreign characters from Unicode charset
* https://github.com/JohnSnowLabs/spark-nlp/pull/93
Assertion Status. This annotator identifies negated sequences within target scope. Assertion status is a machine learning
annotator and works throughout a set of Word Embeddings which a set of them is provided as a part of our Python notebook examples.
* https://github.com/JohnSnowLabs/spark-nlp/pull/90
Recursive Pipelines. We have created our own Pipeline class which will take more advantages from Spark-NLP annotators.
Although this Pipeline is completely optional and works well with default Apache Spark estimators and transforms, it allows
training our annotators more efficiently by allowing annotator approaches access the previous state of the Pipeline,
allowing them to use it to tokenize or transform their own external content. It is recommended to use such Pipelines.
----------------
Enhancements
----------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/83
Part of Speech training has been improved in both performance and quality, and now better makes use of the input corpus provided.
New params have been extended in order to have more control of its training, through corpusFormat and corpusLimit, allowing
whether to read training data as Dataset or raw text files, and the number of limit files if a folder is provided
* https://github.com/JohnSnowLabs/spark-nlp/pull/84
Thanks to @lambdaofgod to allow Normalizer to optionally lower case tokens
* Thanks to Lorenz Bernauer, Normalizer default pattern now becomes language agnostic by not breaking unicode characters such as Spanish or German letters
* Features now have appropriate default values which are lazy by nature and executed only once upon request. This improves by side effect to the Lemmatizer performance.
* RuleFactory (A regex rule factory) performance has been improved due to set to use a Factory pattern and not re-check it's strategy on every transformation in run-time.
This might have positive side effects in SentenceDetector, DateMatcher and RegexMatcher which extensively use this class.
----------------
Class Renames
----------------
RegexTokenizer -> Tokenizer (it is not just regex anymore)
SentenceDetectorModel -> SentenceDetector (it is not a model, it is a rule-based algorithm)
SentimentDetectorModel -> SentimentDetector (it is not a model, it is a rule-based algorithm)
----------------
User Utilities
----------------
* ResourceHelper has a function createDatasetFromText which allows the user to more
easily read one or multiple text files from path into a dataset with various options,
including filename by row or by file aggregation. This class should be more widely
used since it helps dealing with local files parsing. It shall be better documented.
* com.johnsnowlabs.util now contains a Benchmark class which allows measuring the time of
any function easily, by using it as Benchmark.time("Description of measured") {someFunction()}
----------------
Developer API
----------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/89/files
Word embedding traits have been generalized. Now any annotator who might want to use them can easily access their properties
* Recursive pipelines now allow injecting PipelineModel object into train() stage. It is an optional parameter. If the user
utilizes RecursivePipeline, the annotator might use this pipeline for transforming secondary data inputs.
* Annotator abstract class has been divided into a previous RawAnnotator class which contains all annotator properties
and validations, but does not make use of the annotate() function. This allows annotators that need to work directly with
the transform() call, but also participate between other annotators in the pipeline
----------------
Bugfixes
----------------
* Fixed a bug in annotators with word embeddings not correctly serializing into disk
* Fixed a bug creating temporary folders in home folder
* Fixed a broken geospatial pattern in sentence detection
========
1.2.6
========
---------------
Enhancements
---------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/82
Vivekn Sentiment Analysis improved memory consumption and training performance
Parameter pruneCorpus is an adjustable value now, defaults to 1. Higher values lead to better performance
but are meant on larger corpora. tokenPattern params are meant to allow different tokenization regex
within the corpora provided on Vivekn and Norvig models.
* https://github.com/JohnSnowLabs/spark-nlp/pull/81
Serialization improvements. New default format (parquet lasted little) is RDD objects. Proved to be lighter on
heap memory. Also added lazier default values for Feature containers. New application.conf performance tunning
settings allow to customize whether we want to Feature broadcast or not, and use parquet or objects in serialization.
========
1.2.5
========
IMPORTANT: Pipelines from 1.2.4 or older cannot be loaded from 1.2.5
---------------
New features
---------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/70
Word embeddings parameter for CRF NER annotator
* https://github.com/JohnSnowLabs/spark-nlp/pull/78
Annotator features replace params and are now serialized using KRYO and partitioned files, increases performance and smaller
memory consumption in Driver for saving and loading pipelines with large corpora. Such features are now also broadcasted
for better performance in distributed environments. This enhancement is a breaking change, does not allow to load older pipelines
----------------
Bug fixes
----------------
* https://github.com/JohnSnowLabs/spark-nlp/commit/cb9aa4366f3e2c9863482df39e07b7bacff13049
Stemmer was not capable of being deserialized (Implements DefaultParamsReadable)
* https://github.com/JohnSnowLabs/spark-nlp/pull/75
Sentence Boundary detector was not properly setting bounds
----------------
Documentation (thanks @maziyarpanahi)
----------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/79
Typo in code
* https://github.com/JohnSnowLabs/spark-nlp/pull/74
Bad description
========
1.2.4
========
---------------
New features
---------------
* https://github.com/JohnSnowLabs/spark-nlp/commit/c17ddac7a5a9e775cddc18d672e80e60f0040e38
ResourceHelper now allows input files to be read in the shape of Spark Dataset, implicitly enabling HDFS paths, allowing larger annotator input files. Needs to set 'TXTDS' as input format Param to let annotators read this way. Allowed in: Lemmatizer, EntityExtractor, RegexMatcher, Sentiment Analysis models, Spell Checker and Dependency Parser.
---------------
Enhancements and progress
---------------
* https://github.com/JohnSnowLabs/spark-nlp/commit/4920e5ce394b25937969cc4cab1d81172be722a3
CRF NER Benchmarking progress
* https://github.com/JohnSnowLabs/spark-nlp/pull/64
EntityExtractor refactored. This annotator uses an input file containing a list of entities to look for inside target text. This annotator has been refactored to be of better use and specifically faster, by using a Trie search algorithm. Proper examples included in python notebooks.
---------------
Bug fixes
---------------
* Issue https://github.com/JohnSnowLabs/spark-nlp/issues/41 <> https://github.com/JohnSnowLabs/spark-nlp/commit/d3b9086e834233f3281621d7c82e32195479fc82
Fixed default resources not being loaded properly when using the library through --spark-packages. Improved input reading from resources and folder resources, and falling back to disk, with better error handling.
* https://github.com/JohnSnowLabs/spark-nlp/commit/08405858c6186e6c3e8b668233e30df12fa50374
Corrected param names in DocumentAssembler
* Issue https://github.com/JohnSnowLabs/spark-nlp/issues/58 <> https://github.com/JohnSnowLabs/spark-nlp/commit/5a533952cdacf67970c5a8042340c8a4c9416b13
Deleted a left-over deprecated function which was misleading.
* https://github.com/JohnSnowLabs/spark-nlp/commit/c02591bd683db3f615150d7b1d121ffe5d9e4535
Added a filtering to ensure no empty sentences arrive to unnormalized Vivekn Sentiment Analysis
---------------
Documentation and examples
---------------
* https://github.com/JohnSnowLabs/spark-nlp/commit/b81e95ce37ed3c4bd7b05e9f9c7b63b31d57e660
Added additional resources into FAQ page.
* https://github.com/JohnSnowLabs/spark-nlp/commit/0c3f43c0d3e210f3940f7266fe84426900a6294e
Added Spark Summit example notebook with full Pipeline use case
* Issue https://github.com/JohnSnowLabs/spark-nlp/issues/53 <> https://github.com/JohnSnowLabs/spark-nlp/commit/20efe4a3a5ffbceedac7bf775466b7a8cde5044f
Fixed scala python documentation mistakes
* https://github.com/JohnSnowLabs/spark-nlp/commit/782eb8dce171b69a615887b3defaf8b729b735f2
Typos fix
---------------
Other
---------------
* https://github.com/JohnSnowLabs/spark-nlp/commit/91d8acb1f0f4840dad86db3319d0b062bd63b8c6
Removed Regex NER due to slowness and little use. CRF NER to replace NER.
---------------
Other
---------------
https://github.com/JohnSnowLabs/spark-nlp/commit/91d8acb1f0f4840dad86db3319d0b062bd63b8c6
Removed Regex NER due to slowness and little use. CRF NER to replace NER.
========
1.2.3
========
---------------
Bugfixes
---------------
* Sentence detection not properly bounding punctuation marks
* Sentence detection abbreviations feature disabled by default, since it is a slow feature and not necessary benefitial
* Sentence detection punctuation bounds improved algorithm performance by removing redundant formatting
* Fixed a bug causing sentiment analysis not to work when using normalized tokens that may cause tokens to be deleted from sentence
* Fixed Resource Helper text reading missing first line from text stream. More improvements to come later.
---------------
Other
---------------
CRF NER progress, word embeddings support. Not yet officially released.
========
1.2.2
========
---------------
New features
---------------
* Finisher parameter to export output as Array
========
1.2.1
========
---------------
Bugfixes
---------------
* Finisher not properly deleting annotation columns even when set to true explicitly
========
1.2.0
========
--------------
New features
--------------
* New annotator: CRF NER
* New transformer: Token Assembler
* Added custom bounds parameter in sentence detector
* Added custom pattern parameter in Normalizer
* SpellChecker able to train from text files as dataset
* Typesafe configuration may now be read from disk at runtime
-------------------------
Performance improvements
-------------------------
* Enabled and suggested KRYO Serializer for better pipeline write-read
--------------------
Code improvements
--------------------
* Pack/Unpack in annotators leads to better code reutilization
* ResourceHelper reduced io responsibility
* Annotations centralized main result and Finisher improvements
---------------
Bugfixes
---------------
* RegexMatcher fixed to read from input files properly
* DocumentAssembler fixed relative positioning
------------------
Release framework
------------------
* Better package release readiness for different repositories
* Organization package name change due to central's standards