Performance issues with UnicodeTokenizer #80

GoogleCodeExporter · 2015-05-25T15:03:04Z

What steps will reproduce the problem?
1. call ArticleExtractor.getInstance().getText() on the example data 
(Stability.html) 

What is the expected output? What do you see instead?
The extraction takes a very long time (1-3 minutes depending on hardware and 
jvm load) with heavy memory re-allocations in StringBuilder during 
Matcher.replaceAll calls. HTML of this size typically takes 2-3s on the same 
hardware.

What version of the product are you using? On what operating system?
1.1.0 & 1.2.0 on Ubuntu 12.04 with Oracle JVM
java version "1.7.0_67"
Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)

Please provide any additional information below.
The attached patch fixes the regressive performance and improves the 
tokenization of tokens containing word, non-word, and transitional characters.

Note: I am not the author of the attached html file causing regressive 
performance.

Original issue reported on code.google.com by [email protected] on 14 Oct 2014 at 7:22

Attachments:

The text was updated successfully, but these errors were encountered:

GoogleCodeExporter added Priority-Medium Type-Defect auto-migrated labels May 25, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues with UnicodeTokenizer #80

Performance issues with UnicodeTokenizer #80

GoogleCodeExporter commented May 25, 2015

Performance issues with UnicodeTokenizer #80

Performance issues with UnicodeTokenizer #80

Comments

GoogleCodeExporter commented May 25, 2015