- bumped Go to 1.22;
- bumped
github.com/zoomio/inout
to v0.14.0; - introduced
UserAgent
(-ua
in CLI mode) to allow to pass a custom user agent for headless HTTP calls.
- bumped Go to 1.20;
- bumped
github.com/zoomio/inout
to v0.13.0.
- fixed dictionary loader for segmenter for Chinese & Japanese languages.
- BREAKING: from now on
ContentOnly
option is set totrue
by default; - optimization: moved segmenter inside the config with the lazy initialization so now it happens only once;
- fix: in cases when language detection is reliable it is now using correct value;
- fix: use the same segmenter logic in the plain text processor.
- graduated
ContentOnly
option (-content
option in the CLI mode); - BREAKING: from now on
-content
option in the CLI mode is set totrue
by default.
- use different segmentation logic based on the
github.com/go-ego/gse
segmenter for Chinese & Japanese languages; - improved HTML parser logic: optimised the way it collects contents of a document and improved logic for splitting into sentences;
- fallback to the English language for the stop words in cases when language detection is not reliable;
- added
lang
option to the CLI to be able to provide the language of the document; - bumped
github.com/zoomio/stopwords
to0.11.0
.
- stopped ignoring
<h1>
in cases when they are equal to the<title>
, as in now they are included.
- Bumped
github.com/zoomio/inout
to0.12.0
; - Fixed
-q
option orQuery
in the code (HTTP/HTML mode only), so now it actually works and retrieves contents of the DOM element for the query; - Introduced
-r
option orWaitFor
(HTTP/HTML mode only) to allow for waiting for certain DOM element to be ready before getting HTML; - Introduced
-u
option orWaitUntil
(HTTP/HTML mode only) to allow to wait for a certain delay before getting HTML; - Introduced
-i
option orScreenshot
(HTTP/HTML mode only) to capture a full screenshot of HTML in the given path.
- Added macOS (darwin) ARM64 release.
- Bumped Go to 1.18;
- BREAKING: renamed
ParseHTML
,ParseMD
&ParseText
toProcessHTML
,ProcessMD
&ProcessText
respectively; - BREAKING: renamed
extension.Result
toextension.ExtResult
; - New option
AllTagWeights
for enabling parsing through everything; - New option
ExcludeTagsString
for prohibitting some of the tags; ParseHTML
&ParseMD
are made public to open up parsing capabilities.
- improved handling of the words with the "`" or "'" symbols.
- BREAKING FROM 0.53.0: changed
config.StopWords
option signature to expect a slice of strings instead of*stopwords.Register
; - bumped
github.com/zoomio/stopwords
to0.10.0
.
- added
Option
calledStopWords
to allow for custom stop-words setup, also madeDomains
variable public.
- added URL sanitization in the texts, so it excludes things like http, https & www from them.
- HTML processor: fallback to
- HTML processor: use the longest parsed line in order to detect document language.
- [BREAKING CHANGE (most likely)] extensions (BETA) release - this is the BIGGEST RELEASE since the addition of the Markdown (documentation is in progress);
- support backwards compatibility for
ContentTypeOf
.
- same as
v0.49.0
.
- added language detection in order to improve handling of stop words.
- FEATURE: added new parameter
-adjust-scores
to allow configuring scores adjustment to the interval from 0.0 to 1.0.
- consider only the <title> tags which are part of the .
- FEATURE: added two new parameters
-tag-weights
and-tag-weights-json
to allow configuring parsed tags & weights for HTML and Markdown sources; - FEATURE: HTML mode is now parsing contents of
<meta name="description" content="...">
by default; - MISC: re-organised
processor
package into smaller focused sub-packages.
- HTML: prioritize longer page titles over the shorter ones;
- bumped Go version to 1.16;
- released with GitHub Actions.
- added support for Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean tags.
- a milestone release (hence the jump in the version number), part of the "Faster Stronger Better" initiative;
- added support for Markdown content type;
- improved performance and accuracy of HTML tagifier;
- added experiemntal
-content
option - it allows to target "content" only tags (such as headings and paragraphs); - added experiemntal
-site
option - it allows to Tagify full site; - added
Result#ForEach
for easier iteration through the tags; - added
-version
option to show version of Tagify.
- BREAKING CHANGE (with
v0.33.0
): renamedResult.Meta.DocVersion
toResult.Meta.DocHash
.
- unified
#GetTagsFromString
with#Run
so it is now a single API call -#Run
; - fixed logic with
Query
option when it was wrongly settingContentType
toText
instead ofHTML
, whenQuery
was not empty.
- added hash value, which represents the version of the document in
Result.Meta.DocVersion
.
- added support for Russian tags.
- better handling of the page titles for HTML content types;
- more informative output in verbose mode in CLI - added "title" and "content-type".
- BREAKING CHANGE: changed shape of returned struct by
tagify#Run
; - optimized the page title selection.
- BREAKING CHANGE:
tagify#Run
now returns a struct with an extra data along with tags.
- better handling for the apostrophe stopwords.
#sanitize
now splits word into parts if there are an unallowed symbols in it;- moved onto table-driven tests for same cases.
- simplified sentence splitting regex to only split by either of
.,!;:
symbols.
- improved HTML parser, now it keeps crawling inside the tag even if there are other tags inside;
- added
<code>
tag for HTML processor; - TF part of the TF-IDF score caluclation is now logarithmically scaled;
- added more tests, plus a bit of a code re-shuffling;
- bumped
github.com/zoomio/stopwords
to0.3.0
.
- use TF-IDF for better tags scoring;
- bumped
github.com/zoomio/inout
to0.8.2
; - improved stop words detection by checking after sanitisation;
- added
<li>
tag for HTML processor.
- HTML: do not count page's title twice, if it is represented in one of the headings.
- fixed test.
- added
<title>
HTML tag handling; - minor code optimizations.
- bumped
github.com/zoomio/inout
to0.8.0
; - removed dependency on
hizer
so now HTML parser actually works better.
- bumped
github.com/zoomio/inout
to0.7.0
, to handle bigger lines of text; - made private internal housekkeping struct
in
.
- bumped Go to
1.13.x
; - better error printing/wrapping with
fmt.Fprintf(os.Stderr, ...)
andfmt.Errorf("...%W...", ...)
; - added benchmark test and profiling;
- improved infrastructure scripts for build and install, added
Makefile
; - better help on
--help
option.
- bumped
github.com/zoomio/inout
to0.6.0
; - re-used "self-referential functions and the design of options" approach by Rob Pike by introducing
Option
and new API method#Run
.
- CSS query (
-q
option): improved overall querying logic, now it retrieves all texts from the matching tags; - bumped
github.com/zoomio/inout
to0.5.0
.
- bumped
github.com/zoomio/inout
to be able to wait for DOM elements to be visible on the web-pages; - introduced
-q
option, which is short for "query" to allow to provide CSS query.
- added support for more HTML tags
<h5>
,<h6>
and<a>
.
- breaking change: signature of
processor#ParseHTML
changed, removedbool
argument -doTagify
, previously it returned a tuple([]string, []*Tag)
and now it is a single result -[]*Tag
; - increased code coverage.
- breaking change: signature of
processor#ParseHTML
changed, added extrabool
argument -tagify
(if set to true, then output[]*Tag
slice will be populated, otherwise it will be empty) and return tuple values swapped places -([]string, []*Tag)
instead of([]*Tag, []string)
.
- bumped
github.com/zoomio/inout
from0.1.0
to0.2.0
.
- externalized inout into standalone package
github.com/zoomio/inout
.
- skip words that start with hyphen;
- moved to Go Modules.
- externalized stop-words into standalone package
github.com/zoomio/stopwords
.
- more refactorring;
- enabled skipped test;
- added
-no-stop
boolean flag, to allow disabling of stop-words filter.
- moved stop words in
*.go
file; - removed dependency on
github.com/gobuffalo/packr
.
- improved stop words list;
- changed math for HTML tag weights;
- improved
#normalize
to be mored defensive in case if sanitize regex still returns not a word.
- removed default and max limits for the tags query;
- moved
_scripts
to_bin
; - moved
_files
to_resources
.
- refactored everything;
- added comments in some places to better understand logic;
- added tests for normalization/de-duping;
- added
-d
option to return tags along with detailed information.
- added de-duplication algorithm based on the inflection;
#GetTags
and#GetTagsFromString
now acceptcontentType
of typeContentType
, which is more typo-proof.
- code refactoring;
- simplified internal structure for ease of API use.
- better error handling;
- code refactoring;
#GetTags
now acceptsint
variableconteType
;- added
#GetTagsFromString
with equal signature to#GetTags
.
- first release.