diff --git a/.vscode/settings.json b/.vscode/settings.json new file mode 100644 index 0000000..55712c1 --- /dev/null +++ b/.vscode/settings.json @@ -0,0 +1,3 @@ +{ + "typescript.tsdk": "node_modules/typescript/lib" +} \ No newline at end of file diff --git a/docs/guide/matchers.md b/docs/guide/matchers.md index 4835e44..5baa9a6 100644 --- a/docs/guide/matchers.md +++ b/docs/guide/matchers.md @@ -4,28 +4,20 @@ We've previously discussed patterns and transformers. It's time to learn about how to use Obscenity to search for blacklisted terms in text, while respecting whitelisted terms. -Obscenity provides two matchers which implement this behavior, which are quite similar: the `RegExpMatcher` and the `NfaMatcher`. Both have their pros and cons, which we'll discuss briefly here. +To facilitate this, Obscenity provides the `RegExpMatcher`, which -- as the name suggests -- implements matching using regular expressions and string searching methods. At a high level, all it does is: -- The `RegExpMatcher` implements matching using regular expressions and string searching methods. At a high level, all it does is: - - ``` - apply transformations to text before matching whitelisted terms - find whitelisted terms in text - - apply transformations to text before matching blacklisted terms - for each blacklisted term - for all matches of the blacklisted term in the text - if a whitelisted term did not match this part of the text - emit match - ``` - - The `RegExpMatcher` is the implementation we recommend for most applications, as it performs better than the `NfaMatcher` on small - medium numbers of patterns and consumes less memory as well. - -- The `NfaMatcher` implements matching using finite automata (more specifically, it builds a heavily modified [Aho-Corasick automaton](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm) from the patterns and runs through the text once, walking the trie as it does so). - - It is, in theory, more efficient than the `RegExpMatcher` as it uses a single pass to match all the patterns, but the performance difference is only noticeable when you have a high number of patterns (> 100). Furthermore, as it has to build a trie from the patterns, it consumes more memory than the `RegExpMatcher` as well. +``` +apply transformations to text before matching whitelisted terms +find whitelisted terms in text + +apply transformations to text before matching blacklisted terms +for each blacklisted term + for all matches of the blacklisted term in the text + if a whitelisted term did not match this part of the text + emit match +``` -> **Note:** For the rest of this article, we will be using the `RegExpMatcher`, but it applies equally to the `NfaMatcher`. +For now, the `RegExpMatcher` is the only matcher implementation offered by Obscenity, though this may change in future versions. ## Providing matcher options diff --git a/docs/reference/README.md b/docs/reference/README.md index 80c7f5d..3e58498 100644 --- a/docs/reference/README.md +++ b/docs/reference/README.md @@ -11,7 +11,6 @@ obscenity ### Classes - [DataSet](classes/DataSet.md) -- [NfaMatcher](classes/NfaMatcher.md) - [ParserError](classes/ParserError.md) - [PhraseBuilder](classes/PhraseBuilder.md) - [RegExpMatcher](classes/RegExpMatcher.md) @@ -25,7 +24,6 @@ obscenity - [LiteralNode](interfaces/LiteralNode.md) - [MatchPayload](interfaces/MatchPayload.md) - [Matcher](interfaces/Matcher.md) -- [NfaMatcherOptions](interfaces/NfaMatcherOptions.md) - [OptionalNode](interfaces/OptionalNode.md) - [ParsedPattern](interfaces/ParsedPattern.md) - [PhraseContainer](interfaces/PhraseContainer.md) @@ -80,7 +78,7 @@ Context passed to [[TextCensorStrategy | text censoring strategies]]. #### Defined in -[src/censor/TextCensor.ts:104](https://github.com/jo3-l/obscenity/blob/563159b/src/censor/TextCensor.ts#L104) +[src/censor/TextCensor.ts:104](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/censor/TextCensor.ts#L104) ___ @@ -94,7 +92,7 @@ should be a set of characters that map to the transformed character. #### Defined in -[src/transformer/remap-characters/index.ts:60](https://github.com/jo3-l/obscenity/blob/563159b/src/transformer/remap-characters/index.ts#L60) +[src/transformer/remap-characters/index.ts:60](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/transformer/remap-characters/index.ts#L60) ___ @@ -106,7 +104,7 @@ All the profane words that are included in the [[englishDataset | english datase #### Defined in -[src/preset/english.ts:377](https://github.com/jo3-l/obscenity/blob/563159b/src/preset/english.ts#L377) +[src/preset/english.ts:377](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/preset/english.ts#L377) ___ @@ -124,7 +122,7 @@ Extends the default match payload by adding phrase metadata. #### Defined in -[src/dataset/DataSet.ts:199](https://github.com/jo3-l/obscenity/blob/563159b/src/dataset/DataSet.ts#L199) +[src/dataset/DataSet.ts:190](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/dataset/DataSet.ts#L190) ___ @@ -136,7 +134,7 @@ All the possible kinds of nodes. #### Defined in -[src/pattern/Nodes.ts:24](https://github.com/jo3-l/obscenity/blob/563159b/src/pattern/Nodes.ts#L24) +[src/pattern/Nodes.ts:24](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/pattern/Nodes.ts#L24) ___ @@ -163,7 +161,7 @@ replacement string. #### Defined in -[src/censor/TextCensor.ts:99](https://github.com/jo3-l/obscenity/blob/563159b/src/censor/TextCensor.ts#L99) +[src/censor/TextCensor.ts:99](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/censor/TextCensor.ts#L99) ## Variables @@ -224,7 +222,7 @@ SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. #### Defined in -[src/preset/english.ts:103](https://github.com/jo3-l/obscenity/blob/563159b/src/preset/english.ts#L103) +[src/preset/english.ts:103](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/preset/english.ts#L103) ___ @@ -237,20 +235,20 @@ A set of transformers to be used when matching blacklisted patterns with the #### Defined in -[src/preset/english.ts:14](https://github.com/jo3-l/obscenity/blob/563159b/src/preset/english.ts#L14) +[src/preset/english.ts:14](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/preset/english.ts#L14) ___ ### englishRecommendedTransformers -• `Const` **englishRecommendedTransformers**: `Pick`<[`NfaMatcherOptions`](interfaces/NfaMatcherOptions.md), ``"blacklistMatcherTransformers"`` \| ``"whitelistMatcherTransformers"``\> +• `Const` **englishRecommendedTransformers**: `Pick`<[`RegExpMatcherOptions`](interfaces/RegExpMatcherOptions.md), ``"blacklistMatcherTransformers"`` \| ``"whitelistMatcherTransformers"``\> Recommended transformers to be used with the [[englishDataset | english word -dataset]] and the [[RegExpMatcher]] or the [[NfaMatcher]]. +dataset]] and the [[RegExpMatcher]]. #### Defined in -[src/preset/english.ts:48](https://github.com/jo3-l/obscenity/blob/563159b/src/preset/english.ts#L48) +[src/preset/english.ts:48](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/preset/english.ts#L48) ___ @@ -263,7 +261,7 @@ A set of transformers to be used when matching whitelisted terms with the #### Defined in -[src/preset/english.ts:36](https://github.com/jo3-l/obscenity/blob/563159b/src/preset/english.ts#L36) +[src/preset/english.ts:36](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/preset/english.ts#L36) ___ @@ -275,7 +273,7 @@ The current version of the library, formatted as `MAJOR.MINOR.PATCH`. #### Defined in -[src/index.ts:28](https://github.com/jo3-l/obscenity/blob/563159b/src/index.ts#L28) +[src/index.ts:27](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/index.ts#L27) ## Functions @@ -310,11 +308,11 @@ const matcher = new RegExpMatcher({ [`BlacklistedTerm`](interfaces/BlacklistedTerm.md)[] A list of blacklisted terms with valid IDs which can then be passed -to the [[RegExpMatcher]] or [[NfaMatcher]]. +to the [[RegExpMatcher]]. #### Defined in -[src/matcher/BlacklistedTerm.ts:37](https://github.com/jo3-l/obscenity/blob/563159b/src/matcher/BlacklistedTerm.ts#L37) +[src/matcher/BlacklistedTerm.ts:37](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/matcher/BlacklistedTerm.ts#L37) ___ @@ -341,7 +339,7 @@ A [[TextCensorStrategy]] for use with the [[TextCensor]]. #### Defined in -[src/censor/BuiltinStrategies.ts:71](https://github.com/jo3-l/obscenity/blob/563159b/src/censor/BuiltinStrategies.ts#L71) +[src/censor/BuiltinStrategies.ts:71](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/censor/BuiltinStrategies.ts#L71) ___ @@ -400,11 +398,11 @@ const matcher = new RegExpMatcher({ ..., blacklistMatcherTransformers: [transfor `StatefulTransformerContainer` A container holding the transformer, which can then be passed to the -[[RegExpMatcher]] or the [[NfaMatcher]]. +[[RegExpMatcher]]. #### Defined in -[src/transformer/collapse-duplicates/index.ts:46](https://github.com/jo3-l/obscenity/blob/563159b/src/transformer/collapse-duplicates/index.ts#L46) +[src/transformer/collapse-duplicates/index.ts:46](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/transformer/collapse-duplicates/index.ts#L46) ___ @@ -445,7 +443,7 @@ than the first. #### Defined in -[src/matcher/MatchPayload.ts:57](https://github.com/jo3-l/obscenity/blob/563159b/src/matcher/MatchPayload.ts#L57) +[src/matcher/MatchPayload.ts:57](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/matcher/MatchPayload.ts#L57) ___ @@ -479,7 +477,7 @@ A [[TextCensorStrategy]] for use with the [[TextCensor]]. #### Defined in -[src/censor/BuiltinStrategies.ts:134](https://github.com/jo3-l/obscenity/blob/563159b/src/censor/BuiltinStrategies.ts#L134) +[src/censor/BuiltinStrategies.ts:134](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/censor/BuiltinStrategies.ts#L134) ___ @@ -523,7 +521,7 @@ A [[TextCensorStrategy]] for use with the [[TextCensor]]. #### Defined in -[src/censor/BuiltinStrategies.ts:115](https://github.com/jo3-l/obscenity/blob/563159b/src/censor/BuiltinStrategies.ts#L115) +[src/censor/BuiltinStrategies.ts:115](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/censor/BuiltinStrategies.ts#L115) ___ @@ -552,7 +550,7 @@ A [[TextCensorStrategy]] for use with the [[TextCensor]]. #### Defined in -[src/censor/BuiltinStrategies.ts:89](https://github.com/jo3-l/obscenity/blob/563159b/src/censor/BuiltinStrategies.ts#L89) +[src/censor/BuiltinStrategies.ts:89](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/censor/BuiltinStrategies.ts#L89) ___ @@ -586,7 +584,7 @@ A [[TextCensorStrategy]] for use with the [[TextCensor]]. #### Defined in -[src/censor/BuiltinStrategies.ts:51](https://github.com/jo3-l/obscenity/blob/563159b/src/censor/BuiltinStrategies.ts#L51) +[src/censor/BuiltinStrategies.ts:51](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/censor/BuiltinStrategies.ts#L51) ___ @@ -631,7 +629,7 @@ A [[TextCensorStrategy]] for use with the [[TextCensor]]. #### Defined in -[src/censor/BuiltinStrategies.ts:28](https://github.com/jo3-l/obscenity/blob/563159b/src/censor/BuiltinStrategies.ts#L28) +[src/censor/BuiltinStrategies.ts:28](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/censor/BuiltinStrategies.ts#L28) ___ @@ -662,11 +660,11 @@ pattern. [`ParsedPattern`](interfaces/ParsedPattern.md) The parsed pattern, which can then be used with the -[[RegExpMatcher]] or the [[NfaMatcher]]. +[[RegExpMatcher]]. #### Defined in -[src/pattern/Pattern.ts:130](https://github.com/jo3-l/obscenity/blob/563159b/src/pattern/Pattern.ts#L130) +[src/pattern/Pattern.ts:130](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/pattern/Pattern.ts#L130) ___ @@ -795,11 +793,11 @@ using a template tag. [`ParsedPattern`](interfaces/ParsedPattern.md) The parsed pattern, which can then be used with the -[[RegExpMatcher]] or the [[NfaMatcher]]. +[[RegExpMatcher]]. #### Defined in -[src/pattern/Pattern.ts:106](https://github.com/jo3-l/obscenity/blob/563159b/src/pattern/Pattern.ts#L106) +[src/pattern/Pattern.ts:106](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/pattern/Pattern.ts#L106) ___ @@ -833,7 +831,7 @@ A [[TextCensorStrategy]] for use with the [[TextCensor]]. #### Defined in -[src/censor/BuiltinStrategies.ts:155](https://github.com/jo3-l/obscenity/blob/563159b/src/censor/BuiltinStrategies.ts#L155) +[src/censor/BuiltinStrategies.ts:155](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/censor/BuiltinStrategies.ts#L155) ___ @@ -889,11 +887,11 @@ const matcher = new RegExpMatcher({ ..., blacklistMatcherTransformers: [transfor `SimpleTransformerContainer` A container holding the transformer, which can then be passed to the -[[RegExpMatcher]] or the [[NfaMatcher]]. +[[RegExpMatcher]]. #### Defined in -[src/transformer/remap-characters/index.ts:38](https://github.com/jo3-l/obscenity/blob/563159b/src/transformer/remap-characters/index.ts#L38) +[src/transformer/remap-characters/index.ts:38](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/transformer/remap-characters/index.ts#L38) ___ @@ -922,11 +920,11 @@ const matcher = new RegExpMatcher({ ..., blacklistMatcherTransformers: [transfor `SimpleTransformerContainer` A container holding the transformer, which can then be passed to the -[[RegExpMatcher]] or the [[NfaMatcher]]. +[[RegExpMatcher]]. #### Defined in -[src/transformer/resolve-confusables/index.ts:22](https://github.com/jo3-l/obscenity/blob/563159b/src/transformer/resolve-confusables/index.ts#L22) +[src/transformer/resolve-confusables/index.ts:22](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/transformer/resolve-confusables/index.ts#L22) ___ @@ -956,11 +954,11 @@ const matcher = new RegExpMatcher({ ..., blacklistMatcherTransformers: [transfor `SimpleTransformerContainer` A container holding the transformer, which can then be passed to the -[[RegExpMatcher]] or the [[NfaMatcher]]. +[[RegExpMatcher]]. #### Defined in -[src/transformer/resolve-leetspeak/index.ts:23](https://github.com/jo3-l/obscenity/blob/563159b/src/transformer/resolve-leetspeak/index.ts#L23) +[src/transformer/resolve-leetspeak/index.ts:23](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/transformer/resolve-leetspeak/index.ts#L23) ___ @@ -990,11 +988,11 @@ const matcher = new RegExpMatcher({ ..., blacklistMatcherTransformers: [transfor `SimpleTransformerContainer` A container holding the transformer, which can then be passed to the -[[RegExpMatcher]] or the [[NfaMatcher]]. +[[RegExpMatcher]]. #### Defined in -[src/transformer/skip-non-alphabetic/index.ts:23](https://github.com/jo3-l/obscenity/blob/563159b/src/transformer/skip-non-alphabetic/index.ts#L23) +[src/transformer/skip-non-alphabetic/index.ts:23](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/transformer/skip-non-alphabetic/index.ts#L23) ___ @@ -1017,8 +1015,8 @@ of varying cases. `SimpleTransformerContainer` A container holding the transformer, which can then be passed to the -[[RegExpMatcher]] or the [[NfaMatcher]]. +[[RegExpMatcher]]. #### Defined in -[src/transformer/to-ascii-lowercase/index.ts:18](https://github.com/jo3-l/obscenity/blob/563159b/src/transformer/to-ascii-lowercase/index.ts#L18) +[src/transformer/to-ascii-lowercase/index.ts:18](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/transformer/to-ascii-lowercase/index.ts#L18) diff --git a/docs/reference/classes/DataSet.md b/docs/reference/classes/DataSet.md index b757069..1d8203a 100644 --- a/docs/reference/classes/DataSet.md +++ b/docs/reference/classes/DataSet.md @@ -63,7 +63,7 @@ const customDataset = new DataSet().addAll(englishDataset); #### Defined in -[src/dataset/DataSet.ts:29](https://github.com/jo3-l/obscenity/blob/563159b/src/dataset/DataSet.ts#L29) +[src/dataset/DataSet.ts:29](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/dataset/DataSet.ts#L29) ___ @@ -96,16 +96,15 @@ const data = new DataSet<{ originalWord: string }>() #### Defined in -[src/dataset/DataSet.ts:75](https://github.com/jo3-l/obscenity/blob/563159b/src/dataset/DataSet.ts#L75) +[src/dataset/DataSet.ts:75](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/dataset/DataSet.ts#L75) ___ ### build -▸ **build**(): `Pick`<[`NfaMatcherOptions`](../interfaces/NfaMatcherOptions.md), ``"blacklistedTerms"`` \| ``"whitelistedTerms"``\> +▸ **build**(): `Pick`<[`RegExpMatcherOptions`](../interfaces/RegExpMatcherOptions.md), ``"blacklistedTerms"`` \| ``"whitelistedTerms"``\> -Returns the dataset in a format suitable for usage with the [[RegExpMatcher]] -or the [[NfaMatcher]]. +Returns the dataset in a format suitable for usage with the [[RegExpMatcher]]. **`Example`** @@ -117,23 +116,13 @@ const matcher = new RegExpMatcher({ }); ``` -**`Example`** - -```typescript -// With the NfaMatcher: -const matcher = new NfaMatcher({ - ...dataset.build(), - // additional options here -}); -``` - #### Returns -`Pick`<[`NfaMatcherOptions`](../interfaces/NfaMatcherOptions.md), ``"blacklistedTerms"`` \| ``"whitelistedTerms"``\> +`Pick`<[`RegExpMatcherOptions`](../interfaces/RegExpMatcherOptions.md), ``"blacklistedTerms"`` \| ``"whitelistedTerms"``\> #### Defined in -[src/dataset/DataSet.ts:127](https://github.com/jo3-l/obscenity/blob/563159b/src/dataset/DataSet.ts#L127) +[src/dataset/DataSet.ts:118](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/dataset/DataSet.ts#L118) ___ @@ -165,7 +154,7 @@ const phraseMetadata = matchesWithPhraseMetadata[0].phraseMetadata; #### Defined in -[src/dataset/DataSet.ts:94](https://github.com/jo3-l/obscenity/blob/563159b/src/dataset/DataSet.ts#L94) +[src/dataset/DataSet.ts:94](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/dataset/DataSet.ts#L94) ___ @@ -195,4 +184,4 @@ const customDataset = new DataSet<{ originalWord: string }>() #### Defined in -[src/dataset/DataSet.ts:46](https://github.com/jo3-l/obscenity/blob/563159b/src/dataset/DataSet.ts#L46) +[src/dataset/DataSet.ts:46](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/dataset/DataSet.ts#L46) diff --git a/docs/reference/classes/NfaMatcher.md b/docs/reference/classes/NfaMatcher.md deleted file mode 100644 index cb2e452..0000000 --- a/docs/reference/classes/NfaMatcher.md +++ /dev/null @@ -1,176 +0,0 @@ -[obscenity](../README.md) / NfaMatcher - -# Class: NfaMatcher - -An implementation of the [[Matcher]] interface using finite automata -techniques. - -It is theoretically faster than the [[RegExpMatcher]]: the `hasMatch()` and -`getAllMatches()` execute in time proportional only to that of the length of -the input text and the number of matches. In other words, it _theoretically_ -should not degrade in performance as you add more terms - matching with 100 -and 1000 patterns should have the same performance. It achieves this by -building a heavily modified [Aho-Corasick -automaton](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm) from -the input patterns. - -In practice, its high constant factors make it slower than the -[[RegExpMatcher]] until about ~100 patterns, at which point both -implementations have approximately the same performance. - -The regular-expression matcher should be preferred to this one if at all -possible, as it uses more memory and is only marginally faster at the scale -most users of this package are expected to use it at. However, it may be -appropriate if: - -- You have a large number of patterns (> 100); -- You expect to be matching on long text; -- You have benchmarked the implementations and found the [[NfaMatcher]] to be - noticeably faster. - -## Implements - -- [`Matcher`](../interfaces/Matcher.md) - -## Table of contents - -### Constructors - -- [constructor](NfaMatcher.md#constructor) - -### Methods - -- [getAllMatches](NfaMatcher.md#getallmatches) -- [hasMatch](NfaMatcher.md#hasmatch) - -## Constructors - -### constructor - -• **new NfaMatcher**(`options`) - -Creates a new [[NfaMatcher]] with the options given. - -**`Example`** - -```typescript -// Use the options provided by the English preset. -const matcher = new NfaMatcher({ - ...englishDataset.build(), - ...englishRecommendedTransformers, -}); -``` - -**`Example`** - -```typescript -// Simple matcher that only has blacklisted patterns. -const matcher = new NfaMatcher({ - blacklistedTerms: assignIncrementingIds([ - pattern`fuck`, - pattern`f?uck`, // wildcards (?) - pattern`bitch`, - pattern`b[i]tch` // optionals ([i] matches either "i" or "") - ]), -}); - -// Check whether some string matches any of the patterns. -const doesMatch = matcher.hasMatch('fuck you bitch'); -``` - -**`Example`** - -```typescript -// A more advanced example, with transformers and whitelisted terms. -const matcher = new NfaMatcher({ - blacklistedTerms: [ - { id: 1, pattern: pattern`penis` }, - { id: 2, pattern: pattern`fuck` }, - ], - whitelistedTerms: ['pen is'], - blacklistMatcherTransformers: [ - resolveConfusablesTransformer(), // '🅰' => 'a' - resolveLeetSpeakTransformer(), // '$' => 's' - foldAsciiCharCaseTransformer(), // case insensitive matching - skipNonAlphabeticTransformer(), // 'f.u...c.k' => 'fuck' - collapseDuplicatesTransformer(), // 'aaaa' => 'a' - ], -}); - -// Output all matches. -console.log(matcher.getAllMatches('fu.....uuuuCK the pen is mightier than the sword!')); -``` - -#### Parameters - -| Name | Type | Description | -| :------ | :------ | :------ | -| `options` | [`NfaMatcherOptions`](../interfaces/NfaMatcherOptions.md) | Options to use. | - -#### Defined in - -[src/matcher/nfa/NfaMatcher.ts:170](https://github.com/jo3-l/obscenity/blob/563159b/src/matcher/nfa/NfaMatcher.ts#L170) - -## Methods - -### getAllMatches - -▸ **getAllMatches**(`input`, `sorted?`): [`MatchPayload`](../interfaces/MatchPayload.md)[] - -Returns all matches of blacklisted terms in the text. - -If you only need to check for the presence of a match, and do not need -more specific information about the matches, use the `hasMatch()` method, -which is typically more efficient. - -#### Parameters - -| Name | Type | Default value | Description | -| :------ | :------ | :------ | :------ | -| `input` | `string` | `undefined` | Text to find profanities in. | -| `sorted` | `boolean` | `false` | Whether the resulting list of matches should be sorted using [[compareMatchByPositionAndId]]. Defaults to `false`. | - -#### Returns - -[`MatchPayload`](../interfaces/MatchPayload.md)[] - -A list of matches of the matcher on the text. The matches are -guaranteed to be sorted if and only if the `sorted` parameter is `true`, -otherwise, their order is unspecified. - -#### Implementation of - -[Matcher](../interfaces/Matcher.md).[getAllMatches](../interfaces/Matcher.md#getallmatches) - -#### Defined in - -[src/matcher/nfa/NfaMatcher.ts:202](https://github.com/jo3-l/obscenity/blob/563159b/src/matcher/nfa/NfaMatcher.ts#L202) - -___ - -### hasMatch - -▸ **hasMatch**(`input`): `boolean` - -Checks whether there is a match for any blacklisted term in the text. - -This is typically more efficient than calling `getAllMatches` and -checking the result, though it depends on the implementation. - -#### Parameters - -| Name | Type | Description | -| :------ | :------ | :------ | -| `input` | `string` | Text to check. | - -#### Returns - -`boolean` - -#### Implementation of - -[Matcher](../interfaces/Matcher.md).[hasMatch](../interfaces/Matcher.md#hasmatch) - -#### Defined in - -[src/matcher/nfa/NfaMatcher.ts:197](https://github.com/jo3-l/obscenity/blob/563159b/src/matcher/nfa/NfaMatcher.ts#L197) diff --git a/docs/reference/classes/ParserError.md b/docs/reference/classes/ParserError.md index 3ad63b5..17b7337 100644 --- a/docs/reference/classes/ParserError.md +++ b/docs/reference/classes/ParserError.md @@ -44,7 +44,7 @@ Error.constructor #### Defined in -[src/pattern/ParserError.ts:18](https://github.com/jo3-l/obscenity/blob/563159b/src/pattern/ParserError.ts#L18) +[src/pattern/ParserError.ts:18](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/pattern/ParserError.ts#L18) ## Properties @@ -57,7 +57,7 @@ Note that surrogate pairs are counted as 1 column wide, not 2. #### Defined in -[src/pattern/ParserError.ts:16](https://github.com/jo3-l/obscenity/blob/563159b/src/pattern/ParserError.ts#L16) +[src/pattern/ParserError.ts:16](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/pattern/ParserError.ts#L16) ___ @@ -69,7 +69,7 @@ The line on which the error occurred (one-based). #### Defined in -[src/pattern/ParserError.ts:10](https://github.com/jo3-l/obscenity/blob/563159b/src/pattern/ParserError.ts#L10) +[src/pattern/ParserError.ts:10](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/pattern/ParserError.ts#L10) ___ @@ -83,7 +83,7 @@ Error.message #### Defined in -node_modules/.pnpm/typescript@5.1.3/node_modules/typescript/lib/lib.es5.d.ts:1068 +node_modules/.pnpm/typescript@5.2.2/node_modules/typescript/lib/lib.es5.d.ts:1068 ___ @@ -97,7 +97,7 @@ Error.name #### Defined in -[src/pattern/ParserError.ts:5](https://github.com/jo3-l/obscenity/blob/563159b/src/pattern/ParserError.ts#L5) +[src/pattern/ParserError.ts:5](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/pattern/ParserError.ts#L5) ___ @@ -111,4 +111,4 @@ Error.stack #### Defined in -node_modules/.pnpm/typescript@5.1.3/node_modules/typescript/lib/lib.es5.d.ts:1069 +node_modules/.pnpm/typescript@5.2.2/node_modules/typescript/lib/lib.es5.d.ts:1069 diff --git a/docs/reference/classes/PhraseBuilder.md b/docs/reference/classes/PhraseBuilder.md index 564ad98..1855623 100644 --- a/docs/reference/classes/PhraseBuilder.md +++ b/docs/reference/classes/PhraseBuilder.md @@ -55,7 +55,7 @@ Associates a pattern with this phrase. #### Defined in -[src/dataset/DataSet.ts:158](https://github.com/jo3-l/obscenity/blob/563159b/src/dataset/DataSet.ts#L158) +[src/dataset/DataSet.ts:149](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/dataset/DataSet.ts#L149) ___ @@ -77,7 +77,7 @@ Associates a whitelisted pattern with this phrase. #### Defined in -[src/dataset/DataSet.ts:168](https://github.com/jo3-l/obscenity/blob/563159b/src/dataset/DataSet.ts#L168) +[src/dataset/DataSet.ts:159](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/dataset/DataSet.ts#L159) ___ @@ -94,7 +94,7 @@ Builds the phrase, returning a [[PhraseContainer]] for use with the #### Defined in -[src/dataset/DataSet.ts:187](https://github.com/jo3-l/obscenity/blob/563159b/src/dataset/DataSet.ts#L187) +[src/dataset/DataSet.ts:178](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/dataset/DataSet.ts#L178) ___ @@ -116,4 +116,4 @@ Associates some metadata with this phrase. #### Defined in -[src/dataset/DataSet.ts:178](https://github.com/jo3-l/obscenity/blob/563159b/src/dataset/DataSet.ts#L178) +[src/dataset/DataSet.ts:169](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/dataset/DataSet.ts#L169) diff --git a/docs/reference/classes/RegExpMatcher.md b/docs/reference/classes/RegExpMatcher.md index 99226d3..61d650b 100644 --- a/docs/reference/classes/RegExpMatcher.md +++ b/docs/reference/classes/RegExpMatcher.md @@ -5,13 +5,6 @@ An implementation of the [[Matcher]] interface using regular expressions and string searching methods. -It should be the default choice for users of this package, as though it is -theoretically slower than the more complex [[NfaMatcher]], it uses much less -memory and is more efficient for low/medium numbers of patterns. - -Refer to the documentation of the [[NfaMatcher]] class for further discussion -on when to choose that implementation over this one. - ## Implements - [`Matcher`](../interfaces/Matcher.md) @@ -93,7 +86,7 @@ console.log(matcher.getAllMatches('fu.....uuuuCK the pen is mightier than the sw #### Defined in -[src/matcher/regexp/RegExpMatcher.ts:81](https://github.com/jo3-l/obscenity/blob/563159b/src/matcher/regexp/RegExpMatcher.ts#L81) +[src/matcher/regexp/RegExpMatcher.ts:74](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/matcher/regexp/RegExpMatcher.ts#L74) ## Methods @@ -128,7 +121,7 @@ otherwise, their order is unspecified. #### Defined in -[src/matcher/regexp/RegExpMatcher.ts:93](https://github.com/jo3-l/obscenity/blob/563159b/src/matcher/regexp/RegExpMatcher.ts#L93) +[src/matcher/regexp/RegExpMatcher.ts:86](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/matcher/regexp/RegExpMatcher.ts#L86) ___ @@ -157,4 +150,4 @@ checking the result, though it depends on the implementation. #### Defined in -[src/matcher/regexp/RegExpMatcher.ts:123](https://github.com/jo3-l/obscenity/blob/563159b/src/matcher/regexp/RegExpMatcher.ts#L123) +[src/matcher/regexp/RegExpMatcher.ts:116](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/matcher/regexp/RegExpMatcher.ts#L116) diff --git a/docs/reference/classes/TextCensor.md b/docs/reference/classes/TextCensor.md index 85b8f57..e34e4b9 100644 --- a/docs/reference/classes/TextCensor.md +++ b/docs/reference/classes/TextCensor.md @@ -58,7 +58,7 @@ The censored text. #### Defined in -[src/censor/TextCensor.ts:66](https://github.com/jo3-l/obscenity/blob/563159b/src/censor/TextCensor.ts#L66) +[src/censor/TextCensor.ts:66](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/censor/TextCensor.ts#L66) ___ @@ -104,4 +104,4 @@ utility functions: #### Defined in -[src/censor/TextCensor.ts:41](https://github.com/jo3-l/obscenity/blob/563159b/src/censor/TextCensor.ts#L41) +[src/censor/TextCensor.ts:41](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/censor/TextCensor.ts#L41) diff --git a/docs/reference/enums/SyntaxKind.md b/docs/reference/enums/SyntaxKind.md index 539bb8e..af91d2e 100644 --- a/docs/reference/enums/SyntaxKind.md +++ b/docs/reference/enums/SyntaxKind.md @@ -21,7 +21,7 @@ An enumeration of the kinds of nodes there are. #### Defined in -[src/pattern/Nodes.ts:33](https://github.com/jo3-l/obscenity/blob/563159b/src/pattern/Nodes.ts#L33) +[src/pattern/Nodes.ts:33](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/pattern/Nodes.ts#L33) ___ @@ -31,7 +31,7 @@ ___ #### Defined in -[src/pattern/Nodes.ts:32](https://github.com/jo3-l/obscenity/blob/563159b/src/pattern/Nodes.ts#L32) +[src/pattern/Nodes.ts:32](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/pattern/Nodes.ts#L32) ___ @@ -41,7 +41,7 @@ ___ #### Defined in -[src/pattern/Nodes.ts:30](https://github.com/jo3-l/obscenity/blob/563159b/src/pattern/Nodes.ts#L30) +[src/pattern/Nodes.ts:30](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/pattern/Nodes.ts#L30) ___ @@ -51,4 +51,4 @@ ___ #### Defined in -[src/pattern/Nodes.ts:31](https://github.com/jo3-l/obscenity/blob/563159b/src/pattern/Nodes.ts#L31) +[src/pattern/Nodes.ts:31](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/pattern/Nodes.ts#L31) diff --git a/docs/reference/interfaces/BlacklistedTerm.md b/docs/reference/interfaces/BlacklistedTerm.md index dd6edf5..52ac446 100644 --- a/docs/reference/interfaces/BlacklistedTerm.md +++ b/docs/reference/interfaces/BlacklistedTerm.md @@ -21,7 +21,7 @@ The identifier of the pattern; should be unique across all patterns. #### Defined in -[src/matcher/BlacklistedTerm.ts:10](https://github.com/jo3-l/obscenity/blob/563159b/src/matcher/BlacklistedTerm.ts#L10) +[src/matcher/BlacklistedTerm.ts:10](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/matcher/BlacklistedTerm.ts#L10) ___ @@ -33,4 +33,4 @@ The parsed pattern. #### Defined in -[src/matcher/BlacklistedTerm.ts:15](https://github.com/jo3-l/obscenity/blob/563159b/src/matcher/BlacklistedTerm.ts#L15) +[src/matcher/BlacklistedTerm.ts:15](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/matcher/BlacklistedTerm.ts#L15) diff --git a/docs/reference/interfaces/BoundaryAssertionNode.md b/docs/reference/interfaces/BoundaryAssertionNode.md index 01c4262..0032616 100644 --- a/docs/reference/interfaces/BoundaryAssertionNode.md +++ b/docs/reference/interfaces/BoundaryAssertionNode.md @@ -18,4 +18,4 @@ A boundary assertion node. #### Defined in -[src/pattern/Nodes.ts:72](https://github.com/jo3-l/obscenity/blob/563159b/src/pattern/Nodes.ts#L72) +[src/pattern/Nodes.ts:72](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/pattern/Nodes.ts#L72) diff --git a/docs/reference/interfaces/CollapseDuplicatesTransformerOptions.md b/docs/reference/interfaces/CollapseDuplicatesTransformerOptions.md index f010251..4fb2198 100644 --- a/docs/reference/interfaces/CollapseDuplicatesTransformerOptions.md +++ b/docs/reference/interfaces/CollapseDuplicatesTransformerOptions.md @@ -37,7 +37,7 @@ new Map() #### Defined in -[src/transformer/collapse-duplicates/index.ts:91](https://github.com/jo3-l/obscenity/blob/563159b/src/transformer/collapse-duplicates/index.ts#L91) +[src/transformer/collapse-duplicates/index.ts:91](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/transformer/collapse-duplicates/index.ts#L91) ___ @@ -59,4 +59,4 @@ would be transformed to `aa`. #### Defined in -[src/transformer/collapse-duplicates/index.ts:102](https://github.com/jo3-l/obscenity/blob/563159b/src/transformer/collapse-duplicates/index.ts#L102) +[src/transformer/collapse-duplicates/index.ts:102](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/transformer/collapse-duplicates/index.ts#L102) diff --git a/docs/reference/interfaces/LiteralNode.md b/docs/reference/interfaces/LiteralNode.md index 4498d6c..25b9465 100644 --- a/docs/reference/interfaces/LiteralNode.md +++ b/docs/reference/interfaces/LiteralNode.md @@ -21,7 +21,7 @@ The code points that this literal matches. #### Defined in -[src/pattern/Nodes.ts:63](https://github.com/jo3-l/obscenity/blob/563159b/src/pattern/Nodes.ts#L63) +[src/pattern/Nodes.ts:63](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/pattern/Nodes.ts#L63) ___ @@ -31,4 +31,4 @@ ___ #### Defined in -[src/pattern/Nodes.ts:65](https://github.com/jo3-l/obscenity/blob/563159b/src/pattern/Nodes.ts#L65) +[src/pattern/Nodes.ts:65](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/pattern/Nodes.ts#L65) diff --git a/docs/reference/interfaces/MatchPayload.md b/docs/reference/interfaces/MatchPayload.md index d012c86..7d6b533 100644 --- a/docs/reference/interfaces/MatchPayload.md +++ b/docs/reference/interfaces/MatchPayload.md @@ -29,7 +29,7 @@ then this points to the index of the low surrogate. #### Defined in -[src/matcher/MatchPayload.ts:16](https://github.com/jo3-l/obscenity/blob/563159b/src/matcher/MatchPayload.ts#L16) +[src/matcher/MatchPayload.ts:16](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/matcher/MatchPayload.ts#L16) ___ @@ -41,7 +41,7 @@ Total number of of code points that matched. #### Defined in -[src/matcher/MatchPayload.ts:21](https://github.com/jo3-l/obscenity/blob/563159b/src/matcher/MatchPayload.ts#L21) +[src/matcher/MatchPayload.ts:21](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/matcher/MatchPayload.ts#L21) ___ @@ -53,7 +53,7 @@ Start index of the match, inclusive. #### Defined in -[src/matcher/MatchPayload.ts:26](https://github.com/jo3-l/obscenity/blob/563159b/src/matcher/MatchPayload.ts#L26) +[src/matcher/MatchPayload.ts:26](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/matcher/MatchPayload.ts#L26) ___ @@ -65,4 +65,4 @@ ID of the blacklisted term that matched. #### Defined in -[src/matcher/MatchPayload.ts:31](https://github.com/jo3-l/obscenity/blob/563159b/src/matcher/MatchPayload.ts#L31) +[src/matcher/MatchPayload.ts:31](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/matcher/MatchPayload.ts#L31) diff --git a/docs/reference/interfaces/Matcher.md b/docs/reference/interfaces/Matcher.md index 7a3a523..dd3da8e 100644 --- a/docs/reference/interfaces/Matcher.md +++ b/docs/reference/interfaces/Matcher.md @@ -6,15 +6,10 @@ Searches for blacklisted terms in text, ignoring parts matched by whitelisted terms. See: -- [[NfaMatcher]] for an implementation using finite automata; - [[RegExpMatcher]] for an implementation using regular expressions. -Refer to the documentation of the classes mentioned above for discussion of -which circumstances one should prefer one over the other. - ## Implemented by -- [`NfaMatcher`](../classes/NfaMatcher.md) - [`RegExpMatcher`](../classes/RegExpMatcher.md) ## Table of contents @@ -53,7 +48,7 @@ otherwise, their order is unspecified. #### Defined in -[src/matcher/Matcher.ts:29](https://github.com/jo3-l/obscenity/blob/563159b/src/matcher/Matcher.ts#L29) +[src/matcher/Matcher.ts:25](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/matcher/Matcher.ts#L25) ___ @@ -78,4 +73,4 @@ checking the result, though it depends on the implementation. #### Defined in -[src/matcher/Matcher.ts:39](https://github.com/jo3-l/obscenity/blob/563159b/src/matcher/Matcher.ts#L39) +[src/matcher/Matcher.ts:35](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/matcher/Matcher.ts#L35) diff --git a/docs/reference/interfaces/NfaMatcherOptions.md b/docs/reference/interfaces/NfaMatcherOptions.md deleted file mode 100644 index b149c98..0000000 --- a/docs/reference/interfaces/NfaMatcherOptions.md +++ /dev/null @@ -1,94 +0,0 @@ -[obscenity](../README.md) / NfaMatcherOptions - -# Interface: NfaMatcherOptions - -Options for the [[NfaMatcher]]. - -## Table of contents - -### Properties - -- [blacklistMatcherTransformers](NfaMatcherOptions.md#blacklistmatchertransformers) -- [blacklistedTerms](NfaMatcherOptions.md#blacklistedterms) -- [whitelistMatcherTransformers](NfaMatcherOptions.md#whitelistmatchertransformers) -- [whitelistedTerms](NfaMatcherOptions.md#whitelistedterms) - -## Properties - -### blacklistMatcherTransformers - -• `Optional` **blacklistMatcherTransformers**: `TransformerContainer`[] - -A set of transformers that should be applied to the input text before -blacklisted patterns are matched. This does not affect the matching of -whitelisted terms. - -Transformers will be applied in the order they appear. - -**`Default`** - -```ts -[] -``` - -#### Defined in - -[src/matcher/nfa/NfaMatcher.ts:622](https://github.com/jo3-l/obscenity/blob/563159b/src/matcher/nfa/NfaMatcher.ts#L622) - -___ - -### blacklistedTerms - -• **blacklistedTerms**: [`BlacklistedTerm`](BlacklistedTerm.md)[] - -A list of blacklisted terms. - -#### Defined in - -[src/matcher/nfa/NfaMatcher.ts:627](https://github.com/jo3-l/obscenity/blob/563159b/src/matcher/nfa/NfaMatcher.ts#L627) - -___ - -### whitelistMatcherTransformers - -• `Optional` **whitelistMatcherTransformers**: `TransformerContainer`[] - -A set of transformers that should be applied to the input text before -whitelisted terms are matched. This does not affect the matching of -blacklisted terms. - -Transformers will be applied in the order they appear. - -**`Default`** - -```ts -[] -``` - -#### Defined in - -[src/matcher/nfa/NfaMatcher.ts:638](https://github.com/jo3-l/obscenity/blob/563159b/src/matcher/nfa/NfaMatcher.ts#L638) - -___ - -### whitelistedTerms - -• `Optional` **whitelistedTerms**: `string`[] - -A list of whitelisted terms. If a whitelisted term matches some part of -the text, a match of a blacklisted pattern within that part of the text -will not be emitted. - -For example, if we had a pattern `penis` and a whitelisted term `pen is`, -only no matches would be reported for the input text `the pen is mightier -than the sword.` - -**`Default`** - -```ts -[] -``` - -#### Defined in - -[src/matcher/nfa/NfaMatcher.ts:651](https://github.com/jo3-l/obscenity/blob/563159b/src/matcher/nfa/NfaMatcher.ts#L651) diff --git a/docs/reference/interfaces/OptionalNode.md b/docs/reference/interfaces/OptionalNode.md index 6990a2b..43c0a33 100644 --- a/docs/reference/interfaces/OptionalNode.md +++ b/docs/reference/interfaces/OptionalNode.md @@ -22,7 +22,7 @@ would be a literal node with the value `abc`. #### Defined in -[src/pattern/Nodes.ts:44](https://github.com/jo3-l/obscenity/blob/563159b/src/pattern/Nodes.ts#L44) +[src/pattern/Nodes.ts:44](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/pattern/Nodes.ts#L44) ___ @@ -32,4 +32,4 @@ ___ #### Defined in -[src/pattern/Nodes.ts:46](https://github.com/jo3-l/obscenity/blob/563159b/src/pattern/Nodes.ts#L46) +[src/pattern/Nodes.ts:46](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/pattern/Nodes.ts#L46) diff --git a/docs/reference/interfaces/ParsedPattern.md b/docs/reference/interfaces/ParsedPattern.md index 3addb8a..ce1ac2a 100644 --- a/docs/reference/interfaces/ParsedPattern.md +++ b/docs/reference/interfaces/ParsedPattern.md @@ -22,7 +22,7 @@ A list of nodes which make up the pattern. #### Defined in -[src/pattern/Nodes.ts:8](https://github.com/jo3-l/obscenity/blob/563159b/src/pattern/Nodes.ts#L8) +[src/pattern/Nodes.ts:8](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/pattern/Nodes.ts#L8) ___ @@ -34,7 +34,7 @@ Whether the pattern requires a word boundary at the end. #### Defined in -[src/pattern/Nodes.ts:13](https://github.com/jo3-l/obscenity/blob/563159b/src/pattern/Nodes.ts#L13) +[src/pattern/Nodes.ts:13](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/pattern/Nodes.ts#L13) ___ @@ -46,4 +46,4 @@ Whether the pattern requires a word boundary at the start. #### Defined in -[src/pattern/Nodes.ts:18](https://github.com/jo3-l/obscenity/blob/563159b/src/pattern/Nodes.ts#L18) +[src/pattern/Nodes.ts:18](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/pattern/Nodes.ts#L18) diff --git a/docs/reference/interfaces/PhraseContainer.md b/docs/reference/interfaces/PhraseContainer.md index 813098f..9da7496 100644 --- a/docs/reference/interfaces/PhraseContainer.md +++ b/docs/reference/interfaces/PhraseContainer.md @@ -28,7 +28,7 @@ Metadata associated with this phrase. #### Defined in -[src/dataset/DataSet.ts:213](https://github.com/jo3-l/obscenity/blob/563159b/src/dataset/DataSet.ts#L213) +[src/dataset/DataSet.ts:204](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/dataset/DataSet.ts#L204) ___ @@ -40,7 +40,7 @@ Patterns associated with this phrase. #### Defined in -[src/dataset/DataSet.ts:218](https://github.com/jo3-l/obscenity/blob/563159b/src/dataset/DataSet.ts#L218) +[src/dataset/DataSet.ts:209](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/dataset/DataSet.ts#L209) ___ @@ -52,4 +52,4 @@ Whitelisted terms associated with this phrase. #### Defined in -[src/dataset/DataSet.ts:223](https://github.com/jo3-l/obscenity/blob/563159b/src/dataset/DataSet.ts#L223) +[src/dataset/DataSet.ts:214](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/dataset/DataSet.ts#L214) diff --git a/docs/reference/interfaces/ProcessedCollapseDuplicatesTransformerOptions.md b/docs/reference/interfaces/ProcessedCollapseDuplicatesTransformerOptions.md index 6748364..2ed1bca 100644 --- a/docs/reference/interfaces/ProcessedCollapseDuplicatesTransformerOptions.md +++ b/docs/reference/interfaces/ProcessedCollapseDuplicatesTransformerOptions.md @@ -17,7 +17,7 @@ #### Defined in -[src/transformer/collapse-duplicates/index.ts:68](https://github.com/jo3-l/obscenity/blob/563159b/src/transformer/collapse-duplicates/index.ts#L68) +[src/transformer/collapse-duplicates/index.ts:68](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/transformer/collapse-duplicates/index.ts#L68) ___ @@ -27,4 +27,4 @@ ___ #### Defined in -[src/transformer/collapse-duplicates/index.ts:69](https://github.com/jo3-l/obscenity/blob/563159b/src/transformer/collapse-duplicates/index.ts#L69) +[src/transformer/collapse-duplicates/index.ts:69](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/transformer/collapse-duplicates/index.ts#L69) diff --git a/docs/reference/interfaces/RegExpMatcherOptions.md b/docs/reference/interfaces/RegExpMatcherOptions.md index c7cfb18..790b1a0 100644 --- a/docs/reference/interfaces/RegExpMatcherOptions.md +++ b/docs/reference/interfaces/RegExpMatcherOptions.md @@ -33,7 +33,7 @@ Transformers will be applied in the order they appear. #### Defined in -[src/matcher/regexp/RegExpMatcher.ts:227](https://github.com/jo3-l/obscenity/blob/563159b/src/matcher/regexp/RegExpMatcher.ts#L227) +[src/matcher/regexp/RegExpMatcher.ts:220](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/matcher/regexp/RegExpMatcher.ts#L220) ___ @@ -45,7 +45,7 @@ A list of blacklisted terms. #### Defined in -[src/matcher/regexp/RegExpMatcher.ts:232](https://github.com/jo3-l/obscenity/blob/563159b/src/matcher/regexp/RegExpMatcher.ts#L232) +[src/matcher/regexp/RegExpMatcher.ts:225](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/matcher/regexp/RegExpMatcher.ts#L225) ___ @@ -67,7 +67,7 @@ Transformers will be applied in the order they appear. #### Defined in -[src/matcher/regexp/RegExpMatcher.ts:243](https://github.com/jo3-l/obscenity/blob/563159b/src/matcher/regexp/RegExpMatcher.ts#L243) +[src/matcher/regexp/RegExpMatcher.ts:236](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/matcher/regexp/RegExpMatcher.ts#L236) ___ @@ -91,4 +91,4 @@ than the sword.` #### Defined in -[src/matcher/regexp/RegExpMatcher.ts:256](https://github.com/jo3-l/obscenity/blob/563159b/src/matcher/regexp/RegExpMatcher.ts#L256) +[src/matcher/regexp/RegExpMatcher.ts:249](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/matcher/regexp/RegExpMatcher.ts#L249) diff --git a/docs/reference/interfaces/WildcardNode.md b/docs/reference/interfaces/WildcardNode.md index 627da69..71b81f6 100644 --- a/docs/reference/interfaces/WildcardNode.md +++ b/docs/reference/interfaces/WildcardNode.md @@ -18,4 +18,4 @@ A wildcard node. #### Defined in -[src/pattern/Nodes.ts:53](https://github.com/jo3-l/obscenity/blob/563159b/src/pattern/Nodes.ts#L53) +[src/pattern/Nodes.ts:53](https://github.com/jo3-l/obscenity/blob/ae4df1a/src/pattern/Nodes.ts#L53) diff --git a/jest.config.ts b/jest.config.ts index 8526a3f..2f7acb4 100644 --- a/jest.config.ts +++ b/jest.config.ts @@ -5,8 +5,14 @@ const config: Config.InitialOptions = { testEnvironment: 'node', testRunner: 'jest-circus/runner', testMatch: ['/test/**/*.test.ts'], - globals: { - 'ts-jest': { tsconfig: '/test/tsconfig.json' }, + transform: { + // eslint-disable-next-line @typescript-eslint/naming-convention + '^.+\\.ts$': [ + 'ts-jest', + { + tsconfig: '/test/tsconfig.json', + }, + ], }, collectCoverage: true, collectCoverageFrom: ['/src/**/*.ts'], diff --git a/package.json b/package.json index fb77444..b119622 100644 --- a/package.json +++ b/package.json @@ -68,16 +68,16 @@ "fast-check": "^2.25.0", "gen-esm-wrapper": "^1.1.3", "is-ci": "^3.0.1", - "jest": "^29.5.0", + "jest": "^29.7.0", "jest-circus": "^29.5.0", "prettier": "^2.8.8", "rimraf": "^5.0.0", "standard-version": "^9.5.0", - "ts-jest": "^29.1.0", + "ts-jest": "^29.1.1", "ts-node": "^10.9.1", "typedoc": "^0.25.0", "typedoc-plugin-markdown": "^3.15.3", - "typescript": "^5.1.3" + "typescript": "^5.2.2" }, "engines": { "node": ">=14.0.0" diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml index f98f7fa..56375c1 100644 --- a/pnpm-lock.yaml +++ b/pnpm-lock.yaml @@ -51,7 +51,7 @@ devDependencies: specifier: ^3.0.1 version: 3.0.1 jest: - specifier: ^29.5.0 + specifier: ^29.7.0 version: 29.7.0(@types/node@20.2.5)(ts-node@10.9.1) jest-circus: specifier: ^29.5.0 @@ -66,7 +66,7 @@ devDependencies: specifier: ^9.5.0 version: 9.5.0 ts-jest: - specifier: ^29.1.0 + specifier: ^29.1.1 version: 29.1.1(@babel/core@7.22.1)(@jest/types@29.5.0)(jest@29.7.0)(typescript@5.2.2) ts-node: specifier: ^10.9.1 @@ -78,7 +78,7 @@ devDependencies: specifier: ^3.15.3 version: 3.15.3(typedoc@0.25.0) typescript: - specifier: ^5.1.3 + specifier: ^5.2.2 version: 5.2.2 packages: diff --git a/src/dataset/DataSet.ts b/src/dataset/DataSet.ts index e1474f7..dc91bb8 100644 --- a/src/dataset/DataSet.ts +++ b/src/dataset/DataSet.ts @@ -1,6 +1,6 @@ import { assignIncrementingIds } from '../matcher/BlacklistedTerm'; import type { MatchPayload } from '../matcher/MatchPayload'; -import type { NfaMatcherOptions } from '../matcher/nfa/NfaMatcher'; +import type { RegExpMatcherOptions } from '../matcher/regexp/RegExpMatcher'; import type { ParsedPattern } from '../pattern/Nodes'; /** @@ -104,8 +104,7 @@ export class DataSet { } /** - * Returns the dataset in a format suitable for usage with the [[RegExpMatcher]] - * or the [[NfaMatcher]]. + * Returns the dataset in a format suitable for usage with the [[RegExpMatcher]]. * * @example * ```typescript @@ -115,16 +114,8 @@ export class DataSet { * // additional options here * }); * ``` - * @example - * ```typescript - * // With the NfaMatcher: - * const matcher = new NfaMatcher({ - * ...dataset.build(), - * // additional options here - * }); - * ``` */ - public build(): Pick { + public build(): Pick { return { blacklistedTerms: assignIncrementingIds(this.containers.flatMap((p) => p.patterns)), whitelistedTerms: this.containers.flatMap((p) => p.whitelistedTerms), diff --git a/src/index.ts b/src/index.ts index fd93d09..5675ee3 100644 --- a/src/index.ts +++ b/src/index.ts @@ -4,7 +4,6 @@ export * from './censor/TextCensor'; export * from './dataset/DataSet'; export * from './matcher/regexp/RegExpMatcher'; -export * from './matcher/nfa/NfaMatcher'; export * from './matcher/BlacklistedTerm'; export * from './matcher/MatchPayload'; export * from './matcher/Matcher'; diff --git a/src/matcher/BlacklistedTerm.ts b/src/matcher/BlacklistedTerm.ts index 3cb7f6e..1b2c383 100644 --- a/src/matcher/BlacklistedTerm.ts +++ b/src/matcher/BlacklistedTerm.ts @@ -32,7 +32,7 @@ export interface BlacklistedTerm { * ``` * @param patterns - List of parsed patterns. * @returns A list of blacklisted terms with valid IDs which can then be passed - * to the [[RegExpMatcher]] or [[NfaMatcher]]. + * to the [[RegExpMatcher]]. */ export function assignIncrementingIds(patterns: ParsedPattern[]) { let currentId = 0; diff --git a/src/matcher/Matcher.ts b/src/matcher/Matcher.ts index f2cda21..3962494 100644 --- a/src/matcher/Matcher.ts +++ b/src/matcher/Matcher.ts @@ -5,11 +5,7 @@ import type { MatchPayload } from './MatchPayload'; * terms. * * See: - * - [[NfaMatcher]] for an implementation using finite automata; * - [[RegExpMatcher]] for an implementation using regular expressions. - * - * Refer to the documentation of the classes mentioned above for discussion of - * which circumstances one should prefer one over the other. */ export interface Matcher { /** diff --git a/src/matcher/nfa/NfaMatcher.ts b/src/matcher/nfa/NfaMatcher.ts deleted file mode 100644 index 0decfab..0000000 --- a/src/matcher/nfa/NfaMatcher.ts +++ /dev/null @@ -1,663 +0,0 @@ -import type { LiteralNode } from '../../pattern/Nodes'; -import { SyntaxKind } from '../../pattern/Nodes'; -import type { SimpleNode } from '../../pattern/Simplifier'; -import { simplify } from '../../pattern/Simplifier'; -import type { LiteralGroup, WildcardGroup } from '../../pattern/Util'; -import { computePatternMatchLength, groupByNodeType, potentiallyMatchesEmptyString } from '../../pattern/Util'; -import { TransformerSet } from '../../transformer/TransformerSet'; -import type { TransformerContainer } from '../../transformer/Transformers'; -import { isHighSurrogate, isLowSurrogate, isWordChar } from '../../util/Char'; -import { CharacterIterator } from '../../util/CharacterIterator'; -import { CircularBuffer } from '../../util/CircularBuffer'; -import { Queue } from '../../util/Queue'; -import type { BlacklistedTerm } from '../BlacklistedTerm'; -import { IntervalCollection } from '../IntervalCollection'; -import type { MatchPayload } from '../MatchPayload'; -import { compareMatchByPositionAndId } from '../MatchPayload'; -import type { Matcher } from '../Matcher'; -import { WhitelistedTermMatcher } from './WhitelistedTermMatcher'; -import type { PartialMatchData } from './trie/BlacklistTrieNode'; -import { BlacklistTrieNode, NodeFlag, PartialMatchFlag, SharedFlag, hashPartialMatch } from './trie/BlacklistTrieNode'; -import type { ForwardingEdgeCollection } from './trie/edge/ForwardingEdgeCollection'; - -/** - * An implementation of the [[Matcher]] interface using finite automata - * techniques. - * - * It is theoretically faster than the [[RegExpMatcher]]: the `hasMatch()` and - * `getAllMatches()` execute in time proportional only to that of the length of - * the input text and the number of matches. In other words, it _theoretically_ - * should not degrade in performance as you add more terms - matching with 100 - * and 1000 patterns should have the same performance. It achieves this by - * building a heavily modified [Aho-Corasick - * automaton](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm) from - * the input patterns. - * - * In practice, its high constant factors make it slower than the - * [[RegExpMatcher]] until about ~100 patterns, at which point both - * implementations have approximately the same performance. - * - * The regular-expression matcher should be preferred to this one if at all - * possible, as it uses more memory and is only marginally faster at the scale - * most users of this package are expected to use it at. However, it may be - * appropriate if: - * - * - You have a large number of patterns (> 100); - * - You expect to be matching on long text; - * - You have benchmarked the implementations and found the [[NfaMatcher]] to be - * noticeably faster. - */ -export class NfaMatcher implements Matcher { - private readonly rootNode = new BlacklistTrieNode(); - - private readonly originalIds: number[] = []; // originalIds[i] is the term ID of the the pattern with ID i - - private readonly matchLengths: number[] = []; // matchLengths[i] is the match length of the pattern with ID i - - private readonly partialMatchStepCounts = new Map(); // partialMatchStepCounts[i] is the total number of steps of the pattern with ID i. Only applies to partial matches. - - private readonly wildcardOnlyPatterns: WildcardOnlyPatternData[] = []; - - // Maximum number of trailing wildcards. - // - // x x x ? y y ? ? ? ? - // ^^^^^^^ - // 4 trailing wildcards - private maxTrailingWildcardCount = 0; - - // Maximum distance between the start of a partial pattern and the end of the partial pattern following it. - // - // x x x ? ? ? y y y y - // 0 1 2 3 4 5 6 7 8 9 - // ^^^^^^^^^^^^^^^^^^^ - // distance of 10 - // - // This value is equal to how long we need to keep partial matches around. - private maxPartialPatternDistance = 0; - - private maxMatchLength = 0; // Maximum match length of any pattern, equal to how many indices to the left of the current position we need to track. - - private currentId = 0; // Current generated pattern ID. - - private readonly whitelistedTermMatcher: WhitelistedTermMatcher; - - private readonly slowTransformers: TransformerSet; - - private readonly fastTransformers: TransformerSet; - - // Use two iterators: one fast, and one slow. The fast iterator will - // constantly be |maxTrailingWildcardCount| positions head of the slow - // iterator. - private readonly slowIter = new CharacterIterator(); - - private readonly fastIter = new CharacterIterator(); - - // Sliding window of indices used for matching. - // - // current position - // | - // i0 i1 i2 i3 i4 i5 - // ^^^^^^^^^^^^^^^^^ - // maxMatchLength - private readonly usedIndices: CircularBuffer; - - // Sliding window of indices to the right of the current position. - // - // current position - // | - // i6 i7 i8 i9 i10 i11 i12 i13 - // ^^^^^^^^^^^^^^^^^^^^^^^^^^ - // maxTrailingWildcardCount - private readonly futureIndices: CircularBuffer; - - private matches: MatchPayload[] = []; - - private readonly partialMatches: CircularBuffer | undefined>; // partial matches found; value is a set of partial match hashes - - private currentNode = this.rootNode; - - private whitelistedIntervals = new IntervalCollection(); - - /** - * Creates a new [[NfaMatcher]] with the options given. - * - * @example - * ```typescript - * // Use the options provided by the English preset. - * const matcher = new NfaMatcher({ - * ...englishDataset.build(), - * ...englishRecommendedTransformers, - * }); - * ``` - * @example - * ```typescript - * // Simple matcher that only has blacklisted patterns. - * const matcher = new NfaMatcher({ - * blacklistedTerms: assignIncrementingIds([ - * pattern`fuck`, - * pattern`f?uck`, // wildcards (?) - * pattern`bitch`, - * pattern`b[i]tch` // optionals ([i] matches either "i" or "") - * ]), - * }); - * - * // Check whether some string matches any of the patterns. - * const doesMatch = matcher.hasMatch('fuck you bitch'); - * ``` - * @example - * ```typescript - * // A more advanced example, with transformers and whitelisted terms. - * const matcher = new NfaMatcher({ - * blacklistedTerms: [ - * { id: 1, pattern: pattern`penis` }, - * { id: 2, pattern: pattern`fuck` }, - * ], - * whitelistedTerms: ['pen is'], - * blacklistMatcherTransformers: [ - * resolveConfusablesTransformer(), // '🅰' => 'a' - * resolveLeetSpeakTransformer(), // '$' => 's' - * foldAsciiCharCaseTransformer(), // case insensitive matching - * skipNonAlphabeticTransformer(), // 'f.u...c.k' => 'fuck' - * collapseDuplicatesTransformer(), // 'aaaa' => 'a' - * ], - * }); - * - * // Output all matches. - * console.log(matcher.getAllMatches('fu.....uuuuCK the pen is mightier than the sword!')); - * ``` - * @param options - Options to use. - */ - public constructor({ - blacklistedTerms, - whitelistedTerms = [], - blacklistMatcherTransformers = [], - whitelistMatcherTransformers = [], - }: NfaMatcherOptions) { - this.whitelistedTermMatcher = new WhitelistedTermMatcher({ - terms: whitelistedTerms, - transformers: whitelistMatcherTransformers, - }); - this.slowTransformers = new TransformerSet(blacklistMatcherTransformers); - this.fastTransformers = new TransformerSet(blacklistMatcherTransformers); - this.ensureNoDuplicateIds(blacklistedTerms); - this.buildTrie(blacklistedTerms); - this.constructLinks(); - this.useUnderlyingEdgeCollectionImplementation(this.rootNode); - - // Sort wildcard-only patterns by the number of wildcards they have. - this.wildcardOnlyPatterns.sort((a, b) => - /* istanbul ignore next: not really possible to write a robust test for this */ - a.wildcardCount < b.wildcardCount ? -1 : b.wildcardCount < a.wildcardCount ? 1 : 0, - ); - this.usedIndices = new CircularBuffer(this.maxMatchLength); - this.futureIndices = new CircularBuffer(this.maxTrailingWildcardCount); - this.partialMatches = new CircularBuffer(this.maxPartialPatternDistance); - } - - public hasMatch(input: string) { - this.setInput(input); - return this.run(true); - } - - public getAllMatches(input: string, sorted = false) { - this.setInput(input); - this.run(); - if (sorted) this.matches.sort(compareMatchByPositionAndId); - return this.matches; - } - - private setInput(input: string) { - this.slowIter.setInput(input); - this.fastIter.setInput(input); - this.whitelistedIntervals = this.whitelistedTermMatcher.getMatches(input); - this.currentNode = this.rootNode; - this.slowTransformers.resetAll(); - this.fastTransformers.resetAll(); - this.usedIndices.clear(); - this.futureIndices.clear(); - this.partialMatches.clear(); - this.matches = []; - } - - private run(breakAfterFirstMatch = false) { - // Fill the future index buffer by advancing the fast iterator forward. - while (this.futureIndices.length < this.futureIndices.capacity) { - const char = this.fastIter.next().value; - if (char === undefined) { - // Iterator is done. - this.futureIndices.push(undefined); - } else { - const transformed = this.fastTransformers.applyTo(char); - // Only add the position if the character didn't become - // undefined after transformation. - if (transformed !== undefined) this.futureIndices.push(this.fastIter.position); - } - } - - for (const char of this.slowIter) { - const transformed = this.slowTransformers.applyTo(char); - if (transformed === undefined) continue; - - // Advance the index window forward. - // 1 3 4 5 7 8 9 - // becomes - // 3 4 5 7 8 9 10 (if 10 is the current position) - this.usedIndices.push(this.slowIter.position); - - // Advance the partial matches buffer forward. - this.partialMatches.push(undefined); - - // Find next usable character for the fast iterator. - if (this.maxTrailingWildcardCount > 0) { - let found = false; - while (!this.fastIter.done && !found) { - found = this.fastTransformers.applyTo(this.fastIter.next().value!) !== undefined; - if (found) this.futureIndices.push(this.fastIter.position); - } - - if (!found) this.futureIndices.push(undefined); - } - - // Follow failure links until we find a node that has a transition for the current character. - while (this.currentNode !== this.rootNode && !this.currentNode.edges.get(transformed)) { - this.currentNode = this.currentNode.failureLink; - } - - this.currentNode = this.currentNode.edges.get(transformed) ?? this.rootNode; - - // Emit matches for wildcard-only patterns. Patterns of the form - // ?^N ('?' repeated N times) always have a match ending at the - // current index if the number of characters seen is - // >= N. - for (const data of this.wildcardOnlyPatterns) { - if (data.wildcardCount > this.usedIndices.length) break; - const matchLength = this.matchLengths[data.id]; - const startIndex = this.usedIndices.get(this.usedIndices.length - matchLength)!; - const matched = this.emitMatch( - data.id, - data.flags, - startIndex, - this.slowIter.position + this.slowIter.lastWidth - 1, - matchLength, - ); - if (matched && breakAfterFirstMatch) return true; - } - - // Emit matches for the current node, then follow its output links. - if (this.currentNode.flags & NodeFlag.MatchLeaf) { - const matchLength = this.matchLengths[this.currentNode.termId]; - const startIndex = this.usedIndices.get(this.usedIndices.length - matchLength)!; - const matched = this.emitMatch( - this.currentNode.termId, - this.currentNode.flags, - startIndex, - this.slowIter.position + this.slowIter.lastWidth - 1, - matchLength, - ); - if (matched && breakAfterFirstMatch) return true; - } - - if (this.currentNode.flags & NodeFlag.PartialMatchLeaf) { - for (const partialMatch of this.currentNode.partialMatches!) { - if (this.emitPartialMatch(partialMatch) && breakAfterFirstMatch) return true; - } - } - - let outputLink = this.currentNode.outputLink; - while (outputLink) { - if (outputLink.flags & NodeFlag.PartialMatchLeaf) { - for (const partialMatch of outputLink.partialMatches!) { - if (this.emitPartialMatch(partialMatch) && breakAfterFirstMatch) return true; - } - } - - if (outputLink.flags & NodeFlag.MatchLeaf) { - const matchLength = this.matchLengths[outputLink.termId]; - const startIndex = this.usedIndices.get(this.usedIndices.length - matchLength)!; - const matched = this.emitMatch( - outputLink.termId, - outputLink.flags, - startIndex, - this.slowIter.position + this.slowIter.lastWidth - 1, - matchLength, - ); - if (matched && breakAfterFirstMatch) return true; - } - - outputLink = outputLink.outputLink; - } - } - - return this.matches.length > 0; - } - - private emitPartialMatch(data: PartialMatchData) { - // ??xxxxx - // If we have a match for 'xxxxx', the whole pattern matches if the - // number of characters seen is greater than the number of leading - // wildcards (in this case 2). - const hasSufficientCharactersBefore = data.leadingWildcardCount + data.matchLength <= this.usedIndices.length; - if (!hasSufficientCharactersBefore) return false; - - // x x ? ? y y y y y - // 0 1 2 3 4 5 5 6 7 - // - // If we have a match for 'yyyyy', the whole pattern matches if we have - // a match for 'xx' ending 7 characters before (length of 'yyyyy', plus - // two wildcards, plus one). - const hasMatchForPreviousStep = - // First step has no match before it. - data.step === 1 || - (this.partialMatches - .get(this.partialMatches.length - data.leadingWildcardCount - data.matchLength - 1) - ?.has(hashPartialMatch(data.step - 1, data.termId)) ?? - false); - if (!hasMatchForPreviousStep) return false; - if (data.step === this.partialMatchStepCounts.get(data.termId)) { - // Say the pattern is 'xx???yyyyy'. - // We're currently on 'yyyyy' and we know that the steps before - // match. We can safely emit a match if there are no trailing - // wildcards. - if (data.trailingWildcardCount === 0) { - const matchLength = this.matchLengths[data.termId]; - const startIndex = this.usedIndices.get(this.usedIndices.length - matchLength)!; - return this.emitMatch( - data.termId, - data.flags, - startIndex, - this.slowIter.position + this.slowIter.lastWidth - 1, - matchLength, - ); - } - - // Say the pattern is 'xx??yy??'. - // This pattern matches if there are at least two characters that are - // usable to the right of the current position. - let endIndex = this.futureIndices.get(data.trailingWildcardCount - 1); - if (endIndex === undefined) return false; - - // Adjust for surrogate pairs. - if ( - // not the last character - endIndex < this.slowIter.input.length - 1 && - // character is a high surrogate - isHighSurrogate(this.slowIter.input.charCodeAt(endIndex)) && - // next character is a low surrogate - isLowSurrogate(this.slowIter.input.charCodeAt(endIndex + 1)) - ) { - endIndex++; - } - - const matchLength = this.matchLengths[data.termId]; - const startIndex = this.usedIndices.get(this.usedIndices.length - matchLength + data.trailingWildcardCount)!; - return this.emitMatch(data.termId, data.flags, startIndex, endIndex, matchLength); - } - - // Otherwise, add a partial match. - let hashes = this.partialMatches.get(this.partialMatches.length - 1); - if (!hashes) this.partialMatches.set(this.partialMatches.length - 1, (hashes = new Set())); - hashes.add(hashPartialMatch(data.step, data.termId)); - return false; - } - - private emitMatch(id: number, flags: number, startIndex: number, endIndex: number, matchLength: number) { - const startBoundaryOk = - !(flags & SharedFlag.RequireWordBoundaryAtStart) || // doesn't require word boundary at the start - startIndex === 0 || // first character - !isWordChar(this.slowIter.input.charCodeAt(startIndex - 1)); // character before isn't a word char - const endBoundaryOk = - !(flags & SharedFlag.RequireWordBoundaryAtEnd) || // doesn't require word boundary at the end - endIndex === this.slowIter.input.length - 1 || // last character - !isWordChar(this.slowIter.input.charCodeAt(endIndex + 1)); // character after isn't a word char - if (!startBoundaryOk || !endBoundaryOk) return false; - - const termId = this.originalIds[id]; - if (this.whitelistedIntervals.query(startIndex, endIndex)) return false; - - this.matches.push({ termId, matchLength, startIndex, endIndex }); - return true; - } - - private ensureNoDuplicateIds(terms: BlacklistedTerm[]) { - const seen = new Set(); - for (const term of terms) { - if (seen.has(term.id)) throw new Error(`Found duplicate blacklisted term ID ${term.id}.`); - seen.add(term.id); - } - } - - private buildTrie(patterns: BlacklistedTerm[]) { - for (const pattern of patterns) this.registerTerm(pattern); - } - - private registerTerm(term: BlacklistedTerm) { - if (potentiallyMatchesEmptyString(term.pattern)) { - throw new Error(`Pattern with ID ${term.id} potentially matches empty string; this is unsupported.`); - } - - const simplifiedPatterns = simplify(term.pattern.nodes); - for (const pattern of simplifiedPatterns) { - // Each pattern may actually correspond to several simplified - // patterns, so use an incrementing numerical ID internally. - const id = this.currentId++; - this.originalIds.push(term.id); - - if (pattern.every((node): node is LiteralNode => node.kind === SyntaxKind.Literal)) { - this.registerPatternWithOnlyLiterals(id, pattern, term); - } else if (pattern.every((node) => node.kind === SyntaxKind.Wildcard)) { - this.registerPatternWithOnlyWildcards(id, pattern, term); - } else { - this.registerPatternWithWildcardsAndLiterals(id, pattern, term); - } - } - } - - private registerPatternWithOnlyLiterals(id: number, pattern: LiteralNode[], term: BlacklistedTerm) { - const matchLength = computePatternMatchLength(pattern); - this.matchLengths[id] = matchLength; - this.maxMatchLength = Math.max(this.maxMatchLength, matchLength); - - const endNode = this.extendTrie(pattern[0].chars); - endNode.flags |= NodeFlag.MatchLeaf; - endNode.termId = id; - if (term.pattern.requireWordBoundaryAtStart) endNode.flags |= NodeFlag.RequireWordBoundaryAtStart; - if (term.pattern.requireWordBoundaryAtEnd) endNode.flags |= NodeFlag.RequireWordBoundaryAtEnd; - } - - private registerPatternWithOnlyWildcards(id: number, pattern: SimpleNode[], term: BlacklistedTerm) { - const matchLength = computePatternMatchLength(pattern); - this.matchLengths[id] = matchLength; - this.maxMatchLength = Math.max(this.maxMatchLength, matchLength); - - const data: WildcardOnlyPatternData = { - id, - flags: 0, - wildcardCount: matchLength, - }; - if (term.pattern.requireWordBoundaryAtStart) data.flags |= WildcardOnlyPatternFlag.RequireWordBoundaryAtStart; - if (term.pattern.requireWordBoundaryAtEnd) data.flags |= WildcardOnlyPatternFlag.RequireWordBoundaryAtEnd; - this.wildcardOnlyPatterns.push(data); - } - - private registerPatternWithWildcardsAndLiterals(id: number, pattern: SimpleNode[], term: BlacklistedTerm) { - const matchLength = computePatternMatchLength(pattern); - this.matchLengths[id] = matchLength; - this.maxMatchLength = Math.max(this.maxMatchLength, matchLength); - - // If a pattern has a wildcard in addition to at least one literal, we - // will split the pattern at its wildcards, resulting in a number of - // partial patterns. For example, given 'l1 w1 l2 w2' where l1, l2 are - // literals and w1, w2 are wildcards, we would have 2 partial patterns: - // l1 and l2. - // - // We will then assign each partial pattern a step: l1 would be tep 1 - // and l2 step 2. Then, we will extend the trie with l1 and l2. After - // that is done, we will decorate the leaf nodes at the leaf nodes of - // each pattern with some additional metadata to indicate that they are - // the leaf node of a partial match. - // - // So how does this help us match wildcards? - // - // Let's say that we find the pattern l1 in the text. Since it is the - // first step, we will hash it and add it to the set of partial matches - // ending at that position. Now, let's say that we find pattern l2 in - // the text. We can combine the partial matches l1 and l2 iff l1 was - // found in the text 1 position before the start position of where l2 - // matched. (1 is the number of wildcards separating l1 and l2 in the - // original pattern). - // - // Since l2 is the last partial pattern, we add it to a stack of pending - // partial matches. (Note that if there was no wildcard after l2, we - // could emit it immediately. However, as there are wildcards after l2, - // we have to wait until we are sure that we have an adequate number of - // characters to satisfy the required number of wildcards). - const groups = groupByNodeType(pattern); - let step = 1; - - const startsWithLiteral = groups[0].isLiteralGroup; - for (let i = startsWithLiteral ? 0 : 1; i < groups.length; i += 2, step++) { - // Count the number of trailing and leading wildcards - // before/after the current literal segment. - const lastLiteralGroupLength = - i < 2 ? 0 : (groups[i - 2] as LiteralGroup).literals.reduce((a, b) => a + b.chars.length, 0); - const leadingWildcardCount = i === 0 ? 0 : (groups[i - 1] as WildcardGroup).wildcardCount; - const trailingWildcardCount = i === groups.length - 1 ? 0 : (groups[i + 1] as WildcardGroup).wildcardCount; - - // Extend the trie with the characters of the literal. - const chars = (groups[i] as LiteralGroup).literals.flatMap((node) => node.chars); - const endNode = this.extendTrie(chars); - - // Add some additional metadata to the leaf node. - const data: PartialMatchData = { - step, - termId: id, - flags: 0, - leadingWildcardCount, - trailingWildcardCount, - matchLength: chars.length, - }; - if (term.pattern.requireWordBoundaryAtStart) data.flags |= PartialMatchFlag.RequireWordBoundaryAtStart; - if (term.pattern.requireWordBoundaryAtEnd) data.flags |= PartialMatchFlag.RequireWordBoundaryAtEnd; - (endNode.partialMatches ??= []).push(data); - endNode.flags |= NodeFlag.PartialMatchLeaf; - - this.maxPartialPatternDistance = Math.max( - this.maxPartialPatternDistance, - lastLiteralGroupLength + leadingWildcardCount + chars.length, - ); - if (i >= groups.length - 2) { - // Last group of literals. - this.maxTrailingWildcardCount = Math.max(this.maxTrailingWildcardCount, trailingWildcardCount); - } - } - - this.partialMatchStepCounts.set(id, step - 1); - } - - private extendTrie(chars: number[]) { - let currentNode = this.rootNode; - for (const char of chars) { - const nextNode = currentNode.edges.get(char); - if (nextNode) { - currentNode = nextNode; - } else { - const newNode = new BlacklistTrieNode(); - currentNode.edges.set(char, newNode); - currentNode = newNode; - } - } - - return currentNode; - } - - private constructLinks() { - // Compute the failure and output functions for the trie. This - // implementation is fairly straightforward and is essentially the exact - // same as that detailed in Aho and Corasick's original paper. Refer to - // section 3 in said paper for more details. - this.rootNode.failureLink = this.rootNode; - const queue = new Queue(); - for (const node of this.rootNode.edges.values()) { - node.failureLink = this.rootNode; - queue.push(node); - } - - while (queue.length > 0) { - const node = queue.shift()!; - for (const [char, childNode] of node.edges) { - let cur = node.failureLink; - while (!cur.edges.get(char) && cur !== this.rootNode) cur = cur.failureLink; - - const failureLink = cur.edges.get(char) ?? this.rootNode; - childNode.failureLink = failureLink; - queue.push(childNode); - } - - node.outputLink = - node.failureLink.flags & NodeFlag.MatchLeaf || node.failureLink.flags & NodeFlag.PartialMatchLeaf - ? node.failureLink - : node.failureLink.outputLink; - } - } - - private useUnderlyingEdgeCollectionImplementation(node: BlacklistTrieNode) { - node.edges = (node.edges as ForwardingEdgeCollection).underlyingImplementation; - for (const childNode of node.edges.values()) this.useUnderlyingEdgeCollectionImplementation(childNode); - } -} - -/** - * Options for the [[NfaMatcher]]. - */ -export interface NfaMatcherOptions { - /** - * A set of transformers that should be applied to the input text before - * blacklisted patterns are matched. This does not affect the matching of - * whitelisted terms. - * - * Transformers will be applied in the order they appear. - * - * @default [] - */ - blacklistMatcherTransformers?: TransformerContainer[]; - - /** - * A list of blacklisted terms. - */ - blacklistedTerms: BlacklistedTerm[]; - - /** - * A set of transformers that should be applied to the input text before - * whitelisted terms are matched. This does not affect the matching of - * blacklisted terms. - * - * Transformers will be applied in the order they appear. - * - * @default [] - */ - whitelistMatcherTransformers?: TransformerContainer[]; - - /** - * A list of whitelisted terms. If a whitelisted term matches some part of - * the text, a match of a blacklisted pattern within that part of the text - * will not be emitted. - * - * For example, if we had a pattern `penis` and a whitelisted term `pen is`, - * only no matches would be reported for the input text `the pen is mightier - * than the sword.` - * - * @default [] - */ - whitelistedTerms?: string[]; -} - -interface WildcardOnlyPatternData { - flags: number; - id: number; - wildcardCount: number; -} - -const enum WildcardOnlyPatternFlag { - RequireWordBoundaryAtStart = 1, - RequireWordBoundaryAtEnd = 1 << 1, -} diff --git a/src/matcher/nfa/WhitelistedTermMatcher.ts b/src/matcher/nfa/WhitelistedTermMatcher.ts deleted file mode 100644 index 0b1e713..0000000 --- a/src/matcher/nfa/WhitelistedTermMatcher.ts +++ /dev/null @@ -1,132 +0,0 @@ -import { TransformerSet } from '../../transformer/TransformerSet'; -import type { TransformerContainer } from '../../transformer/Transformers'; -import { CharacterIterator } from '../../util/CharacterIterator'; -import { CircularBuffer } from '../../util/CircularBuffer'; -import { Queue } from '../../util/Queue'; -import { IntervalCollection } from '../IntervalCollection'; -import { WhitelistTrieNode } from './trie/WhitelistTrieNode'; -import type { ForwardingEdgeCollection } from './trie/edge/ForwardingEdgeCollection'; - -export class WhitelistedTermMatcher { - private readonly rootNode = new WhitelistTrieNode(); - - private currentId = 0; - - private readonly matchLengths = new Map(); // term ID -> match length - - private maxMatchLength = 0; - - private readonly transformers: TransformerSet; - - public constructor({ terms, transformers = [] }: WhitelistedTermMatcherOptions) { - this.transformers = new TransformerSet(transformers); - for (const term of terms) this.registerTerm(term); - this.constructLinks(); - this.useUnderlyingEdgeCollectionImplementation(this.rootNode); - } - - public getMatches(text: string) { - if (this.rootNode.edges.size === 0) return new IntervalCollection(); - const usedIndices = new CircularBuffer(this.maxMatchLength); - const matches = new IntervalCollection(); - - let currentNode = this.rootNode; - const iter = new CharacterIterator(text); - for (const char of iter) { - const transformed = this.transformers.applyTo(char); - if (transformed === undefined) continue; // Returning undefined from a transformer skips that character. - - // Mark the current position as one used for matching. - usedIndices.push(iter.position); - - // Follow failure links until we find a node that has a transition for the current character. - while (currentNode !== this.rootNode && !currentNode.edges.get(transformed)) { - currentNode = currentNode.failureLink; - } - - currentNode = currentNode.edges.get(transformed) ?? this.rootNode; - - // Report matches as needed. - if (currentNode.isOutputNode) { - const matchLength = this.matchLengths.get(currentNode.termId)!; - const startIndex = usedIndices.get(usedIndices.length - matchLength)!; - // Adjust the end index by iter.lastWidth - 1 to account for surrogate pairs. - matches.insert(startIndex, iter.position + iter.lastWidth - 1); - } - - let linkedNode = currentNode.outputLink; - while (linkedNode) { - const matchLength = this.matchLengths.get(linkedNode.termId)!; - const startIndex = usedIndices.get(usedIndices.length - matchLength)!; - // Similar. - matches.insert(startIndex, iter.position + iter.lastWidth - 1); - linkedNode = linkedNode.outputLink; - } - } - - this.transformers.resetAll(); - return matches; - } - - private registerTerm(term: string) { - if (term.length === 0) throw new Error('Unexpected empty whitelisted term.'); - - const id = this.currentId++; - // Track the match length of this term. - const chars = [...new CharacterIterator(term)]; - const matchLength = chars.length; - this.matchLengths.set(id, matchLength); - if (matchLength > this.maxMatchLength) this.maxMatchLength = matchLength; - - let currentNode = this.rootNode; - for (const char of chars) { - const nextNode = currentNode.edges.get(char); - if (nextNode) { - currentNode = nextNode; - } else { - const newNode = new WhitelistTrieNode(); - currentNode.edges.set(char, newNode); - currentNode = newNode; - } - } - - currentNode.isOutputNode = true; - currentNode.termId = id; - } - - private constructLinks() { - // Compute the failure and output functions for the trie. This - // implementation is fairly straightforward and is essentially the exact - // same as that detailed in Aho and Corasick's original paper. Refer to - // section 3 in said paper for more details. - this.rootNode.failureLink = this.rootNode; - const queue = new Queue(); - for (const node of this.rootNode.edges.values()) { - node.failureLink = this.rootNode; - queue.push(node); - } - - while (queue.length > 0) { - const node = queue.shift()!; - for (const [char, childNode] of node.edges) { - let cur = node.failureLink; - while (!cur.edges.get(char) && cur !== this.rootNode) cur = cur.failureLink; - - childNode.failureLink = cur.edges.get(char) ?? this.rootNode; - queue.push(childNode); - } - - node.outputLink = node.failureLink.isOutputNode ? node.failureLink : node.failureLink.outputLink; - } - } - - private useUnderlyingEdgeCollectionImplementation(node: WhitelistTrieNode) { - node.edges = (node.edges as ForwardingEdgeCollection).underlyingImplementation; - for (const childNode of node.edges.values()) this.useUnderlyingEdgeCollectionImplementation(childNode); - } -} - -export interface WhitelistedTermMatcherOptions { - terms: string[]; - transformers?: TransformerContainer[]; -} diff --git a/src/matcher/nfa/trie/BlacklistTrieNode.ts b/src/matcher/nfa/trie/BlacklistTrieNode.ts deleted file mode 100644 index d398880..0000000 --- a/src/matcher/nfa/trie/BlacklistTrieNode.ts +++ /dev/null @@ -1,46 +0,0 @@ -import type { EdgeCollection } from './edge/EdgeCollection'; -import { ForwardingEdgeCollection } from './edge/ForwardingEdgeCollection'; - -export class BlacklistTrieNode { - public edges: EdgeCollection = new ForwardingEdgeCollection(); - - public termId = -1; - - public failureLink!: this; - - public outputLink?: this; - - public partialMatches?: PartialMatchData[]; // partial matches that end at this node - - public flags = 0; -} - -export const enum SharedFlag { - RequireWordBoundaryAtStart = 1, - RequireWordBoundaryAtEnd = 1 << 1, -} - -export const enum NodeFlag { - RequireWordBoundaryAtStart = 1, - RequireWordBoundaryAtEnd = 1 << 1, - MatchLeaf = 1 << 2, - PartialMatchLeaf = 1 << 3, -} - -export const enum PartialMatchFlag { - RequireWordBoundaryAtStart = 1, - RequireWordBoundaryAtEnd = 1 << 1, -} - -export interface PartialMatchData { - flags: number; - leadingWildcardCount: number; - matchLength: number; - step: number; - termId: number; - trailingWildcardCount: number; -} - -export function hashPartialMatch(step: number, termId: number) { - return `${step}-${termId}`; -} diff --git a/src/matcher/nfa/trie/WhitelistTrieNode.ts b/src/matcher/nfa/trie/WhitelistTrieNode.ts deleted file mode 100644 index 62215ec..0000000 --- a/src/matcher/nfa/trie/WhitelistTrieNode.ts +++ /dev/null @@ -1,14 +0,0 @@ -import type { EdgeCollection } from './edge/EdgeCollection'; -import { ForwardingEdgeCollection } from './edge/ForwardingEdgeCollection'; - -export class WhitelistTrieNode { - public edges: EdgeCollection = new ForwardingEdgeCollection(); - - public termId = -1; - - public failureLink!: WhitelistTrieNode; - - public outputLink?: WhitelistTrieNode; - - public isOutputNode = false; -} diff --git a/src/matcher/nfa/trie/edge/ArrayEdgeCollection.ts b/src/matcher/nfa/trie/edge/ArrayEdgeCollection.ts deleted file mode 100644 index e9302ab..0000000 --- a/src/matcher/nfa/trie/edge/ArrayEdgeCollection.ts +++ /dev/null @@ -1,70 +0,0 @@ -import type { Edge, EdgeCollection } from './EdgeCollection'; - -export class ArrayEdgeCollection implements EdgeCollection { - // Crossover point at which binary search becomes faster than a linear - // search. Somewhat arbitrary as benchmarking get() is hard (both linear - // search and binary search execute in less than one-tenth of a millisecond - // at the scale we're looking at) but micro-benchmarks seem to point to 8-12 - // being a crossover point. - private static readonly binarySearchThreshold = 10; - - private readonly edges: Edge[] = []; - - private dirty = false; - - public set(char: number, node: T) { - // Prefer overwriting an existing edge. - const index = this.edges.findIndex((edge) => edge[0] === char); - if (index === -1) { - this.edges.push([char, node]); - this.dirty = true; - } else { - this.edges[index][1] = node; - } - } - - public get(char: number) { - if (this.edges.length <= ArrayEdgeCollection.binarySearchThreshold) { - for (const edge of this.edges) { - if (edge[0] === char) return edge[1]; - } - - return; - } - - if (this.dirty) { - // Sort by character value. - this.edges.sort( - /* istanbul ignore next: not possible to write a robust test for this */ - (a, b) => (a[0] < b[0] ? -1 : b[0] < a[0] ? 1 : 0), - ); - this.dirty = false; - } - - let low = 0; - let high = this.edges.length - 1; - while (low <= high) { - const mid = (low + high) >>> 1; - const edge = this.edges[mid]; - if (edge[0] > char) high = mid - 1; - else if (edge[0] === char) return edge[1]; - else low = mid + 1; - } - } - - public get size() { - return this.edges.length; - } - - public keys() { - return this.edges.map((edge) => edge[0]).values(); - } - - public values() { - return this.edges.map((edge) => edge[1]).values(); - } - - public [Symbol.iterator]() { - return this.edges.values(); - } -} diff --git a/src/matcher/nfa/trie/edge/BucketEdgeCollection.ts b/src/matcher/nfa/trie/edge/BucketEdgeCollection.ts deleted file mode 100644 index 738dafd..0000000 --- a/src/matcher/nfa/trie/edge/BucketEdgeCollection.ts +++ /dev/null @@ -1,42 +0,0 @@ -import { CharacterCode } from '../../../../util/Char'; -import type { Edge, EdgeCollection } from './EdgeCollection'; - -export class BucketEdgeCollection implements EdgeCollection { - private _size = 0; - - private buckets = Array.from({ length: 26 }); - - public set(char: number, node: T) { - const k = char - CharacterCode.LowerA; - // Only increment the size if we didn't already have a node corresponding to it. - if (!this.buckets[k]) this._size++; - this.buckets[k] = node; - } - - public get(char: number) { - const k = char - CharacterCode.LowerA; - if (k >= 0 && k < 26) return this.buckets[k]; - } - - public get size() { - return this._size; - } - - public *keys() { - for (let i = 0; i < 26; i++) { - if (this.buckets[i] !== undefined) yield i + CharacterCode.LowerA; - } - } - - public *values() { - for (let i = 0; i < 26; i++) { - if (this.buckets[i] !== undefined) yield this.buckets[i]!; - } - } - - public *[Symbol.iterator]() { - for (let i = 0; i < 26; i++) { - if (this.buckets[i] !== undefined) yield [i + CharacterCode.LowerA, this.buckets[i]] as Edge; - } - } -} diff --git a/src/matcher/nfa/trie/edge/EdgeCollection.ts b/src/matcher/nfa/trie/edge/EdgeCollection.ts deleted file mode 100644 index fdcb80c..0000000 --- a/src/matcher/nfa/trie/edge/EdgeCollection.ts +++ /dev/null @@ -1,9 +0,0 @@ -export type EdgeCollection = Iterable> & { - get(char: number): T | undefined; - keys(): IterableIterator; - set(char: number, node: T): void; - get size(): number; - values(): IterableIterator; -}; - -export type Edge = [char: number, node: T]; diff --git a/src/matcher/nfa/trie/edge/ForwardingEdgeCollection.ts b/src/matcher/nfa/trie/edge/ForwardingEdgeCollection.ts deleted file mode 100644 index cc86ad2..0000000 --- a/src/matcher/nfa/trie/edge/ForwardingEdgeCollection.ts +++ /dev/null @@ -1,93 +0,0 @@ -import { isLowerCase } from '../../../../util/Char'; -import { ArrayEdgeCollection } from './ArrayEdgeCollection'; -import { BucketEdgeCollection } from './BucketEdgeCollection'; -import type { EdgeCollection } from './EdgeCollection'; - -export class ForwardingEdgeCollection implements EdgeCollection { - private _underlyingImplementation: EdgeCollection = new ArrayEdgeCollection(); - - private implementation = Implementation.Array; - - private areKeysAllLowerCase = true; - - public set(char: number, node: T) { - this.areKeysAllLowerCase &&= isLowerCase(char); - if (!this.areKeysAllLowerCase && this.implementation === Implementation.Bucket) { - this.useImplementation(this.selectImplementation()); - } - - this._underlyingImplementation.set(char, node); - this.useImplementation(this.selectImplementation()); - } - - public get(char: number) { - return this._underlyingImplementation.get(char); - } - - public get size() { - return this._underlyingImplementation.size; - } - - public keys() { - return this._underlyingImplementation.keys(); - } - - public values() { - return this._underlyingImplementation.values(); - } - - public get underlyingImplementation() { - return this._underlyingImplementation; - } - - public [Symbol.iterator]() { - return this._underlyingImplementation[Symbol.iterator](); - } - - private selectImplementation() { - // These thresholds are all somewhat arbitrary as all implementations - // execute in less than one-tenth of a millisecond at the scale we're at - // here. However, micro-benchmarks point to the bucket implementation - // always being faster when it's applicable (lower-case ASCII - // characters). As it's not very memory-efficient for small numbers of - // edges, we use the array implementation if the size is less than 10 - // and the bucket implementation otherwise. - // - // When the bucket implementation is not available we choose between the - // array and map implementation. Both are fairly fast; though the map is - // fastest, the difference is not noticeable until ~50-60 edges are - // being stored. Thus, as the array implementation uses less memory, we - // choose it for medium sized collections and use the map implementation - // in all other cases. - if (this.size <= 10) return Implementation.Array; - if (this.areKeysAllLowerCase) return Implementation.Bucket; - if (this.size <= 35) return Implementation.Array; - return Implementation.Map; - } - - private useImplementation(newImplementation: Implementation) { - if (this.implementation === newImplementation) return; - const newCollection = this.instantiateImplementation(newImplementation); - for (const [k, v] of this._underlyingImplementation) newCollection.set(k, v); - this._underlyingImplementation = newCollection; - this.implementation = newImplementation; - } - - private instantiateImplementation(implementation: Implementation): EdgeCollection { - switch (implementation) { - case Implementation.Array: - /* istanbul ignore next: instantiateImplementation() should never be called with Array */ - return new ArrayEdgeCollection(); - case Implementation.Bucket: - return new BucketEdgeCollection(); - case Implementation.Map: - return new Map(); - } - } -} - -const enum Implementation { - Array, - Bucket, - Map, -} diff --git a/src/matcher/regexp/RegExpMatcher.ts b/src/matcher/regexp/RegExpMatcher.ts index 463c388..7f4fdb1 100644 --- a/src/matcher/regexp/RegExpMatcher.ts +++ b/src/matcher/regexp/RegExpMatcher.ts @@ -12,13 +12,6 @@ import type { Matcher } from '../Matcher'; /** * An implementation of the [[Matcher]] interface using regular expressions and * string searching methods. - * - * It should be the default choice for users of this package, as though it is - * theoretically slower than the more complex [[NfaMatcher]], it uses much less - * memory and is more efficient for low/medium numbers of patterns. - * - * Refer to the documentation of the [[NfaMatcher]] class for further discussion - * on when to choose that implementation over this one. */ export class RegExpMatcher implements Matcher { private readonly blacklistedTerms: CompiledBlacklistedTerm[]; diff --git a/src/pattern/Pattern.ts b/src/pattern/Pattern.ts index 4f3478c..5d26226 100644 --- a/src/pattern/Pattern.ts +++ b/src/pattern/Pattern.ts @@ -97,7 +97,7 @@ const parser = new Parser(); * const parsed = pattern`my initials are \[??\]`; // match "my initials are [", then any two characters, then a "]" * ``` * @returns The parsed pattern, which can then be used with the - * [[RegExpMatcher]] or the [[NfaMatcher]]. + * [[RegExpMatcher]]. * @throws [[ParserError]] if a syntactical error was detected while parsing the * pattern. * @see [[parseRawPattern]] if you want to parse a string into a pattern without @@ -125,7 +125,7 @@ export function pattern(strings: TemplateStringsArray, ...expressions: unknown[] * @throws [[ParserError]] if a syntactical error was detected while parsing the * pattern. * @returns The parsed pattern, which can then be used with the - * [[RegExpMatcher]] or the [[NfaMatcher]]. + * [[RegExpMatcher]]. */ export function parseRawPattern(pattern: string) { return parser.parse(pattern); diff --git a/src/preset/english.ts b/src/preset/english.ts index 55e72e9..dd739f9 100644 --- a/src/preset/english.ts +++ b/src/preset/english.ts @@ -1,5 +1,5 @@ import { DataSet } from '../dataset/DataSet'; -import type { NfaMatcherOptions } from '../matcher/nfa/NfaMatcher'; +import type { RegExpMatcherOptions } from '../matcher/regexp/RegExpMatcher'; import { pattern } from '../pattern/Pattern'; import { collapseDuplicatesTransformer } from '../transformer/collapse-duplicates'; import { resolveConfusablesTransformer } from '../transformer/resolve-confusables'; @@ -43,10 +43,10 @@ export const englishRecommendedWhitelistMatcherTransformers = [ /** * Recommended transformers to be used with the [[englishDataset | english word - * dataset]] and the [[RegExpMatcher]] or the [[NfaMatcher]]. + * dataset]] and the [[RegExpMatcher]]. */ export const englishRecommendedTransformers: Pick< - NfaMatcherOptions, + RegExpMatcherOptions, 'blacklistMatcherTransformers' | 'whitelistMatcherTransformers' > = { blacklistMatcherTransformers: englishRecommendedBlacklistMatcherTransformers, diff --git a/src/transformer/Transformers.ts b/src/transformer/Transformers.ts index 69c2aa7..a014c25 100644 --- a/src/transformer/Transformers.ts +++ b/src/transformer/Transformers.ts @@ -40,7 +40,7 @@ export type TransformerContainer = SimpleTransformerContainer | StatefulTransfor * character. A return value of `undefined` indicates that the character should * be ignored. * @returns A container holding the transformer, which can then be passed to the - * [[RegExpMatcher]] or the [[NfaMatcher]]. + * [[RegExpMatcher]]. */ export function createSimpleTransformer(transformer: TransformerFn): SimpleTransformerContainer { return { type: TransformerType.Simple, transform: transformer }; @@ -95,7 +95,7 @@ export interface SimpleTransformerContainer { * @param factory A function that returns an instance of the stateful * transformer. * @returns A container holding the stateful transformer, which can then be - * passed to the [[RegExpMatcher]] or the [[NfaMatcher]]. + * passed to the [[RegExpMatcher]]. */ export function createStatefulTransformer(factory: StatefulTransformerFactory): StatefulTransformerContainer { return { type: TransformerType.Stateful, factory }; diff --git a/src/transformer/collapse-duplicates/index.ts b/src/transformer/collapse-duplicates/index.ts index 45d591d..b196bcd 100644 --- a/src/transformer/collapse-duplicates/index.ts +++ b/src/transformer/collapse-duplicates/index.ts @@ -41,7 +41,7 @@ import { CollapseDuplicatesTransformer } from './transformer'; * ``` * @param options - Options for the transformer. * @returns A container holding the transformer, which can then be passed to the - * [[RegExpMatcher]] or the [[NfaMatcher]]. + * [[RegExpMatcher]]. */ export function collapseDuplicatesTransformer({ defaultThreshold = 1, diff --git a/src/transformer/remap-characters/index.ts b/src/transformer/remap-characters/index.ts index 3d5fc6f..a6a756d 100644 --- a/src/transformer/remap-characters/index.ts +++ b/src/transformer/remap-characters/index.ts @@ -31,7 +31,7 @@ import { createSimpleTransformer } from '../Transformers'; * ``` * @param mapping - A map/object mapping certain characters to others. * @returns A container holding the transformer, which can then be passed to the - * [[RegExpMatcher]] or the [[NfaMatcher]]. + * [[RegExpMatcher]]. * @see [[resolveConfusablesTransformer| Transformer that handles confusable Unicode characters]] * @see [[resolveLeetSpeakTransformer | Transformer that handles leet-speak]] */ diff --git a/src/transformer/resolve-confusables/index.ts b/src/transformer/resolve-confusables/index.ts index 543e720..c8f39cb 100644 --- a/src/transformer/resolve-confusables/index.ts +++ b/src/transformer/resolve-confusables/index.ts @@ -17,7 +17,7 @@ import { confusables } from './confusables'; * const matcher = new RegExpMatcher({ ..., blacklistMatcherTransformers: [transformer] }); * ``` * @returns A container holding the transformer, which can then be passed to the - * [[RegExpMatcher]] or the [[NfaMatcher]]. + * [[RegExpMatcher]]. */ export function resolveConfusablesTransformer() { return remapCharactersTransformer(confusables); diff --git a/src/transformer/resolve-leetspeak/index.ts b/src/transformer/resolve-leetspeak/index.ts index cca260e..0e19398 100644 --- a/src/transformer/resolve-leetspeak/index.ts +++ b/src/transformer/resolve-leetspeak/index.ts @@ -18,7 +18,7 @@ import { dictionary } from './dictionary'; * const matcher = new RegExpMatcher({ ..., blacklistMatcherTransformers: [transformer] }); * ``` * @returns A container holding the transformer, which can then be passed to the - * [[RegExpMatcher]] or the [[NfaMatcher]]. + * [[RegExpMatcher]]. */ export function resolveLeetSpeakTransformer() { return remapCharactersTransformer(dictionary); diff --git a/src/transformer/skip-non-alphabetic/index.ts b/src/transformer/skip-non-alphabetic/index.ts index 2a41f7a..e967588 100644 --- a/src/transformer/skip-non-alphabetic/index.ts +++ b/src/transformer/skip-non-alphabetic/index.ts @@ -18,7 +18,7 @@ import { createSimpleTransformer } from '../Transformers'; * const matcher = new RegExpMatcher({ ..., blacklistMatcherTransformers: [transformer] }); * ``` * @returns A container holding the transformer, which can then be passed to the - * [[RegExpMatcher]] or the [[NfaMatcher]]. + * [[RegExpMatcher]]. */ export function skipNonAlphabeticTransformer() { return createSimpleTransformer((c) => (isAlphabetic(c) ? c : undefined)); diff --git a/src/transformer/to-ascii-lowercase/index.ts b/src/transformer/to-ascii-lowercase/index.ts index 4202b9e..a8b16ff 100644 --- a/src/transformer/to-ascii-lowercase/index.ts +++ b/src/transformer/to-ascii-lowercase/index.ts @@ -13,7 +13,7 @@ import { createSimpleTransformer } from '../Transformers'; * of varying cases. * * @returns A container holding the transformer, which can then be passed to the - * [[RegExpMatcher]] or the [[NfaMatcher]]. + * [[RegExpMatcher]]. */ export function toAsciiLowerCaseTransformer() { return createSimpleTransformer((c) => (isUpperCase(c) ? invertCaseOfAlphabeticChar(c) : c)); diff --git a/test/matcher/nfa/NfaMatcher.fuzz.test.ts b/test/matcher/nfa/NfaMatcher.fuzz.test.ts deleted file mode 100644 index cd38437..0000000 --- a/test/matcher/nfa/NfaMatcher.fuzz.test.ts +++ /dev/null @@ -1,147 +0,0 @@ -import * as fc from 'fast-check'; -import { assignIncrementingIds } from '../../../src/matcher/BlacklistedTerm'; -import { NfaMatcher } from '../../../src/matcher/nfa/NfaMatcher'; -import type { LiteralNode, ParsedPattern } from '../../../src/pattern/Nodes'; -import { SyntaxKind } from '../../../src/pattern/Nodes'; -import { CharacterCode } from '../../../src/util/Char'; -import { CharacterIterator } from '../../../src/util/CharacterIterator'; -import type { Interval } from '../../../src/util/Interval'; -import { compareIntervals } from '../../../src/util/Interval'; - -test('running the pattern matcher on a set of patterns and input should have the same result as using a brute force approach with regexp', () => { - fc.assert( - fc.property( - fc.stringOf(fc.char()).chain((input) => { - // Generate patterns that are a substring of the input. - const arbitrarySubstringPatterns = - input.length < 2 - ? fc.constant([]) - : fc.array( - fc.tuple( - fc - .tuple(fc.integer({ min: 0, max: input.length - 1 }), fc.integer({ min: 0, max: input.length - 1 })) - .filter(([a, b]) => a !== b) - .chain(([a, b]) => { - // eslint-disable-next-line no-param-reassign - if (a > b) [a, b] = [b, a]; - return fc.tuple(fc.constant(input.slice(a, b)), fc.uniqueArray(fc.integer({ min: a, max: b }))); - }) - .map(([pattern, wildcardIndices]) => { - let patternWithWildcards = ''; - // eslint-disable-next-line unicorn/no-for-loop - for (let i = 0; i < pattern.length; i++) { - if (wildcardIndices.includes(i)) patternWithWildcards += '?'; - else patternWithWildcards += pattern[i]; - } - - return patternWithWildcards; - }), - fc.boolean(), - fc.boolean(), - ), - ); - // Completely random patterns. - const completelyArbitraryPatterns = fc.array( - fc.tuple( - fc - .stringOf(fc.oneof(fc.char16bits(), fc.char16bits(), fc.char16bits(), fc.constant('?'))) - .filter((p) => p.length > 0), - fc.boolean(), - fc.boolean(), - ), - ); - return fc.tuple(fc.constant(input), completelyArbitraryPatterns, arbitrarySubstringPatterns); - }), - ([input, randomPatterns, substrPatterns]) => { - const seen = new Set(); - const allPatterns: [string, boolean, boolean][] = []; - for (const pattern of randomPatterns) { - // Make sure we don't use the same pattern twice. - if (!seen.has(pattern[0])) { - allPatterns.push(pattern); - seen.add(pattern[0]); - } - } - - for (const pattern of substrPatterns) { - // Similar. - if (!seen.has(pattern[0])) { - allPatterns.push(pattern); - seen.add(pattern[0]); - } - } - - const matcher = new NfaMatcher({ - blacklistedTerms: assignIncrementingIds( - allPatterns.map(([pattern, requireWordBoundaryAtStart, requireWordBoundaryAtEnd]) => - toNodes(pattern, requireWordBoundaryAtStart, requireWordBoundaryAtEnd), - ), - ), - }); - - const matchedRegions = matcher.getAllMatches(input); - const transformedMatches: Record = {}; - for (const payload of matchedRegions) { - // eslint-disable-next-line @typescript-eslint/no-unnecessary-condition - (transformedMatches[payload.termId] ??= []).push([payload.startIndex, payload.endIndex]); - } - - for (const matches of Object.values(transformedMatches)) matches.sort((a, b) => compareIntervals(...a, ...b)); - expect(transformedMatches).toStrictEqual( - bruteForceMatch( - allPatterns.map(([pattern, requireWordBoundaryAtStart, requireWordBoundaryAtEnd]) => - toRegExp(pattern, requireWordBoundaryAtStart, requireWordBoundaryAtEnd), - ), - input, - ), - ); - }, - ), - ); -}); - -function bruteForceMatch(regExps: RegExp[], input: string) { - const result: Record = {}; - for (const [i, regExp] of regExps.entries()) { - let match: RegExpExecArray | null; - while ((match = regExp.exec(input))) { - // eslint-disable-next-line @typescript-eslint/no-unnecessary-condition - (result[i] ??= []).push([match.index, match.index + match[0].length - 1]); - regExp.lastIndex = match.index + 1; - } - } - - for (const matches of Object.values(result)) matches.sort((a, b) => compareIntervals(...a, ...b)); - return result; -} - -const regExpSpecialChars = ['.', '*', '+', '^', '$', '{', '}', '(', ')', '|', '[', '\\', ']']; - -function toRegExp(pattern: string, requireWordBoundaryAtStart: boolean, requireWordBoundaryAtEnd: boolean) { - let regexpStr = ''; - if (requireWordBoundaryAtStart) regexpStr += '(?<=[^\\dA-Za-z]|^)'; - for (const char of pattern) { - if (regExpSpecialChars.includes(char)) regexpStr += `\\${char}`; - else if (char === '?') regexpStr += '.'; - else regexpStr += char; - } - - if (requireWordBoundaryAtEnd) regexpStr += '(?=[^\\dA-Za-z]|$)'; - return new RegExp(regexpStr, 'gs'); -} - -function toNodes(pattern: string, requireWordBoundaryAtStart: boolean, requireWordBoundaryAtEnd: boolean) { - const parsed: ParsedPattern = { nodes: [], requireWordBoundaryAtStart, requireWordBoundaryAtEnd }; - for (const char of new CharacterIterator(pattern)) { - if (char === CharacterCode.QuestionMark) { - parsed.nodes.push({ kind: SyntaxKind.Wildcard }); - } else if (parsed.nodes.length === 0 || parsed.nodes[parsed.nodes.length - 1].kind !== SyntaxKind.Literal) { - parsed.nodes.push({ kind: SyntaxKind.Literal, chars: [char] }); - } else { - // eslint-disable-next-line @typescript-eslint/no-unnecessary-type-assertion - (parsed.nodes[parsed.nodes.length - 1] as LiteralNode).chars.push(char); - } - } - - return parsed; -} diff --git a/test/matcher/nfa/NfaMatcher.test.ts b/test/matcher/nfa/NfaMatcher.test.ts deleted file mode 100644 index ed7d5c7..0000000 --- a/test/matcher/nfa/NfaMatcher.test.ts +++ /dev/null @@ -1,845 +0,0 @@ -import { assignIncrementingIds } from '../../../src/matcher/BlacklistedTerm'; -import type { MatchPayload } from '../../../src/matcher/MatchPayload'; -import { NfaMatcher } from '../../../src/matcher/nfa/NfaMatcher'; -import { WhitelistedTermMatcher } from '../../../src/matcher/nfa/WhitelistedTermMatcher'; -import { parseRawPattern, pattern } from '../../../src/pattern/Pattern'; -import { createSimpleTransformer } from '../../../src/transformer/Transformers'; -import { skipNonAlphabeticTransformer } from '../../../src/transformer/skip-non-alphabetic'; -import { CharacterCode } from '../../../src/util/Char'; - -describe('constructor', () => { - it('should not accept patterns with the same id', () => { - expect( - () => - new NfaMatcher({ - blacklistedTerms: [ - { id: 10, pattern: pattern`` }, - { id: 10, pattern: pattern`yo` }, - ], - }), - ).toThrow(new Error('Found duplicate blacklisted term ID 10.')); - }); - - it('should not accept empty patterns', () => { - expect( - () => - new NfaMatcher({ - blacklistedTerms: [{ id: 10, pattern: pattern`` }], - }), - ).toThrow('potentially matches empty string'); - }); - - it('should not accept patterns with optionals that have the empty string in their match set', () => { - expect( - () => - new NfaMatcher({ - blacklistedTerms: [{ id: 10, pattern: pattern`[abc]` }], - }), - ).toThrow('potentially matches empty string'); - }); -}); - -it('should match nothing if there are no patterns', () => { - const matcher = new NfaMatcher({ blacklistedTerms: [] }); - expect(matcher.getAllMatches('foo bar')).toHaveLength(0); -}); - -describe('simple matching; no wildcards/optionals', () => { - it.each([ - [ - 'should match a term at the start of the string', - ['hello'], - 'hello world', - { - 0: [[0, 4]], - }, - ], - [ - 'should match a term at the end of the string', - ['world'], - 'hello world', - { - 0: [[6, 10]], - }, - ], - ['should be case sensitive (no match)', ['WORLD'], 'hello world', []], - ['should be case sensitive (with match)', ['yO'], 'hello yO yo', { 0: [[6, 7]] }], - [ - 'should support spaces in terms', - ['hello W0rld'], - 'hello world! hello W0rld!', - { - 0: [[13, 23]], - }, - ], - [ - 'should support surrogate pairs', - ['cool 🌉'], - 'cool cool cool cool 🌉', - { - 0: [[15, 21]], - }, - ], - [ - 'should work with terms that are suffixes of other ones', - ['cool', 'cool beans'], - 'cool cool beans', - { - 0: [ - [0, 3], - [5, 8], - ], - 1: [[5, 14]], - }, - ], - [ - 'should work with terms that are suffixes of other ones, test 2', - ['he', 'she', 'his', 'her', 'here'], - 'he waited for she and her mom to go there', - { - 0: [ - [0, 1], - [15, 16], - [22, 23], - [37, 38], - ], - 1: [[14, 16]], - 3: [ - [37, 39], - [22, 24], - ], - 4: [[37, 40]], - }, - ], - ['should only match on the term exactly', ['her'], 'h he! her', { 0: [[6, 8]] }], - [ - 'should work with very long terms', - ['Pneumonoultramicroscopicsilicovolcanoconiosis', 'horrible'], - 'wow this word is quite long: Pneumonoultramicroscopicsilicovolcanoconiosie <- did you notice there was a typo there? horrible of me to do that... Pneumonoultramicroscopicsilicovolcanoconiosis', - { - 0: [[146, 190]], - 1: [[117, 124]], - }, - ], - [ - 'should match several similar terms', - ['thing', 'thang'], - 'im just doin my thign thing ok thang', - { - 0: [[22, 26]], - 1: [[31, 35]], - }, - ], - ['should work with terms that normalize to a different string', ['豈'], '豈', { 0: [[0, 0]] }], - ['should work with the null character', ['\u0000'], '\u0000', { 0: [[0, 0]] }], - ])('%s', (_, patterns, input, matches) => { - const expected: MatchPayload[] = []; - for (const [idStr, matchData] of Object.entries(matches)) { - const id = Number(idStr); - for (const match of matchData) { - expected.push({ - termId: id, - startIndex: match[0], - endIndex: match[1], - matchLength: [...patterns[id]].length, - }); - } - } - - const matcher = new NfaMatcher({ blacklistedTerms: assignIncrementingIds(patterns.map(parseRawPattern)) }); - expect(matcher.getAllMatches(input)).toBePermutationOf(expected); - }); -}); - -describe('matching with optionals', () => { - it('should emit matches with the correct ID', () => { - const matches = new NfaMatcher({ blacklistedTerms: [{ id: 10, pattern: pattern`w[o]rld` }] }).getAllMatches( - 'world wrld', - ); - expect(matches).toHaveLength(2); - expect(matches[0].termId).toBe(10); - expect(matches[1].termId).toBe(10); - }); - - it.each([ - [ - 'should match a single pattern with an optional at the start', - ['[a]bc'], - 'abc bc', - { - 0: [ - [0, 2, 3], - [1, 2, 2], - [4, 5, 2], - ], - }, - ], - [ - 'should match a single pattern with an optional at the end', - ['bc[d]'], - 'cant think of any good strings bc :(d', - { - 0: [[31, 32, 2]], - }, - ], - [ - 'should match a single pattern with an optional in the middle', - ['b[c]d'], - 'getting tired of writing tests, bcd bd :P', - { - 0: [ - [32, 34, 3], - [36, 37, 2], - ], - }, - ], - [ - 'should match a single pattern with an optional wildcard', - ['pi[?]kle'], - 'pickles are good and so are pikles, whatever those are', - { - 0: [ - [0, 5, 6], - [28, 32, 5], - ], - }, - ], - [ - 'should match several patterns with optionals', - ['s[?]m[e]th', 's[?]metimes'], - 'sometimes i like smth', - { - 0: [[17, 20, 4]], - 1: [[0, 8, 9]], - }, - ], - ])('%s', (_, patterns, input, matches) => { - const expected: MatchPayload[] = []; - for (const [idStr, matchData] of Object.entries(matches)) { - const id = Number(idStr); - for (const match of matchData) { - expected.push({ - termId: id, - startIndex: match[0], - endIndex: match[1], - matchLength: match[2], - }); - } - } - - const matcher = new NfaMatcher({ blacklistedTerms: assignIncrementingIds(patterns.map(parseRawPattern)) }); - expect(matcher.getAllMatches(input)).toBePermutationOf(expected); - }); -}); - -describe('matching with wildcards', () => { - it.each([ - [ - 'should match a pattern that only contains wildcards', - ['??', '?'], - 'abc', - { - 0: [ - [0, 1], - [1, 2], - ], - 1: [ - [0, 0], - [1, 1], - [2, 2], - ], - }, - ], - [ - 'should match a single pattern with an wildcard at the end correctly', - ['hello?'], - 'hellom world', - { - 0: [[0, 5]], - }, - ], - [ - 'should match a single pattern with a wildcard at the start correctly', - ['?world'], - 'my world', - { - 0: [[2, 7]], - }, - ], - [ - 'should match a single pattern with a wildcard in the middle correctly', - ['?world?'], - 'the world!', - { - 0: [[3, 9]], - }, - ], - [ - 'should match several patterns with wildcards in varying positions correctly', - ['?start', 'end?', 'mid?le?'], - 'look, wildcards can be at the start, the end, or the middle!', - { - 0: [[29, 34]], - 1: [[41, 44]], - 2: [[53, 59]], - }, - ], - [ - 'should match two patterns where the first is a proper suffix of the latter and has a wildcard correctly', - ['hello', 'ell?'], - 'hey, hello there!', - { - 0: [[5, 9]], - 1: [[6, 9]], - }, - ], - [ - 'should match four patterns where the first is a proper suffix of the second, similar with the second and so on', - ['l?', 'll?', 'ell?', 'hello'], - 'test test test hello??', - { - 0: [ - [17, 18], - [18, 19], - ], - 1: [[17, 19]], - 2: [[16, 19]], - 3: [[15, 19]], - }, - ], - [ - 'should match two patterns where one is a single wildcard and the second is a literal', - ['a!', '?'], - 'a! ', - { - 0: [[0, 1]], - 1: [ - [0, 0], - [1, 1], - [2, 2], - ], - }, - ], - [ - 'should treat surrogate pairs as a single character and thus match a wildcard', - ['night', 'cool ?'], - 'what a cool 🌉 night sky', - { - 0: [[15, 19]], - 1: [[7, 13]], - }, - ], - [ - 'should not match patterns with leading wildcards if there are insufficient characters at the start', - ['??bye'], - 'dbye', - {}, - ], - [ - 'should not match patterns with trailing wildcards if there are insufficient characters at the end', - ['hi????'], - 'hid', - {}, - ], - ])('%s', (_, patterns, input, matches) => { - const expected: MatchPayload[] = []; - for (const [idStr, matchData] of Object.entries(matches)) { - const id = Number(idStr); - for (const match of matchData) { - expected.push({ - termId: id, - startIndex: match[0], - endIndex: match[1], - matchLength: [...patterns[id]].length, - }); - } - } - - const matcher = new NfaMatcher({ blacklistedTerms: assignIncrementingIds(patterns.map(parseRawPattern)) }); - expect(matcher.getAllMatches(input)).toBePermutationOf(expected); - }); -}); - -describe('matching with word boundaries', () => { - it.each([ - // normal patterns - [ - 'should not emit matches for patterns which require a word boundary at the start if the matched segment has a word char before it', - ['|cool'], - 'something is quitecool', - {}, - ], - [ - 'should emit matches for patterns which require a word boundary at the start if the matched segment has a non-word char before it', - ['|beans'], - 'delicious beans', - { - 0: [[10, 14, 5]], - }, - ], - [ - 'should emit matches for patterns which require a word boundary at the start if the matched segment begins at the start of the string', - ['|things'], - 'things are cool', - { - 0: [[0, 5, 6]], - }, - ], - [ - 'should not emit matches for patterns which require a word boundary at the end if the matched segment does not have a non-word char after it', - ['cool|'], - 'something is quite coolbeans', - {}, - ], - [ - 'should emit matches for patterns which require a word boundary at the start if the matched segment has a non-word char after it', - ['beans|'], - 'delicious beans yes', - { - 0: [[10, 14, 5]], - }, - ], - [ - 'should emit matches for patterns which require a word boundary at the start if the matched segment ends at the eof', - ['things|'], - 'there are many things', - { - 0: [[15, 20, 6]], - }, - ], - - // normal patterns w/ non-word chars - [ - 'should not emit matches for patterns which require a word boundary at the start if the matched segment has a word char before it (pattern has non-word char near the start)', - ['|c!ol'], - 'something is quitec!ol', - {}, - ], - [ - 'should emit matches for patterns which require a word boundary at the start if the matched segment has a non-word char before it (pattern has non-word char near the start)', - ['|b*ans'], - 'delicious b*ans', - { - 0: [[10, 14, 5]], - }, - ], - [ - 'should emit matches for patterns which require a word boundary at the start if the matched segment begins at the start of the string (pattern has non-word char near the start)', - ['|t^ings'], - 't^ings are cool', - { - 0: [[0, 5, 6]], - }, - ], - [ - 'should not emit matches for patterns which require a word boundary at the end if the matched segment does not have a non-word char after it (pattern has non-word char near the end)', - ['co#l|'], - 'something is quite co#lbeans', - {}, - ], - [ - 'should emit matches for patterns which require a word boundary at the start if the matched segment has a non-word char after it (pattern has non-word char near the end)', - ['bea!s|'], - 'delicious bea!s yes', - { - 0: [[10, 14, 5]], - }, - ], - [ - 'should emit matches for patterns which require a word boundary at the start if the matched segment ends at the eof (pattern has non-word char near the end)', - ['thin$s|'], - 'there are many thin$s', - { - 0: [[15, 20, 6]], - }, - ], - - // patterns with wildcards - [ - 'should not emit matches for patterns which require a word boundary at the start if the matched segment does not have a non-word char after it (with wildcards)', - ['|c?ol'], - 'something is quitecool', - {}, - ], - [ - 'should emit matches for patterns which require a word boundary at the start if the matched segment has a non-word char after it (with wildcards)', - ['|be?ns'], - 'delicious beans', - { - 0: [[10, 14, 5]], - }, - ], - [ - 'should emit matches for patterns which require a word boundary at the start if the matched segment begins at the start of the string (with wildcards)', - ['|?hings'], - 'things are cool', - { - 0: [[0, 5, 6]], - }, - ], - [ - 'should not emit matches for patterns which require a word boundary at the end if the matched segment does not have a non-word char after it (with wildcards)', - ['?ool|'], - 'something is quite coolbeans', - {}, - ], - [ - 'should emit matches for patterns which require a word boundary at the start if the matched segment has a non-word char after it (with wildcards)', - ['be?ns|'], - 'delicious beans yes', - { - 0: [[10, 14, 5]], - }, - ], - [ - 'should emit matches for patterns which require a word boundary at the start if the matched segment ends at the eof (with wildcards)', - ['thing?|'], - 'there are many things', - { - 0: [[15, 20, 6]], - }, - ], - [ - 'should match a pattern with only wildcards and a word boundary at the start correctly', - ['|??'], - 'myby', - { - 0: [[0, 1, 2]], - }, - ], - [ - 'should match a pattern with only wildcards and a word boundary at the end correctly', - ['??|'], - 'myby', - { - 0: [[2, 3, 2]], - }, - ], - - // patterns with wildcards and non-word chars - [ - 'should not emit matches for patterns which require a word boundary at the start if the matched segment does not have a non-word char after it (with wildcards and a non-word char near the start)', - ['|c!?ol'], - 'something is quitec!ool', - {}, - ], - [ - 'should emit matches for patterns which require a word boundary at the start if the matched segment has a non-word char after it (with wildcards and a non-word char near the start)', - ['|b$e?ns'], - 'delicious b$eans', - { - 0: [[10, 15, 6]], - }, - ], - [ - 'should emit matches for patterns which require a word boundary at the start if the matched segment begins at the start of the string (with wildcards and a non-word char near the start)', - ['|?^ings'], - 't^ings are cool', - { - 0: [[0, 5, 6]], - }, - ], - [ - 'should not emit matches for patterns which require a word boundary at the end if the matched segment does not have a non-word char after it (with wildcards and a non-word char near the end)', - ['?o_l|'], - 'something is quite coolbeans', - {}, - ], - [ - 'should emit matches for patterns which require a word boundary at the start if the matched segment has a non-word char after it (with wildcards and a non-word char near the end)', - ['be?%s|'], - 'delicious bea%s yes', - { - 0: [[10, 14, 5]], - }, - ], - [ - 'should emit matches for patterns which require a word boundary at the start if the matched segment ends at the eof (with wildcards and a non-word char near the end)', - ['thin*?|'], - 'there are many thin*s', - { - 0: [[15, 20, 6]], - }, - ], - ])('%s', (_, patterns, input, matches) => { - const expected: MatchPayload[] = []; - for (const [idStr, matchData] of Object.entries(matches)) { - const id = Number(idStr); - for (const match of matchData) { - expected.push({ - termId: id, - startIndex: match[0], - endIndex: match[1], - matchLength: match[2], - }); - } - } - - const matcher = new NfaMatcher({ blacklistedTerms: assignIncrementingIds(patterns.map(parseRawPattern)) }); - expect(matcher.getAllMatches(input)).toBePermutationOf(expected); - }); -}); - -describe('matching with whitelisted terms', () => { - it('should call the getMatches() method of the WhitelistedTermMatcher with the input', () => { - const spy = jest.spyOn(WhitelistedTermMatcher.prototype, 'getMatches'); - const matcher = new NfaMatcher({ - blacklistedTerms: [{ id: 1, pattern: pattern`thing` }], - blacklistMatcherTransformers: [skipNonAlphabeticTransformer()], - whitelistedTerms: ['thi ing'], - }); - matcher.getAllMatches('the thi ing'); - expect(spy).toHaveBeenCalledTimes(1); - expect(spy).toHaveBeenLastCalledWith('the thi ing'); - }); - - it('should not match parts of the text which are completely matched by a whitelisted term', () => { - const matcher = new NfaMatcher({ - blacklistedTerms: [{ id: 1, pattern: pattern`penis` }], - blacklistMatcherTransformers: [skipNonAlphabeticTransformer()], - whitelistedTerms: ['pen is'], - }); - expect(matcher.getAllMatches('the pen is mightier than the penis')).toStrictEqual([ - { - termId: 1, - startIndex: 29, - endIndex: 33, - matchLength: 5, - }, - ]); - }); - - it('should match parts of the text that only overlap (and are not completely contained) by a whitelisted term', () => { - const matcher = new NfaMatcher({ - blacklistedTerms: [{ id: 1, pattern: pattern`bitch` }], - whitelistedTerms: ['bit', 'itch'], - }); - expect(matcher.getAllMatches('a bitch')).toStrictEqual([ - { - termId: 1, - startIndex: 2, - endIndex: 6, - matchLength: 5, - }, - ]); - }); -}); - -describe('matching with blacklist transformers', () => { - it('should skip characters which became undefined after transformation', () => { - const skipSpaces = createSimpleTransformer((c) => (c === 32 ? undefined : c)); - const matcher = new NfaMatcher({ - blacklistedTerms: [{ id: 1, pattern: pattern`something` }], - blacklistMatcherTransformers: [skipSpaces], - }); - expect(matcher.getAllMatches('s o m e t h i n g')).toStrictEqual([ - { - termId: 1, - startIndex: 0, - endIndex: 17, - matchLength: 9, - }, - ]); - }); - - it('should work with transformers that change chars (no match)', () => { - // eslint-disable-next-line @typescript-eslint/naming-convention, @typescript-eslint/restrict-plus-operands - const changeAToB = createSimpleTransformer((c) => (c === CharacterCode.LowerA ? c + 1 : c)); - const matcher = new NfaMatcher({ - blacklistedTerms: [{ id: 1, pattern: pattern`sa?e` }], - blacklistMatcherTransformers: [changeAToB], - }); - expect(matcher.getAllMatches('same')).toHaveLength(0); - }); - - it('should work with transformers that change chars (with match)', () => { - // eslint-disable-next-line @typescript-eslint/naming-convention, @typescript-eslint/restrict-plus-operands - const changeAToB = createSimpleTransformer((c) => (c === CharacterCode.LowerA ? c + 1 : c)); - const matcher = new NfaMatcher({ - blacklistedTerms: [{ id: 1, pattern: pattern`hbllo?` }], - blacklistMatcherTransformers: [changeAToB], - }); - expect(matcher.getAllMatches('sup hallothere')).toStrictEqual([ - { - termId: 1, - startIndex: 4, - endIndex: 9, - matchLength: 6, - }, - ]); - }); - - it('should not affect matching of whitelisted terms', () => { - // eslint-disable-next-line @typescript-eslint/restrict-plus-operands - const ignoreAllAs = createSimpleTransformer((c) => (c === CharacterCode.LowerA ? c + 1 : c)); - const matcher = new NfaMatcher({ - blacklistedTerms: [{ id: 1, pattern: pattern`bbb` }], - whitelistedTerms: ['aabbbaa'], - blacklistMatcherTransformers: [ignoreAllAs], - }); - expect(matcher.getAllMatches('!!!! $$aabbbaa## !!!')).toHaveLength(0); - }); - - it('should work with patterns that have trailing wildcards', () => { - const skipSpaces = createSimpleTransformer((c) => (c === 32 ? undefined : c)); - const matcher = new NfaMatcher({ - blacklistedTerms: [{ id: 1, pattern: pattern`trailing?` }], - whitelistedTerms: [], - blacklistMatcherTransformers: [skipSpaces], - }); - expect(matcher.getAllMatches(' !!!! $$ t r a i l i n g## !!!')).toStrictEqual([ - { - termId: 1, - startIndex: 9, - endIndex: 33, - matchLength: 9, - }, - ]); - }); -}); - -describe('matching with whitelist transformers', () => { - it('should work with transformers which become undefined after transformation', () => { - const skipSpaces = createSimpleTransformer((c) => (c === 32 ? undefined : c)); - const matcher = new NfaMatcher({ - blacklistedTerms: [{ id: 1, pattern: pattern`world` }], - whitelistedTerms: ['helloworld!'], - whitelistMatcherTransformers: [skipSpaces], - }); - expect(matcher.getAllMatches('h e l l o world!')).toHaveLength(0); - }); - - it('should work with transformers that change chars (no match)', () => { - // eslint-disable-next-line @typescript-eslint/naming-convention, @typescript-eslint/restrict-plus-operands - const changeAToB = createSimpleTransformer((c) => (c === CharacterCode.LowerA ? c + 1 : c)); - const matcher = new NfaMatcher({ - blacklistedTerms: [{ id: 1, pattern: pattern`biash` }], - whitelistedTerms: ['a biash'], - whitelistMatcherTransformers: [changeAToB], - }); - expect(matcher.getAllMatches('the a biash was')).toStrictEqual([ - { termId: 1, startIndex: 6, endIndex: 10, matchLength: 5 }, - ]); - }); - - it('should work with transformers that change chars (with match)', () => { - // eslint-disable-next-line @typescript-eslint/naming-convention, @typescript-eslint/restrict-plus-operands - const changeAToB = createSimpleTransformer((c) => (c === CharacterCode.LowerA ? c + 1 : c)); - const matcher = new NfaMatcher({ - blacklistedTerms: [{ id: 1, pattern: pattern`ass` }], - whitelistedTerms: ['bss'], - whitelistMatcherTransformers: [changeAToB], - }); - expect(matcher.getAllMatches('a big ass')).toHaveLength(0); - }); - - it('should not affect matching of blacklisted terms', () => { - // eslint-disable-next-line @typescript-eslint/restrict-plus-operands - const ignoreAllAs = createSimpleTransformer((c) => (c === CharacterCode.LowerA ? c + 1 : c)); - const matcher = new NfaMatcher({ - blacklistedTerms: [{ id: 1, pattern: pattern`dader` }], - whitelistedTerms: ['a dader'], - whitelistMatcherTransformers: [ignoreAllAs], - }); - expect(matcher.getAllMatches('there is a dader')).toStrictEqual([ - { termId: 1, startIndex: 11, endIndex: 15, matchLength: 5 }, - ]); - }); -}); - -describe('NfaMatcher#getAllMatches()', () => { - describe('result match order', () => { - it('should be sorted if the sorted parameter is set to true', () => { - const matcher = new NfaMatcher({ - blacklistedTerms: assignIncrementingIds([pattern`sup`, pattern`u?`, pattern`dude`]), - }); - expect(matcher.getAllMatches('sup guys there are some dudes here', true)).toStrictEqual([ - { termId: 0, startIndex: 0, endIndex: 2, matchLength: 3 }, - { termId: 1, startIndex: 1, endIndex: 2, matchLength: 2 }, - { termId: 1, startIndex: 5, endIndex: 6, matchLength: 2 }, - { termId: 2, startIndex: 24, endIndex: 27, matchLength: 4 }, - { termId: 1, startIndex: 25, endIndex: 26, matchLength: 2 }, - ]); - }); - }); - - it('should work when called several times in a row', () => { - const matcher = new NfaMatcher({ - blacklistedTerms: assignIncrementingIds([pattern`foobar`, pattern`hello`]), - whitelistedTerms: ['the foobar'], - }); - expect(matcher.getAllMatches('the foobar is quite foobar hello yo')).toBePermutationOf([ - { termId: 0, startIndex: 20, endIndex: 25, matchLength: 6 }, - { termId: 1, startIndex: 27, endIndex: 31, matchLength: 5 }, - ]); - expect(matcher.getAllMatches('the foobar is quite foobar hello yo')).toBePermutationOf([ - { termId: 0, startIndex: 20, endIndex: 25, matchLength: 6 }, - { termId: 1, startIndex: 27, endIndex: 31, matchLength: 5 }, - ]); - expect(matcher.getAllMatches('the foobar is quite foobar hello yo')).toBePermutationOf([ - { termId: 0, startIndex: 20, endIndex: 25, matchLength: 6 }, - { termId: 1, startIndex: 27, endIndex: 31, matchLength: 5 }, - ]); - }); -}); - -describe('NfaMatcher#hasMatch()', () => { - it('should be true if there is a match', () => { - const matcher = new NfaMatcher({ - blacklistedTerms: assignIncrementingIds([pattern`there`, pattern`yo there hi`]), - }); - expect(matcher.hasMatch('the yo there has a yo there')).toBeTruthy(); - }); - - it('should be falsy if there is no match', () => { - const matcher = new NfaMatcher({ - blacklistedTerms: assignIncrementingIds([pattern`yo`]), - }); - expect(matcher.hasMatch('no y-word here!')).toBeFalsy(); - }); - - it('should not return true if a match with incorrect word boundaries is found', () => { - const matcher = new NfaMatcher({ - blacklistedTerms: assignIncrementingIds([pattern`|xs`]), - }); - expect(matcher.hasMatch('yoxs')).toBeFalsy(); - }); - - it('should return true if there is a match for a pattern with a wildcard at the end', () => { - const matcher = new NfaMatcher({ - blacklistedTerms: assignIncrementingIds([pattern`x?`]), - }); - expect(matcher.hasMatch('my xo')).toBeTruthy(); - }); - - it('should return true if there is a match for a pattern that only contains wildcards', () => { - const matcher = new NfaMatcher({ - blacklistedTerms: assignIncrementingIds([pattern`??`]), - }); - expect(matcher.hasMatch('xy')).toBeTruthy(); - }); - - it('should return true if there a match for a pattern that contains wildcards at the start (only 1 pattern)', () => { - const matcher = new NfaMatcher({ - blacklistedTerms: assignIncrementingIds([pattern`?x`]), - }); - expect(matcher.hasMatch('foo bar quux')).toBeTruthy(); - }); - - it('should return true if there is a match for a pattern that contains wildcards at the start (two patterns)', () => { - const matcher = new NfaMatcher({ - blacklistedTerms: assignIncrementingIds([pattern`?a`, pattern`?ba|`]), - }); - expect(matcher.hasMatch('xbac')).toBeTruthy(); - }); - - it('should work when called several times in a row', () => { - const matcher = new NfaMatcher({ - blacklistedTerms: assignIncrementingIds([pattern`yo there`]), - whitelistedTerms: ['the yo there'], - }); - expect(matcher.hasMatch('the yo there has a yo there')).toBeTruthy(); - expect(matcher.hasMatch('the yo there has a yo there')).toBeTruthy(); - expect(matcher.hasMatch('the yo there has a yo there')).toBeTruthy(); - }); -}); diff --git a/test/matcher/nfa/WhitelistedTermMatcher.fuzz.test.ts b/test/matcher/nfa/WhitelistedTermMatcher.fuzz.test.ts deleted file mode 100644 index 011b746..0000000 --- a/test/matcher/nfa/WhitelistedTermMatcher.fuzz.test.ts +++ /dev/null @@ -1,52 +0,0 @@ -import * as fc from 'fast-check'; -import { WhitelistedTermMatcher } from '../../../src/matcher/nfa/WhitelistedTermMatcher'; -import type { Interval } from '../../../src/util/Interval'; - -test('running the whitelist matcher with a set of terms and input should have the same result as running the brute force string searching algorithm on it', () => { - fc.assert( - fc.property( - fc.unicodeString().chain((input) => { - // Generate patterns that are substrings of the input. - const arbitrarySubstringPatterns = - input.length < 2 - ? fc.constant([]) - : fc.array( - fc - .tuple(fc.integer({ min: 0, max: input.length - 1 }), fc.integer({ min: 0, max: input.length - 1 })) - .map(([a, b]) => { - if (a > b) return input.slice(b, a); - return input.slice(a, b); - }) - .filter((p) => p.length > 0), - ); - return fc.tuple( - fc.constant(input), - fc.array(fc.unicodeString().filter((p) => p.length > 0)), - arbitrarySubstringPatterns, - ); - }), - ([input, randomPatterns, substrPatterns]) => { - // Deduplicate the patterns. - const set = new Set(); - for (const pattern of randomPatterns) set.add(pattern); - for (const pattern of substrPatterns) set.add(pattern); - const allPatterns = [...set]; - const matcher = new WhitelistedTermMatcher({ terms: allPatterns }); - expect([...matcher.getMatches(input)]).toBePermutationOf(bruteForceMatch(allPatterns, input)); - }, - ), - ); -}); - -function bruteForceMatch(patterns: string[], input: string) { - const result: Interval[] = []; - for (let i = 0; i < input.length; i++) { - for (const pattern of patterns) { - if (input.startsWith(pattern, i)) { - result.push([i, i + pattern.length - 1]); - } - } - } - - return result; -} diff --git a/test/matcher/nfa/WhitelistedTermMatcher.test.ts b/test/matcher/nfa/WhitelistedTermMatcher.test.ts deleted file mode 100644 index 0453d4f..0000000 --- a/test/matcher/nfa/WhitelistedTermMatcher.test.ts +++ /dev/null @@ -1,98 +0,0 @@ -import { WhitelistedTermMatcher } from '../../../src/matcher/nfa/WhitelistedTermMatcher'; -import { createSimpleTransformer } from '../../../src/transformer/Transformers'; -import { CharacterCode } from '../../../src/util/Char'; - -describe('constructor', () => { - it('should not allow empty terms', () => { - expect(() => new WhitelistedTermMatcher({ terms: [''] })).toThrow(new Error('Unexpected empty whitelisted term.')); - }); -}); - -describe('WhitelistedTermMatcher#getMatchedSpans', () => { - it('should return an empty interval collection if there are no terms', () => { - const matches = new WhitelistedTermMatcher({ terms: [] }).getMatches('hello world'); - expect([...matches]).toHaveLength(0); - }); - - it.each([ - ['should match a term at the start of the string', ['hello'], 'hello world', [[0, 4]]], - ['should match a term at the end of the string', ['world'], 'hello world', [[6, 10]]], - ['should be case sensitive (no match)', ['WORLD'], 'hello world', []], - ['should be case sensitive (with match)', ['yO'], 'hello yO yo', [[6, 7]]], - ['should support spaces in terms', ['hello W0rld'], 'hello world! hello W0rld!', [[13, 23]]], - ['should support surrogate pairs', ['cool 🌉'], 'cool cool cool cool 🌉', [[15, 21]]], - [ - 'should work with terms that are suffixes of other ones', - ['cool', 'cool beans'], - 'cool cool beans', - [ - [0, 3], - [5, 8], - [5, 14], - ], - ], - [ - 'should work with terms that are suffixes of other ones, test 2', - ['he', 'she', 'his', 'her', 'here'], - 'he waited for she and her mom to go there', - [ - [0, 1], - [14, 16], - [15, 16], - [22, 23], - [22, 24], - [37, 40], - [37, 38], - [37, 39], - ], - ], - ['should only match on the term exactly', ['her'], 'h he! her', [[6, 8]]], - [ - 'should work with very long terms', - ['Pneumonoultramicroscopicsilicovolcanoconiosis', 'horrible'], - 'wow this word is quite long: Pneumonoultramicroscopicsilicovolcanoconiosie <- did you notice there was a typo there? horrible of me to do that... Pneumonoultramicroscopicsilicovolcanoconiosis', - [ - [117, 124], - [146, 190], - ], - ], - [ - 'should match several similar terms', - ['thing', 'thang'], - 'im just doin my thign thing ok thang', - [ - [22, 26], - [31, 35], - ], - ], - ['should work with terms that normalize to a different string', ['豈'], '豈', [[0, 0]]], - ['should handle null characters correctly', ['\u0000'], '\u0000', [[0, 0]]], - ])('%s', (_, terms, input, expected) => { - const matches = new WhitelistedTermMatcher({ terms }).getMatches(input); - expect([...matches]).toBePermutationOf(expected); - }); - - describe('transformers', () => { - it('should work with transformers that skip chars', () => { - const skipA = createSimpleTransformer((c) => (c === CharacterCode.LowerA ? undefined : c)); - const matches = new WhitelistedTermMatcher({ terms: ['intriguing'], transformers: [skipA] }).getMatches( - 'hello world! inatrigauainagfoo bar.', - ); - expect([...matches]).toBePermutationOf([[13, 26]]); - }); - - it('should work with transformers that change chars (no match)', () => { - // eslint-disable-next-line @typescript-eslint/naming-convention, @typescript-eslint/restrict-plus-operands - const changeAToB = createSimpleTransformer((c) => (c === CharacterCode.LowerA ? CharacterCode.LowerA + 1 : c)); - const matches = new WhitelistedTermMatcher({ terms: ['hallo'], transformers: [changeAToB] }).getMatches('hallo'); - expect([...matches]).toHaveLength(0); - }); - - it('should work with transformers that change chars (with match)', () => { - // eslint-disable-next-line @typescript-eslint/naming-convention, @typescript-eslint/restrict-plus-operands - const changeAToB = createSimpleTransformer((c) => (c === CharacterCode.LowerA ? CharacterCode.LowerA + 1 : c)); - const matches = new WhitelistedTermMatcher({ terms: ['hbllo'], transformers: [changeAToB] }).getMatches('hallo'); - expect([...matches]).toBePermutationOf([[0, 4]]); - }); - }); -}); diff --git a/test/matcher/nfa/trie/BlacklistTrieNode.test.ts b/test/matcher/nfa/trie/BlacklistTrieNode.test.ts deleted file mode 100644 index 82e6a5d..0000000 --- a/test/matcher/nfa/trie/BlacklistTrieNode.test.ts +++ /dev/null @@ -1,21 +0,0 @@ -import { BlacklistTrieNode, hashPartialMatch } from '../../../../src/matcher/nfa/trie/BlacklistTrieNode'; - -describe('constructor', () => { - it('should set edges to an empty edge list', () => { - expect(new BlacklistTrieNode().edges.size).toBe(0); - }); - - it('should set term id to -1', () => { - expect(new BlacklistTrieNode().termId).toBe(-1); - }); - - it('should set flags to 0', () => { - expect(new BlacklistTrieNode().flags).toBe(0); - }); -}); - -describe('hashPartialMatch()', () => { - it('should return a string in the format step-termId', () => { - expect(hashPartialMatch(0, 5)).toBe('0-5'); - }); -}); diff --git a/test/matcher/nfa/trie/WhitelistTrieNode.test.ts b/test/matcher/nfa/trie/WhitelistTrieNode.test.ts deleted file mode 100644 index e8617a7..0000000 --- a/test/matcher/nfa/trie/WhitelistTrieNode.test.ts +++ /dev/null @@ -1,15 +0,0 @@ -import { WhitelistTrieNode } from '../../../../src/matcher/nfa/trie/WhitelistTrieNode'; - -describe('constructor', () => { - it('should set edges to an empty edge list', () => { - expect(new WhitelistTrieNode().edges.size).toBe(0); - }); - - it('should set termId to -1', () => { - expect(new WhitelistTrieNode().termId).toBe(-1); - }); - - it('should set isOutputNode to false', () => { - expect(new WhitelistTrieNode().isOutputNode).toBeFalsy(); - }); -}); diff --git a/test/matcher/nfa/trie/edge/ArrayEdgeCollection.test.ts b/test/matcher/nfa/trie/edge/ArrayEdgeCollection.test.ts deleted file mode 100644 index b7a51bf..0000000 --- a/test/matcher/nfa/trie/edge/ArrayEdgeCollection.test.ts +++ /dev/null @@ -1,113 +0,0 @@ -import { ArrayEdgeCollection } from '../../../../../src/matcher/nfa/trie/edge/ArrayEdgeCollection'; -import type { Edge } from '../../../../../src/matcher/nfa/trie/edge/EdgeCollection'; - -let coll: ArrayEdgeCollection; - -beforeEach(() => { - coll = new ArrayEdgeCollection(); -}); - -describe('ArrayEdgeCollection#set()', () => { - it('should add the edge to the collection', () => { - coll.set(5, 'a'); - expect([...coll]).toBePermutationOf([[5, 'a']]); - }); - - it('should overwrite an existing edge if possible', () => { - coll.set(1, 'y'); - // eslint-disable-next-line sonarjs/no-element-overwrite - coll.set(1, 'z'); - expect([...coll]).toBePermutationOf([[1, 'z']]); - }); - - it('should increment the size only if no existing edge was found', () => { - coll.set(7, 'y'); - expect(coll.size).toBe(1); - coll.set(7, 'z'); - expect(coll.size).toBe(1); - }); -}); - -const edges: Edge[] = [ - [1, 'e'], - [4, 'o'], - [3, 'i'], - [7, 'z'], - [0, 'w'], - [10, 'b'], - [43, 'c'], - [57, 'v'], - [19, 'f'], -]; - -describe('ArrayEdgeCollection#get()', () => { - it('should return the node corresponding to the edge (<= 10 values in the collection)', () => { - coll.set(7, 'd'); - coll.set(9, 'z'); - coll.set(5, 'c'); - coll.set(10, 'e'); - expect(coll.get(5)).toBe('c'); - expect(coll.get(7)).toBe('d'); - expect(coll.get(9)).toBe('z'); - expect(coll.get(10)).toBe('e'); - }); - - it('should return the node corresponding to the edge (> 10 values in the collection)', () => { - for (const edge of edges) coll.set(...edge); - expect(coll.get(1)).toBe('e'); - expect(coll.get(57)).toBe('v'); - expect(coll.get(43)).toBe('c'); - expect(coll.get(0)).toBe('w'); - expect(coll.get(10)).toBe('b'); - }); - - it('should return undefined if there is no node corresponding to the edge (<= 3 values in the collection)', () => { - coll.set(5, 'x'); - coll.set(7, 'd'); - expect(coll.get(0)).toBeUndefined(); - expect(coll.get(6)).toBeUndefined(); - expect(coll.get(494)).toBeUndefined(); - }); - - it('should return undefined if there is no node corresponding to the edge (> 3 values in the collection)', () => { - for (const edge of edges) coll.set(...edge); - expect(coll.get(12)).toBeUndefined(); - expect(coll.get(15)).toBeUndefined(); - expect(coll.get(-1)).toBeUndefined(); - expect(coll.get(58)).toBeUndefined(); - expect(coll.get(554)).toBeUndefined(); - }); -}); - -describe('ArrayEdgeCollection#keys()', () => { - it('should return an iterator over the keys of the collection', () => { - coll.set(5, 'd'); - coll.set(8, 'e'); - coll.set(3, 'x'); - coll.set(10, 'e'); - expect([...coll.keys()]).toBePermutationOf([5, 8, 3, 10]); - }); -}); - -describe('ArrayEdgeCollection#values()', () => { - it('should return an iterator over the values of the collection', () => { - coll.set(19, 'd'); - coll.set(9, 'd'); - coll.set(15, 'e'); - coll.set(13, 'e'); - expect([...coll.values()]).toBePermutationOf(['d', 'd', 'e', 'e']); - }); -}); - -it('should be iterable', () => { - coll.set(12, 'j'); - coll.set(43, 'e'); - coll.set(17, 'p'); - coll.set(59, 'e'); - expect([...coll]).toBePermutationOf([ - [12, 'j'], - [43, 'e'], - [17, 'p'], - [59, 'e'], - ]); -}); diff --git a/test/matcher/nfa/trie/edge/BucketEdgeCollection.test.ts b/test/matcher/nfa/trie/edge/BucketEdgeCollection.test.ts deleted file mode 100644 index c159f4d..0000000 --- a/test/matcher/nfa/trie/edge/BucketEdgeCollection.test.ts +++ /dev/null @@ -1,85 +0,0 @@ -import { BucketEdgeCollection } from '../../../../../src/matcher/nfa/trie/edge/BucketEdgeCollection'; - -const getCode = (c: string) => c.charCodeAt(0); - -let coll: BucketEdgeCollection; - -beforeEach(() => { - coll = new BucketEdgeCollection(); -}); - -describe('BucketEdgeCollection#set()', () => { - it('should add the edge to the collection', () => { - coll.set(getCode('a'), 'x'); - expect([...coll]).toBePermutationOf([[getCode('a'), 'x']]); - }); - - it('should increment the size only if no existing edge was found', () => { - coll.set(getCode('d'), 'a'); - expect(coll.size).toBe(1); - coll.set(getCode('d'), 'c'); - expect(coll.size).toBe(1); - }); -}); - -describe('BucketEdgeCollection#get()', () => { - it('should return the node corresponding to the character', () => { - coll.set(getCode('c'), 'd'); - coll.set(getCode('e'), 'y'); - coll.set(getCode('z'), 'e'); - expect(coll.get(getCode('c'))).toBe('d'); - expect(coll.get(getCode('e'))).toBe('y'); - expect(coll.get(getCode('z'))).toBe('e'); - }); - - it('should return undefined if no node exists', () => { - coll.set(getCode('e'), 'd'); - coll.set(getCode('z'), 'e'); - coll.set(getCode('y'), 'z'); - coll.set(getCode('p'), 'y'); - expect(coll.get(getCode('a'))).toBeUndefined(); - expect(coll.get(getCode('w'))).toBeUndefined(); - expect(coll.get(getCode('f'))).toBeUndefined(); - }); - - it('should return undefined if the key was not a lowercase letter', () => { - expect(coll.get(-1)).toBeUndefined(); - expect(coll.get(0)).toBeUndefined(); - expect(coll.get(getCode('z') + 1)).toBeUndefined(); - expect(coll.get(getCode('a') - 1)).toBeUndefined(); - expect(coll.get(594)).toBeUndefined(); - }); -}); - -describe('BucketEdgeCollection#keys()', () => { - it('should return an iterator over the keys', () => { - coll.set(getCode('z'), 'd'); - coll.set(getCode('u'), 'e'); - coll.set(getCode('n'), 'p'); - coll.set(getCode('m'), 'r'); - expect([...coll.keys()]).toBePermutationOf(['z', 'u', 'n', 'm'].map(getCode)); - }); -}); - -describe('BucketEdgeCollection#values()', () => { - it('should return an iterator over the values', () => { - coll.set(getCode('l'), 'e'); - coll.set(getCode('h'), 'e'); - coll.set(getCode('a'), 'p'); - coll.set(getCode('i'), 'v'); - expect([...coll.values()]).toBePermutationOf(['e', 'e', 'p', 'v']); - }); -}); - -it('should be iterable', () => { - coll.set(getCode('r'), 'j'); - coll.set(getCode('l'), 'e'); - coll.set(getCode('n'), 'p'); - coll.set(getCode('s'), 'e'); - expect([...coll]).toBePermutationOf([ - [getCode('r'), 'j'], - [getCode('l'), 'e'], - [getCode('n'), 'p'], - [getCode('s'), 'e'], - ]); -}); diff --git a/test/matcher/nfa/trie/edge/ForwardingEdgeCollection.test.ts b/test/matcher/nfa/trie/edge/ForwardingEdgeCollection.test.ts deleted file mode 100644 index 796465e..0000000 --- a/test/matcher/nfa/trie/edge/ForwardingEdgeCollection.test.ts +++ /dev/null @@ -1,117 +0,0 @@ -import { ArrayEdgeCollection } from '../../../../../src/matcher/nfa/trie/edge/ArrayEdgeCollection'; -import { BucketEdgeCollection } from '../../../../../src/matcher/nfa/trie/edge/BucketEdgeCollection'; -import type { Edge } from '../../../../../src/matcher/nfa/trie/edge/EdgeCollection'; -import { ForwardingEdgeCollection } from '../../../../../src/matcher/nfa/trie/edge/ForwardingEdgeCollection'; -import { CharacterCode } from '../../../../../src/util/Char'; - -let coll: ForwardingEdgeCollection; - -beforeEach(() => { - coll = new ForwardingEdgeCollection(); -}); - -afterEach(() => { - jest.restoreAllMocks(); -}); - -describe('ForwardingEdgeCollection#set()', () => { - it('should use the array implementation by default', () => { - const spy = jest.spyOn(ArrayEdgeCollection.prototype, 'set'); - coll.set(5, 7); - expect(coll.underlyingImplementation).toBeInstanceOf(ArrayEdgeCollection); - expect(spy).toHaveBeenCalledTimes(1); - expect(spy).toHaveBeenLastCalledWith(5, 7); - }); - - it('should switch to the bucket implementation if the number of edges is > 10 and the keys are all lowercase', () => { - const spy = jest.spyOn(BucketEdgeCollection.prototype, 'set'); - const edges = [...Array.from({ length: 11 }).keys()].map>((i) => [i + CharacterCode.LowerA, i]); - for (const edge of edges) coll.set(...edge); - expect(coll.underlyingImplementation).toBeInstanceOf(BucketEdgeCollection); - expect(spy).toHaveBeenCalledTimes(11); - expect(spy.mock.calls).toBePermutationOf(edges); - }); - - it('should switch back to the array implementation if currently using the bucket implementation and an edge with a non-lowercase key is added', () => { - const bucketImplSpy = jest.spyOn(BucketEdgeCollection.prototype, 'set'); - const arrayImplSpy = jest.spyOn(ArrayEdgeCollection.prototype, 'set'); - const edges = [...Array.from({ length: 11 }).keys()].map>((i) => [i + CharacterCode.LowerA, i]); - for (const edge of edges) coll.set(...edge); - coll.set(5, 19); - expect(bucketImplSpy).toHaveBeenCalledTimes(11); - expect(bucketImplSpy).not.toHaveBeenCalledWith(5, 19); - expect(coll.underlyingImplementation).toBeInstanceOf(ArrayEdgeCollection); - expect(arrayImplSpy).toHaveBeenCalledTimes(23); - expect(arrayImplSpy.mock.calls).toBePermutationOf([...edges, ...edges, [5, 19]]); - }); - - it('should use the map implementation if the number of edges is > 35', () => { - const spy = jest.spyOn(Map.prototype, 'set'); - const edges = [...Array.from({ length: 36 }).keys()].map>((i) => [i, i + 5]); - for (const edge of edges) coll.set(...edge); - expect(coll.underlyingImplementation).toBeInstanceOf(Map); - expect(spy).toHaveBeenCalledTimes(36); - expect(spy.mock.calls).toBePermutationOf(edges); - }); -}); - -function getEdgeCollWithArrayImpl() { - const coll = new ForwardingEdgeCollection(); - coll.set(5, 19); - return coll; -} - -function getEdgeCollWithBucketImpl() { - const coll = new ForwardingEdgeCollection(); - for (let i = 0; i < 11; i++) coll.set(i + CharacterCode.LowerA, i); - return coll; -} - -function getEdgeCollWithMapImpl() { - const coll = new ForwardingEdgeCollection(); - for (let i = 0; i < 36; i++) coll.set(i, i + 5); - return coll; -} - -describe('ForwardingEdgeCollection#get()', () => { - it('should forward the call to the array implementation if that is the underlying implementation', () => { - const spy = jest.spyOn(ArrayEdgeCollection.prototype, 'get'); - getEdgeCollWithArrayImpl().get(5); - expect(spy).toHaveBeenCalledTimes(1); - expect(spy).toHaveBeenLastCalledWith(5); - }); - - it('should forward the call to the bucket implementation if that is the underlying implementation', () => { - const spy = jest.spyOn(BucketEdgeCollection.prototype, 'get'); - getEdgeCollWithBucketImpl().get(95); - expect(spy).toHaveBeenCalledTimes(1); - expect(spy).toHaveBeenLastCalledWith(95); - }); - - it('should forward the call to the map implementation if that is the underlying implementation', () => { - const spy = jest.spyOn(Map.prototype, 'get'); - getEdgeCollWithMapImpl().get(39); - expect(spy).toHaveBeenCalledTimes(1); - expect(spy).toHaveBeenLastCalledWith(39); - }); -}); - -describe.each<'keys' | 'values'>(['keys', 'values'])('ForwardingEdgeCollection#%s()', (method) => { - it('should forward the call to the array implementation if that is the underlying implementation', () => { - const spy = jest.spyOn(ArrayEdgeCollection.prototype, method); - getEdgeCollWithArrayImpl()[method](); - expect(spy).toHaveBeenCalledTimes(1); - }); - - it('should forward the call to the bucket implementation if that is the underlying implementation', () => { - const spy = jest.spyOn(BucketEdgeCollection.prototype, method); - getEdgeCollWithBucketImpl()[method](); - expect(spy).toHaveBeenCalledTimes(1); - }); - - it('should forward the call to the map implementation if that is the underlying implementation', () => { - const spy = jest.spyOn(Map.prototype, method); - getEdgeCollWithMapImpl()[method](); - expect(spy).toHaveBeenCalledTimes(1); - }); -});