tokenizer

Method for writing super-fast tokenizers/lexers using re2c. A set of regex rules and their associated token types is reduced to a fast, optimized, C-based finite state automaton (FSA).

The rules could be derived from data or written out. Unicode categories are provided and kept up-to-date.

Since there's no one-size-fits-all for tokenizers, this repo is a template using copier to start a new repo for each tokenizer. The result will be in C using clib with an optional Python binding.

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
template		template
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
copier.yaml		copier.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tokenizer

About

Releases

Packages

Languages

goodcleanfun/tokenizer

Folders and files

Latest commit

History

Repository files navigation

tokenizer

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages