Tokenize source code into integer vectors, symbols, or discrete tokens.
The following languages are currently supported.
- C
- C#
- C++
- Go
- Java
- JavaScript
- PHP
- Python
- Rust
- TypeScript
cd src
make
Ensure CppUnit is installed.
Depending on your environment, you may also need to pass its installation
directory prefixes to make through the command line arguments.
For example, under macOS pass
ADDCXXFLAGS='-I /opt/homebrew/include' ADDLDFLAGS='-L /opt/homebrew/lib'
as arguments to make.
cd src
make test
cd src
sudo make install
tokenizer file.c
tokenizer -l Java -o statement <file.java
$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/c.c | tokenizer -l C
35 320 60 2000 46 2001 62 322 2002 40 41 123 2003 40 625 41 59 327 1500 59 125
$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/c.c | tokenizer -l C -s
# include < ID:2000 . ID:2001 > int ID:2002 ( ) { ID:2003 ( STRING_LITERAL
) ; return 0 ; }
$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/csharp.cs | tokenizer -l "C#"
312 2000 123 360 376 2001 40 41 123 2002 46 2003 46 2004 40 627 41 59 125 125
$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/csharp.cs | tokenizer -l "C#" -s
class ID:2000 { static void ID:2001 ( ) { ID:2002 . ID:2003 . ID:2004
( STRING_LITERAL ) ; } }
$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/csharp.cs | tokenizer -l "C#" -o method
123 2002 46 2003 46 2004 40 627 41 59 125
$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/c%2B%2B.cpp | tokenizer -l C++ -s
# include < ID:2000 > LINE_COMMENT using namespace ID:2001 ; int ID:2002
( ) LINE_COMMENT { ID:2003 LSHIFT STRING_LITERAL LSHIFT ID:2004 ;
LINE_COMMENT return 0 ; LINE_COMMENT }
$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/j/Java.java | tokenizer -l Java -s
public class ID:2000 { public static void ID:2001 ( ID:2002 [ ] ID:2003 )
{ ID:2004 . ID:2005 . ID:2006 ( STRING_LITERAL ) ; } }
curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/c%2B%2B.cpp | tokenizer -l C++ -c
#
include
<
iostream
>
// ...
using
namespace
std
;
int
main
(
)
// ...
{
cout
<<
"..."
<<
endl
;
// ...
return
0
;
// ...
}
Produce a token-by-token difference between the current version of the
file tokenizer.cpp
and the one in version v1.1.
diff <(git show v1.1:./tokenizer.cpp | tokenizer -l C++ -b) \
<(tokenizer -l C++ -b tokenizer.cpp)
List Type 2 (near) clones in the tokenizer source code.
tokenizer -l C++ -c -f -o line *.cpp *.h | mpcd
You can read the command's Unix manual page through this link.
In 2023 version 2.0 of the tokenizer was released, with a simpler and more orthogonal command-line interface. To convert old code, you can read the Unix manual page of the original v1.1 version through this link.
To support a new language proceed as follows.
- Open an issue with the language name and a pointer to its lexical structure defintion.
- Add a comment indicating that you're working on it.
- List the language's keywords in a file name language
-keyword.txt
. Keep alphabetic order. If the language supports a C-like preprocessor add those keywords as well. - Copy the source code files of an existing language that most resembles
the new language to create the new language files:
language
Tokenizer.cpp
, languageTokenizer.h
, languageTokenizerTest.h
. - In the copied files rename all instances (uppercase, lowercase, CamelCase) of the existing language name to the new language name.
- Create a list of the new language's operators and punctuators, and
methodically go through the language
Tokenizer.cpp
switch
statements to ensure that these are correctly handled. When code is missing or different, base the new code on an existing pattern. Keep token names used for the same semantic purpose same between languages. If you need a new token name just writeToken:MY_NAME
and it will be defined automatigcally. - Add code to handle the language's comments.
- Adjust, if needed, the handling of constants and literals. Note that for the sake of simplicity and efficiency, the tokenizer can assume that its input is correct.
- To implement features that aren't handled in the language whose tokenizer implementation you copied, look at the implementation of other language tokenizers that have these features.
- If you need to reuse a method from another language, move it to
TokenizerBase
. - Add the object file language
Tokenizer.o
to theOBJ
list of file names in theMakefile
. - Add unit tests for any new or modified features you implemented.
- Update the file
UnitTests.cpp
to include the unit test header file, and calladdTest
with the unit test suite. - Update the method
process_file
intokenizer.cpp
to call the tokenizer you implemented and the language's name to the list of supported languages. - Ensure the language is correctly tokenized, both by running the
tokenizer and by running the unit tests with
make test
. - Update the manual page
tokenizer.1
and thisREADME.md
file. - Bump up the semantic version middle number of the version string
in
tokenizer.cpp