funycode - Unicode encoding for C symbol names

Like Punycode (it's namesake and inspiration), funycode maps an input string in an extended alphabet to an output string in a more limited alphabet. The output alphabet used by funycode consists of all 7-bit ASCII characters that are valid for ANSI/ISO C symbol names:

0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_

A funycode-encoded string consist of a prefix and a suffix, separated by an underscore (_), one of which must always be present. The prefix contains all direct-mapped characters in the input string (all characters in the alphabet except underscore), the suffix a sequence of variable-length encoded step counts for the decoder's state machine (see Punycode description for an explanation as to how this works).

Original	Encoded	Remarks
`foo`	`foo`	Prefix only.
`føø`	`f_670`	Prefix and suffix.
`𝓯𝓸𝓸`	`cxr0I00_`	Suffix only.

The position of the separator and the encoding of the suffix are both chosen in a way as to always result in a valid non-reserved C identifier; specifically, the output will never start with an underscore or a digit.

Compression

Symbol names for modern programming languages typically contain a lot of redundancy: not only as the names of parameter types, but also in the form of deeply-nested namespaces. When encoding these symbol names their length tends to be come unwieldy. Therefore, a simple compression algorithm is a mandatory part of funycode.

Consider a symbol such as the following 261-character one (from OpenBSD 7.1's /usr/lib/libc++.a):

std::__1::__fs::filesystem::__last_write_time(std::__1::__fs::filesystem::path const&, std::__1::chrono::time_point<std::__1::__fs::filesystem::_FilesystemClock, std::__1::chrono::duration<__int128, std::__1::ratio<1ll, 1000000000ll> > >, std::__1::error_code*)

Without compression, this would translate to the following 278-character long funycode string:

std1fsfilesystemlastwritetimestd1fsfilesystempathconststd1chronotimepointstd1fsfilesystemFilesystemClockstd1chronodurationint128std1ratio1ll1000000000llstd1errorcode_n05oOCC00xw3M7pAf5vAp0PDFxsI01020A0H01020A0G01060C01020A0K01060J010T010xSc1Lvg11rrx1030G045A030Y0FB030GM0K0D0c08

With compression however the result is this large but manageable 176-character string:

std1fsfilesystemlastwritetimepathconstchronopointFClockdurationint1281ll10llerrorcode_X05Y40sjb4l6z2z7Zrr7030h0BzEF6rH11vwT0J5Q6G0Qqrwrzx0Wt1nuGEr4WrFln1ouLQp29SmzE8LSty1Mu5Ph6

Typically, for longer strings compression results in output that is 80-90% of the input; without compression that would be around 110%.

The compression algorithm used is based on LZRW1-A. Internally, matches are encoded as symbols in the 0xd800-0xdfff range, with a 4-bit length and 7-bit distance.

Examples

Original	Encoded
`foo`	`foo`
`foo_bar`	`foobar_H7`
`supercalifragilisticexpialidocious`	`supercalifragilisticexpialidocious`
`bücher`	`bcher_eL`
`hörbücher`	`hrbcher_5S0u0`
`_`	`C1_`
(space)	`A0_`
`自転車`	`qeE4K2A1_`
`велосипед`	`FH420EHL9G_`
`wikipedia::article::wikilink::wikilink(std::string const&)`	`wikipediaarticlelinkstdstringconst_T0zGw0s0sw007080sywurJ3t1`
`<mycrate::Foo<u32> as mycrate::Bar<u64>>::foo`	`mycrateFoou32asBaru64foo_D02qs10G0ZCAy0B0sqzxxE`

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
funycode.c		funycode.c
funycode.h		funycode.h
funyfilt.c		funyfilt.c
test.enc		test.enc
test.txt		test.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

funycode - Unicode encoding for C symbol names

Compression

Examples

About

Releases

Packages

Languages

License

irdc/funycode

Folders and files

Latest commit

History

Repository files navigation

funycode - Unicode encoding for C symbol names

Compression

Examples

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages