Skip to content

Commit

Permalink
Introduce custom UTF-8 decoding pipeline.
Browse files Browse the repository at this point in the history
WordPress relies on various extensions, regular expressions, and basic
string operations when working with text potentially encoded as UTF-8.

In this patch an efficient UTF-8 decoding pipeline is introduced which
can remove these dependencies, normalize all decoding behaviors, and
open up new kinds of processing opportunities.

The decoder was taken from [Björn Höhrmann]. While it may be possible
that other methods are more efficient, such as in the multi-byte
extension, this decoder provides a streamable interface useful for
more flexible kinds of processing: for example, whether or not to
replace invalid byte sequences, zero-memory-overhead code point
counting, and partially decoding strings.

[Björn Höhrmann]: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
  • Loading branch information
dmsnell committed Jun 24, 2024
1 parent 4241c2b commit d154bea
Show file tree
Hide file tree
Showing 4 changed files with 492 additions and 127 deletions.
Loading

0 comments on commit d154bea

Please sign in to comment.