UTF-8 BOM (Byte-Order-Mark) Strip

Introduction

Bomstrip is a very simple tool that reads UTF-8 text from the standard input (stdin), removes BOMs (byte-order-marks), then writes to the standard out (stdout).

This tool is written in C. If you don't have a C compiler, or want some tool do the same thing but in other language, you can take a look at Mechiel Lukkien's bomstrip project.

It also includes a set of snippets that all do the same thing in multiple programming languages：

Python
C

What is a UTF-8 BOM

The UTF-8 BOM can be found at the start of some files. It consists of three bytes: \xEF \xBB \xBF. This is the UTF-8 encoding of unicode character FEFF.

Reasons for not having a BOM in UTF-8 encoded files

It breaks compatibility with ASCII.
It may breaks shell scripts.
It may breaks all kind of text processing.
It looks ugly in your editor. Unless it thinks it should be smart and decides it needs to hide it from you.
Since UTF-8 is interpreted as a sequence of bytes, there is no endian problem as there is for encoding forms that use 16-bit or 32-bit code units. Where a BOM is used with UTF-8, it is only used as an encoding signature to distinguish UTF-8 from other encodings — it has nothing to do with byte order. ( From unicode.org's UTF-8, UTF-16, UTF-32 & BOM FAQ )

How can I help

Implimentations in following languages are wanted:

awk
Brainfuck
C++
C# / Mono
Common LISP
Erlang
Forth
Go
Haskell
Java
LUA
Limbo
OCAML
Objective-C
ook!
Pascal ( tested on FreePascal, GNU Pasal )
Perl
PHP
Postscript
Prolog
Ruby
sed
TCL/TK
Unlambda

You can send it to me < [email protected] >, or clone this repo and send me a pull request.

Credits

Mechiel Lukkien < [email protected] > ( This project comes from his bomstrip project )
Timothy Boronczyk, source code of UTF-8 handling in init version of bomstrip CLI tool comes from his article.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
snippets		snippets
LICENSE		LICENSE
README.md		README.md
bomstrip.c		bomstrip.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UTF-8 BOM (Byte-Order-Mark) Strip

Introduction

What is a UTF-8 BOM

Reasons for not having a BOM in UTF-8 encoded files

How can I help

Credits

About

Releases

Packages

Languages

License

zer4tul/utf8-bom-strip

Folders and files

Latest commit

History

Repository files navigation

UTF-8 BOM (Byte-Order-Mark) Strip

Introduction

What is a UTF-8 BOM

Reasons for not having a BOM in UTF-8 encoded files

How can I help

Credits

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages