This application reads every text and character field in a MySQL database and analyzes each DataPoint (the intersection of a row and column like a cell in a spreadsheet) looking for invalid utf8 sequences. This is a command line only application.
The utf8convert database stores a deconstructed target database and provides a working environment for storing conversions and converted data.
8°6 crew
This should be rendered as 8°6 crew
but the extended ° character has been decoded
from utf8 into multibyte component parts. Where there is one there are
probably many.
Duvalier’s Dream
This will be corrected to Duvalier’s Dream
.
I created this tool to correct every invalid utf8 sequence in my database in a single conversion. My example database finds around 89,000 invalid sequences. Only DataPoints which have been converted will be exported. Valid utf8 characters will be evaluated too and ignored if they are correct.
This application was inspired by https://www.bluebox.net/insight/blog-article/getting-out-of-mysql-character-set-hell
composer install
cp config/autoload/local.php.dist config/autoload/local.php
; edit local for specific environment
php public/index orm:schema-tool:create
Validation occurs before a conversion may be ran.
Step 1: Validate the database. This command will verify all database settings, table data types, and column data types are utf8.
php public/index.php database:validate
If the validate command failed you must correct the problem(s) before continuing.
You need to create a conversion. Each conversion requires a name.
whitelist and blacklist are comma delimited lists of table names. If not specified then all tables will be evaluated.
php public/index.php conversion:create [--name=conversionName] [--whitelist=] [--blacklist=]
After your conversion has been created you must convert it.
php public/index.php conversion:convert --name=conversionName
To copy corrected utf8 data back into your database you must export it:
php public/index.php conversion:export --name=conversionName
To clone a conversion to a new name:
php public/index.php conversion:clone --from=conversionName --to=conversionName
Any text field containing data which is not just text may cause a problem.
Examples are a text field to store serialized data or a text field which
stores otherwise binary data. If you have any fields like this be sure to
set their approved
flag in the DataPoint table to 0 (false) before
exporting a conversion.
If you ever get stuck you can always delete the utf8convert
database and start over.