Charset/Encoding Issues and Conversion (Files, MySQL, PHP, UTF)
Resources
Tables
- ASCII: https://www.ullright.org/uploads/tableTool/UllWiki/attachments/ascii-hex.png
- iso-8859-15 table: https://tablesofstuff.com/encodings/iso-8859-15/
- WINDOWS-1258 table with utf8 codes: http://www.cp1252.com/
- UTF-8 table with latin equivalents: https://www.utf8-zeichentabelle.de/unicode-utf8-table.pl
Online Converters
- Hex + Charsets: https://8gwifi.org/HexToStringFunctions.jsp
- Charset converter http://string-functions.com/encodedecode.aspx
Some german chars
ISO UTF8 Ä c4 c3 84 Ö d6 c3 96 Ü dc c3 9c ß df c3 9f ä e4 c3 a4 ö f6 c3 b6 ü fc c3 bc
Wrong translation examples
- ? in a black diamond = iso umlaut/special char displayed as UTF-8
- "Prävention" should read "Prävention" with umlaut "a"
- Fix double encoded UTF-8 strings in PHP without double decoding correct umlauts:
-
$decoded = utf8_decode($string); // If still valid UTF-8 after decoding, it was double encoded if (mb_check_encoding($decoded, 'UTF-8')) { $string = $decoded; }
Hex Editors:
- Ubuntu Graphical
- apt install bless
Client System: Ubuntu 16.04
check
- locale
- LANG=de_DE.UTF-8
bin2hex
- echo -n ö | xxd
- 00000000: c3b6
- echo -n ö | xxd -p
- c3b6
hex2bin
- echo -n c3b6 | xxd -r -p
- ö
File Encoding
Grafical tool to rename files:
https://wiki.ubuntuusers.de/H2rename/
Overview chartset detection tools:
- uchardet: not very reliable (utf8 ü -> WINDOWS-1258)
- chardetect: even worse
- enca: only good for eastern european langs
- file -b -e soft -
- usable for ascii, iso, utf8
- isutf8 no stdin
- encguess no stdin
- convmv no stdin converts filenames
Guess charset with uchardet (iconv compatible charset names)
https://github.com/BYVoid/uchardet
- sudo apt install uchardet
- or for Ubuntu 16.04 or older:
https://packages.ubuntu.com/artful/amd64/uchardet/download
sudo dpkg -i uchardet_0.0.6-2_amd64.deb
- or for Ubuntu 16.04 or older:
- uchardet myfile.txt
- windows-1252
- UTF-8
- unknown
- find . \( -name "*.php" -or -name "*.html" -or -name "*.css" -or -name "*.js" \) -exec echo {} \; -exec uchardet {} \;
Recode
- sudo apt install recode
Conversion Script
- vi convert-textfiles-utf8.sh
-
#!/bin/bash # # By Klemens Ullmann-Marx klemens@ull.at 2018-09-24 # Exit script immediately if a command exits with a nonzero exit value set -e FILES=`find . -type f \( -name "*.php" -or -name "*.inc" -or -name "*.htm*" -or -name "*.*css" -or -name "*.js" -or -name "*.txt" \)` for FILE in $FILES; do echo echo $FILE ENCODING=`uchardet "$FILE"` echo before: $ENCODING if [ "$ENCODING" == "unknown" ]; then echo "skipping..." continue; fi recode $ENCODING..UTF-8 "$FILE" echo after: `uchardet "$FILE"` done
-
- chmod u+x convert-textfiles-utf8.sh
MySQL General Charset Settings
https://severalnines.com/blog/understanding-character-sets-and-collations-mysql/
Modern targets:
- utf8mb4/ utf8mb4_unicode_ci
(utf8mb4_general_ci is a simplified set of sorting rules which takes shortcuts designed to improve speed)
Show available charsets:
-
SELECT * FROM information_schema.CHARACTER_SETS ORDER BY CHARACTER_SET_NAME;
Show default charset:
-
SHOW VARIABLES LIKE 'char%'; SHOW VARIABLES LIKE 'collation%';
- collation utf8mb4_0900_ai_ci => 0900 is the version
- Everything is good if all values are utf8mb4. Otherwise check the mysql config file.
Overview/All
- Databases/Schemata
-
SELECT schema_name,default_character_set_name FROM information_schema.SCHEMATA;
-
- Collation
-
SELECT T.table_schema, T.table_name, CCSA.character_set_name FROM information_schema.`TABLES` T, information_schema.`COLLATION_CHARACTER_SET_APPLICABILITY` CCSA WHERE CCSA.collation_name = T.table_collation;
-
- Columns
-
SELECT table_schema, table_name, column_name, character_set_name FROM information_schema.`COLUMNS` C;
-
Specific
- Database:
-
SELECT default_character_set_name FROM information_schema.SCHEMATA S WHERE schema_name = "my_database";
- Often the default mysql charset is used
-
- Tables:
-
SELECT T.table_schema, T.table_name, CCSA.character_set_name FROM information_schema.`TABLES` T, information_schema.`COLLATION_CHARACTER_SET_APPLICABILITY` CCSA WHERE CCSA.collation_name = T.table_collation AND T.table_schema = "my_database";
-
- Columns:
-
SELECT table_schema, table_name, column_name, character_set_name FROM information_schema.`COLUMNS` C WHERE table_schema='my_database';
- Mostly the table charset is used
-
Convert table charset
-
ALTER TABLE my_table CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Convert data (deprecated?)
cd /tmpmysqldump -c -e --default-character-set=utf8mb4 --single-transaction --skip-set-charset --add-drop-database -B my_schema > my_schema.sqlCheck encoding of my_schema.sqlsed 's/DEFAULT CHARACTER SET latin1/DEFAULT CHARACTER SET utf8mb4 COLLATE utf8_general_ci/' < my_schema.sql | sed 's/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8mb4/' > my_schema_utf8_fixed.sqliconv -f ISO-8859-15 my_schema_utf8_fixed.sql -t UTF-8 > my_schema_utf8_converted.sqlmysql mydb < my_schema_utf8_converted.sql
PHP
Show hex value of character: bin2hex()
UTF Issue "Decomposed Umlauts with Trema"
- https://blog.marcoka.de/index.php/posts/mit-umlauten-ins-21jahrhundert
- find . -maxdepth 1 -type f -exec sh -c 'printf "%-10s %s\n" "$1" "$(printf "$1" | xxd -pu )"' None {} \;
-
2e2f4d656e 75cc88 20 506f737467726164756174652e706e67 2e2f4d656e c3bc 20 506f737467726164756174652e706e67 . / M e n ü space P c3bc = normal UTF-8 ü 75cc88 = u + trema
- Dabei kommt es zu eben dem skurillem Verhalten, weil die Umlaute in dem PDF offenbar nicht direkt als Umlaut (was z.B. bei einem 'ä' U+00E4 wäre) hinterlegt sind, sondern als 'a' (Unicode: U+0061) mit Trema (Unicode: U+0308). Dieses Verhalten nennt man decomposed.
-