Charset/Encoding Issues and Conversion (Files, MySQL, PHP, UTF)

Resources

Tables

ASCII: https://www.ullright.org/uploads/tableTool/UllWiki/attachments/ascii-hex.png
iso-8859-15 table: https://tablesofstuff.com/encodings/iso-8859-15/
WINDOWS-1258 table with utf8 codes: http://www.cp1252.com/
UTF-8 table with latin equivalents: https://www.utf8-zeichentabelle.de/unicode-utf8-table.pl

Online Converters

Hex + Charsets: https://8gwifi.org/HexToStringFunctions.jsp
Charset converter http://string-functions.com/encodedecode.aspx

Some german chars

   ISO  UTF8 
Ä  c4   c3 84
Ö  d6   c3 96
Ü  dc   c3 9c
ß  df   c3 9f
ä  e4   c3 a4
ö  f6   c3 b6
ü  fc   c3 bc

Wrong translation examples

? in a black diamond = iso umlaut/special char displayed as UTF-8

"PrÃ¤vention" should read "Prävention" with umlaut "a"

Fix double encoded UTF-8 strings in PHP without double decoding correct umlauts:

$decoded = utf8_decode($string);

// If still valid UTF-8 after decoding, it was double encoded
if (mb_check_encoding($decoded, 'UTF-8')) {
    $string = $decoded;
}

Hex Editors:

Ubuntu Graphical
- apt install bless

Client System: Ubuntu 16.04

check

locale
- LANG=de_DE.UTF-8

bin2hex

echo -n ö | xxd
- 00000000: c3b6
echo -n ö | xxd -p
- c3b6

hex2bin

echo -n c3b6 | xxd -r -p
- ö

File Encoding

Grafical tool to rename files:

https://wiki.ubuntuusers.de/H2rename/

Overview chartset detection tools:

uchardet: not very reliable (utf8 ü -> WINDOWS-1258)
chardetect: even worse
enca: only good for eastern european langs
file -b -e soft -
- usable for ascii, iso, utf8
isutf8 no stdin
encguess no stdin
convmv no stdin converts filenames

Guess charset with uchardet (iconv compatible charset names)

https://github.com/BYVoid/uchardet

sudo apt install uchardet
- or for Ubuntu 16.04 or older:
  https://packages.ubuntu.com/artful/amd64/uchardet/download
  sudo dpkg -i uchardet_0.0.6-2_amd64.deb
uchardet myfile.txt
- windows-1252
- UTF-8
- unknown
find . $ -name "*.php" -or -name "*.html" -or -name "*.css" -or -name "*.js" $ -exec echo {} \; -exec uchardet {} \;

Recode

sudo apt install recode

Conversion Script

vi convert-textfiles-utf8.sh

#!/bin/bash
#
# By Klemens Ullmann-Marx klemens@ull.at 2018-09-24

# Exit script immediately if a command exits with a nonzero exit value
set -e

FILES=`find . -type f \( -name "*.php" -or -name "*.inc" -or -name "*.htm*" -or -name "*.*css" -or -name "*.js" -or -name "*.txt" \)`

for FILE in $FILES; do
  echo
  echo $FILE
  ENCODING=`uchardet "$FILE"`
  echo before: $ENCODING
  if [ "$ENCODING" == "unknown" ]; then
    echo "skipping..."
    continue;
  fi
  recode $ENCODING..UTF-8 "$FILE"
  echo after: `uchardet "$FILE"`
done

chmod u+x convert-textfiles-utf8.sh

MySQL General Charset Settings

https://severalnines.com/blog/understanding-character-sets-and-collations-mysql/

Modern targets:

utf8mb4/ utf8mb4_unicode_ci

(utf8mb4_general_ci is a simplified set of sorting rules which takes shortcuts designed to improve speed)

Show available charsets:

SELECT * FROM information_schema.CHARACTER_SETS ORDER BY CHARACTER_SET_NAME;

Show default charset:

SHOW VARIABLES LIKE 'char%'; SHOW VARIABLES LIKE 'collation%';

collation utf8mb4_0900_ai_ci => 0900 is the version
Everything is good if all values are utf8mb4. Otherwise check the mysql config file.

Overview/All

Databases/Schemata

SELECT schema_name,default_character_set_name FROM information_schema.SCHEMATA;

Collation

SELECT T.table_schema, T.table_name, CCSA.character_set_name FROM information_schema.`TABLES` T, information_schema.`COLLATION_CHARACTER_SET_APPLICABILITY` CCSA WHERE CCSA.collation_name = T.table_collation;

Columns

SELECT table_schema, table_name, column_name, character_set_name FROM information_schema.`COLUMNS` C;

Specific

Database:

SELECT default_character_set_name FROM information_schema.SCHEMATA S WHERE schema_name = "my_database";

Often the default mysql charset is used

Tables:

SELECT T.table_schema, T.table_name, CCSA.character_set_name FROM information_schema.`TABLES` T,        information_schema.`COLLATION_CHARACTER_SET_APPLICABILITY` CCSA WHERE CCSA.collation_name = T.table_collation   AND T.table_schema = "my_database";

Columns:

SELECT table_schema, table_name, column_name, character_set_name FROM information_schema.`COLUMNS` C WHERE table_schema='my_database';

Mostly the table charset is used

Convert table charset

ALTER TABLE my_table CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Convert data (deprecated?)

~~cd /tmp~~
~~mysqldump -c -e --default-character-set=utf8mb4 --single-transaction --skip-set-charset --add-drop-database -B my_schema > my_schema.sql~~
~~Check encoding of my_schema.sql~~
~~sed 's/DEFAULT CHARACTER SET latin1/DEFAULT CHARACTER SET utf8mb4 COLLATE utf8_general_ci/' < my_schema.sql | sed 's/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8mb4/' > my_schema_utf8_fixed.sql~~
~~iconv -f ISO-8859-15 my_schema_utf8_fixed.sql -t UTF-8 > my_schema_utf8_converted.sql~~
~~mysql mydb < my_schema_utf8_converted.sql~~

PHP

Show hex value of character: bin2hex()

UTF Issue "Decomposed Umlauts with Trema"

https://blog.marcoka.de/index.php/posts/mit-umlauten-ins-21jahrhundert
find . -maxdepth 1 -type f -exec sh -c 'printf "%-10s %s\n" "$1" "$(printf "$1" | xxd -pu )"' None {} \;
- ```
2e2f4d656e 75cc88 20    506f737467726164756174652e706e67
2e2f4d656e c3bc   20    506f737467726164756174652e706e67
. / M e n  ü      space P


c3bc   = normal UTF-8 ü
75cc88 = u + trema
```
- Dabei kommt es zu eben dem skurillem Verhalten, weil die Umlaute in dem PDF offenbar nicht direkt als Umlaut (was z.B. bei einem 'ä' U+00E4 wäre) hinterlegt sind, sondern als 'a' (Unicode: U+0061) mit Trema (Unicode: U+0308). Dieses Verhalten nennt man decomposed.

@see: https://www.ullright.org/ullWiki/show/php-cli-script-fix-directories-and-file-encodings-with-german-umlauts

ascii-hex.png

Updated at 04/11/2024 12:33

https://www.ullright.org/ullWiki/show/mysql-charset-issues-and-conversion

Create

Links