Charset/Encoding Issues and Conversion (Files, MySQL, PHP, UTF)

Resources

Tables

Online Converters

Some german chars

   ISO  UTF8 
Ä  c4   c3 84
Ö  d6   c3 96
Ü  dc   c3 9c
ß  df   c3 9f
ä  e4   c3 a4
ö  f6   c3 b6
ü  fc   c3 bc

Wrong translation examples

  • ? in a black diamond = iso umlaut/special char displayed as UTF-8
  • "Prävention" should read "Prävention" with umlaut "a"
    • Fix double encoded UTF-8 strings in PHP without double decoding correct umlauts:
    • $decoded = utf8_decode($string);
      
      // If still valid UTF-8 after decoding, it was double encoded
      if (mb_check_encoding($decoded, 'UTF-8')) {
          $string = $decoded;
      }

Hex Editors:

  • Ubuntu Graphical
    • apt install bless

Client System: Ubuntu 16.04

check

  • locale
    • LANG=de_DE.UTF-8

bin2hex

  • echo -n ö | xxd
    • 00000000: c3b6
  • echo -n ö | xxd -p
    • c3b6

hex2bin

  • echo -n c3b6 | xxd -r -p
    • ö

File Encoding

Grafical tool to rename files:

https://wiki.ubuntuusers.de/H2rename/

Overview chartset detection tools:

  • uchardet: not very reliable (utf8 ü -> WINDOWS-1258)
  • chardetect: even worse
  • enca: only good for eastern european langs
  • file -b -e soft -
    • usable for ascii, iso, utf8
  • isutf8         no stdin
  • encguess    no stdin
  • convmv        no stdin converts filenames

Guess charset with uchardet (iconv compatible charset names)

https://github.com/BYVoid/uchardet

  • sudo apt install uchardet
  • uchardet myfile.txt
    • windows-1252
    • UTF-8
    • unknown
  • find . \( -name "*.php" -or -name "*.html" -or -name "*.css" -or -name "*.js" \) -exec echo {} \; -exec uchardet {} \;

Recode

  • sudo apt install recode

Conversion Script

  • vi convert-textfiles-utf8.sh
    • #!/bin/bash
      #
      # By Klemens Ullmann-Marx klemens@ull.at 2018-09-24
      
      # Exit script immediately if a command exits with a nonzero exit value
      set -e
      
      FILES=`find . -type f \( -name "*.php" -or -name "*.inc" -or -name "*.htm*" -or -name "*.*css" -or -name "*.js" -or -name "*.txt" \)`
      
      for FILE in $FILES; do
        echo
        echo $FILE
        ENCODING=`uchardet "$FILE"`
        echo before: $ENCODING
        if [ "$ENCODING" == "unknown" ]; then
          echo "skipping..."
          continue;
        fi
        recode $ENCODING..UTF-8 "$FILE"
        echo after: `uchardet "$FILE"`
      done 
  • chmod u+x convert-textfiles-utf8.sh

MySQL General Charset Settings

https://severalnines.com/blog/understanding-character-sets-and-collations-mysql/

Modern targets:

  • utf8mb4/ utf8mb4_unicode_ci

    (utf8mb4_general_ci is a simplified set of sorting rules which takes shortcuts designed to improve speed)

Show available charsets:

  • SELECT * FROM information_schema.CHARACTER_SETS ORDER BY CHARACTER_SET_NAME;

Show default charset:

  • SHOW VARIABLES LIKE 'char%'; SHOW VARIABLES LIKE 'collation%';
  • collation utf8mb4_0900_ai_ci => 0900 is the version
  • Everything is good if all values are utf8mb4. Otherwise check the mysql config file.

Overview/All

  • Databases/Schemata
    • SELECT schema_name,default_character_set_name FROM information_schema.SCHEMATA;
  • Collation
    • SELECT T.table_schema, T.table_name, CCSA.character_set_name FROM information_schema.`TABLES` T, information_schema.`COLLATION_CHARACTER_SET_APPLICABILITY` CCSA WHERE CCSA.collation_name = T.table_collation;
  • Columns
    • SELECT table_schema, table_name, column_name, character_set_name FROM information_schema.`COLUMNS` C;

 

Specific

 

  • Database:
    • SELECT default_character_set_name FROM information_schema.SCHEMATA S WHERE schema_name = "my_database";
    • Often the default mysql charset is used
  • Tables:
    • SELECT T.table_schema, T.table_name, CCSA.character_set_name FROM information_schema.`TABLES` T,        information_schema.`COLLATION_CHARACTER_SET_APPLICABILITY` CCSA WHERE CCSA.collation_name = T.table_collation   AND T.table_schema = "my_database";
  • Columns:
    • SELECT table_schema, table_name, column_name, character_set_name FROM information_schema.`COLUMNS` C WHERE table_schema='my_database';
    • Mostly the table charset is used

Convert table charset

  • ALTER TABLE my_table CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

 

 

Convert data (deprecated?)

 

  • cd /tmp
  • mysqldump -c -e --default-character-set=utf8mb4 --single-transaction --skip-set-charset --add-drop-database -B my_schema > my_schema.sql
  • Check encoding of my_schema.sql
    •  
  • sed 's/DEFAULT CHARACTER SET latin1/DEFAULT CHARACTER SET utf8mb4 COLLATE utf8_general_ci/' < my_schema.sql | sed 's/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8mb4/' > my_schema_utf8_fixed.sql
  • iconv -f ISO-8859-15 my_schema_utf8_fixed.sql -t UTF-8 > my_schema_utf8_converted.sql
  • mysql mydb < my_schema_utf8_converted.sql

 

 

 

 

PHP

Show hex value of character: bin2hex()

UTF Issue "Decomposed Umlauts with Trema"

  • https://blog.marcoka.de/index.php/posts/mit-umlauten-ins-21jahrhundert
  • find . -maxdepth 1 -type f -exec sh -c 'printf "%-10s %s\n" "$1" "$(printf "$1" | xxd -pu )"' None {} \;
    • 2e2f4d656e 75cc88 20    506f737467726164756174652e706e67
      2e2f4d656e c3bc   20    506f737467726164756174652e706e67
      . / M e n  ü      space P
      
      
      c3bc   = normal UTF-8 ü
      75cc88 = u + trema
      
    • Dabei kommt es zu eben dem skurillem Verhalten, weil die Umlaute in dem PDF offenbar nicht direkt als Umlaut (was z.B. bei einem 'ä' U+00E4 wäre) hinterlegt sind, sondern als 'a' (Unicode: U+0061) mit Trema (Unicode: U+0308). Dieses Verhalten nennt man decomposed.

@see: https://www.ullright.org/ullWiki/show/php-cli-script-fix-directories-and-file-encodings-with-german-umlauts