• Edit
  • Delete

Charset/Encoding Issues and Conversion (Files, MySQL, PHP)

Resources

Tables

Online Converters

Some german chars

   ISO  UTF8 
Ä  c4   c3 84
Ö  d6   c3 96
Ü  dc   c3 9c
ß  df   c3 9f
ä  e4   c3 a4
ö  f6   c3 b6
ü  fc   c3 bc

Wrong translations

  • ? in a black diamond = iso umlaut/special char displayed as UTF-8

Client System: Ubuntu 16.04

check

  • locale
    • LANG=de_DE.UTF-8

bin2hex

  • echo -n ö | xxd
    • 00000000: c3b6
  • echo -n ö | xxd -p
    • c3b6

hex2bin

  • echo -n c3b6 | xxd -r -p
    • ö

File Encoding

Overview chartset detection tools:

  • uchardet: not very reliable (utf8 ü -> WINDOWS-1258)
  • chardetect: even worse
  • enca: only good for eastern european langs
  • file -b -e soft -
    • usable for ascii, iso, utf8
  • isutf8         no stdin
  • encguess    no stdin
  • convmv        no stdin converts filenames

Guess charset with uchardet (iconv compatible charset names)

https://github.com/BYVoid/uchardet

  • sudo apt install uchardet
  • uchardet myfile.txt
    • windows-1252
    • UTF-8
    • unknown
  • find . \( -name "*.php" -or -name "*.html" -or -name "*.css" -or -name "*.js" \) -exec echo {} \; -exec uchardet {} \;

Recode

  • sudo apt install recode

Conversion Script

  • vi convert-textfiles-utf8.sh
    • #!/bin/bash
      #
      # By Klemens Ullmann-Marx klemens@ull.at 2018-09-24
      
      # Exit script immediately if a command exits with a nonzero exit value
      set -e
      
      FILES=`find . -type f \( -name "*.php" -or -name "*.inc" -or -name "*.htm*" -or -name "*.*css" -or -name "*.js" -or -name "*.txt" \)`
      
      for FILE in $FILES; do
        echo
        echo $FILE
        ENCODING=`uchardet "$FILE"`
        echo before: $ENCODING
        if [ "$ENCODING" == "unknown" ]; then
          echo "skipping..."
          continue;
        fi
        recode $ENCODING..UTF-8 "$FILE"
        echo after: `uchardet "$FILE"`
      done 
  • chmod u+x convert-textfiles-utf8.sh

MySQL General Charset Settings

  • SHOW VARIABLES LIKE  'char%';

    • should all be "utf8mb4" and "utf8mb4_general_ci"

  • If not:

    • sudo vi /etc/mysql/my.cnf

      • [mysqld]
        character-set-server=utf8mb4
        collation-server=utf8mb4_general_ci

        [client]
        default-character-set=utf8mb4

        [mysql]
        default-character-set=utf8mb4

    • sudo service mysql restart

MySQL Schema specific Settings

  • SELECT * FROM information_schema.SCHEMATA;
  • Change charset
    • ALTER DATABASE my_schema DEFAULT CHARACTER SET utf8mb4  DEFAULT COLLATE utf8mb4_general_ci;

Convert data

  • cd /tmp
  • mysqldump -c -e --default-character-set=utf8mb4 --single-transaction --skip-set-charset --add-drop-database -B my_schema > my_schema.sql
  • Check encoding of my_schema.sql
    •  
  • sed 's/DEFAULT CHARACTER SET latin1/DEFAULT CHARACTER SET utf8mb4 COLLATE utf8_general_ci/' < my_schema.sql | sed 's/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8mb4/' > my_schema_utf8_fixed.sql
  • iconv -f ISO-8859-15 my_schema_utf8_fixed.sql -t UTF-8 > my_schema_utf8_converted.sql
  • mysql mydb < my_schema_utf8_converted.sql

 


mysql> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8mb4                    |
| character_set_connection | utf8mb4                    |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8mb4                    |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0,00 sec)
 

PHP

Show hex value of character: bin2hex()

UTF Issue "Decomposed Umlauts with Trema"

  • https://blog.marcoka.de/index.php/posts/mit-umlauten-ins-21jahrhundert
  • find . -maxdepth 1 -type f -exec sh -c 'printf "%-10s %s\n" "$1" "$(printf "$1" | xxd -pu )"' None {} \;
    • 2e2f4d656e 75cc88 20    506f737467726164756174652e706e67
      2e2f4d656e c3bc   20    506f737467726164756174652e706e67
      . / M e n  ü      space P
      
      
      c3bc   = normal UTF-8 ü
      75cc88 = u + trema
      
    • Dabei kommt es zu eben dem skurillem Verhalten, weil die Umlaute in dem PDF offenbar nicht direkt als Umlaut (was z.B. bei einem 'ä' U+00E4 wäre) hinterlegt sind, sondern als 'a' (Unicode: U+0061) mit Trema (Unicode: U+0308). Dieses Verhalten nennt man decomposed.

@see: https://www.ullright.org/ullWiki/show/php-cli-script-fix-directories-and-file-encodings-with-german-umlauts