The Developer's Guide to Character Encoding
You open a text file and see ``. This is "Mojibake" (garbled text). It means your text editor is guessing the wrong encoding for the bytes in the file.
The Byte Problem
Computers only understand numbers (0-255). We need a map to say "Number 65 is the letter 'A'". This map is the Character Encoding.
Era 1: ASCII (The American Era)
In the 1960s, ASCII used 7 bits (0-127). It covered English letters, numbers, and control codes. It was simple. But it didn't support accents (é), Cyrillic, or Kanji.
Era 2: Code Pages (The Chaos Era)
To support other languages, vendors used the 8th bit (128-255).
- ISO-8859-1 (Western Europe) said 196 is Ä.
- ISO-8859-5 (Cyrillic) said 196 is Д.
If you opened a Russian file on a French computer, it looked like garbage. This was valid byte data, but interpreted with the wrong map.
Era 3: Unicode (The Universal Era)
In the 90s, the industry united. Unicode assigns a unique number (Code Point) to every character in human history. 'A' is U+0041. The 'Pile of Poo 💩' emoji is U+1F4A9.
UTF-8: The Implementation
Unicode is the concept; UTF-8 is the storage format.
- It is Variable Width: Standard English uses 1 byte (same as ASCII!). Complex symbols use 2, 3, or 4 bytes.
- It is Backward Compatible: Any valid ASCII file is also a valid UTF-8 file.
The Solution for Developers
- Always use UTF-8. Set your IDE to save as UTF-8.
- Declare it. HTML:
<meta charset="UTF-8">. HTTP Header:Content-Type: text/html; charset=utf-8. - Database: Ensure your columns are
utf8mb4(in MySQL) to support Emojis.
Character encoding bugs are silent data corrupters. Using UTF-8 everywhere is the only vaccine.