The Developer's Guide to Character Encoding

You open a text file and see ``. This is "Mojibake" (garbled text). It means your text editor is guessing the wrong encoding for the bytes in the file.

The Byte Problem

Computers only understand numbers (0-255). We need a map to say "Number 65 is the letter 'A'". This map is the Character Encoding.

Era 1: ASCII (The American Era)

In the 1960s, ASCII used 7 bits (0-127). It covered English letters, numbers, and control codes. It was simple. But it didn't support accents (é), Cyrillic, or Kanji.

Era 2: Code Pages (The Chaos Era)

To support other languages, vendors used the 8th bit (128-255).
- ISO-8859-1 (Western Europe) said 196 is Ä.
- ISO-8859-5 (Cyrillic) said 196 is Д.
If you opened a Russian file on a French computer, it looked like garbage. This was valid byte data, but interpreted with the wrong map.

Era 3: Unicode (The Universal Era)

In the 90s, the industry united. Unicode assigns a unique number (Code Point) to every character in human history. 'A' is U+0041. The 'Pile of Poo 💩' emoji is U+1F4A9.

UTF-8: The Implementation

Unicode is the concept; UTF-8 is the storage format.
- It is Variable Width: Standard English uses 1 byte (same as ASCII!). Complex symbols use 2, 3, or 4 bytes.
- It is Backward Compatible: Any valid ASCII file is also a valid UTF-8 file.

The Solution for Developers

Always use UTF-8. Set your IDE to save as UTF-8.
Declare it. HTML: <meta charset="UTF-8">. HTTP Header: Content-Type: text/html; charset=utf-8.
Database: Ensure your columns are utf8mb4 (in MySQL) to support Emojis.

Character encoding bugs are silent data corrupters. Using UTF-8 everywhere is the only vaccine.