Featured

Understanding HTML Character Encoding: Why UTF-8 Matters

Ensure your HTML documents display special characters correctly by understanding character encoding and why UTF-8 is the preferred standard for modern web development.

 

This post isn’t meant to be a treatise on character encoding in general, but rather in the context of HTML documents—especially on how you can avoid special characters, such as German umlauts, from getting displayed as cryptic characters. If you consider the following two points, there shouldn’t be any problem:

  • In the HTML document, you need to specify the character encoding of the document in the head data between <head> and </head>, such as with <meta charset="UTF-8">. Unless you have a specific reason, you should always use the UTF-8 value for the charset attribute.
  • However, it doesn’t suffice to specify the character encoding in the HTML document as it must also be saved in this encoding using the editor of your choice. Consequently, if you’ve specified UTF-8 as the character encoding in the HTML document, you must also save the document with the UTF-8 encoding. With most editors, you no longer have to bother about this. Nevertheless, it should be briefly mentioned here. 

From Bytes to Character Encoding

The smallest unit, the bit, will be skipped here because you don’t need to go so deep into detail at this point. The byte unit is quite sufficient for this purpose. When the computer reads a file or data into the main memory, it’s basically just bytes that have a certain value. The value of a single byte results from the states of the individual bits. Let’s use a byte with the value 68 (incidentally, with the bit value 1.000.100) as an example.

 

To create a human-readable character from this byte with the value 68, a convention is needed that describes which byte value corresponds to which representable character. For this purpose, a translation table (also referred to as encoding table) is used for encoding bytes.

 

From ASCII to ISO-8859

You know that an encoding table is responsible for turning a byte into a readable character. The first type of such a character set was introduced with the ASCII encoding and the EBCDIC encoding, with which 128 different states can be represented on 7 bits. ASCII encoding has become established in common practice. In the ASCII encoding table, the first 32 characters are pure control characters, and the actual characters are stored in the character set between 32 and 127. A look at the ASCII encoding table shows that the value 68 corresponds to the capital letter D.

 

The 8th bit was initially used only for error-correction purposes (parity bit) for communication lines or other control tasks. Because there was no space left in the ASCII character set between the values 32 and 127 for language-specific characters (e.g., umlauts), the 8th bit was used to extend the character set. At this point, the Babylonian character confusion also arose because different developers wrote their own “8th-bit-codes.” IBM PCs and English MS-DOS systems used codepage 435, for example. In Germany, codepage 850 was used for Western European characters.

 

Newer standards such as ISO-8859 also use 8 bits. Here, several character set tables were developed at once. For example, ISO-8859-1 (or Latin-1) represents the Western European languages. The first 127 characters were taken over from the ASCII encoding. In the values between 128 and 255, many special characters and important characters from different European languages were implemented (with the German umlauts, the Spanish tilde character, or the French accent characters).

 

So, theoretically, you can use the ISO-8859-1 character set for the HTML document:

 

<meta charset="iso-8859-1">

 

While in theory, you can use any character set for charset, you should keep in mind that not every web browser understands all character sets. If you use a more widely used character set, you have a better chance that a web browser in distant countries will be able to do something with it.

 

Microsoft had also added its own variation to the ISO-8859-1 encoding with codepage 1252. After all, code page 1252 already contained the euro sign. ISO-8859-1, on the other hand, doesn’t recognize the euro sign because at the time this table was created, the euro didn’t even exist. The euro sign was added by the ISO only later with ISO-8859-15. Now the situation here is that ISO-8859-1 doesn’t recognize the euro sign, while ISO-8859-15 and codepage 1252 do know it, but the value in the encoding table is again different. Fortunately, today you don’t need to deal with the different character sets of a language. The description of the ISO 8859 standards here serves only as background information on the subject.

 

The current HTML specification uses the Unicode UTF-8 character set with charset= "UTF-8".

 

Beyond the Byte Boundary with Unicode

The preceding provided a good impression of the confusion regarding the different character encodings. Note, however, that this was only about the Western European character set, and I haven’t really gone into detail yet. The fact is that character encoding can be relatively complex if you pack everything into a byte and then want to use different characters from different cultures. To bring all characters under one hood, the Unicode system was introduced.

 

The Unicode character set can be used to represent all human-made characters. In purely theoretical terms, more than four billion characters could be used with 32 bits per character—in practice, Unicode is limited to about one million code points. UTF-8 is the 8-bit encoding of Unicode, which is also backward compatible with ASCII encoding. A character can contain between one and four 8-bit words. UTF-8 is now a uniform standard. For example, many operating systems use UTF-8 by default, and UTF-8 is also being used increasingly in web development with HTML to represent language-specific characters, where it more and more replaces the use of HTML entities.

 

For more information, visit http://r12a.github.io/scripts/tutorial/ and https://home.unicode.org/. You can also find the characters of the Unicode encoding at www.unicode.org/charts/.

 

Editor’s note: This post has been adapted from a section of the book HTML and CSS: The Comprehensive Guide by Jürgen Wolf. Jürgen is a web and software developer and the author of several seminal works about programming and photography. Find out more about him on www.pronix.de.

 

This post was originally published 4/2025.

Recommendation

HTML and CSS
HTML and CSS

Web developers—this is your all-in-one guide to HTML and CSS! Learn to use HTML to format text and structure web pages. Understand the HTML document skeleton before creating forms, referencing hyperlinks, embedding active content, and more. Then style your pages with CSS: Create consistent designs with selectors, the box model, the cascade algorithm, and inheritance. Round out your client-side development experience by getting to know JavaScript. With detailed code examples, you’ll master HTML and CSS in no time!

Learn More
Rheinwerk Computing
by Rheinwerk Computing

Rheinwerk Computing is an imprint of Rheinwerk Publishing and publishes books by leading experts in the fields of programming, administration, security, analytics, and more.

Comments