Character Sets & Encodings

Character Encodings

A character is an abstract concept of a character. This is different from a glyph, which is the way the character is shown in a particular font. A character set is a group of characters.

A character encoding defines the relationship between a set of numeric codes and a particular character set. This conversion from number to letter is called the character encoding.

Most character encodings are 8-bit, allowing 2 to the power 8 (= 256) characters to be encoded. The commonest one used for English and other Western European languages is ISO-8859-1, also know as Latin 1 or Western European. The limited number of characters means that different character encodings are often required for different languages.

CJK (Chinese-Japanese-Korean) languages have many thousands of characters. These require 2 bytes (16 bits) to encode the character set, so they are often referred to as double-byte character sets.

Before the advent of Unicode, many different character sets and encodings had to be used for different languages. It was not possible to store plain text files containing text from languages using different character sets in the same document.

Sometimes there are even several encodings in common use for one language, for example EUC-JP, SHIFT_JIS, and ISO-2022-JP for Japanese or different character sets /encodings like ISO-8859-1 and ISO-8859-15 for German.

Unicode

Unicode is a character encoding which aims to encode all the characters of all the world's languages - a universal character set. It is 16-bit, allowing 2 to the power 16 (= 65536) characters to be encoded. The first 256 characters of Unicode are the same as ISO-8859-1. Each character receives a "code point" which is unique and unchangable. A code point is "U+" followed by a hexadecimal value. Valid Unicode code points range from U+0000 to U+10FFFF.

Unicode 4.0.0 was published in April 2003. Around 49000 of the possible codes have so far been used.

There are 2 commonly-used versions of the Unicode standard: UTF-8 and UTF-16. The numeric code for a character is the same in both versions, but in UTF-16 it has initial 0s to fill the number of places and in UTF-8 it does not. UTF-8 is suitable for use on the Internet, in networks and in some kinds of application which need to use slow connections.

UTF stands for Unicode Transformation Format or UCS Transformation Format. UCS stands for Universal Character Set. ISO o10646, also known as ISO UCS is the ISO (International Organization for Standardization) standard which is equivalent to Unicode.

Unicode is useful even for a language such as English because it includes many punctuation and technical symbols which are in common use but are not in any legacy encoding. It also contains more Kanjis than any of the pre-Unicode Japanese national standard character sets. For Korean, Unicode is needed in order to represent both modern and old Korean and Chinese characters in the same document.

See the Unicode ranges.

Character Encodings in HTML

The character encoding must be specified in the <head> of a well-written html document. This enables the browser to make the correct mapping from the numbers to characters. It's a good idea to put this before items such as the <title> which have text that requires the encoding.

For ISO-8859-1 the character encoding is specified as follows:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

UTF-8 is the usual Unicode version for a web page. This is specified as follows:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

For a Unicode page to be fully represented, it requires 2 things: a browser which supports Unicode and a Unicode font with the required glyphs.

Character Encodings in XHTML

An XML declaration at the beginning of a document should be used to specify the character encoding. For example:

<?xml version="1.0" encoding="UTF-8"?>

This declaration should go before the !DOCTYPE statement. Unfortunately this puts IE 6 into quirks mode!

The XML declaration is required when the character encoding of the document is other than the default UTF-8 or UTF-16, and no encoding was determined by a higher-level protocol.

The character encoding can also be specified in the same way as for HTML.

Character Entity References in HTML

HTML character references begin with & and end with ;

Certain characters have a "character entity reference" assigned to them in html. (There are 252 character entity references in HTML 4.0.) For example, £ is represented by &pound; Note that these are case-sensitive.

Whether or not there is a character entity, all characters can be represented by using a numeric character reference. This is the code in the character set which is specified in the <head>. If the number is given in decimal, it is prefixed by #; if it is given in hexadecimal it is prefixed by #x. For example, &#163; and &#xA3; can both be used as an alternative to &pound;. , which does not have a character entity, can be represented by &#22899; or &#x5973;.

Among older browsers, there is better support for decimal numeric character references than hexadecimal ones.

Home