Which is the best site for coding

Choose and apply a character encoding

Intended audience: HTML developers (who use editors or scripts), script developers (PHP, JSP, etc.), CSS developers, web project managers and anyone looking for instructions on how to choose and use a character encoding


Which character encoding should you choose for your content and how do you apply it to your content?

Content is made up of a sequence of characters. Characters include the letters of the alphabet, punctuation marks, etc. However, in a computer, content is stored as a sequence of bytes, which are numeric values. Some characters are represented by more than one byte. As with ciphers in espionage, the way in which sequences of bytes are converted into characters depends on the key used to encode the text. The key is called in this context Character encoding.

This article will give you simple advice on what character encoding to use for your content and how to apply it, i.e. how to create a document using that character encoding.

If you want to better understand what characters and character encodings are, read the article Character encoding for beginners.

Short answer

Use UTF-8 for all of your content. Consider converting content in legacy character encodings to UTF-8.

If you cannot use Unicode encoding, check that the encoding you choose is supported by different browsers and that this encoding is not on the list of encodings to avoid that the current specifications say should not be used.

Check whether your choice is overridden by server-side HTTP settings.

In addition to declaring the encoding of the document inside the document and / or on the server, you need to save the text in that encoding to apply it to your content.

Developers also need to ensure that the different parts of the system can communicate with each other.


Apply the character encoding to the content

Content authors should character encode their pages using one of the in Specification of the character encoding in HTML the methods described.

However, it is important to understand that it is not enough to specify the character encoding in the document or on the server. That doesn't change the bytes; You need to save the text in this character encoding. (The information only helps the browser to interpret the byte sequence in which the text is stored.)

The item Setting the character encoding in web editors and text editors gives advice on how to set the character encoding when saving for a number of different editors.

You should also make sure that your server delivers documents with the correct HTTP information, as these will overwrite the information within the document (see below).

Developers also need to ensure that the different parts of the system can communicate with each other. Websites must be able to communicate with scripts in the backend, databases, etc. Of course, this works best when everything is UTF-8 encoded. What developers need to consider can be found in the article Migration to Unicode.

Why should you use UTF-8?

An HTML page can only be encoded in one character encoding. You cannot encode different parts of a document in different character encodings.

The barriers to using Unicode are very low these days. In January 2012, Google announced that over 60% of the web is now using UTF-8 in billions of web pages examined. If you add the number of pure ASCII websites (ASCII is a subset of UTF-8), the value increases to approximately 80%.

There are 3 different character encodings for Unicode: UTF-8, UTF-16 and UTF-32. Of these, only UTF-8 is recommended for use for web content. The HTML5 specification says, “Authors should use UTF-8. Validators can advise authors not to use outdated character encodings. Authoring tools should use UTF-8 as the default setting for new documents. "

All ASCII characters are encoded in UTF-8 by the exact same bytes as in the ASCII encoding, which is often helpful for interoperability and backward compatibility.

Consideration of the HTTP header

A specification of the character encoding in the HTTP header overwrites specifications within the document. If the HTTP header specifies a character encoding that is not what you want to use for your content, it is a problem if you cannot change the server settings.

You may not have access to the information in the HTTP header and you may need to contact your server administrators for help. On the other hand, you can change server settings if you have limited access to the configuration files or if you generate pages with script languages. Read Setting of the HTTP charset parameter for more information on changing the character encoding specification for a number of files on the server or for content generated by scripting language.

Before you do that, you should check to see if the HTTP header includes any character encoding information. You can use the W3C Internationalization Checker to find out whether a character encoding is specified in the HTTP header, and if so, which one. The item Check HTTP headers refers to alternative tools for verifying the server's character encoding specification.

additional Information

This section contains subtleties that you don't necessarily need to know, but that are mentioned here for the sake of completeness.

What to do if you can't use UTF-8

If you really can't avoid using an encoding other than UTF-8, you will need to choose one of a limited set of character encoding identifiers to ensure maximum interoperability and future readability of your content, and to minimize security vulnerabilities.

Until recently, the IANA register was the reference work for identifiers of character encodings. The IANA register often contains several identifiers for the same coding. In these cases you should use the identifier marked as “preferred”.

The new specification Encoding contains a list that has been tested against current browser implementations. You can find them in the table in the Encodings section. It is best to use the identifiers in the left column of this table.

Note: If an identifier appears in one of these sources, it does not automatically mean that it would be good to use this coding. Read the following section to find out which character encodings you should avoid.

Avoid these character encodings

The HTML5 specification lists some character encodings that you should avoid.

Documents are not allowed JIS_C6226-1983, JIS_X0212-1990, HZ-GB-2312, JOHAB (Windows code page 1361), ISO-2022-based encodings or EBCDIC-based encodings. The reason is that ASCII character codes therein represent non-ASCII characters, which is a security flaw.

Documents are also not allowed CESU-8, UTF-7, BOCU-1 or SCSU-Use encodings; these were never intended for web content and the HTML5 specification prohibits browsers from using them.

The specification also advises against the use of UTF-16 from, and from using UTF-32 is "particularly discouraged".

Others in the Encoding-Character encodings listed in the specification should not be used, including Big5 and EUC-JPthat are problematic in terms of interoperability. ISO-8859-8 (Hebrew encoding for visual letter sequence) you should not use an encoding that encodes in logical letter sequence either (UTF-8; or if that is not possible: ISO-8859-8-i).

The one in the EncodingSpecification listed replacement-Coding is actually not a coding, but a fallback that maps each octet (byte) to the Unicode character code U + FFFD REPLACEMENT CHARACTER. Of course, it does not make sense to transmit data in this coding.

The x-user-definedCoding is a one-byte coding, the lower half of which is ASCII and the upper half of which is mapped in the Unicode Private Use Area (PUA). As with the private domain in general, this coding should be avoided on the public Internet because it is detrimental to interoperability and long-term use.

Further reading