banner
Lainbo

Lainbo's Blog

github

Writing the lang attribute of HTML tags

Format Introduction#

The lang attribute of HTML tags consists of 7 parts, and the commonly used ones are the ones introduced on MDN in the 1st, 2nd, and 4th positions.

image

  1. Language: For example, zh for Chinese, en for English.
  2. Extended language (dialect): For example, cmn represents Mandarin, yue represents Cantonese, lzh represents Classical Chinese, mainly used to indicate the reading effect of screen readers.
  3. Writing system/format: For example, Hans represents Simplified Chinese, Hant represents Traditional Chinese.
  4. Region: For example, CN represents China, HK represents Hong Kong.
  5. Variant: Used to specify specific variants of a language. There are few that start with zh and can write this item, such as pinyin representing Pinyin in our cognition, and wadegile representing Wade-Giles romanization. Both of these examples must be prefixed with zh-Latn to be used.
  6. Extension: Used to add language-specific extensions.
  7. Private use: Subtags reserved for private use. Unless there is a special need, they are generally not used.

Extended Language/Dialect (2nd position)#

In the case of starting with zh, there are currently 13 approved dialects (only discussing dialects, not all starting with zh), which are:

  1. cdo - Min Dong

  2. cjy - Jin

  3. cpx - Pu-Xian Min

  4. czh - Huizhou

  5. czo - Min Zhong

  6. gan - Gan

  7. hak - Hakka

  8. hsn - Xiang

  9. lzh - Classical Chinese

  10. mnp - Min Bei

  11. nan - Min Nan

  12. wuu - Wu

  13. yue - Cantonese

It is confusing that these tags can be used as extlang (2nd position) to represent dialects (e.g., zh-cdo), or they can be placed in the language (1st position) as the main language (e.g., cdo-Hans). So what is the relationship between these tags and the traditional "zh" tag? IANA defines "zh" as a "macrolanguage," and I cannot determine how to translate it, macro language? Or language family? The viewpoint of BCP 47 is that Chinese includes several languages, and it seems to consider Chinese dialects as independent languages.

Therefore, both of the following two ways are correct:

  1. <html lang="zh-cdo-Hans"> (zh as the main language in the 1st position)
  2. <html lang="cdo-Hans"> (cdo, this dialect, as the main language in the 1st position)

After clarifying the confusion caused by dialects, my personal suggestion is to use "zh" as the main language. I don't want to engage in political discussions or study profound academic issues. The only reason I suggest using "zh" as the only subtag for the main language is to avoid confusion. Who knows how many languages in China the big shots at IANA will approve in the future? Do we have to memorize them all? Or ask AI or look up a dictionary when maintaining the code to see which obscure foreign language/dialect it is?

"zh" represents Chinese, "zh-xxx" still represents Chinese, but the dialect characteristics need to be considered. This expression will not cause any misunderstanding.

How to Write the lang Attribute in HTML#

Shorter is better!

The W3C recommendation is:

The golden rule when creating language tags is to keep the tag as short as possible.

Therefore, the W3C example "zh-Hans" - Simplified Chinese, becomes the best usage. Or personally, I think using only "zh" is also fine, and mixing simplified and traditional Chinese is also acceptable.

How to Write CSS Selectors#

Only write the prefix!

I have seen some authors in some open-source projects on GitHub who have diligently listed all language varieties and extlangs in CSS (e.g., zh, zh-CN, zh-SG, zh-Hans, cmn, cmn-Hans, zh-cmn-Hans). I admire the author's diligent spirit and believe that he must have consulted a lot of information to collect all seven Dragon Balls. However, this is unnecessary. Let's take an example. From the figure, it can be seen that no matter how complex the lang attribute of the html tag is written, only the prefix needs to be written in CSS to match, and it is not necessary to exhaustively list all dialects. Exhaustive listing may inevitably lead to omissions or new dialects that may appear in the future, which may cause problems with CSS matching.

image

If you are not sure what you are doing, please do not exhaustively list, just write the prefix.

Compatibility Issues#

In fact, on many Chinese websites, we often see the writing style of zh-CN. I am not sure how many browsers are incompatible with this writing style because, from the format point of view, it skips the 2nd and 3rd positions, which are the dialect and writing format positions. From the specification point of view, if it is necessary to represent the element CN, perhaps zh-Hans-CN would be better (only omitting the dialect position used by screen readers).

References#

  1. W3C Documentation
  2. IANA Registered Subtags
  3. BCP 47
  4. RFC 5646
Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.