Template:BCP47/doc

From RobotX
Jump to navigation Jump to search

Usage

This template fixes the current violations of the BCP47 standard (for inclusion in the lang HTML attribute) by Wikimedia sites for their language codes.

It also normalizes these codes to lowercase only (though it is not mandatory).

Use this template only in HTML lang="" attributes, in XML xml:lang="" pseudo-attributes, or in CSS lang() selectors, for tagging the linguistic contents or for creating localisations.

Notes

The template does not replace languages family codes (part of ISO 639-5) such as "cr"="cri" for the Cri family, into any one of their individual languages or macrolanguages (most of these family codes don't have a supported language code for the localisation in MediaWiki). However, Wikimedia sites make some assumptions in a few cases:

  • "bh"="bih" (Bihari family, the only languages family inherited from ISO 639-1 with a 2-letter code), is interpreted as meaning only the single language "bho" (other members of the Bihari family have their own codes from ISO 639-3 and may have their own supported language code for use in interwikis).
  • all other language families are coming from ISO 639-2 with both a 2-letter and a 3-letter code or are only present in ISO 639-5 with a 3-letter code only. All these codes are left unchanged by this template as they are valid in BCP47 for language tagging of contents (even if they are not very precise). This includes codes like "roa" for Romance languages.
  • "cr"="cri" (Cri family, inherited from ISO 639-2 with a 2-letter code and a 3-letter code, and later assigned the same 3-letter code "cri" in ISO 639-5) and all other family codes are left unchanged, even if they are not supported language codes in Wikimedia (so they should not be used for localisation, but may be used for data classification in some templates). These family language codes are left unchanged by this template. "cr" is however a supported language code for interwikis, but the Cri wiki sites (such as Wikipedia) are in fact multilingual and have to support all of them with separately tagged contents. The "cr" language code is still valid under BCP47, even if it should be deprecated for language tagging of HTML/XML contents, and replaced by individual language codes or macrolanguage codes. The Cri Wikipedia site for example is unable to define its own default content language, so users have to select their prefered language in their browser (the Universal Language Selector does not work there as it cannot return "cr" as the current selection for user preferences, it may however return some individual language code from the Cri family, but contents (pages, templates, medias...) will need to be sorted, or the site should support multilingual internationalisation like Commons.

Most macrolanguages are left unchanged by this template, but some of them are considered as supported languages (with a valid interwiki code):

  • "zh"="zho" (Chinese macrolanguage) actually means "cmn" for Mandarin in Wikimedia sites (there is little or no difference in its written form using Han sinograms between members of this macrolanguage); for some medias however (audio and videos, or for phonetic transcriptions), the difference between members of the macrolanguage is significant. For this reason, members of the Chinese macrolanguage other than Mandarin must use their own ISO 639-3 isolated language code.
  • some old BCP47 codes (which have been retired from ISO 639) are no longer supported and have replacement codes. This is the case of "mo" for Moldavian, considered now as an alias of "ro" for Romanian. However in Wikimedia sites, the interwiki code "mo" means Romanian in the Cyrillic script, i.e. "ro-cyrl", when the language code "ro" is actually the same as "ro-Latn" (both "ro" and "mo" are used as interwiki codes, even if only one is valid for BCP47).

All violations of the BCP47 standard by Wikimedia language codes (used in interwikis) are fixed by his template, but most language codes are not checked for existence or for eing wellformed, so they will remain intact if the template is used with such codes. The template also does not assume if these languages are supported by Wikimedia: all valid BCP47 codes will be accepted, violations in supported language codes will be corrected, but the template cannot be used to validate these codes.

However, all languages codes supported by Wikimedia sites should be replaced by conforming BCP47 codes in this template, so that HTML validators will not warn or return an error about them in pages rendered by Wikimedia sites, and so that no browser and other softwares (such as web content indexing robots) will run into some quirks mode for rendering texts or misinterpret it.

All "grandfathered" language tags in BCP47 are supported and converted if possible to their preferred code, if they exist in the IANA database. Some of them have been retired and have no replacement codes (the template will make some resonable matches into standard tags, but if it's impossible due to ambiguity, these tags will be left unchanged, unless it is used with a defined precise meaning in a Wikimedia site to disambiguate them). For example "zh-min-nan" is grandfathered, with a preferred tag "nan", and this template returns it.

All "redundant" language tags of BCP47 (which existed before RFC 5646) are supported an will be left unchanged: they are valid, even if their registration is now redundant as they have equivalent meaning when decomposed in their subtags. Some registered redundant tags have defined "preferred" codes, and this template return them. However it cannot normalize all the many possible redundant tags that have not been registered in the IANA database (and than will never be). This would require a complete Lua module for Scribunto, or a PHP library exposed to Lua or as a MediaWiki extension, impossible to write in a template with reasonnable performance.

Only one code is still left intact by this template even if it conflicts in Wikimedia with another language : "ksh" actually means "Kölsch" (Colonian) in standards but in a small Wikipedia it is used to mean "Ripuarian" (a dialectal branch of Central Franconian). Central Franconian is not yet encoded by itself but is a dialectal branch within Old High German which has a standard code "goh" ; Ripuarian has no defined standard code yet (and it has not been registered in the IANA database as a valid variant of Central Franconian). Ideally it should be replaced by the code of Old High German, with a private extension appended after "-x-" (such as "goh-x-rip"). "ksh" however is a valid BCP47 code, even if it is used incorrectly on the Ripuarian Wikipedia. It is not clear how Ripuarian is distinct from Kölsch, or if the Ripuarian Wikipedia does not mix these two Germanic languages in their current modern use: according to sources, Ripuarian is (under this name) an extinct language since about 1100, but the term is reused today as an alias for the modern "Kölsch" (Colonian), also referenced as "German (Ripuarian)", so "goh" would be a wrong match and "ksh" seems OK for the use in Wikipedia as a living language.

Examples

  • Unaffected codes (most of them, valid for use HTML/XML/CSS, are not valid interwikis), for example:
    • "{{BCP47|en}}" returns: "en"
    • "{{BCP47|nb}}" returns: "nb" (same as "no" only where it's used by interwikis on Wikimedia sites)
    • "{{BCP47|zh-hans}}" returns: "zh-hans"
    • "{{BCP47|zh-hant}}" returns: "zh-hant"
    • "{{BCP47|zh-cn}}" returns: "zh-cn"
  • Changes preferable with BCP47 for improved interoperability (not really violations, these replacements for HTML/XML/CSS should not be used for interwikis, as long as their domain names are not aliased):
    • "{{BCP47|be-x-old}}" returns: "be-tarask"
    • "{{BCP47|bh}}" returns: "bho" (it's an old ISO 639-1 code for a language family, valid for BCP47 but not recommended, but Wikimedia uses it to mean only "bho" in that family, and replaces this ISO 639-1 code in HTML contents by the ISO 639-3 code)
    • "{{BCP47|mo}}" returns: "ro-cyrl"
    • "{{BCP47|no-bok}}" returns: "nb" (same as "no", only where "no" is used by interwikis on Wikimedia sites to mean only Bokmål plus Riksmål as its minor dialectal variant, but not Nynorsk as allowed in BCP47 and ISO639-1, where "no" is a macrolanguage)
    • "{{BCP47|no-nyn}}" returns: "nn"
    • "{{BCP47|zh-cmn}}" returns: "cmn" (alias of "zh" on Wikimedia sites ?)
    • "{{BCP47|zh-wuu}}" returns: "wuu"
    • "{{BCP47|zh-yue}}" returns: "yue"
  • The following codes are not recommended (they are unnecessary aliases with preferred values) but they are still conforming to the standard and are not affected (this template still does not resolve them to their canonical codes):
    • "{{BCP47|en-latn}}" returns: "en-latn" (alias of "en"?)
    • "{{BCP47|zh-cmn-hans-cn}}" returns: "zh-cmn-hans-cn" (alias of "cmn-hans-cn"? or simply "zh-hans-cn"?)
  • Changes required by BCP47 (due to standard violation), using standard codes when they exist (but these replacements for HTML/XML/CSS conformance are not valid in interwikis, as long as their domains are not aliased):
    • "{{BCP47|als}}" returns: "gsw"
    • "{{BCP47|bat-smg}}" returns: "sgs"
    • "{{BCP47|fiu-vro}}" returns: "vro"
    • "{{BCP47|roa-rup}}" returns: "rup"
    • "{{BCP47|simple}}" returns: "en"
    • "{{BCP47|sr-sc}}" returns: "sr-cyrl"
    • "{{BCP47|sr-sl}}" returns: "sr-latn"
    • "{{BCP47|zh-classical}}" returns: "lzh"
  • Changes required by BCP47 (due to standard violation), currently using private-use extensions:
    • "{{BCP47|cbk-zam}}" returns: "cbk-x-zam"
    • "{{BCP47|de-formal}}" returns: "de"
    • "{{BCP47|eml}}" returns: "it-x-eml"
    • "{{BCP47|map-bms}}" returns: "map-x-bms"
    • "{{BCP47|nl-informal}}" returns: "nl"
    • "{{BCP47|nrm}}" returns: "fr-x-nrm" (note that Norman is actually an unencoded macrolanguage with three separate modern branches: one is the continental minority regional language, considered as a variant of French and called "Normand" in French, the two others are Jériais and Guernésiais, both having now local official status in Jersey and Guernsey as individual languages; none of these three modern languages, or the macrolanguage if there's one, or the historic Old Norman from which they are descending, are also encoded).
    • "{{BCP47|roa-tara}}" returns: "it-x-tara"

See also