W3C Group Draft Note 07 January 2025
More details about this document This version: https://www.w3.org/TR/2025/DNOTE-string-search-20250107/ Latest published version: https://www.w3.org/TR/string-search/ Latest editor's draft:https://w3c.github.io/string-search/ History: https://www.w3.org/standards/history/string-search/ Commit history Editor: Addison Phillips (Invited Expert) Feedback: GitHub w3c/string-search (pull requests, new issue, open issues)Copyright © 2016-2025 World Wide Web Consortium. W3C® liability, trademark and permissive document license rules apply.
This document describes string searching operations on the Web in order to allow greater interoperability. String searching refers to natural language string matching such as the "find" command in a Web browser. This document builds upon the concepts found in Character Model for the World Wide Web 1.0: Fundamentals [CHARMOD] and Character Model for the World Wide Web 1.0: String Matching [CHARMOD-NORM] to provide authors of specifications, software developers, and content developers the information they need to describe and implement search features suitable for global audiences.
This section describes the status of this document at the time of its publication. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.
Caution: work in progress
This document is not being actively developed by the Internationalization Working Group. It was created to capture information about substring matching in natural language text that was off-topic in other I18N documents as well as to serve as a ready reference for conversations about the problems of substring matching on the Web.
Readers should not expect that the materials found here represent full guidance for the implementation or specification of "find" features.
This document was published by the Internationalization Working Group as a Group Draft Note using the Note track.
Group Draft Notes are not endorsed by W3C nor its Members.
This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
The W3C Patent Policy does not carry any licensing requirements or commitments on this document.
This document is governed by the 03 November 2023 W3C Process Document.
This document describes the problems, requirements, and considerations for specification or implementations of string searching operations. A common example of string searching is the "find" command in a Web browser, but there are many other forms of searching that a specification might wish to define.
This document builds on Character Model for the World Wide Web: Fundamentals [CHARMOD] and Character Model for the Word Wide Web: String Matching [CHARMOD-NORM]. Understanding the concepts in those documents are important to being able to understand and apply this document successfully.
The main target audience of this specification is W3C specification developers who need to define some form of search or find algorithm: the goal is to provide a stable reference to the concepts, terms, and requirements needed.
The concepts described in this document provide authors of specifications, software developers, and content developers with a common reference for consistent, interoperable text searching on the World Wide Web. Working together, these three groups can build a globally accessible Web.
This document contains best practices and requirements for other specifications, as well as recommendations for implementations and content authors. These best practices for specifications (and others) can also be found in the Internationalization Working Group's document Internationalization Best Practices for Spec Developers [INTERNATIONAL-SPECS], which is intended to serve as a general reference for all Internationalization best practices in W3C specifications.
In this document [RFC2119] keywords in uppercase italics have their usual meaning. We also use these stylistic conventions:
Definitions appear with a different background color and decoration like this.
Best practices appear with a different background color and decoration like this.
Issues, gaps, and recommendations for future work appear with a different background color and decoration like this.
This section contains terminology specific to this document.
Much of the terminology needed to understand this document is provided by the Internationalization Glossary [I18N-GLOSSARY]. Some terms are also defined by [CHARMOD-NORM] and can be found in the Terminology and Notation section of that document.
Unicode, also known as the Universal Character Set, allows Web documents to be authored in any of the world's writing systems, scripts, or languages, on any computing platforms and then to be exchanged, read, and searched by the Web's users around the world. The first few chapters of the Unicode Standard [Unicode] provide useful background reading. Also see the Unicode Collation Algorithm [UTS10], which contains a chapter on searching.
Corpus The natural language text contained by a document or set of documents which the user would like to search.
Segmentation The process of breaking natural language text up into distinct words and phrases. This often includes operations such as "named entity recognition" (such as recognizing that the three word sequence Dr. Jonas Salk is a person's name).
Stemming A process or operation that reduces words to their "stem" or root. For example, the words runs, ran, and running all share the stem run. This is sometimes called (more formally) lemmatization and the stem is sometimes called the lemma.
Full-Text Search refers to searches that process the entire contents of the textual document or set of documents. Full-text queries perform linguistic searches against text data in full-text indexes by operating on words and phrases based on the rules of a particular language such as English or Japanese. Full-text queries can include simple words and phrases or multiple forms of a word or phrase.
Frequently this means that a full-text search employs indexes and natural language processing. When you are using a search engine, you are using a form of full text search. Full text search often breaks natural language text into words or phrases (this is called segmentation) and may apply complex processing to get at the semantic "root" values of words (this is called stemming). These processes are sensitive to language, context, and many other aspects of textual variation.
Natural Language Processing (NLP) refers to the domain of software designed to understand, process, and manipulate human languages (that is, natural language). This is a very wide ranging term. It can cover relatively simple problems, such as word tokenization, or more complex behaviors, such as deriving "meaning" from text, recognizing parts of speech, performing accurate translation, and much else.
Users of the Web often want to search for specific text in a document or collection of documents without having to read line-by-line. Specifications sometimes seek to support this desire by exposing text searching in the Web platform.
There are different types of document searching. One type, called a full text search, is the sort of searching most often found in applications such as a search engine. This type of searching is complex, can be resource intensive, and often depends on processes outside the scope of a given search request.
A more limited form of text search (and the topic of this document) is sub-string matching. One familiar form of sub-string matching is the find feature of browsers and other types of user-agent. For user agents with physical keyboards, this functionality is often accessed via a key combination such as Cmd+F or Ctrl+F. Such a feature might be exposed on the Web via the API window.find, which is currently not fully standardized, or capabilities such as the proposed [SCROLL-TO-TEXT-FRAGMENT].
Textual search is different from the sorts of programmatic matching needed by formal languages, such as markup languages like [HTML]; style sheets [CSS21]; or data formats such as [TURTLE] or [JSON-LD]. String matching in formal languages is described by our document String Matching [CHARMOD-NORM].
Find operations can provide optional mechanisms for improving or tailoring the matching behavior. For example, the abilility to add (or remove) case sensitivity, whether the feature supports different aspects of a regular expression language such as wildcard characters, or whether to limit matches to whole words.
One way that sub-string matching usually differs from full-text search is that, while it might use various algorithms in an attempt to suppress or ignore textual variations, it usually does not produce matches that contain additional or unspecified character sequences, words, or phrases, such as would result from stemming or other NLP processes.
When attempting to standardize sub-string matching, specification authors often struggle with the complexity that is inherent in the encoding of natural language in computer systems, including the different mechanisms employed to encode characters in the [Unicode] standard.
Quite often, the user's input doesn't consist of exactly the same sequence of code points as that used in the document being searched, while the user still expects a match to occur. This can happen for a variety of reasons. Sometimes it is because the text being searched varies in ways the user could not have predicted. In other cases it is because the user's keyboard or input method does not provide ready access to the textual variations needed. It can even be because the user cannot be bothered to input the text accurately.
In this section, we examine various common cases known to us which specification authors need to take into consideration when specifying a sub-string match API or mechanism.
User expectations about whether their search term matches a given part of a document or corpus sometimes depends on the user's language, the language of the document, or both. It might also involve other factors, such as which keyboards or input methods are available on a given device. This might be because various operations that are part of searching, such as case folding, are locale-affected, or that, given the complexity of human language and culture, that expectations about matching or about the use and interpretation of various character sequences differs, even within a given script. Similarly, the handling of accents, alternate scripts, or character encoding (such as variations in the formation of grapheme clusters) is linked to the specific language of the text in question.
It is important to emphasize that we mean language here, and not script. Many different languages that share a script apply different processing or imply different expectations.
Implementations of a "find" feature often have to guess what language the user intended based solely on the user's input or on various "hints" in the runtime environment, such as the operating environment locale, the user agent's localization, or the language of the active keyboard. These hints are, at best, a proxy for the user's intent, particularly when the user is searching a document that doesn't match any of these or when the searched document contains more than one language.
Different languages treat the letter combinations a, ae, and ä differently. English speakers expect ae to be different from a and ä. Since ä is a foreign letter, they usually expect it to match the unmarked a. German speakers expect ae and ä to be equivalent (and different from a). Finnish speakers expect all three to be separate.
Now suppose you have a sentence in Finnish: Haen Han Solon. Hän on salakuljettaja.
(For the curious, this translates to: I’ll go get Han Solo. He is a smuggler.)
The above sentence is tagged as Finnish (lang="fi"). Notice that the letter "n" attached to the end of Han Solo's name (Han Solon) is a part of Finnish grammar.
Here are some spelling variations that speakers of English, German, and Finnish might enter when performing a "find" operation on the text. (Hint: Try them in the "find" command for your browser when viewing this page.)
Finnish speakers expect that each of the above examples is a different word. They might expect that the case variation between Hän and hän might be ignored. German speakers might expect that Hän and Haen are equivalent. English speakers might expect Han to match Hän (but perhaps not the reverse, since ä is not native to English). However, the language tagging of the document doesn't seem to affect most find operations. Neither is there usually a way for the user to affect which language is applied to the search term.
Here is a phrase that we believe means warm marrow in Turkish: ılık ilik.
Here are some spelling variations that English and Turkish speakers might enter:
| ILIK | U+0049 U+004C U+0049 U+004B |
| İLİK | U+0130 U+004C U+0130 U+004B |
| ilik | U+0069 U+006C U+0069 U+006B |
| ılık | U+0131 U+006C U+0131 U+006B |
Depending on your browser and runtime locale, you can get anomolous matching with these terms. In some browsers, the first three terms above consistently match ilik (with an ASCII dotted-i) but not the word ılık with ıU+0131 LATIN SMALL LETTER DOTLESS I.
This is not what Turkish users would expect, since they expect "I"/"ı" and "İ"/"i" to be caseless pairs. A side-effect of this is that the search term "ılık" only matches its lowercase equivalent—and that the uppercase variations do not match that word, even when they match the lowercase version with dotted letter i ("ilik"). Such variation means that both English and Turkish users will notice that the search misses words.
A user might expect a term entered in lowercase to match uppercase equivalents (and perhaps vice-versa). Sub-string matching features, such as the browser "find" command, often offer a user-selectable option for matching (or not) the case of the input to that of the text.
For a survey of case folding, see the discussion here in [CHARMOD-NORM].
Unicode defines canonical and compatibility relationships between characters which can impact user perceptions of string searching. For a detailed discussion of Unicode Normalization forms see Section 2.2 of [CHARMOD-NORM] as well as the definitions found in Unicode Normalization Forms [UAX15].
For example, consider the letter "K". The characters with a normalization including U+004B LATIN CAPITAL LETTER K include the following, many of which might be expected to match a letter "K" in a sub-string search request by a user because they appear to contain a logical "letter K":
In many complex scripts it is possible to encode letters or vowel-signs in more than one way, but the alternatives are canonically equivalent.
Some languages are written in more than one script. A user searching a document might type in text in one script, but wish to find equivalent text in both scripts.
Japanese uses two syllabic scripts, hiragana and katakana. These scripts encode the same phonemes; thus the user might expect that typing in a search term in hiragana would find the exact same word spelled out in katakana.
In the example shown here, the word nihongo (Japanese for "Japanese") is shown in both hiragana and katakana. Note that this word is usually represented by kanji (Han ideograph) characters: 日本語.
| Hiragana | にほんご |
| U+306B U+307B U+3093 U+3054 | |
| Katakana | ニホンゴ |
| U+30CB U+30DB U+30F3 U+30B4 |
Some compatibility characters were encoded into Unicode to account for single- or multibyte representation in legacy character encodings or for compatibility with certain layout behaviors in East Asian languages.
| full-width katakana | ニホンゴ |
| U+30CB U+30DB U+30F3 U+30B4 | |
| half-width katakana These are compatibility characters |
ニホンゴ |
| U+FF86 U+FF83 U+FF9D U+FF7A U+FF9E | |
| half-width Latin letters These are ASCII letters! |
abcXYZ |
| U+0061 U+0062 U+0063 U+0058 U+0059 U+005A | |
| full-width Latin letters These are compatibility characters. |
abcXYZ |
| U+FF41 U+FF42 U+FF43 U+FF38 U+FF39 U+FF3A |
Many scripts have their own digit characters for the numbers from 0 to 9. In some Web applications, the familiar ASCII digits are replaced for display purposes with the local digit shapes. In other cases, the text actually might contain the Unicode characters for the local digits. Users attempting to search a document might expect that typing one form of digit will find the eqivalent digits.
Here are some selected examples of different digit shapes, from zero to nine, in four scripts. Many scripts have equivalent sets of digits with distinct shapes.
| Latin | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| Gujurati | ૦ | ૧ | ૨ | ૩ | ૪ | ૫ | ૬ | ૭ | ૮ | ૯ |
| Thai | ๐ | ๑ | ๒ | ๓ | ๔ | ๕ | ๖ | ๗ | ๘ | ๙ |
| Arabic | ٠ | ١ | ٢ | ٣ | ٤ | ٥ | ٦ | ٧ | ٨ | ٩ |
Some languages have different orthographic traditions that vary by region or dialect or allow different spellings of the same word. Searches and spell-checking may need to know about these variations.
US English (language tag en-US) and UK English (language tag en-GB) have different spelling traditions, which manifest in different ways. For example, color versus colour or exchanging the letters s and z as in internationaliZation vs. internationaliSation. A few words have even more divergent spellings, such as jail vs. gaol.
The spelling variants for US vs UK English are mostly standardised, however sometimes the spelling is down to personal preferences (or sometimes lack of knowledge). For example, the US English word 'through' can be spelled 'thru'.
Indic script languages have many instances of this kind of problem. Sometimes these are spelling errors, but in other cases multiple spellings are acceptable.
For example, the Bengali language (language tag bn) is notorious for having a wide range of spelling variations permitted by the language: nearly 80% of Bengali words have at least two spellings. Many words have 3, 4, or more variations—with at least one word having 16 different valid spellings.
One example is the word which transliterates to the Latin script as rani, but which users may spell with different letters and vowel marks. In modern Bengali ণ [U+09A3 BENGALI LETTER NNA] and ন [U+09A8 BENGALI LETTER NA] are pronounced /n/, and ি [U+09BF BENGALI VOWEL SIGN I ] and ী [U+09C0 BENGALI VOWEL SIGN II ] are both pronounced /i/. Therefore different users might choose any of the following alternative code point sequences for the same word:
| রানি | রাণি |
| U+09B0 U+09BE U+09A8 U+09BF | U+09B0 U+09BE U+09A3 U+09BF |
| রানী | রাণী |
| U+09B0 U+09BE U+09A8 U+09C0 | U+09B0 U+09BE U+09A3 U+09C0 |
Other Indic scripts provide alternative mechanisms for representing particular sounds, and in most cases either representation is considered equally valid. The most common instance of this involves representation of syllable-final nasals.
For example, the /n/ sound in the word for snake in Hindi can be written using either ँ [U+0901 DEVANAGARI SIGN CANDRABINDU] and ं [U+0902 DEVANAGARI SIGN ANUSVARA] Both of the following are possible valid spellings:
| With ँ [U+0901 DEVANAGARI SIGN CANDRABINDU] | साँप |
| U+0938 U+093E U+0901 U+092A | |
| With ं [U+0902 DEVANAGARI SIGN ANUSVARA] | सांप |
| U+0938 U+093E U+0902 U+092A |
In an additional twist to this story, two diacritics with different code points could be used here. In our previous example we used ं [U+0902 DEVANAGARI SIGN ANUSVARA ] to represent the nasal sound because the accompanying vowel-sign rises above the hanging baseline. If the vowel-sign was one that didn't rise above the hanging baseline, we would normally use ँ [U+0901 DEVANAGARI SIGN CANDRABINDU ] instead. The function of both of these diacritics is the same, but their code points are different.
The alternative use of either a letter or a diacritic for syllable-final nasals is common to several other Indian languages. In addition to Devanagari, used to write languages such as Hindi (language tag hi) or Marathi (language tag mr, scripts such as Malayalam, Gujarati, Odia, and others provide similar spelling options.
Here is an example from Malayalam (ml) showing alternative spellings of the same word.
| with U+0D03 MALAYALAM SIGN VISARGA | ദുഃഖം |
| U+0D26 U+0D41 U+0D03 U+0D16 U+0D02 | |
| without U+0D03 MALAYALAM SIGN VISARGA | ദുഖം |
| U+0D26 U+0D41 U+0D16 U+0D02 |
Some languages use whitespace to separate words, sentences, or paragraphs while others do not. When performing sub-string matching, different forms of whitespace found in [Unicode] must be normalized so that the match succeeds.
Users will sometimes vary their input when dealing with letters that contain accents or diacritic marks when entering search terms in scripts (such as the Latin script) that use various diacritics, even though the text they are searching includes the additional marks. This is particularly true on mobile keyboards, where input of these characters can require additional effort. In these cases, users generally expect the search operation to be more "promiscuous" to make up for their failure to make the additional effort needed.
Users in languages such as French sometimes omit entering accents when inputting search terms because it is more work to enter the correct character, even though this affects the meaning. For example, they might type cote and might expect to find the variations (which have different meanings) like côte or côté, etc. This is "misspelling".
German uses several letters that have an umlaut accent, such as ö [U+00F6 LATIN SMALL LETTER O WITH DIERISIS] or ü [U+00FC LATIN SMALL LETTER U WITH DIERISIS]. Users sometimes will enter these accents when searching, but sometimes they replace the umlauts with the letter e. For example, instead of entering Dürst they might enter Duerst. Either spelling is recognizable and has the same meaning. The umlauts are probably "better" than the e spelling, but German speakers are not confused by the difference.
Other languages use these same characters for a different purpose than German does. The formal name of the "umlaut" diacritic in Unicode is diaeresis, which means approximately "break" or "pause". Languages such as French, Spanish, and English occasionally use the diaeresis to indicate the need to pronounce a specific letter, such as the word "ambigüedad" in Spanish or a name like "Zoë" in English.
This effect might vary depending on context as well. For example, a person using a physical keyboard may have direct access to accented letters, while a virtual or on-screen keyboard may require extra effort to access and select the same letters.
In some orthographies it is necessary to match strings with different numbers of characters.
A prime example of this involves vowel diacritics in abjads. For example, some languages that use the Arabic and Hebrew scripts do not require (but optionally allow) the user to input short vowels. (For some other languages in these scripts, the inclusion of the short vowels is not optional.) The presence or absence of vowels in the text being input or searched might impede a match if the user doesn't enter or know to enter them.
Arabic, Persian, and Urdu users generally do not enter short vowels—but some texts do include them. Searching is affected by this, but meaning generally is not. A generalized description of this might be "optional to encode" sequences.
In some cases, visually similar or identical glyph patterns can be made from different sequences of code points. Sometimes this is intentional and variations can be removed via Unicode normalization. But there are other cases in which similar-appearing graphemes are not made the same by normalisation, and they are not semantically equivalent.
For example, here are a number of character sequences that produce the same or similar textual appearance in the Malayalam script. The inappropriate sequences should be avoided because they will cause the meaning of the text to change: searches, matching and other aspects of the text will fail to be understood by the application or the font. In some cases, fonts will indicate that there is a problem by forcing the appearance of a dotted circle or otherwise failing to render the text correctly, but this may not always be the case.
| ൈ | െെ |
| [U+0D48 MALAYALAM VOWEL SIGN AI] | [U+0D46 MALAYALAM VOWEL SIGN E + U+0D46 VOWEL SIGN E] |
| ഈ | ഇൗ |
| [U+0D08 MALAYALAM LETTER II] | [U+0D07 MALAYALAM LETTER I + U+0D57 AU LENGTH MARK] |
| ഊ | ഉൗ |
| [U+0D0A MALAYALAM LETTER UU] | [U+0D09 MALAYALAM LETTER U + U+0D57 AU LENGTH MARK] |
| ഓ | ഒാ |
| [U+0D13 MALAYALAM LETTER OO] | [U+0D12 MALAYALAM LETTER O + U+0D3E VOWEL SIGN AA] |
| ഐ | എെ |
| [U+0D10 MALAYALAM LETTER AI] | [U+0D0E MALAYALAM LETTER E + U+0D46 VOWEL SIGN E] |
| ഔ | ഒൗ |
| [U+0D14 MALAYALAM LETTER AU] | [U+0D12 MALAYALAM LETTER O + U+0D57 MALAYALAM AU LENGTH MARK] |
Some languages which use the Arabic script also have graphemes which can be encoded in more than one way. In some cases, these variations are handled by Unicode Normalization, but in other cases they are not considered equivalent by Unicode, even if they appear visually to be identical. Sometimes these variations are considered to be valid spelling variations. In other cases they are the result of user's mistaken perception.
A number of language are written in the Arabic script but are unrelated to the Arabic language. Some of these languages therefore require character sequences to represent sounds not present in Arabic. A significant problem for some of these languages is that these specially-encoded character sequences can be visually similar (or identical) to character sequences encoded for other uses and users may experience difficulty entering or knowing how to enter the correct sequence, such as when inputting a search term.
One such language is Kashmiri (language tag ks). Here are some selected examples one might find in Kashmiri:
| Canonically equivalent alternatives (differences resolved by Unicode Normalization) |
إ | U+0625 ARABIC LETTER ALEF WITH HAMZA BELOW | إ | U+0627 ARABIC LETTER ALEF + U+0655 ARABIC HAMZA BELOW |
| Not canonically equivalent (differences that remain after Unicode Normalization) Many of these are linked to user perception of whether the vowel is part of the base letter (ijam) vs. separable (tashkil) |
ێ | U+06CE ARABIC LETTER YEH WITH SMALL V | یٚ | U+06CC ARABIC LETTER FARSI YEH + U+065A ARABIC VOWEL SIGN SMALL V ABOVE |
| Confusables or spelling errors these can be common in certain kinds of text due to gaps in keyboard support or due to a similarity in appearance |
ئ | U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE | یٔ | U+06CC ARABIC LETTER FARSI YEH + U+0654 ARABIC HAMZA ABOVE |
(For more information, see Richard Ishida's doc here.)
Some languages, such as English or Arabic, use spaces between words. Other languages, such as Chinese, Japanese, or Thai, do not. Some language use spaces to separate other text units, such as phrases. In those languages that do not use spaces between words, computing "whole word" matching often depends on the ability to determine word boundaries when the boundaries are not themselves encoded into the text.
This section was identified as a new area needing document as part of the overall rearchitecting of the document. The text here is incomplete and needs further development. Contributions from the community are invited.
Implementers often need to provide simple "find text" algorithms and specifications often try to define APIs to support these needs. Find operations on text generate different user expectations and thus have different requirements from the need for absolute identity matching needed by document formats and protocols. It is important to note that domain-specific requirements may impose additional restrictions or alter the considerations presented here.
Increasing input effort from the user SHOULD be mirrored by more selective matching.
When the user expends more effort on the input—by using the shift key to produce uppercase or by entering a letter with diacritics instead of just the base letter—they might expect their search results to match (only) their more-specific input.
Consider a document containing these strings: "re-resume", "RE-RESUME", "re-résumé", and "RE-RÉSUMÉ".
In the table below, the user's input (on the left) might be considered a match for the above items as follows:
| e (lowercase 'e') | "re-resume", "RE-RESUME", "re-résumé", and "RE-RÉSUMÉ" |
| E (uppercase 'E') | "RE-RESUME" and "RE-RÉSUMÉ" |
| é (lowercase 'e' with acute accent) | "re-résumé" and "RE-RÉSUMÉ" |
| É (uppercase 'E' with acute accent) | "RE-RÉSUMÉ" |
When creating a string search API or algorithm, the following textual options might be useful to users:
The W3C Internationalization Working Group and Interest Group, as well as others, provided many comments and suggestions. The Working Group would like to thank: all of the contributors to the Character Model series of documents over the many years of their development.
The examples in this example were taken from a page authored by Henri Sivonen, as were a number of concepts and ideas recorded by him in this issue.
[CHARMOD] Character Model for the World Wide Web 1.0: Fundamentals. Martin Dürst; François Yergeau; Richard Ishida; Misha Wolf; Tex Texin et al. W3C. 15 February 2005. W3C Recommendation. URL: https://www.w3.org/TR/charmod/ [CHARMOD-NORM] Character Model for the World Wide Web: String Matching. Addison Phillips et al. W3C. 11 August 2021. W3C Working Group Note. URL: https://www.w3.org/TR/charmod-norm/ [CSS21] Cascading Style Sheets Level 2 Revision 1 (CSS 2.1) Specification. Bert Bos; Tantek Çelik; Ian Hickson; Håkon Wium Lie. W3C. 7 June 2011. W3C Recommendation. URL: https://www.w3.org/TR/CSS21/ [HTML] HTML Standard. Anne van Kesteren; Domenic Denicola; Dominic Farolino; Ian Hickson; Philip Jägenstedt; Simon Pieters. WHATWG. Living Standard. URL: https://html.spec.whatwg.org/multipage/ [I18N-GLOSSARY] Internationalization Glossary. Richard Ishida; Addison Phillips. W3C. 17 October 2024. W3C Working Group Note. URL: https://www.w3.org/TR/i18n-glossary/ [INTERNATIONAL-SPECS] Internationalization Best Practices for Spec Developers. Richard Ishida; Addison Phillips. W3C. 17 October 2024. W3C Working Group Note. URL: https://www.w3.org/TR/international-specs/ [JSON-LD] JSON-LD 1.0. Manu Sporny; Gregg Kellogg; Markus Lanthaler. W3C. 3 November 2020. W3C Recommendation. URL: https://www.w3.org/TR/json-ld/ [RFC2119] Key words for use in RFCs to Indicate Requirement Levels. S. Bradner. IETF. March 1997. Best Current Practice. URL: https://www.rfc-editor.org/rfc/rfc2119 [SCROLL-TO-TEXT-FRAGMENT] URL Fragment Text Directives. W3C. Draft Community Group Report. URL: https://wicg.github.io/scroll-to-text-fragment/ [TURTLE] RDF 1.1 Turtle. Eric Prud'hommeaux; Gavin Carothers. W3C. 25 February 2014. W3C Recommendation. URL: https://www.w3.org/TR/turtle/ [UAX15] Unicode Normalization Forms. Ken Whistler. Unicode Consortium. 14 August 2024. Unicode Standard Annex #15. URL: https://www.unicode.org/reports/tr15/tr15-56.html [Unicode] The Unicode Standard. Unicode Consortium. URL: https://www.unicode.org/versions/latest/ [UTS10] Unicode Collation Algorithm. Ken Whistler; Markus Scherer. Unicode Consortium. 22 August 2024. Unicode Technical Standard #10. URL: https://www.unicode.org/reports/tr10/tr10-51.html