background.html

<!DOCTYPE html>
<html>
    <head>
        <meta charset="utf-8">
        <!--[if IE]><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><![endif]-->
        <meta name="viewport" content="width=1024, user-scalable=no">

        <title>Building International Sites: Background Information</title>

        <!-- Required stylesheet -->
        <link rel="stylesheet" href="deck.js/core/deck.core.css">


        <link rel="stylesheet" href="deck.js/extensions/goto/deck.goto.css">
        <link rel="stylesheet" href="deck.js/extensions/menu/deck.menu.css">
        <link rel="stylesheet" href="deck.js/extensions/navigation/deck.navigation.css">
        <link rel="stylesheet" href="deck.js/extensions/status/deck.status.css">
        <link rel="stylesheet" href="deck.js/extensions/hash/deck.hash.css">

        <link rel="stylesheet" href="deck.js/themes/style/swiss.css">

        <link rel="stylesheet" href="custom.css">

        <script src="deck.js/modernizr.custom.js"></script>
    </head>

    <body class="deck-container">
        <section class="slide" id="intro">
            <h1>Background</h1>

            <div class="bottom">
                <ul class="inline center">
                    <li><a href="#basic-terminology">Terminology</a></li>
                    <li><a href="#cultural-differences">Cultural Differences</a></li>
                    <li><a href="#writing">Writing</a></li>
                </ul>
            </div>
        </section>

        <section class="slide terminology" id="basic-terminology">
            <h2>Terminology</h2>

            <aside class="note">
                These terms are not standardized. Since we're web-oriented, I'll follow the
                <a href="http://www.w3.org/International/questions/qa-i18n/">W3</a>.
            </aside>

            <dl>
                <dt><dfn>Locale</dfn></dt>
                <dd>
                    A collection of preferences defining how a system should
                    behave for a target group. For example, users in the United
                    States, Great Britain and Australia mostly share a language
                    but choose different ways to spell words, display dates and
                    measure.
                </dd>

                <dt><dfn>Localization (l10n)</dfn></dt>
                <dd>
                    A collection of preferences defining how the user interface
                    should behave for a locale. This implies a number of
                    surprisingly complex topics ranging from how basic text
                    processing and number or formatting to questions about
                    prefered colors and icons and even legal requirements.
                </dd>

                <dt><dfn>Internationalization (i18n)</dfn></dt>
                <dd>
                    Making it <em>easy</em> to localize software: in general
                    this involves identifying locale dependency points and
                    adding an abstraction mechanism to manage locale-specific
                    changes
                </dd>
            </dl>
        </section>

        <section class="slide" id="cultural-differences">
            <h2>Cultural Differences</h2>

            <p class="collapse-inactive">
                Localization is about more than cosmetic differences: forget
                whether the year comes before or after the month in a date,
                basic beliefs about how the world works vary significantly
            </p>

            <ol>
                <li class="slide">
                    Geography's pretty universal, right?

                    <figure class="slide collapse-inactive" id="geography-falsehoods">
                        <figcaption class="center"><a href="http://wiesmann.codiferes.net/wordpress/?p=15187">Falsehoods Programmers Believe About Geography</a></figcaption>
                        <blockquote class="long" cite="http://wiesmann.codiferes.net/wordpress/?p=15187">
                            <ul>
                                <li>Places have only one official name</li>
                                <li>Place names follow the character rules of the language</li>
                                <li>Place names can be written with the usual character set of a country</li>
                            </ul>
                        </blockquote>
                    </figure>
                </li>
                <li class="slide">
                    Well, what about someone's name?

                    <figure class="slide collapse-inactive" id="name-falsehoods">
                        <figcaption class="center"><a href="http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/">Falsehoods Programmers Believe About Names</a></figcaption>
                        <blockquote class="long" cite="http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/">
                            <ol>
                                <li>People have exactly one canonical full name.</li>
                                <li>People have exactly one full name which they go by.</li>
                                <li value="7">People’s names do not change.</li>
                                <li value="8">People’s names change, but only at a certain enumerated set of events.</li>
                                <li value="20">People have last names, family names, or anything else which is shared by folks recognized as their relatives.</li>
                            </ol>
                        </blockquote>
                    </figure>
                </li>
                <li class="slide">
                    What about something as simple as a person's gender? That's just biology, right?

                    <figure class="slide collapse-inactive" id="gender-falsehoods">
                        <figcaption class="center"><a href="http://www.cscyphers.com/blog/2012/06/28/falsehoods-programmers-believe-about-gender/">Falsehoods Programmers Believe About Gender</a></figcaption>
                        <blockquote class="long" cite="http://www.cscyphers.com/blog/2012/06/28/falsehoods-programmers-believe-about-gender/">
                            <ul>
                                <li>There are two and only two genders</li>
                                <li>Okay, then there are two and only two biological genders.</li>
                                <li>Gender is determined solely by biology.</li>
                            </ul>
                        </blockquote>
                    </figure>
                </li>
            </ol>
        </section>

        <section class="slide" id="dealing-with-cultural-differences">
            <h2>Dealing with Cultural Differences</h2>
            <ul>
                <li>
                    The easiest way to spend less time dealing with complex
                    data is not to ask for it: do you really need to know your
                    users' gender? This is also a good way to avoid your signup
                    process feeling nosy
                </li>
                <li>
                    If you do need data, ask how much structure you need: a
                    simple “What name should we display on your profile?” field
                    is easy to build and much easier than trying to migrate a
                    simplistic system after it's full of user data
                </li>
                <li>
                    If you have to model something complicated, pay the cost
                    upfront: use a library or service, test major assumptions
                    and potential outliers regularly and think about how you'll
                    deal with problems
                </li>
            </ul>
        </section>

        <section class="slide" id="writing">
            <h1>Writing</h1>
        </section>

        <section class="slide">
            <h2>An Abbreviated History of Electronic Text</h2>
            <p>
                When people started building electronic communication systems,
                it was easy to continue assigning each distinct character a
                number. Since early systems needed to be simple each character
                was assigned a fixed-length binary number
            </p>
            <ul>
                <li>
                    <a href="http://en.wikipedia.org/wiki/Baudot_code">Baudot code</a>
                    (1870) to ITA2 (1930): 5 bits - just enough for the English alphabet
                </li>
                <li>
                    TeleTypeSetter and
                    <a href="http://en.wikipedia.org/wiki/BCD_(6-bit)">BCD</a>
                    (1928): 6 bits allowed punctuation. Unfortunately,
                    different manufacturers used different schemes, making it
                    difficult to exchange data or even mix computers and
                    printers from different manufacturers!
                </li>
                <li>
                    <a href="http://en.wikipedia.org/wiki/American_Standard_Code_for_Information_Interchange">ASCII</a> (1963):
                    7 bits allow both upper <em>and</em> lower case!
                    Standardization should also help avoid painful conversion
                    issues between manufacturers…
                </li>
                <li class="slide">
                    <p>… but almost everyone outside the United States needs
                    more characters and uses 8-bits to store extended
                    characters beyond basic ASCII. Worse, it's frequently
                    possible to exchange text incorrectly until someone notices
                    the first document using one of the different
                    characters!</p>

                    <p>Since there are individual languages which need more
                    than 256 characters, there's no possibility of a standard
                    8-bit encoding emerging</p>
                </li>
            </ul>
        </section>

        <section class="slide" id="writing-systems">
            <h2>The Range of Human Writing</h2>

            <figure>
                <img src="img/WritingSystemsOfTheWorld.svg" />
                <figcaption class="center">
                    <a href="http://commons.wikimedia.org/wiki/File%3AWritingSystemsOfTheWorld.svg">Writing Systems of the World</a>
                    <p class="attribution"><cite>Maximilian Dörrbecker via Wikimedia Commons (<a href="http://creativecommons.org/licenses/by-sa/3.0/">CC-BY-SA-3.0</a>)</cite></p>
                </figcaption>
            </figure>
        </section>

        <section class="slide" id="unicode">
            <h2>Unicode</h2>

            <p>
                Starting in the 1980s, engineers from various companies started
                working on an ambitious project: a universal 16-bit character
                set which could represent every character used in human writing.
                At some point it expanded beyond 16 bits but the goal hasn't
                changed
            </p>

            <figure>
                <figcaption><a href="http://www.unicode.org/standard/principles.html">The Unicode® Standard: A Technical Introduction</a></figcaption>
                <blockquote class="long" cite="http://www.unicode.org/standard/principles.html">
                    <p>The Unicode Standard defines codes for characters used in <em>all the major
                        languages written today</em>. Scripts include the European alphabetic scripts,
                        Middle Eastern right-to-left scripts, and many scripts of Asia.</p>
                    <p>The Unicode Standard further includes punctuation marks,
                    diacritics, mathematical symbols, technical symbols, arrows,
                    dingbats, emoji, etc. … In all, the Unicode Standard, Version
                    6.0 provides codes for <em>109,449</em> characters from the world&#39;s
                    alphabets, ideograph sets, and symbol collections.</p>
                </blockquote>
            </figure>
        </section>

        <section class="slide terminology" id="terminology-unicode">
            <h2>Terminology</h2>

            <dl>
                <dt>Character</dt>
                <dd>
                    <q cite="http://www.unicode.org/glossary/#character">
                        The smallest component of written language that has semantic value; refers to the abstract meaning…
                    </q>
                    <p><strong>Key concept: this is not the same as a byte or number!</strong></p>
                </dd>

                <dt><dfn>Diacritic</dfn></dt>
                <dd>
                    <q cite="http://www.unicode.org/glossary/#diacritic">
                        A mark applied or attached to a symbol to create a new
                        symbol that represents a modified or new value. (2) A
                        mark applied to a symbol irrespective of whether it
                        changes the value of that symbol. In the latter case,
                        the diacritic usually represents an independent value
                        (for example, an accent, tone, or some other linguistic
                        information).
                    </q>
                </dd>

                <dt>Grapheme Cluster</dt>
                <dt><dfn>Combining Character</dfn></dt>
                <dd>
                    Unicode supports a key concept that multiple characters can
                    be combined to produce a single displayed character
                    (generally anything a user would consider the same is a
                    “grapheme cluster”).

                    <table class="center-block align-right">
                        <tr>
                            <td colspan="3"><samp>é</samp></td>
                        </tr>
                        <tr>
                            <td colspan="3">LATIN SMALL LETTER E WITH ACUTE (U+00E9)</td>
                        </tr>
                        <tr>
                            <td><samp>e</samp></td>
                            <td><samp>´</samp></td>
                            <td><samp>é</samp></td>
                        </tr>
                        <tr>
                            <td>LATIN SMALL LETTER E (U+0065)</td>
                            <td>COMBINING ACUTE ACCENT (U+0301)</td>
                            <td></td>
                        </tr>
                    </table>
                </dd>
            </dl>
        </section>

        <section class="slide terminology" id="terminology-unicode-2">
            <h2>Terminology</h2>

            <dl>
                <dt id="equivalence">Equivalence</dt>
                <dd>
                    <p>Unicode provides rules to determine when characters are
                    exactly the same (e.g. U+00F1 = U+006E + U+0303 = ñ) and also
                    when they are functionally the same (e.g. "&#xFB00;" == "ff" for
                    searching but not display).</p>

                    <p>
                        This can also apply to numbers – these are all numerically equivalent but have significantly different semantic meaning:<br/>
                        <samp>
                            <span title="DIGIT FIVE">&#x35;</span>
                            <span title="ARABIC-INDIC DIGIT FIVE">&#x665;</span>
                            <span title="EXTENDED ARABIC-INDIC DIGIT FIVE">&#x6f5;</span>
                            <span title="NKO DIGIT FIVE">&#x7c5;</span>
                            <span title="DEVANAGARI DIGIT FIVE">&#x96b;</span>
                            <span title="BENGALI DIGIT FIVE">&#x9eb;</span>
                            <span title="GURMUKHI DIGIT FIVE">&#xa6b;</span>
                            <span title="GUJARATI DIGIT FIVE">&#xaeb;</span>
                            <span title="ORIYA DIGIT FIVE">&#xb6b;</span>
                            <span title="TAMIL DIGIT FIVE">&#xbeb;</span>
                            <span title="TELUGU DIGIT FIVE">&#xc6b;</span>
                            <span title="KANNADA DIGIT FIVE">&#xceb;</span>
                            <span title="MALAYALAM DIGIT FIVE">&#xd6b;</span>
                            <span title="THAI DIGIT FIVE">&#xe55;</span>
                            <span title="LAO DIGIT FIVE">&#xed5;</span>
                            <span title="TIBETAN DIGIT FIVE">&#xf25;</span>
                            <span title="MYANMAR DIGIT FIVE">&#x1045;</span>
                            <span title="MYANMAR SHAN DIGIT FIVE">&#x1095;</span>
                            <span title="KHMER DIGIT FIVE">&#x17e5;</span>
                            <span title="MONGOLIAN DIGIT FIVE">&#x1815;</span>
                            <span title="LIMBU DIGIT FIVE">&#x194b;</span>
                            <span title="NEW TAI LUE DIGIT FIVE">&#x19d5;</span>
                            <span title="TAI THAM HORA DIGIT FIVE">&#x1a85;</span>
                            <span title="TAI THAM THAM DIGIT FIVE">&#x1a95;</span>
                            <span title="BALINESE DIGIT FIVE">&#x1b55;</span>
                            <span title="SUNDANESE DIGIT FIVE">&#x1bb5;</span>
                            <span title="LEPCHA DIGIT FIVE">&#x1c45;</span>
                            <span title="OL CHIKI DIGIT FIVE">&#x1c55;</span>
                            <span title="VAI DIGIT FIVE">&#xa625;</span>
                            <span title="SAURASHTRA DIGIT FIVE">&#xa8d5;</span>
                            <span title="KAYAH LI DIGIT FIVE">&#xa905;</span>
                            <span title="JAVANESE DIGIT FIVE">&#xa9d5;</span>
                            <span title="CHAM DIGIT FIVE">&#xaa55;</span>
                            <span title="MEETEI MAYEK DIGIT FIVE">&#xabf5;</span>
                            <span title="FULLWIDTH DIGIT FIVE">&#xff15;</span>
                            <span title="OSMANYA DIGIT FIVE">&#x104a5;</span>
                            <span title="BRAHMI DIGIT FIVE">&#x1106b;</span>
                            <span title="MATHEMATICAL BOLD DIGIT FIVE">&#x1d7d3;</span>
                            <span title="MATHEMATICAL DOUBLE-STRUCK DIGIT FIVE">&#x1d7dd;</span>
                            <span title="MATHEMATICAL SANS-SERIF DIGIT FIVE">&#x1d7e7;</span>
                            <span title="MATHEMATICAL SANS-SERIF BOLD DIGIT FIVE">&#x1d7f1;</span>
                            <span title="MATHEMATICAL MONOSPACE DIGIT FIVE">&#x1d7fb;</span>
                        </samp>
                    </p>
                </dd>

                <dt><dfn>Case Mapping</dfn></dt>
                <dd>
                    Various alphabets (Latin, Georgian, Armenian, Cyrillic,
                    etc.) have the concept of “case” and Unicode has rules for
                    converting text from one case to another, including the
                    various language-specific complications this can entail:
                    for example, the German ß (SMALL LETTER SHARP S) converts
                    to uppercase as “SS”

                    (see <cite><a href="http://unicode.org/reports/tr21/tr21-5.html">Unicode Standard Annex #21: CASE MAPPINGS</a></cite>).
                </dd>

                <dt>Case Folding</dt>
                <dd>
                    Case folding is a similar process, generally used for
                    caseless comparisons (e.g. search). In Unicode this is
                    an expanded form of the lowercase mapping which is
                    consistent but the output is not suitable for display to
                    users
                </dd>
            </dl>
        </section>

        <section class="slide" id="terminology-collation">
            <h2>Terminology</h2>
            <h3>Collation</h3>

            <p>Determining how strings compare for the purposes of sorting</p>

            <div class="center-block">
                <table>
                    <caption>
                        Sample character collation rules
                    </caption>
                    <tr>
                        <th class="section" colspan="2">By Language</th>
                    </tr>
                    <tr>
                        <th>Swedish:</th>
                        <td>z &lt; ö</td>
                    </tr>
                    <tr>
                        <th>German:</th>
                        <td>ö &lt; z</td>
                    </tr>
                    <tr>
                        <th>German:</th>
                        <td><abbr title="LATIN SMALL LETTER SHARP S">ß</abbr> = ss</td>
                    </tr>

                    <tr>
                        <th class="section" colspan="2">By Context</th>
                    </tr>
                    <tr>
                        <th>French:</th>
                        <td>cote &lt; côte &lt; coté &lt; côté</td>
                    </tr>

                    <tr>
                        <th class="section" colspan="2">By Usage</th>
                    </tr>
                    <tr>
                        <th>German Dictionary:</th>
                        <td>of &lt; öf</td>
                    </tr>
                    <tr>
                        <th>German Telephone:</th>
                        <td>öf &lt; of</td>
                    </tr>
                </table>

                <p class="attribution">
                    Sources: <cite><a href="http://unicode.org/reports/tr10/">Unicode Technical Standard #10</a></cite>
                       and <cite><a href="http://en.wikipedia.org/wiki/Alphabetical_order#Language-specific_conventions">Wikipedia: Alphabetical order</a></cite>
                </p>
            </div>
        </section>

        <section class="slide" id="directionality">
            <h2>Directionality</h2>
            <figure>
                <img src="img/Writing_directions_of_the_world.svg" />
                <figcaption class="center">
                    <a href="http://commons.wikimedia.org/wiki/File%3AWriting_directions_of_the_world.svg">Writing Directions of the World</a>
                    <p class="attribution"><cite>SPQRobin via Wikimedia Commons (<a href="http://creativecommons.org/licenses/by-sa/3.0">CC-BY-SA-3.0</a>)</cite></p>
                </figcaption>
            </figure>
        </section>

        <section class="slide" id="unicode-encodings">
            <h2>Unicode Encodings</h2>

            <p>
                The Unicode standard describes abstract characters but we need
                a way to convert them into bytes for storage and exchange.
                Early experiments which simply doubled 8-bit ASCII to 16-bits
                revealed significant problems:
            </p>

            <ol>
                <li>
                    Not every processor stores the bytes comprising a 16-bit
                    integer the same way — big-endian (“UNIX”) or little-endian
                    (“NUXI”) — necessitating a special byte-order mark (BOM) at
                    the start of the string simply to decode it
                </li>
                <li><strong>All</strong> existing text would need to be converted!</li>
                <li>All text becomes more expensive to store and process</li>
                <li>It wasn't enough: Chinese alone would require at least 3 bytes!</li>
            </ol>

            <p>
                UTF-8 was developed to avoid these problems. It's
                <a href="http://tools.ietf.org/html/rfc3629#page-4">a very
                clever variable-length encoding scheme</a> under which all
                existing 7-bit ASCII is valid, all common non-Asian characters
                require only 2 bytes, common CJK still needs only 3-bytes,
                using 4 bytes only for rare and historical characters. Because
                it's read one byte at a time, there's no need for a BOM.
            </p>
        </section>

        <section class="slide" id="state-of-unicode">
            <h2>State of Unicode</h2>

            <ul>
                <li>
                    In <a href="http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html">2008</a>,
                    Google announced that for web content UTF-8 had surpassed ASCII in popularity. In
                    <a href="http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html">2010</a>,
                    it was approaching 50%, and as of
                    <a href="http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html">February 2012</a>
                    it has passed 60%
                </li>
                <li>
                    <p>
                        All major operating systems and programming languages
                        support Unicode and UTF-8, although compatibility is
                        still a consideration for the more recent features
                        added in
                        <a href="http://babelstone.blogspot.com/2009/11/whats-new-in-unicode-60.html">version 6</a>
                        such as <a href="http://www.unicode.org/~scherer/emoji4unicode/snapshot/emojidata.html">emoji</a> (🌏)
                        or regional indicators (
                        &#x1F1FA; &#x1F1F8; = &#x1F1FA;&#x1F1F8;,
                        &#x1F1EB; &#x1F1F7; = &#x1F1EB;&#x1F1F7;, etc.)
                    </p>
                </li>
            </ul>
        </section>

        <section class="slide" id="complex-scripts">
            <h2>Complex scripts</h2>

            <p>
                We previously discussed how accented characters can be formed
                by combining a base character with the desired diacritic. The
                concept of multiple characters producing a visually distinct
                glyph is relatively unusual in English, where only a few
                ligatures are at all commonly used - perhaps the best known
                being the “ae” in encylopædia - but other languages depend on
                this behaviour.
            </p>
            <p>
                A <a href="http://en.wikipedia.org/wiki/Complex_text_layout">complex text layout</a>
                system allows the visual display to be significantly altered
                based on the context. If you need to support complex languages
                this will affect your font choices and design options!
            </p>
            <div class="slide">
                <figure>
                    <figcaption class="center">
                        The name of the Arabic language as individual characters and written normally
                    </figcaption>
                    <p lang="ar" class="center-block larger">
                        <span class="slide">&#x627; &#x644; &#x639; &#x631; &#x628; &#x64A; &#x629;</span><br>
                        <span class="slide">&#x627;&#x644;&#x639;&#x631;&#x628;&#x64A;&#x629;</span>
                    </p>
                </figure>
            </div>
        </section>

        <section class="slide exit">
            <nav>
                <a href="index.html#agenda">Back to agenda</a>
            </nav>
        </section>

        <a href="#" class="deck-prev-link" title="Previous">&#8592;</a>
        <a href="#" class="deck-next-link" title="Next">&#8594;</a>

        <!-- deck.status snippet -->
        <p class="deck-status">
            <span class="deck-status-current"></span>
            /
            <span class="deck-status-total"></span>
        </p>

        <!-- deck.goto snippet -->
        <form action="." method="get" class="goto-form">
            <label for="goto-slide">Go to slide:</label>
            <input type="text" name="slidenum" id="goto-slide" list="goto-datalist">
            <datalist id="goto-datalist"></datalist>
            <input type="submit" value="Go">
        </form>

        <!-- deck.hash snippet -->
        <a href="." title="Permalink to this slide" class="deck-permalink">#</a>


        <script src="deck.js/jquery-1.7.2.min.js"></script>
        <script src="deck.js/core/deck.core.js"></script>

        <script src="deck.js/core/deck.core.js"></script>
        <script src="deck.js/extensions/hash/deck.hash.js"></script>
        <script src="deck.js/extensions/menu/deck.menu.js"></script>
        <script src="deck.js/extensions/goto/deck.goto.js"></script>
        <script src="deck.js/extensions/status/deck.status.js"></script>
        <script src="deck.js/extensions/navigation/deck.navigation.js"></script>

        <script>
            $(function() {
                $.deck('.slide');
            });
        </script>
    </body>
</html>