handling-unicode.html

<!DOCTYPE html>
<html>
    <head>
        <meta charset="utf-8">
        <!--[if IE]><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><![endif]-->
        <meta name="viewport" content="width=1024, user-scalable=no">

        <title>Handling Unicode in Python &amp; Django</title>

        <!-- Required stylesheet -->
        <link rel="stylesheet" href="deck.js/core/deck.core.css">


        <link rel="stylesheet" href="deck.js/extensions/goto/deck.goto.css">
        <link rel="stylesheet" href="deck.js/extensions/menu/deck.menu.css">
        <link rel="stylesheet" href="deck.js/extensions/navigation/deck.navigation.css">
        <link rel="stylesheet" href="deck.js/extensions/status/deck.status.css">
        <link rel="stylesheet" href="deck.js/extensions/hash/deck.hash.css">

        <link rel="stylesheet" href="deck.js/themes/style/swiss.css">

        <link rel="stylesheet" href="custom.css">

        <script src="deck.js/modernizr.custom.js"></script>
    </head>

    <body class="deck-container">
        <section class="slide">
            <h1>Handling Unicode</h1>
        </section>

        <section class="slide" id="unicode-in-python2">
            <h2>Unicode in Python 2.x</h2>

            <p>
                There's a persistent misconception that you need Python 3 to
                support Unicode. This is untrue: the default <code>""</code>
                may produce a <code>str</code> (i.e. bytes without a specified
                encoding) instance but simply prefixing any string with
                <code>u</code> produces a full <code>unicode</code> object.
            </p>

            <p>Python 2 will even helpfully convert them when you mix the two:</p>

            <samp class="block">
                &gt;&gt;&gt; type(&quot;foo&quot;)<br>
                &lt;type &#x27;<b>str</b>&#x27;&gt;<br>
                &gt;&gt;&gt; type(u&quot;bar&quot;)<br>
                &lt;type &#x27;<b>unicode</b>&#x27;&gt;<br>
                &gt;&gt;&gt; &quot;foo&quot; + u&quot;bar&quot;<br>
                u&#x27;foobar&#x27;
            </samp>

            <p class="center bottom">Background: <a href="http://docs.python.org/howto/unicode.html" title="Unicode HOWTO">http://docs.python.org/howto/unicode.html</a></p>
        </section>

        <section class="slide" id="unicode-in-python2-the-bad-parts">
            <h2>Unicode in Python 2.x: the bad parts</h2>

            <p>
                To preserve compatibility with existing code, Python 2 assumes
                byte-strings are ASCII. This means that any input source is a
                risk of <code>UnicodeDecodeError</code>s:
            </p>

            <samp class="block">
                &gt;&gt;&gt; s = &quot; &quot;.join([&#x27;I&#x27;, <b class="mistake">'\xe2\x98\xba'</b>, &#x27;Unicode&#x27;])<br>
                &gt;&gt;&gt; print s  <i># Raw UTF-8 displays correctly in my UTF-8 terminal</i><br>
                I &#x263A; Unicode<br>
                &gt;&gt;&gt; unicode(s)<br>
                Traceback (most recent call last):<br>
                  File &quot;&lt;stdin&gt;&quot;, line 1, in &lt;module&gt;<br>
                UnicodeDecodeError: &#x27;ascii&#x27; codec can&#x27;t decode byte 0xe2 in position 2: ordinal not in range(128)<br>
            </samp>

            <p>Everything works if you know the encoding and remember to decode first:</p>
            <samp class="block">
                &gt;&gt;&gt; s.decode(&quot;utf-8&quot;)<br>
                u&#x27;I \u263a Unicode&#x27;
            </samp>

            <p>
                It also “works” if you simply avoid treating bytes as text until
                you're prepared to deal with them. This was a problem in
                django-localeurl when requests were received with incorrectly
                encoded data - the rest of the system handled it but the
                redirect logic would choke on a <code>UnicodeDecodeError</code>.

                This was fixed by

                <a href="https://bitbucket.org/carljm/django-localeurl/changeset/32afa6f7e79db683ee32c4c3389ebc872db57c6c">
                    doing less
                </a>
            </p>
        </section>

        <section class="slide" id="unicode-in-python3">
            <h2>Unicode in Python 3.x</h2>
            <h3>Explicit is better than implicit</h3>

            <p>
                Learning from what was problematic in Python 2, the default
                string type is <code>unicode</code>, with a separate
                <code>bytes</code> type for raw I/O. You can't mix the two
                without an explicit conversion:
            </p>

            <samp class="block">
                &gt;&gt;&gt; &quot;☹&quot; + b&quot;bytes&quot;<br>
                Traceback (most recent call last):<br>
                  File &quot;&lt;stdin&gt;&quot;, line 1, in &lt;module&gt;<br>
                TypeError: Can&#x27;t convert &#x27;bytes&#x27; object to str implicitly<br>
                &gt;&gt;&gt; &quot;☺&quot; + b&quot; bytes&quot;.decode(&quot;ascii&quot;)<br>
                &#x27;☺ bytes&#x27;<br>
                &gt;&gt;&gt; &quot;&#x263A;&quot;.encode(&quot;utf-8&quot;) + b&quot; bytes&quot;<br>
                b&#x27;\xe2\x98\xba bytes&#x27;<br>
            </samp>

            <p class="center">Python 2 fails later; Python 3 forces you to fix it now</p>
        </section>

        <section class="slide" id="unicode-best-practices">
            <h2>Unicode Best Practices</h2>

            <p class="center note">These rules are the same for Python 2 and 3 so you won't need to change when Django migrates</p>

            <ul>
                <li>
                    <strong>Always use Unicode internally</strong>
                </li>
                <li>
                    <strong>Always convert to Unicode as soon as you receive something and encode it when you send it</strong>
                </li>
                <li>
                    <strong>Decide how to handle invalid values</strong>
                </li>
            </ul>
        </section>

        <section class="slide" id="unicode-conversion">
            <h2>Unicode Best Practices</h2>
            <h3>I/O Conversion</h3>

            <p>
                The most basic form involves str and anything str-like where
                you want to convert everything as soon as you see it:
            </p>
            <code class="block">
                s = foo.read().decode("utf-8")<br>
                print >>foo, s.encode("utf-8")
            </code>

            <p>
                For files, you might want to use the <a
                href="http://docs.python.org/library/codecs.html">codecs</a>
                module so you can read from the file normally and receive
                Unicode characters instead of byte-strings:
            </p>
            <samp class="block">
                &gt;&gt;&gt; import codecs<br>
                &gt;&gt;&gt; my_file = codecs.open(foo_file, encoding=&quot;utf-8&quot;)
            </samp>
        </section>

        <section class="slide" id="invalid-unicode">
            <h2>Unicode Best Practices</h2>
            <h3>Handling invalid data</h3>

            <p>
                In general the best policy is to reject invalid text so the
                user can correct it immediately. However, in a batch context
                you might need to process what you can. Python allows you to
                “ignore” or strip anything which fails to decode:
            </p>

            <code class="block">foo.decode("utf-8", "ignore")</code>

            <p>
                It may be preferable to instead replace each invalid character
                with the
                <a href="http://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character">Unicode Replacement character �</a>
                to clearly indicate that data was lost:
            </p>

            <code class="block">foo.decode("utf-8", "replace")</code>
        </section>

        <section class="slide" id="unicode-normalization">
            <h2>Unicode Best Practices</h2>
            <h3><a href="http://unicode.org/faq/normalization.html">Normalization</a></h3>

            <p>
                There are <a href="background.html#terminology-unicode-2">multiple ways to
                display many characters</a>, producing situations where a
                program treats two values as different even though they appear
                identical to the user. We can avoid this problem by normalizing
                everything to a consistent internal representation for
                comparison:
            </p>
            <samp class="block">
                &gt;&gt;&gt; from unicodedata import normalize<br>
                &gt;&gt;&gt; precomposed = u&quot;\N{LATIN SMALL LETTER E WITH ACUTE}&quot;<br>
                &gt;&gt;&gt; decomposed = u&quot;\N{LATIN SMALL LETTER E}\N{COMBINING ACUTE ACCENT}&quot;<br>
                &gt;&gt;&gt; precomposed == decomposed<br>
                False<br>
                &gt;&gt;&gt; normalize(&quot;NFC&quot;, precomposed) == normalize(&quot;NFC&quot;, decomposed)<br>
                True
            </samp>
            <aside>
                <p>
                    Those <code>\N{<i>name</i>}</code> Unicode named literals
                    are a handy way to avoid confusion in your source code.
                    Your source should be saved as UTF-8 but that doesn't make
                    it any easier to quickly read a string with visually similar
                    characters or changes in directionality which are tricky to
                    work with in many text editors.
                </p>
                <p>
                    Set your source code with this comment on the first line or
                    second if you're writing a shell-script:<br>
                    <code class="block"># encoding: utf-8</code>
                    <code class="block">
                        #!/usr/bin/env python<br>
                        # encoding: utf-8
                    </code>
                </p>
            </aside>
        </section>

        <section class="slide" id="unicode-numeric-value">
            <h2>Other stdlib features</h2>
            <h3>Comparing numbers with Unicode</h3>

            <p>
                Python's <a href="http://docs.python.org/library/unicodedata.html">unicodedata</a>
                module has other features for working with Unicode text. Several of them allow you
                to retrieve information about characters, as in this example which generated the
                different characters in the
                <a href="background.html#terminology-unicode-2">numeric equivalence example</a>:
            </p>
            <code class="block">
            from __future__ import print_function, unicode_literals<br>
            from unicodedata import decimal<br>
            import sys<br>
            <br>
            if sys.version_info[0] == 3:<br>
            &nbsp;&nbsp;&nbsp;&nbsp;unichr = chr<br>
            <br>
            for i in range(0, sys.maxunicode):<br>
            &nbsp;&nbsp;&nbsp;&nbsp;u = unichr(i)<br>
            &nbsp;&nbsp;&nbsp;&nbsp;if decimal(u, None) == 5:<br>
            &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;print(u.encode("utf-8"))<br>
            </code>
        </section>

        <section class="slide" id="re-vs-regex">
            <h2>Unicode Best Practices</h2>
            <h3>re vs. regex</h3>

            <p>
                If you need to process text using regular expressions, you'll
                quickly learn that even with Python 3 there are significant
                drawbacks with the built-in re module. The
                <a href="http://pypi.python.org/pypi/regex">regex</a> module
                is the successor being developed as a standalone library for
                ease of testing and it has much better Unicode support:
            </p>

            <code class="block">
                &gt;&gt;&gt; re.match(r&quot;(?i)strasse&quot;, u&quot;stra\N{LATIN SMALL LETTER SHARP S}e&quot;, flags=re.UNICODE)<br>
                None<br>
                &gt;&gt;&gt; regex.match(ur&quot;(?iV1)strasse&quot;, u&quot;stra\N{LATIN SMALL LETTER SHARP S}e&quot;)<br>
                &lt;_regex.Match object at …&gt;<br>
                &gt;&gt;&gt; regex.match(ur&quot;(?iV1)stra\N{LATIN SMALL LETTER SHARP S}e&quot;, &quot;STRASSE&quot;)<br>
                &lt;_regex.Match object at …&gt;<br>
            </code>
            <p class="note">
                regex's smart behaviour depends on either <a
                href="http://pypi.python.org/pypi/regex#flags">setting the
                Unicode flag</a> or using a Unicode string as the pattern!
            </p>
        </section>

        <section class="slide" id="unicode-in-django">
            <h2>Unicode in Django</h2>

            <blockquote class="long" cite="https://docs.djangoproject.com/en/1.4/ref/unicode/">
                A bytestring does not carry any information with it about its
                encoding. For that reason, we have to make an assumption, and
                Django assumes that all bytestrings are in UTF-8.<br>
                …<br>
                In most cases when Django is dealing with strings, it
                will convert them to Unicode strings before doing
                anything else. So, as a general rule, if you pass in a
                bytestring, be prepared to receive a Unicode string
                back in the result.
            </blockquote>

            <div class="slide">
                <h4>Handy Utilities</h4>
                <ul class="slide collapse-inactive">
                    <li>
                        <a href="https://docs.djangoproject.com/en/1.4/ref/utils/#django.utils.encoding.smart_unicode">django.utils.encoding.smart_unicode()</a>:
                        attempts to convert the input to a <code>unicode</code>
                        instance. Set <code>strings_only=False</code> to leave
                        numbers, boolean and <code>None</code> unconverted
                    </li>

                    <li>
                        <a href="https://docs.djangoproject.com/en/1.4/ref/utils/#django.utils.encoding.force_unicode">django.utils.encoding.force_unicode()</a>:
                        almost identical to <code>smart_unicode</code> except that
                        lazy translations will be processed
                    </li>

                    <li>
                        <a href="https://docs.djangoproject.com/en/1.4/ref/utils/#django.utils.encoding.smart_str">django.utils.encoding.smart_str()</a>:
                        converts the input to a byte-string. Set
                        <code>strings_only=False</code> to leave numbers, boolean
                        and <code>None</code> unconverted
                    </li>
                </ul>
                <ul class="slide collapse-inactive">
                    <li>
                        <a href="https://docs.djangoproject.com/en/1.4/ref/utils/#django.utils.encoding.iri_to_uri">django.utils.encoding.iri_to_uri()</a>
                        correctly encodes Unicode domain IRIs:<br>
                        <a href="http://xn--n3h.net/">http://xn--n3h.net/</a> == <a href="http://☃.net">http://☃.net</a>
                    </li>
                    <li>
                        <a href="https://docs.djangoproject.com/en/1.4/ref/utils/#django.utils.http.urlquote">django.utils.http.urlquote()</a>
                        and
                        <a href="https://docs.djangoproject.com/en/1.4/ref/utils/#django.utils.http.urlquote_plus">django.utils.http.urlquote_plus()</a>
                        are Unicode-safe versions of the stdlib utilities with the
                        same names
                    </li>
                </ul>
                <aside class="slide collapse-inactive">
                    The default <a href="https://docs.djangoproject.com/en/1.4/ref/templates/builtins/#slugify">slugify</a>
                    filter simply drops Unicode characters, which is useless
                    for significantly non-ASCII names. Fortunately, the fine
                    developers at Mozilla have written a drop-in replacement:
                    <a href="http://pypi.python.org/pypi/unicode-slugify">unicode-slugify</a>
                </p>
            </div>

            <p class="note slide">
                See the official
                <a href="https://docs.djangoproject.com/en/1.4/ref/unicode/">Django Unicode</a>
                documentation for more information
            </p>
        </section>

        <section class="slide" id="django-and-databases">
            <h2>Unicode in databases</h2>

            <p>
                Django attempts to use Unicode for communication with the
                database. There's only one catch but it's significant: you
                <em>must</em> ensure that the database is configured to use a
                Unicode encoding when you create it. Otherwise you'll probably
                be able to store data but queries may be erratic - remember
                collation? - and data can be lost or truncated: thanks to
                UTF-8's variable length, a <code>VARCHAR(20)</code> could
                require up to 80 bytes to store!
            </p>
        </section>

        <section class="slide" id="mysql-unicode">
            <h2>Unicode in databases: MySQL</h2>

            <aside>
                <p>
                    n.b. MySQL's <code>utf8</code> only supports up to 3 bytes,
                    breaking on newer or obscure characters such as emoji.
                    <code>utf8mb4</code> support 4-bytes but requires MySQL 5.5!
                </p>

                <p>
                    The older <code>utf8_general_ci</code> is faster than
                    <code>utf8_unicode_ci</code> because it only compares
                    single characters without performing the full Unicode
                    collation rules (e.g. <samp>ß</samp> == <samp>ss</samp>)
                </p>
            </aside>

            <h4>Per database</h4>
            <code class="block">CREATE DATABASE my_site CHARACTER SET utf8 COLLATE utf8_unicode_ci;</code>

            <h4>Server-Wide (recommended)</h4>

            <code class="block">
                [mysqld]<br>
                character-set-server = utf8<br>
                collation-server = utf8_unicode_ci
            </code>

            <aside>
                If you use Amazon Web Services, here are step-by-step
                instructions for
                <a href="https://gist.github.com/1084682">creating a UTF-8
                Amazon RDS database</a>
            </aside>

            <p class="center note">
                See
                <a href="http://dev.mysql.com/doc/refman/5.1/en/globalization.html" title="MySQL :: MySQL 5.1 Reference Manual :: 10 Globalization">http://dev.mysql.com/doc/refman/5.1/en/globalization.html</a>
            </p>
        </section>

        <section class="slide" id="postgresql-unicode">
            <h2>Unicode in databases: PostgreSQL</h2>

            <h4>Per database</h4>

            <code class="block">CREATE DATABASE my_app WITH ENCODING 'UTF8';</code>
            or
            <code class="block">createdb -E UTF8 myapp</code>

            <h4>Server-Wide (recommended)</h4>
            <code class="block">initdb -E UTF8</code>

            <p class="center note">
                See
                <a href="http://www.postgresql.org/docs/9.1/static/multibyte.html" title="PostgreSQL: Documentation: 9.1: Character Set Support">PostgreSQL: Character Set Support</a>
            </p>
        </section>

        <section class="slide" id="model-data-normalization">
            <h2>Normalizing your model data</h2>

            <p class="note">
                <strong>There is no one right approach</strong>: you need to
                validate your inputs at every point but the details vary from
                project to project.
            </p>

            <p>
                My fledgling
                <a href="https://github.com/acdha/django-i18n-utils">django-i18n-utils</a>
                package provides an example of one approach:
                <a href="https://github.com/acdha/django-i18n-utils/blob/master/django_i18n_utils/models.py#L11"><code>UnicodeNormalizerMixin</code></a>
                uses the standard
                <a href="https://docs.djangoproject.com/en/1.4/ref/models/instances/#django.db.models.Model.clean_fields">
                    model validation <code>clean_fields()</code>
                </a> method to normalize all text fields to Unicode NFC:
            </p>

            <code class="block">
                class Person(UnicodeNormalizerMixin, models.Model):<br>
                    name = models.CharField(…)
            </code>

            <p>
                This works well as long as you scrupulously call
                <code>clean()</code> before saving a model. Since this is
                important for other reasons, make it a key point in testing
            </p>
        </section>

        <section class="slide" id="homoglyph-attacks">
            <h2>Unicode Security: Homoglyph Attacks</h2>

            <p>
                Unicode has a number of characters which need to be different
                for some reason but are visually quite similar. This is most
                concerning in the case of domain names if an

                <a href="http://en.wikipedia.org/wiki/IDN_homograph_attack">IDN homograph attack</a>

                allows an attacker to register something like

                <samp>http://www.p<abbr title="CYRILLIC SMALL LETTER A">а</abbr>ypal.com/</samp>

                but the concern applies any time there's something to gain from
                spoofing.
            </p>
            <p>
                Solving this requires some combination of restricting your data
                or alerting the user:

                <ul>
                    <li>
                        Browsers increasingly restrict mixed script display: Chrome
                        displays the above URL as
                        <samp>http://www.xn--pypal-4ve.com/</samp> because it mixes
                        Cyrillic and latin characters
                    </li>
                    <li>
                        Twitter simply blocks non-ASCII usernames:
                        <img src="img/twitter-homoglyphs.png" alt="Blocked attempt to register @[CYRILLIC SMALL LETTER A]cdha on Twitter">
                    </li>
                </ul>
            </p>
        </section>

        <section class="slide exit">
            <nav>
                <a href="index.html#agenda">Back to agenda</a>
            </nav>
        </section>

        <a href="#" class="deck-prev-link" title="Previous">&#8592;</a>
        <a href="#" class="deck-next-link" title="Next">&#8594;</a>

        <!-- deck.status snippet -->
        <p class="deck-status">
            <span class="deck-status-current"></span>
            /
            <span class="deck-status-total"></span>
        </p>

        <!-- deck.goto snippet -->
        <form action="." method="get" class="goto-form">
            <label for="goto-slide">Go to slide:</label>
            <input type="text" name="slidenum" id="goto-slide" list="goto-datalist">
            <datalist id="goto-datalist"></datalist>
            <input type="submit" value="Go">
        </form>

        <!-- deck.hash snippet -->
        <a href="." title="Permalink to this slide" class="deck-permalink">#</a>


        <script src="deck.js/jquery-1.7.2.min.js"></script>
        <script src="deck.js/core/deck.core.js"></script>

        <script src="deck.js/core/deck.core.js"></script>
        <script src="deck.js/extensions/hash/deck.hash.js"></script>
        <script src="deck.js/extensions/menu/deck.menu.js"></script>
        <script src="deck.js/extensions/goto/deck.goto.js"></script>
        <script src="deck.js/extensions/status/deck.status.js"></script>
        <script src="deck.js/extensions/navigation/deck.navigation.js"></script>

        <script>
            $(function() {
                $.deck('.slide');
            });
        </script>
    </body>
</html>