forked from acdha/djangocon-internationalization-tutorial
-
Notifications
You must be signed in to change notification settings - Fork 0
/
handling-unicode.html
executable file
·539 lines (455 loc) · 24 KB
/
handling-unicode.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<!--[if IE]><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><![endif]-->
<meta name="viewport" content="width=1024, user-scalable=no">
<title>Handling Unicode in Python & Django</title>
<!-- Required stylesheet -->
<link rel="stylesheet" href="deck.js/core/deck.core.css">
<link rel="stylesheet" href="deck.js/extensions/goto/deck.goto.css">
<link rel="stylesheet" href="deck.js/extensions/menu/deck.menu.css">
<link rel="stylesheet" href="deck.js/extensions/navigation/deck.navigation.css">
<link rel="stylesheet" href="deck.js/extensions/status/deck.status.css">
<link rel="stylesheet" href="deck.js/extensions/hash/deck.hash.css">
<link rel="stylesheet" href="deck.js/themes/style/swiss.css">
<link rel="stylesheet" href="custom.css">
<script src="deck.js/modernizr.custom.js"></script>
</head>
<body class="deck-container">
<section class="slide">
<h1>Handling Unicode</h1>
</section>
<section class="slide" id="unicode-in-python2">
<h2>Unicode in Python 2.x</h2>
<p>
There's a persistent misconception that you need Python 3 to
support Unicode. This is untrue: the default <code>""</code>
may produce a <code>str</code> (i.e. bytes without a specified
encoding) instance but simply prefixing any string with
<code>u</code> produces a full <code>unicode</code> object.
</p>
<p>Python 2 will even helpfully convert them when you mix the two:</p>
<samp class="block">
>>> type("foo")<br>
<type '<b>str</b>'><br>
>>> type(u"bar")<br>
<type '<b>unicode</b>'><br>
>>> "foo" + u"bar"<br>
u'foobar'
</samp>
<p class="center bottom">Background: <a href="http://docs.python.org/howto/unicode.html" title="Unicode HOWTO">http://docs.python.org/howto/unicode.html</a></p>
</section>
<section class="slide" id="unicode-in-python2-the-bad-parts">
<h2>Unicode in Python 2.x: the bad parts</h2>
<p>
To preserve compatibility with existing code, Python 2 assumes
byte-strings are ASCII. This means that any input source is a
risk of <code>UnicodeDecodeError</code>s:
</p>
<samp class="block">
>>> s = " ".join(['I', <b class="mistake">'\xe2\x98\xba'</b>, 'Unicode'])<br>
>>> print s <i># Raw UTF-8 displays correctly in my UTF-8 terminal</i><br>
I ☺ Unicode<br>
>>> unicode(s)<br>
Traceback (most recent call last):<br>
File "<stdin>", line 1, in <module><br>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2: ordinal not in range(128)<br>
</samp>
<p>Everything works if you know the encoding and remember to decode first:</p>
<samp class="block">
>>> s.decode("utf-8")<br>
u'I \u263a Unicode'
</samp>
<p>
It also “works” if you simply avoid treating bytes as text until
you're prepared to deal with them. This was a problem in
django-localeurl when requests were received with incorrectly
encoded data - the rest of the system handled it but the
redirect logic would choke on a <code>UnicodeDecodeError</code>.
This was fixed by
<a href="https://bitbucket.org/carljm/django-localeurl/changeset/32afa6f7e79db683ee32c4c3389ebc872db57c6c">
doing less
</a>
</p>
</section>
<section class="slide" id="unicode-in-python3">
<h2>Unicode in Python 3.x</h2>
<h3>Explicit is better than implicit</h3>
<p>
Learning from what was problematic in Python 2, the default
string type is <code>unicode</code>, with a separate
<code>bytes</code> type for raw I/O. You can't mix the two
without an explicit conversion:
</p>
<samp class="block">
>>> "☹" + b"bytes"<br>
Traceback (most recent call last):<br>
File "<stdin>", line 1, in <module><br>
TypeError: Can't convert 'bytes' object to str implicitly<br>
>>> "☺" + b" bytes".decode("ascii")<br>
'☺ bytes'<br>
>>> "☺".encode("utf-8") + b" bytes"<br>
b'\xe2\x98\xba bytes'<br>
</samp>
<p class="center">Python 2 fails later; Python 3 forces you to fix it now</p>
</section>
<section class="slide" id="unicode-best-practices">
<h2>Unicode Best Practices</h2>
<p class="center note">These rules are the same for Python 2 and 3 so you won't need to change when Django migrates</p>
<ul>
<li>
<strong>Always use Unicode internally</strong>
</li>
<li>
<strong>Always convert to Unicode as soon as you receive something and encode it when you send it</strong>
</li>
<li>
<strong>Decide how to handle invalid values</strong>
</li>
</ul>
</section>
<section class="slide" id="unicode-conversion">
<h2>Unicode Best Practices</h2>
<h3>I/O Conversion</h3>
<p>
The most basic form involves str and anything str-like where
you want to convert everything as soon as you see it:
</p>
<code class="block">
s = foo.read().decode("utf-8")<br>
print >>foo, s.encode("utf-8")
</code>
<p>
For files, you might want to use the <a
href="http://docs.python.org/library/codecs.html">codecs</a>
module so you can read from the file normally and receive
Unicode characters instead of byte-strings:
</p>
<samp class="block">
>>> import codecs<br>
>>> my_file = codecs.open(foo_file, encoding="utf-8")
</samp>
</section>
<section class="slide" id="invalid-unicode">
<h2>Unicode Best Practices</h2>
<h3>Handling invalid data</h3>
<p>
In general the best policy is to reject invalid text so the
user can correct it immediately. However, in a batch context
you might need to process what you can. Python allows you to
“ignore” or strip anything which fails to decode:
</p>
<code class="block">foo.decode("utf-8", "ignore")</code>
<p>
It may be preferable to instead replace each invalid character
with the
<a href="http://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character">Unicode Replacement character �</a>
to clearly indicate that data was lost:
</p>
<code class="block">foo.decode("utf-8", "replace")</code>
</section>
<section class="slide" id="unicode-normalization">
<h2>Unicode Best Practices</h2>
<h3><a href="http://unicode.org/faq/normalization.html">Normalization</a></h3>
<p>
There are <a href="background.html#terminology-unicode-2">multiple ways to
display many characters</a>, producing situations where a
program treats two values as different even though they appear
identical to the user. We can avoid this problem by normalizing
everything to a consistent internal representation for
comparison:
</p>
<samp class="block">
>>> from unicodedata import normalize<br>
>>> precomposed = u"\N{LATIN SMALL LETTER E WITH ACUTE}"<br>
>>> decomposed = u"\N{LATIN SMALL LETTER E}\N{COMBINING ACUTE ACCENT}"<br>
>>> precomposed == decomposed<br>
False<br>
>>> normalize("NFC", precomposed) == normalize("NFC", decomposed)<br>
True
</samp>
<aside>
<p>
Those <code>\N{<i>name</i>}</code> Unicode named literals
are a handy way to avoid confusion in your source code.
Your source should be saved as UTF-8 but that doesn't make
it any easier to quickly read a string with visually similar
characters or changes in directionality which are tricky to
work with in many text editors.
</p>
<p>
Set your source code with this comment on the first line or
second if you're writing a shell-script:<br>
<code class="block"># encoding: utf-8</code>
<code class="block">
#!/usr/bin/env python<br>
# encoding: utf-8
</code>
</p>
</aside>
</section>
<section class="slide" id="unicode-numeric-value">
<h2>Other stdlib features</h2>
<h3>Comparing numbers with Unicode</h3>
<p>
Python's <a href="http://docs.python.org/library/unicodedata.html">unicodedata</a>
module has other features for working with Unicode text. Several of them allow you
to retrieve information about characters, as in this example which generated the
different characters in the
<a href="background.html#terminology-unicode-2">numeric equivalence example</a>:
</p>
<code class="block">
from __future__ import print_function, unicode_literals<br>
from unicodedata import decimal<br>
import sys<br>
<br>
if sys.version_info[0] == 3:<br>
unichr = chr<br>
<br>
for i in range(0, sys.maxunicode):<br>
u = unichr(i)<br>
if decimal(u, None) == 5:<br>
print(u.encode("utf-8"))<br>
</code>
</section>
<section class="slide" id="re-vs-regex">
<h2>Unicode Best Practices</h2>
<h3>re vs. regex</h3>
<p>
If you need to process text using regular expressions, you'll
quickly learn that even with Python 3 there are significant
drawbacks with the built-in re module. The
<a href="http://pypi.python.org/pypi/regex">regex</a> module
is the successor being developed as a standalone library for
ease of testing and it has much better Unicode support:
</p>
<code class="block">
>>> re.match(r"(?i)strasse", u"stra\N{LATIN SMALL LETTER SHARP S}e", flags=re.UNICODE)<br>
None<br>
>>> regex.match(ur"(?iV1)strasse", u"stra\N{LATIN SMALL LETTER SHARP S}e")<br>
<_regex.Match object at …><br>
>>> regex.match(ur"(?iV1)stra\N{LATIN SMALL LETTER SHARP S}e", "STRASSE")<br>
<_regex.Match object at …><br>
</code>
<p class="note">
regex's smart behaviour depends on either <a
href="http://pypi.python.org/pypi/regex#flags">setting the
Unicode flag</a> or using a Unicode string as the pattern!
</p>
</section>
<section class="slide" id="unicode-in-django">
<h2>Unicode in Django</h2>
<blockquote class="long" cite="https://docs.djangoproject.com/en/1.4/ref/unicode/">
A bytestring does not carry any information with it about its
encoding. For that reason, we have to make an assumption, and
Django assumes that all bytestrings are in UTF-8.<br>
…<br>
In most cases when Django is dealing with strings, it
will convert them to Unicode strings before doing
anything else. So, as a general rule, if you pass in a
bytestring, be prepared to receive a Unicode string
back in the result.
</blockquote>
<div class="slide">
<h4>Handy Utilities</h4>
<ul class="slide collapse-inactive">
<li>
<a href="https://docs.djangoproject.com/en/1.4/ref/utils/#django.utils.encoding.smart_unicode">django.utils.encoding.smart_unicode()</a>:
attempts to convert the input to a <code>unicode</code>
instance. Set <code>strings_only=False</code> to leave
numbers, boolean and <code>None</code> unconverted
</li>
<li>
<a href="https://docs.djangoproject.com/en/1.4/ref/utils/#django.utils.encoding.force_unicode">django.utils.encoding.force_unicode()</a>:
almost identical to <code>smart_unicode</code> except that
lazy translations will be processed
</li>
<li>
<a href="https://docs.djangoproject.com/en/1.4/ref/utils/#django.utils.encoding.smart_str">django.utils.encoding.smart_str()</a>:
converts the input to a byte-string. Set
<code>strings_only=False</code> to leave numbers, boolean
and <code>None</code> unconverted
</li>
</ul>
<ul class="slide collapse-inactive">
<li>
<a href="https://docs.djangoproject.com/en/1.4/ref/utils/#django.utils.encoding.iri_to_uri">django.utils.encoding.iri_to_uri()</a>
correctly encodes Unicode domain IRIs:<br>
<a href="http://xn--n3h.net/">http://xn--n3h.net/</a> == <a href="http://☃.net">http://☃.net</a>
</li>
<li>
<a href="https://docs.djangoproject.com/en/1.4/ref/utils/#django.utils.http.urlquote">django.utils.http.urlquote()</a>
and
<a href="https://docs.djangoproject.com/en/1.4/ref/utils/#django.utils.http.urlquote_plus">django.utils.http.urlquote_plus()</a>
are Unicode-safe versions of the stdlib utilities with the
same names
</li>
</ul>
<aside class="slide collapse-inactive">
The default <a href="https://docs.djangoproject.com/en/1.4/ref/templates/builtins/#slugify">slugify</a>
filter simply drops Unicode characters, which is useless
for significantly non-ASCII names. Fortunately, the fine
developers at Mozilla have written a drop-in replacement:
<a href="http://pypi.python.org/pypi/unicode-slugify">unicode-slugify</a>
</p>
</div>
<p class="note slide">
See the official
<a href="https://docs.djangoproject.com/en/1.4/ref/unicode/">Django Unicode</a>
documentation for more information
</p>
</section>
<section class="slide" id="django-and-databases">
<h2>Unicode in databases</h2>
<p>
Django attempts to use Unicode for communication with the
database. There's only one catch but it's significant: you
<em>must</em> ensure that the database is configured to use a
Unicode encoding when you create it. Otherwise you'll probably
be able to store data but queries may be erratic - remember
collation? - and data can be lost or truncated: thanks to
UTF-8's variable length, a <code>VARCHAR(20)</code> could
require up to 80 bytes to store!
</p>
</section>
<section class="slide" id="mysql-unicode">
<h2>Unicode in databases: MySQL</h2>
<aside>
<p>
n.b. MySQL's <code>utf8</code> only supports up to 3 bytes,
breaking on newer or obscure characters such as emoji.
<code>utf8mb4</code> support 4-bytes but requires MySQL 5.5!
</p>
<p>
The older <code>utf8_general_ci</code> is faster than
<code>utf8_unicode_ci</code> because it only compares
single characters without performing the full Unicode
collation rules (e.g. <samp>ß</samp> == <samp>ss</samp>)
</p>
</aside>
<h4>Per database</h4>
<code class="block">CREATE DATABASE my_site CHARACTER SET utf8 COLLATE utf8_unicode_ci;</code>
<h4>Server-Wide (recommended)</h4>
<code class="block">
[mysqld]<br>
character-set-server = utf8<br>
collation-server = utf8_unicode_ci
</code>
<aside>
If you use Amazon Web Services, here are step-by-step
instructions for
<a href="https://gist.github.com/1084682">creating a UTF-8
Amazon RDS database</a>
</aside>
<p class="center note">
See
<a href="http://dev.mysql.com/doc/refman/5.1/en/globalization.html" title="MySQL :: MySQL 5.1 Reference Manual :: 10 Globalization">http://dev.mysql.com/doc/refman/5.1/en/globalization.html</a>
</p>
</section>
<section class="slide" id="postgresql-unicode">
<h2>Unicode in databases: PostgreSQL</h2>
<h4>Per database</h4>
<code class="block">CREATE DATABASE my_app WITH ENCODING 'UTF8';</code>
or
<code class="block">createdb -E UTF8 myapp</code>
<h4>Server-Wide (recommended)</h4>
<code class="block">initdb -E UTF8</code>
<p class="center note">
See
<a href="http://www.postgresql.org/docs/9.1/static/multibyte.html" title="PostgreSQL: Documentation: 9.1: Character Set Support">PostgreSQL: Character Set Support</a>
</p>
</section>
<section class="slide" id="model-data-normalization">
<h2>Normalizing your model data</h2>
<p class="note">
<strong>There is no one right approach</strong>: you need to
validate your inputs at every point but the details vary from
project to project.
</p>
<p>
My fledgling
<a href="https://github.com/acdha/django-i18n-utils">django-i18n-utils</a>
package provides an example of one approach:
<a href="https://github.com/acdha/django-i18n-utils/blob/master/django_i18n_utils/models.py#L11"><code>UnicodeNormalizerMixin</code></a>
uses the standard
<a href="https://docs.djangoproject.com/en/1.4/ref/models/instances/#django.db.models.Model.clean_fields">
model validation <code>clean_fields()</code>
</a> method to normalize all text fields to Unicode NFC:
</p>
<code class="block">
class Person(UnicodeNormalizerMixin, models.Model):<br>
name = models.CharField(…)
</code>
<p>
This works well as long as you scrupulously call
<code>clean()</code> before saving a model. Since this is
important for other reasons, make it a key point in testing
</p>
</section>
<section class="slide" id="homoglyph-attacks">
<h2>Unicode Security: Homoglyph Attacks</h2>
<p>
Unicode has a number of characters which need to be different
for some reason but are visually quite similar. This is most
concerning in the case of domain names if an
<a href="http://en.wikipedia.org/wiki/IDN_homograph_attack">IDN homograph attack</a>
allows an attacker to register something like
<samp>http://www.p<abbr title="CYRILLIC SMALL LETTER A">а</abbr>ypal.com/</samp>
but the concern applies any time there's something to gain from
spoofing.
</p>
<p>
Solving this requires some combination of restricting your data
or alerting the user:
<ul>
<li>
Browsers increasingly restrict mixed script display: Chrome
displays the above URL as
<samp>http://www.xn--pypal-4ve.com/</samp> because it mixes
Cyrillic and latin characters
</li>
<li>
Twitter simply blocks non-ASCII usernames:
<img src="img/twitter-homoglyphs.png" alt="Blocked attempt to register @[CYRILLIC SMALL LETTER A]cdha on Twitter">
</li>
</ul>
</p>
</section>
<section class="slide exit">
<nav>
<a href="index.html#agenda">Back to agenda</a>
</nav>
</section>
<a href="#" class="deck-prev-link" title="Previous">←</a>
<a href="#" class="deck-next-link" title="Next">→</a>
<!-- deck.status snippet -->
<p class="deck-status">
<span class="deck-status-current"></span>
/
<span class="deck-status-total"></span>
</p>
<!-- deck.goto snippet -->
<form action="." method="get" class="goto-form">
<label for="goto-slide">Go to slide:</label>
<input type="text" name="slidenum" id="goto-slide" list="goto-datalist">
<datalist id="goto-datalist"></datalist>
<input type="submit" value="Go">
</form>
<!-- deck.hash snippet -->
<a href="." title="Permalink to this slide" class="deck-permalink">#</a>
<script src="deck.js/jquery-1.7.2.min.js"></script>
<script src="deck.js/core/deck.core.js"></script>
<script src="deck.js/core/deck.core.js"></script>
<script src="deck.js/extensions/hash/deck.hash.js"></script>
<script src="deck.js/extensions/menu/deck.menu.js"></script>
<script src="deck.js/extensions/goto/deck.goto.js"></script>
<script src="deck.js/extensions/status/deck.status.js"></script>
<script src="deck.js/extensions/navigation/deck.navigation.js"></script>
<script>
$(function() {
$.deck('.slide');
});
</script>
</body>
</html>