forked from acdha/djangocon-internationalization-tutorial
-
Notifications
You must be signed in to change notification settings - Fork 0
/
background.html
executable file
·582 lines (518 loc) · 28.3 KB
/
background.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<!--[if IE]><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><![endif]-->
<meta name="viewport" content="width=1024, user-scalable=no">
<title>Building International Sites: Background Information</title>
<!-- Required stylesheet -->
<link rel="stylesheet" href="deck.js/core/deck.core.css">
<link rel="stylesheet" href="deck.js/extensions/goto/deck.goto.css">
<link rel="stylesheet" href="deck.js/extensions/menu/deck.menu.css">
<link rel="stylesheet" href="deck.js/extensions/navigation/deck.navigation.css">
<link rel="stylesheet" href="deck.js/extensions/status/deck.status.css">
<link rel="stylesheet" href="deck.js/extensions/hash/deck.hash.css">
<link rel="stylesheet" href="deck.js/themes/style/swiss.css">
<link rel="stylesheet" href="custom.css">
<script src="deck.js/modernizr.custom.js"></script>
</head>
<body class="deck-container">
<section class="slide" id="intro">
<h1>Background</h1>
<div class="bottom">
<ul class="inline center">
<li><a href="#basic-terminology">Terminology</a></li>
<li><a href="#cultural-differences">Cultural Differences</a></li>
<li><a href="#writing">Writing</a></li>
</ul>
</div>
</section>
<section class="slide terminology" id="basic-terminology">
<h2>Terminology</h2>
<aside class="note">
These terms are not standardized. Since we're web-oriented, I'll follow the
<a href="http://www.w3.org/International/questions/qa-i18n/">W3</a>.
</aside>
<dl>
<dt><dfn>Locale</dfn></dt>
<dd>
A collection of preferences defining how a system should
behave for a target group. For example, users in the United
States, Great Britain and Australia mostly share a language
but choose different ways to spell words, display dates and
measure.
</dd>
<dt><dfn>Localization (l10n)</dfn></dt>
<dd>
A collection of preferences defining how the user interface
should behave for a locale. This implies a number of
surprisingly complex topics ranging from how basic text
processing and number or formatting to questions about
prefered colors and icons and even legal requirements.
</dd>
<dt><dfn>Internationalization (i18n)</dfn></dt>
<dd>
Making it <em>easy</em> to localize software: in general
this involves identifying locale dependency points and
adding an abstraction mechanism to manage locale-specific
changes
</dd>
</dl>
</section>
<section class="slide" id="cultural-differences">
<h2>Cultural Differences</h2>
<p class="collapse-inactive">
Localization is about more than cosmetic differences: forget
whether the year comes before or after the month in a date,
basic beliefs about how the world works vary significantly
</p>
<ol>
<li class="slide">
Geography's pretty universal, right?
<figure class="slide collapse-inactive" id="geography-falsehoods">
<figcaption class="center"><a href="http://wiesmann.codiferes.net/wordpress/?p=15187">Falsehoods Programmers Believe About Geography</a></figcaption>
<blockquote class="long" cite="http://wiesmann.codiferes.net/wordpress/?p=15187">
<ul>
<li>Places have only one official name</li>
<li>Place names follow the character rules of the language</li>
<li>Place names can be written with the usual character set of a country</li>
</ul>
</blockquote>
</figure>
</li>
<li class="slide">
Well, what about someone's name?
<figure class="slide collapse-inactive" id="name-falsehoods">
<figcaption class="center"><a href="http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/">Falsehoods Programmers Believe About Names</a></figcaption>
<blockquote class="long" cite="http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/">
<ol>
<li>People have exactly one canonical full name.</li>
<li>People have exactly one full name which they go by.</li>
<li value="7">People’s names do not change.</li>
<li value="8">People’s names change, but only at a certain enumerated set of events.</li>
<li value="20">People have last names, family names, or anything else which is shared by folks recognized as their relatives.</li>
</ol>
</blockquote>
</figure>
</li>
<li class="slide">
What about something as simple as a person's gender? That's just biology, right?
<figure class="slide collapse-inactive" id="gender-falsehoods">
<figcaption class="center"><a href="http://www.cscyphers.com/blog/2012/06/28/falsehoods-programmers-believe-about-gender/">Falsehoods Programmers Believe About Gender</a></figcaption>
<blockquote class="long" cite="http://www.cscyphers.com/blog/2012/06/28/falsehoods-programmers-believe-about-gender/">
<ul>
<li>There are two and only two genders</li>
<li>Okay, then there are two and only two biological genders.</li>
<li>Gender is determined solely by biology.</li>
</ul>
</blockquote>
</figure>
</li>
</ol>
</section>
<section class="slide" id="dealing-with-cultural-differences">
<h2>Dealing with Cultural Differences</h2>
<ul>
<li>
The easiest way to spend less time dealing with complex
data is not to ask for it: do you really need to know your
users' gender? This is also a good way to avoid your signup
process feeling nosy
</li>
<li>
If you do need data, ask how much structure you need: a
simple “What name should we display on your profile?” field
is easy to build and much easier than trying to migrate a
simplistic system after it's full of user data
</li>
<li>
If you have to model something complicated, pay the cost
upfront: use a library or service, test major assumptions
and potential outliers regularly and think about how you'll
deal with problems
</li>
</ul>
</section>
<section class="slide" id="writing">
<h1>Writing</h1>
</section>
<section class="slide">
<h2>An Abbreviated History of Electronic Text</h2>
<p>
When people started building electronic communication systems,
it was easy to continue assigning each distinct character a
number. Since early systems needed to be simple each character
was assigned a fixed-length binary number
</p>
<ul>
<li>
<a href="http://en.wikipedia.org/wiki/Baudot_code">Baudot code</a>
(1870) to ITA2 (1930): 5 bits - just enough for the English alphabet
</li>
<li>
TeleTypeSetter and
<a href="http://en.wikipedia.org/wiki/BCD_(6-bit)">BCD</a>
(1928): 6 bits allowed punctuation. Unfortunately,
different manufacturers used different schemes, making it
difficult to exchange data or even mix computers and
printers from different manufacturers!
</li>
<li>
<a href="http://en.wikipedia.org/wiki/American_Standard_Code_for_Information_Interchange">ASCII</a> (1963):
7 bits allow both upper <em>and</em> lower case!
Standardization should also help avoid painful conversion
issues between manufacturers…
</li>
<li class="slide">
<p>… but almost everyone outside the United States needs
more characters and uses 8-bits to store extended
characters beyond basic ASCII. Worse, it's frequently
possible to exchange text incorrectly until someone notices
the first document using one of the different
characters!</p>
<p>Since there are individual languages which need more
than 256 characters, there's no possibility of a standard
8-bit encoding emerging</p>
</li>
</ul>
</section>
<section class="slide" id="writing-systems">
<h2>The Range of Human Writing</h2>
<figure>
<img src="img/WritingSystemsOfTheWorld.svg" />
<figcaption class="center">
<a href="http://commons.wikimedia.org/wiki/File%3AWritingSystemsOfTheWorld.svg">Writing Systems of the World</a>
<p class="attribution"><cite>Maximilian Dörrbecker via Wikimedia Commons (<a href="http://creativecommons.org/licenses/by-sa/3.0/">CC-BY-SA-3.0</a>)</cite></p>
</figcaption>
</figure>
</section>
<section class="slide" id="unicode">
<h2>Unicode</h2>
<p>
Starting in the 1980s, engineers from various companies started
working on an ambitious project: a universal 16-bit character
set which could represent every character used in human writing.
At some point it expanded beyond 16 bits but the goal hasn't
changed
</p>
<figure>
<figcaption><a href="http://www.unicode.org/standard/principles.html">The Unicode® Standard: A Technical Introduction</a></figcaption>
<blockquote class="long" cite="http://www.unicode.org/standard/principles.html">
<p>The Unicode Standard defines codes for characters used in <em>all the major
languages written today</em>. Scripts include the European alphabetic scripts,
Middle Eastern right-to-left scripts, and many scripts of Asia.</p>
<p>The Unicode Standard further includes punctuation marks,
diacritics, mathematical symbols, technical symbols, arrows,
dingbats, emoji, etc. … In all, the Unicode Standard, Version
6.0 provides codes for <em>109,449</em> characters from the world's
alphabets, ideograph sets, and symbol collections.</p>
</blockquote>
</figure>
</section>
<section class="slide terminology" id="terminology-unicode">
<h2>Terminology</h2>
<dl>
<dt>Character</dt>
<dd>
<q cite="http://www.unicode.org/glossary/#character">
The smallest component of written language that has semantic value; refers to the abstract meaning…
</q>
<p><strong>Key concept: this is not the same as a byte or number!</strong></p>
</dd>
<dt><dfn>Diacritic</dfn></dt>
<dd>
<q cite="http://www.unicode.org/glossary/#diacritic">
A mark applied or attached to a symbol to create a new
symbol that represents a modified or new value. (2) A
mark applied to a symbol irrespective of whether it
changes the value of that symbol. In the latter case,
the diacritic usually represents an independent value
(for example, an accent, tone, or some other linguistic
information).
</q>
</dd>
<dt>Grapheme Cluster</dt>
<dt><dfn>Combining Character</dfn></dt>
<dd>
Unicode supports a key concept that multiple characters can
be combined to produce a single displayed character
(generally anything a user would consider the same is a
“grapheme cluster”).
<table class="center-block align-right">
<tr>
<td colspan="3"><samp>é</samp></td>
</tr>
<tr>
<td colspan="3">LATIN SMALL LETTER E WITH ACUTE (U+00E9)</td>
</tr>
<tr>
<td><samp>e</samp></td>
<td><samp>´</samp></td>
<td><samp>é</samp></td>
</tr>
<tr>
<td>LATIN SMALL LETTER E (U+0065)</td>
<td>COMBINING ACUTE ACCENT (U+0301)</td>
<td></td>
</tr>
</table>
</dd>
</dl>
</section>
<section class="slide terminology" id="terminology-unicode-2">
<h2>Terminology</h2>
<dl>
<dt id="equivalence">Equivalence</dt>
<dd>
<p>Unicode provides rules to determine when characters are
exactly the same (e.g. U+00F1 = U+006E + U+0303 = ñ) and also
when they are functionally the same (e.g. "ff" == "ff" for
searching but not display).</p>
<p>
This can also apply to numbers – these are all numerically equivalent but have significantly different semantic meaning:<br/>
<samp>
<span title="DIGIT FIVE">5</span>
<span title="ARABIC-INDIC DIGIT FIVE">٥</span>
<span title="EXTENDED ARABIC-INDIC DIGIT FIVE">۵</span>
<span title="NKO DIGIT FIVE">߅</span>
<span title="DEVANAGARI DIGIT FIVE">५</span>
<span title="BENGALI DIGIT FIVE">৫</span>
<span title="GURMUKHI DIGIT FIVE">੫</span>
<span title="GUJARATI DIGIT FIVE">૫</span>
<span title="ORIYA DIGIT FIVE">୫</span>
<span title="TAMIL DIGIT FIVE">௫</span>
<span title="TELUGU DIGIT FIVE">౫</span>
<span title="KANNADA DIGIT FIVE">೫</span>
<span title="MALAYALAM DIGIT FIVE">൫</span>
<span title="THAI DIGIT FIVE">๕</span>
<span title="LAO DIGIT FIVE">໕</span>
<span title="TIBETAN DIGIT FIVE">༥</span>
<span title="MYANMAR DIGIT FIVE">၅</span>
<span title="MYANMAR SHAN DIGIT FIVE">႕</span>
<span title="KHMER DIGIT FIVE">៥</span>
<span title="MONGOLIAN DIGIT FIVE">᠕</span>
<span title="LIMBU DIGIT FIVE">᥋</span>
<span title="NEW TAI LUE DIGIT FIVE">᧕</span>
<span title="TAI THAM HORA DIGIT FIVE">᪅</span>
<span title="TAI THAM THAM DIGIT FIVE">᪕</span>
<span title="BALINESE DIGIT FIVE">᭕</span>
<span title="SUNDANESE DIGIT FIVE">᮵</span>
<span title="LEPCHA DIGIT FIVE">᱅</span>
<span title="OL CHIKI DIGIT FIVE">᱕</span>
<span title="VAI DIGIT FIVE">꘥</span>
<span title="SAURASHTRA DIGIT FIVE">꣕</span>
<span title="KAYAH LI DIGIT FIVE">꤅</span>
<span title="JAVANESE DIGIT FIVE">꧕</span>
<span title="CHAM DIGIT FIVE">꩕</span>
<span title="MEETEI MAYEK DIGIT FIVE">꯵</span>
<span title="FULLWIDTH DIGIT FIVE">5</span>
<span title="OSMANYA DIGIT FIVE">𐒥</span>
<span title="BRAHMI DIGIT FIVE">𑁫</span>
<span title="MATHEMATICAL BOLD DIGIT FIVE">𝟓</span>
<span title="MATHEMATICAL DOUBLE-STRUCK DIGIT FIVE">𝟝</span>
<span title="MATHEMATICAL SANS-SERIF DIGIT FIVE">𝟧</span>
<span title="MATHEMATICAL SANS-SERIF BOLD DIGIT FIVE">𝟱</span>
<span title="MATHEMATICAL MONOSPACE DIGIT FIVE">𝟻</span>
</samp>
</p>
</dd>
<dt><dfn>Case Mapping</dfn></dt>
<dd>
Various alphabets (Latin, Georgian, Armenian, Cyrillic,
etc.) have the concept of “case” and Unicode has rules for
converting text from one case to another, including the
various language-specific complications this can entail:
for example, the German ß (SMALL LETTER SHARP S) converts
to uppercase as “SS”
(see <cite><a href="http://unicode.org/reports/tr21/tr21-5.html">Unicode Standard Annex #21: CASE MAPPINGS</a></cite>).
</dd>
<dt>Case Folding</dt>
<dd>
Case folding is a similar process, generally used for
caseless comparisons (e.g. search). In Unicode this is
an expanded form of the lowercase mapping which is
consistent but the output is not suitable for display to
users
</dd>
</dl>
</section>
<section class="slide" id="terminology-collation">
<h2>Terminology</h2>
<h3>Collation</h3>
<p>Determining how strings compare for the purposes of sorting</p>
<div class="center-block">
<table>
<caption>
Sample character collation rules
</caption>
<tr>
<th class="section" colspan="2">By Language</th>
</tr>
<tr>
<th>Swedish:</th>
<td>z < ö</td>
</tr>
<tr>
<th>German:</th>
<td>ö < z</td>
</tr>
<tr>
<th>German:</th>
<td><abbr title="LATIN SMALL LETTER SHARP S">ß</abbr> = ss</td>
</tr>
<tr>
<th class="section" colspan="2">By Context</th>
</tr>
<tr>
<th>French:</th>
<td>cote < côte < coté < côté</td>
</tr>
<tr>
<th class="section" colspan="2">By Usage</th>
</tr>
<tr>
<th>German Dictionary:</th>
<td>of < öf</td>
</tr>
<tr>
<th>German Telephone:</th>
<td>öf < of</td>
</tr>
</table>
<p class="attribution">
Sources: <cite><a href="http://unicode.org/reports/tr10/">Unicode Technical Standard #10</a></cite>
and <cite><a href="http://en.wikipedia.org/wiki/Alphabetical_order#Language-specific_conventions">Wikipedia: Alphabetical order</a></cite>
</p>
</div>
</section>
<section class="slide" id="directionality">
<h2>Directionality</h2>
<figure>
<img src="img/Writing_directions_of_the_world.svg" />
<figcaption class="center">
<a href="http://commons.wikimedia.org/wiki/File%3AWriting_directions_of_the_world.svg">Writing Directions of the World</a>
<p class="attribution"><cite>SPQRobin via Wikimedia Commons (<a href="http://creativecommons.org/licenses/by-sa/3.0">CC-BY-SA-3.0</a>)</cite></p>
</figcaption>
</figure>
</section>
<section class="slide" id="unicode-encodings">
<h2>Unicode Encodings</h2>
<p>
The Unicode standard describes abstract characters but we need
a way to convert them into bytes for storage and exchange.
Early experiments which simply doubled 8-bit ASCII to 16-bits
revealed significant problems:
</p>
<ol>
<li>
Not every processor stores the bytes comprising a 16-bit
integer the same way — big-endian (“UNIX”) or little-endian
(“NUXI”) — necessitating a special byte-order mark (BOM) at
the start of the string simply to decode it
</li>
<li><strong>All</strong> existing text would need to be converted!</li>
<li>All text becomes more expensive to store and process</li>
<li>It wasn't enough: Chinese alone would require at least 3 bytes!</li>
</ol>
<p>
UTF-8 was developed to avoid these problems. It's
<a href="http://tools.ietf.org/html/rfc3629#page-4">a very
clever variable-length encoding scheme</a> under which all
existing 7-bit ASCII is valid, all common non-Asian characters
require only 2 bytes, common CJK still needs only 3-bytes,
using 4 bytes only for rare and historical characters. Because
it's read one byte at a time, there's no need for a BOM.
</p>
</section>
<section class="slide" id="state-of-unicode">
<h2>State of Unicode</h2>
<ul>
<li>
In <a href="http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html">2008</a>,
Google announced that for web content UTF-8 had surpassed ASCII in popularity. In
<a href="http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html">2010</a>,
it was approaching 50%, and as of
<a href="http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html">February 2012</a>
it has passed 60%
</li>
<li>
<p>
All major operating systems and programming languages
support Unicode and UTF-8, although compatibility is
still a consideration for the more recent features
added in
<a href="http://babelstone.blogspot.com/2009/11/whats-new-in-unicode-60.html">version 6</a>
such as <a href="http://www.unicode.org/~scherer/emoji4unicode/snapshot/emojidata.html">emoji</a> (🌏)
or regional indicators (
🇺 🇸 = 🇺🇸,
🇫 🇷 = 🇫🇷, etc.)
</p>
</li>
</ul>
</section>
<section class="slide" id="complex-scripts">
<h2>Complex scripts</h2>
<p>
We previously discussed how accented characters can be formed
by combining a base character with the desired diacritic. The
concept of multiple characters producing a visually distinct
glyph is relatively unusual in English, where only a few
ligatures are at all commonly used - perhaps the best known
being the “ae” in encylopædia - but other languages depend on
this behaviour.
</p>
<p>
A <a href="http://en.wikipedia.org/wiki/Complex_text_layout">complex text layout</a>
system allows the visual display to be significantly altered
based on the context. If you need to support complex languages
this will affect your font choices and design options!
</p>
<div class="slide">
<figure>
<figcaption class="center">
The name of the Arabic language as individual characters and written normally
</figcaption>
<p lang="ar" class="center-block larger">
<span class="slide">ا ل ع ر ب ي ة</span><br>
<span class="slide">العربية</span>
</p>
</figure>
</div>
</section>
<section class="slide exit">
<nav>
<a href="index.html#agenda">Back to agenda</a>
</nav>
</section>
<a href="#" class="deck-prev-link" title="Previous">←</a>
<a href="#" class="deck-next-link" title="Next">→</a>
<!-- deck.status snippet -->
<p class="deck-status">
<span class="deck-status-current"></span>
/
<span class="deck-status-total"></span>
</p>
<!-- deck.goto snippet -->
<form action="." method="get" class="goto-form">
<label for="goto-slide">Go to slide:</label>
<input type="text" name="slidenum" id="goto-slide" list="goto-datalist">
<datalist id="goto-datalist"></datalist>
<input type="submit" value="Go">
</form>
<!-- deck.hash snippet -->
<a href="." title="Permalink to this slide" class="deck-permalink">#</a>
<script src="deck.js/jquery-1.7.2.min.js"></script>
<script src="deck.js/core/deck.core.js"></script>
<script src="deck.js/core/deck.core.js"></script>
<script src="deck.js/extensions/hash/deck.hash.js"></script>
<script src="deck.js/extensions/menu/deck.menu.js"></script>
<script src="deck.js/extensions/goto/deck.goto.js"></script>
<script src="deck.js/extensions/status/deck.status.js"></script>
<script src="deck.js/extensions/navigation/deck.navigation.js"></script>
<script>
$(function() {
$.deck('.slide');
});
</script>
</body>
</html>