Gruppe 8

UNICODE
Semistrukturierte Daten 1
Gruppe 8
1
Inhalt
•
•
•
•
•
•
•
•
Motivation
History – Birth of Unicode
Die Organisation
Anwendungsgebiete
Technische Sichtweise / Charactersets
Fonts / Kodierungskriterien
Unicode in HTML / XML
Links & Resources / Q & A
2
Motivation
• Problem: andere Länder – andere Schriften…
• Ziel: alle weltweit bekannten Textzeichen in einem
Zeichensatz zusammenzufassen
• Anzahl riesig (allein >9000 chinesische Zeichen)
• Gesucht: innovative Lösung um alles unter einen
Hut zu kriegen
3
Was wird nun codiert?
•
•
•
•
•
Zeilenende
Absatzende
Schreibrichtung (nach rechts / links)
94.140 Zeichen (auf mehreren Ebenen)
ABER: Darstellung nicht durch jeden Font
möglich – meist kostenpflichtig
4
Was ist Unicode
•
•
•
Internationaler Standard
Zeichensatz
 eine eindeutige Nummer für jedes Zeichen
Unicode ist:
– Plattformunabhängig
– Programm/Programmiersprachen unabhängig
– Sprachenunabhängig
•
•
•
Erste vereinheitlichende Codierung, die ALLE Zeichen darstellen konnte
Durch zentrales gemeinnütziges Konsortium entwickelt und geregelt
Einsatzbereich, vor allem für Programme die:
– auf mehreren Plattformen laufen
– in mehreren Sprachen laufen
– ohne großen Aufwand verschiedenste Sprachen implementieren
•
•
Aktuellste Version: 4.1.0
Unicode ist KEIN Font
5
History – Birth of Unicode V 1.0
1986
• Xerox works on an idea to merge Japanese and Chinese characters more easily
• Apple works out a theory to come up with a universal character set for the Apple File
Exchange development
1987
• Unicode's original "begin at 0 and add the next character" architecture is created
• Xerox begins discussing multilingual issues, new character encoding is a major topic, fixedwidth design is preferable.
• Earliest documented use of the term "Unicode”
1988
• Apple advances the idea about fixed-width 16 bit characters
• First presentation of the Unicode principles in Dallas
6
History – Birth of Unicode V 1.0
1989
• Meetings joined by Sun, then Adobe, Claris, HP, NeXT and Pacific Rim Connections (later
morphed into Unicode Technical Committee)
• Decision to incorporate all composite characters in existing ISO registered standards and to
guarantee round trip conversion to existing standards.
• Decision to use logical ordering for bidirectional (Middle Eastern) and Indic text.
• ANSI proposes a compromised Han Unification and use of C0, C1 to ISO. Apple, Claris,
Metaphor, NeXT, and Sun participate on behalf of Unicode. As a result of this compromise,
the Unicode working group decides to use existing ISO orderings for script subsets, and use
the ISO naming schemes.
• Unicode is presented to Microsoft , IBM, Unix, ISO SC2, WG2
1990
• Microsoft shows interest in Unicode, also Apple Japan, Microsoft KK, IBM becomes active
• First implementation of a WYSIWYG prototype for demonstration
• Final review draft of Unicode is distributed internationally
• Decision to use logical ordering for all South Asian scripts, add length marks
7
History – Birth of Unicode V 1.0
1991
• Creation of the Unicode Technical Committee (UTC)
• first articles about Unicode appears in the New York Times
• Novell joins
• first unofficial 2-day Unicode Workshop a success
• first Unicode book appears finally
1992
• The Unicode Standard Version 1.0, Volume 2 is printed.
• Article "Kiss your ASCII Goodbye" appears in PC Magazine.
8
Die Organisation
•
•
•
•
•
•
Unicode durch zentrales Konsortium geregelt
Non-Profit Organisation
Zusammenarbeit mit W3C und ISO
Zuständig für Zeichensatz ISO/IEC 10646
Ziel: Entwicklung und Erweiterung
Mitglieder aus allen Global Playern der IT-Wirtschaft
(IBM, Microsoft, Apple, Cisco, Oracle, …)
• Zu finden unter: www.unicode.org
9
Die Organisation
Ursprüngliche Vorstandsmitglieder von Unicode Inc:
•
•
•
•
•
•
•
•
•
Larry Tesler, Vice President Advanced Products, Apple Computer, Inc.
Robert Carr, Vice President Software Development, GO Corporation
Richard Holleman, Director of Telecommunications, IBM Corporation
Charles Irby, Vice President of Development, Metaphor Computer Systems
Paul Maritz, Vice President Advanced Operating Systems, Microsoft Corporation
Bud Tribble, Vice President Software Engineering, NeXT Computer Inc.
Jay Israel, Vice President Advanced Technology, Novell, Inc.
David Richards, Director of Development, The Research Libraries Group.
John Gage, Vice President Desktop Development, Sun Microsystems Inc.
Geschäftsführer bzw. Gründungsmiglieder:
•
•
•
•
•
Mark Davis, President
Mike Kernaghan, Vice-President
Joe Becker, Technical Vice-President
Ken Whistler, Secretary
Bill English, Treasurer
10
Anwendungsgebiete (1)
Datenbanken
•
•
•
•
•
•
•
•
•
Adabas
Caché and Ensemble
FrontBase
IBM
Ingres
Justsystem Goro
Microsoft Access,
SQL Server
Mimer SQL
•NCR Teradata
•Onix
•Oracle 8
•PostgreSQL
•Progress Software
•Qwikly
•Sybase
•Unisys UREP
11
Anwendungsgebiete (2)
Betriebssysteme
• Apple Mac OS 9.2, Mac OS X 10.1, Mac OS X Server,
ATSUI
• Compaq's Tru64 UNIX, Open VMS
• GNU/Linux with glibc 2.2.2 or newer - FAQ support
• IBM AIX, AS/400, OS/2
• Inferno by Vita Nuova
• Microsoft Windows CE, NT, 2000, XP
• SCO UnixWare 7.1.0
• Sun Solaris
• Symbian Platform
12
Anwendungsgebiete (3)
Standards
•
•
•
•
•
•
•
XML
XHTML
XSL
LDAP
CORBA 3.0
WAP (WML)
…
Suchmaschinen
•
•
•
•
Alta Vista
Yahoo
Google
Fastsearch
13
Anwendungsgebiete (4)
Programmiersprachen, Entwicklungsumgebungen
Ada 95
CLISP Common Lisp
G2 5.0 Rev. 0 by Gensym Corporation, GAWK 3.0.3
Java
JavaScript (ECMAScript)
Led C++ class library
Microsoft VJ++
Visual Studio 7.0 (forthcoming)
Visual Basic
Perl
Python
XML Spy 3.0 from Icon Information-Systems GmbH
14
Technische Sichtweise
• UTF (Unicode Transformation Format)
– Spezifiziert zu jedem Zeichen eine eindeutige Byte Sequenz
• Verschiedene Standards:
–
–
–
–
–
UTF 8: Hauptsächlich WEB
UTF 16: Hauptsächlich Java und Windows
UTF 32: Hauptsächlich UNIX
UTF 7: E-Mail – ohne MIME (Vollständigkeitshalber)
UTF-EBCDIC: Mainframe (Vollständigkeitshalber)
• Konvertierungen zwischen UTF 8 / 16 / 32:
– Verlustfrei
– Schnell
– Algorithmische Konvertierung
15
Technische Sichtweise
UTF-8
• ASCII compatible
– characters in the range U+0000→U+007F can be encoded
as a single byte.
• Ken Thompson had turned AT&T Bell Lab’s Plan 9 into the
world’s first operating system to use UTF-8
• Default encoding for xml
16
Technische Sichtweise
UTF 8 / 16 / 32 (1)
UTF 8
8 bit
UTF 16
16 bit
UTF 32
32 bit
min.
Bytes/Zeichen
1
2
3
max.
Bytes/Zeichen
4
4
4
Größe
Bsp 1: 1 UTF16 Zeichen kann als 2 UTF8 dargestellt werden
Bsp 2: 1 UTF32  2 UTF16  4 UTF8 Zeichen
17
UTF 8  UTF 16  UTF 32
Beispiel für Zeichen in den jeweiligen Kodierungen:
Zeichen
„e“
„$“
Hah (Arabic)
yi (Xip)
UTF 8
U+65
UTF 16
U+0065
UTF 32
U+0000 0065
U+24
U+DA 85
U+EA 91 A0
U+0024
U+0685
U+A460
U+0000 0024
U+0000 0685
U+0000 A460
18
Technische Sichtweise
UTF 8 / 16 / 32 (2)
Besonderheiten / Unterschiede UTF 16 und UTF32:
• UTF16/32 BE: Big Endian: MSB (most significant bit first)
• UTF16/32 LE: Little Endian: LSB (least significant bit first)
• UTF16/32: std: MSB, oder BOM (byte order mark)
Bytes
FF FE 00 00
00 00 FE FF
FF FE
FE FF
Encoding
UTF32 LE
UTF32 BE
UTF16 LE
UTF16 BE
19
Technische Sichtweise
Schriften
Unicode definiert nicht Sprachen – sondern Schriften
Grund: Viele Sprachen haben gleiche Zeichen – diese
können vereinheitlicht werden
In der Letzten Version 4.1.0 werden folgende Schriften
unterstützt:
20
Character Sets
Schriften
21
Character Sets
Sonderzeichen
22
Character Sets – Ranges (1)
• U+0000* – U+007F* Controls and Basic Latin (~ASCII):
• U+0080* – U+00FF* Controls and Latin-1
* Utf16
23
Character Sets – Ranges (2)
U+0600* – U+06FF* Arabic
U+0685* Hah
U+06B4* Gaf
Ausrichtung: Right to Left
U+069C* Seen
* Utf16
24
Character Sets – Ranges (3)
U+0F00* – U+0FFF* Tibetan
U+0F47* Ja
U+0F5C* Dzha
Ausrichtung: Left to Right
U+0F43* Gha
* Utf16
25
Fonts
Grundsätzlich:
• Font bildet eine Byte Sequence auf ein Bildzeichen ab
Unicode Font:
• Byte Sequenzen des jeweiligen Unicode Typs sind als
Abbildungen auf Bildzeichen verfügbar
Bsp: Arial Unicode MS:
ab MS Office 2002 inkludiert, 38.917 Zeichen, 50377 Bildzeichen
26
Fonts
in Java
Unicode Fonts in Java:
1. Kopieren der font-files in das Java-Font Verzeichnis:
jre/lib/fonts
2. font.properties Datei anpassen bzw. neu erstellen
(wenn ein mapping zwischen logischen und physischem Font oder eine
Lokalisierung benötigt wird)
zB: font.properties.ko (für koreanisch)
zB: serif.0=Arial,ANSI_CHARSET (für Mapping auf Arial)
zB: serif.1=WingDings,SYMBOL_CHARSET,NEED_CONVERTED
fontcharset.serif.1=sun.awt.windows.CharToByteWingDings
3.
Javacode Example: new Font(„serif", Font.PLAIN, 12)
27
Eingabemethoden
• Eingabe chinesischer Zeichen
– Grundsätzlich über jede Tastatur möglich
– Aufgrund der vielen Zeichen: Tastenkombinationen
• die meistgebrauchten Zeichen = 1 Taste
• Alle weiteren Zeichen = Tastenkombination (<1%)
28
Kodierungskriterien
Stetigkeit von kodierten Zeichen:
• Vor Normierung äußerst sorgfältige Prüfungen
• Einmal kodierte Zeichen dürfen nicht mehr entfernt
werden
• Somit Gewährleistung der Langlebigkeit digitaler
Daten
• BMB (Basic Multilingual Plane) vs. Astral Plane
29
Kodierungskriterien
Unicode kodiert abstrakten Zeichen
(Idee eines Buchstaben) –
keine Glyphen
(konkrete grafische Darstellung)
Ermöglichung von Glyphenvarianten:
• 256 Variation Selectors
• werden ggf. dem Code nachgestellt
30
Unicode in HTML
Definition von Encoding in HTML durch
Meta-Angabe:
<meta http-equiv="content-type"
content="text/html; charset=UTF-8">
Allerdings: Auto-Detection durch Browser
(Byte Order Mark), sollte zumindest bis zur
Meta-Angabe möglich sein
31
Unicode in XML
• XML:
<?xml version="1.0" encoding="UTF-8"?>
• Default-Encoding: Unicode über Byte
Order Mark
• Wenn kein BOM, dann UTF-8
• XML-Prozessoren müssen UTF-8 und
UTF-16 unterstützen
32
Unicode in HTML/XML
Numerische Character Referenz:
• Dezimal: &#160;
• Hexadezimal: &#x00A0;
Dokument muss nicht in einem UnicodeFormat gespeichert werden, kann aber
trotzdem numerische Referenzen auf
Codepoints enthalten!
33
Unicode und XML
• Text: Serie von Characters (Daten und
Markup)
• Character: Atomare Texteinheit
• Erlaubte Character Range:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000#xFFFD] | [#x10000-#x10FFFF]
(ohne surrogate blocks, FFFE und FFFF)
34
Unicode und XML
Aber: Namen sind Subsets von Markup und
Text
NameChar ::= Letter | Digit | '.' | '-' | '_' | ':' |
CombiningChar | Extender
Name ::= (Letter | '_' | ':') (Namechar)*
Wobei: Letters nicht nur A-Z!
35
Unicode in XML
Manche Characters nicht für XML geeignet
• Im Unicode Standard veraltet
• Problematisch ohne zusätzliche Daten
• Funktionalität durch Markup sinnvoller
• Kollidieren mit Markup
36
Unicode in XML
Beispiele
• Problem: Überlagerung von Control Codes und XMLMarkup
z.B.: Line and paragraph separator
Codepoint: U+2028 .. U+2029
Lösung: <xhtml:br />, <xhtml:p></xhtml:p> oder
entsprechende
• Weiters: Widersprüche zwischen Control Codes und Markup
möglich. Frage nach Priorität
z.B.: Sprachidentifikation
Codepoint: U+E0000 .. U+E007F
Lösung: xhtml:lang oder xml:lang
37
Links and Ressources
• Unicode Organisation Homepage:
– http://www.unicode.org
• Unicode General Information:
– http://de.wikipedia.org/wiki/Unicode
• Unicode Characters:
– http://www.decodeunicode.org/
• Filecodierungsinformationen:
– http://www.fileformat.info/info/unicode/char/search.htm
• XML 1.0 (3rd Ed) W3C Recommendation
– http://www.w3.org/TR/REC-xml/
38
Questions & Answers
Fragen zu Unicode bitte jetzt stellen
39