03.11.2008 1. Introduction 1.1 Motivation 1.2 Relational Databases – Repetition 1.3 Why use XML? 1.4 XML & Databases 1.5 XML Fundamentals 1.6 Organisational matters 1.7 Overview 1.8 References XML Databases 1. Introduction, 27.10.08 Silke Eckstein Andreas Kupfer Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 1.1 Motivation 2 1.1 Motivation • Within the last 10 years XML has become the de facto standard for data exchange over the web "If I invent another programming language, its name will contain the letter X“ 3 1.1 Motivation 4 Aim of this lecture Give answers to the following questions: – XML is becoming the data "format" • Amount of XML is ever increasing • DBMS are good at handling GBs and TBs of data • What (additional) concepts do we need in order to store XML data in a RDBMS? – Accepted model for semi-structured data • Overcome limitations of structured data • Extend usefulness of DBMS – DB technology is not limited to DBMS • What concepts are crucial in order to build native XML-DBMS systems? • Apps servers, application integration XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 1.1 Motivation • Why is XML relevant from DB perspective? [Fisch05] The latest office documents SVG graphics files Lots of conguration files Some WebCMSs store page contents in XML format Mpeg7 is a standard for describing media meta data in XML format • ... • • • • • – In order to see examples of XML-structured documents, browse through your computer's file system and check for file contents starting with "<?xml "! (N. Wirth, Software Pioniere Konferenz, Bonn 2001) XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig – Examples: 5 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 6 1 03.11.2008 Outline 1.2 Relational Databases 1.1 Motivation 1.2 Relational Databases – Repetition 1.3 Why use XML? 1.4 XML & Databases 1.5 XML Fundamentals 1.6 Organisational matters 1.7 Overview 1.8 References What is a Database? • A database (DB) is a collection of related data – Represents some aspects of the real world • Universe of Discourse (UoD) – Data is logically coherent – Is provided for an intended group of users and applications 7 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig [EN06, 1.1] 8 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 1.2 Relational Databases 1.2 Relational Databases What is a Database Management System? • A database management system (DBMS) is a collection of programs to maintain a database, i.e. for Why not use the File System? • File management systems are physical interfaces – Definition of Data and Structure – Physical Construction – Manipulation – Sharing/Protecting – Persistence/Recovery [EN06, 1.1] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig Customer Data S y s t e m Loans 9 1.2 Relational Databases App 1 Customer Letters Money Transfer App 2 Balance Sheets XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10 1.2 Relational Databases File Systems • Advantages • Databases are logical interfaces – Controlled redundancy – Data consistency & integrity constraints – Integration of data – Effective and secure data sharing – Backup and recovery – Fast and easy access • Disadvantages – Uncontrolled redundancy – Inconsistent data – Limited data sharing and access rights – Poor enforcement of standards – Excessive data and access paths maintenance XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig F i l e Account Data • However… – More complex – More expensive data access 11 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 12 2 03.11.2008 1.2 Relational Databases 1.2 Relational Databases • Databases control redundancy • Databases aim at efficient manipulation of data – Same data used by different applications/tasks is only stored once – Access via a single interface provided by DBMS – Redundancy only purposefully used to speed up data access (e.g. materialized views) – Physical tuning allows for good data allocation – Indexes speed up search and access – Query plans are optimized for improved performance • Example: Simple Index • Databases are well-structured – Catalog (data dictionary) contains all meta-data – Defines the structure of the data in the database [EN06, 1.6.1, 1.3] 13 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig [EN06, 1.3] 1.2 Relational Databases 4543032 5539783 type balance 1278945 saving € 312.10 2437954 saving € 1324.82 € -43.03 4543032 checking € -43.03 € 12.54 5539783 saving € 12.54 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig € -43.03 5539783 saving € 12.54 9134354 7809849 checking € 7643.89 8942214 checking € -345.17 9134354 saving € 2.22 9543252 saving € 524.89 14 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig Disk 2 AccNo balance 1278945 € 312.10 2437954 € 1324.82 4543032 € -43.03 5539783 € 12.54 16 – Views provide a different perspective of the DB • A user’s conceptual understanding or task-based excerpt of all data (e.g. aggregations) • Security considerations and access control (e.g. projections) – For the application, a view does not differ from a table – Views may contain subsets of a DB and/or contain virtual data Disk 2 AccNo checking 5539783 • Databases support multiple views of the data DBMS € 1324.82 4543032 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig SELEC T AccNo FROM account WHERE balance>0 € 312.10 € 1324.82 1278945 1.2 Relational Databases Application 2437954 saving DBMS 15 • Example: Schema is changed and table-space moved without an application noticing balance 2437954 Disk 1 1.2 Relational Databases 1278945 € 312.10 AccNo Application – DBMS-controlled parts of the file system are strongly protected against outside manipulation (tablespaces) AccNo balance saving SELEC T AccNo FROM account WHERE balance>0 • Data is strictly typed (Integer, Timestamp,VarChar,…) • Details on where data is actually stored and how it is accessed is hidden by the DBMS • Applications can access and manipulate data by invoking abstract operations (e.g. SQL Select statements) Disk 1 type 1278945 • Example: Schema is changed and table-space moved without an application noticing – Database employs data abstraction by providing data models – Applications work only on the conceptual representation of data XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig AccNo 1.2 Relational Databases • Isolation between applications and data [EN06, 1.3] Data File Index File • Virtual data is derived from the DB (mostly by simple SQL statements, e.g. joins over several tables) • Can either be computed at query time or materialized upfront 17 [EN06, 1.3] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 18 3 03.11.2008 1.2 Relational Databases 1.2 Relational Databases • Sharing of data and support for atomic multiuser transactions • Persistence of data and disaster recovery – Data needs to be persistent and accessible at all times – Quick recovery from system crashes without data loss – Recovery from natural desasters ( fire, earthquakes,…) – Multiple user and applications may access the DB at the same time – Concurrency control is necessary for maintaining consistency – Transactions need to be atomic and isolated from each other [EN06, 1.3] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 19 [EN06, 1.3] Outline 1.3 Why use XML? 1.1 Motivation 1.2 Relational Databases – Repetition 1.3 Why use XML? 1.4 XML & Databases 1.5 XML Fundamentals 1.6 Organisational matters 1.7 Overview 1.8 References XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig • Bioinformatics example: • Presentation and processing of database query results – Flat file – Web page – HTML text – XML text – Search in TRANSPATH database for molecule "TLR4" 21 Molecule name Molecule name Gene Ontology references Species Links to other DBs Gene Ontology references Reactions the molecule participates in Reactions the molecule participates in Publications Publications Web page Originator Flat file Key Originator Links to other DBs 22 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig Key Species 20 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 24 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 23 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 4 03.11.2008 Key Key Originator Originator Molecule name XML HTML Molecule name Species Links to other DBs Gene Ontology references Species XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 25 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 1.3 Why use XML? • Flat files • HTML • Solution 1.3 Why use XML? • Little layout information • Suitable for presentation only to a limited extent • Can be parsed, but cumbersome • • • • 26 Only layout information Good for presentation Automatic processing difficult Just as generation of other presentation formats • What is XML? – The Extensible Markup Language (XML) is the universal format for structured documents and data on the Web. – Base specifications: • XML 1.0,W3C Recommendation Feb '98 • XML 1.1 (2nd Ed.), W3C Recommendation Aug '06 • Namespaces, W3C Recommendation Jan '99 • Separation of layout and content XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 27 [Fisch05] 1.3 Why use XML? 28 1.3 Why use XML? • XML Data Example • What is XML now then? – XML is semi-structured text – XML is a tag-based markup-language (like HTML) <Buch> <Autor id="1234567890">Rainer Eckstein</Autor> <Autor id="1234568723">Silke Eckstein</Autor> <Titel>XML und Datenmodellierung</Titel> <Untertitel>XML-Schema ...</Untertitel> <Verlag id="3-89864">dpunkt.Verlag</Verlag> </Buch> • eXtensible Markup Language – XML was designed to exchange data – XML tags are not predefined • Tags are defined in a separate schema − Syntax, no abstract model − Documents, elements and attributes − Tree-based, nested, hierarchically organized structure – XML is designed to be self-descriptive – XML is a W3C Recommendation – XML became highly popular due to its simplicity and flexibility XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 29 [Fisch05] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 30 5 03.11.2008 Outline 1.4 XML & Databases • Database world 1.1 Motivation 1.2 Relational Databases – Repetition 1.3 Why use XML? 1.4 XML & Databases 1.5 XML Fundamentals 1.6 Organisational matters 1.7 Overview 1.8 References XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig – 1970 relational databases – 1990 nested relational model and object oriented databases – 1995 semi-structured databases 31 32 – Semi-structured, e.g., XML • Structure of data follows a template, but still allows for a degree of flexibility • Data instances following the same schema may have a different structure • Often, complex relationships between data are allowed (associations, inheritance, sub-classing, aggregation, etc.) • Queries often involve those relationships • Structure explicitly specified in schema • Every tuple in a table has the same attributes and domains • Queries can take advantage of structure – Unstructured, e.g., information retrieval systems • Often just full text with no or only limited structure information • Properties of data usually unknown • Queries difficult to evaluate 33 1.4 XML & Databases XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 34 1.4 XML & Databases • XML • Relational data – 1st killer application: Publishing industry – Invented as a syntax for data, only later an abstract data model – Philosophy: data and schemas should not be correlated, data can exist with or without schema, or with multiple schemas XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig • But there is also something in between – Structured, e.g., relational databases – Killer Application: Banking – Invented as a mathematically clean abstract data model – Philosophy: schema first, then data [Fisch05] 1.4 XML & Databases • Information systems have different degrees of data structure rigidness • Relational data – 1974 SGML (Structured Generalized Markup Language) – 1990 HTML (Hypertext Markup Language) – 1992 URL (Universal Resource Locator) Data + documents = information 1996 XML (Extensible Markup Language) URI (Universal Resource Identifier) 1.4 XML & Databases XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig • Documents world 35 – Never had a standard syntax for data – Strict rules for data normalization, flat tables – Order is irrelevant, textual data supported but not primary goal • XML – Standard syntax existed – No data normalization, flexibility is a must, nesting is good – Order may be very important, textual data support a primary goal XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 36 6 03.11.2008 1.4 XML & Databases 1.4 XML & Databases • Document-Centric XML • Data-Centric XML – Just loosely structured with a lot of unstructured text – Often intended to for human consumption – Querying and processing quite difficult – Advantages of relational DBs don’t pay of – Additional IR techniques advantageous – XML is used to store or transport regularly structured and fine grained data – Data can be mapped to relational tables with some tricks – Is often designed to be processed by machines Table Columns Aggregated Columns? Foreign Keys? Another table? XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 37 1.4 XML & Databases • Many of these requirements can be fulfilled by specialized standards and technologies – Storage: – Generally speaking… yes. But a crappy one! – For allowing effective XML use, we additionally need • XML document on the file system – Queries: Storage schemes for efficiently storing even huge documents Query Languages Schema Languages Support for data integrity and transactions (ACID) Support for data security Programming Interfaces … and all the other thing we know from real DBMS systems XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig • Simple queries with XPath • Complex queries with XQuery – Schemas: • Simple schemas with DTD • Complex schemas XML-Schema (XSD) – Programming Interfaces: • Provided by various implementations of SAX, DOM, STAX, … 39 1.4 XML & Databases XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 40 1.4 XML & Databases • Still, those isolated technologies are not yet a real DBMS • The topic of XML Databases deals with integrating them into a fully functional DBMS • Two options – Integrating XML support into RDMS systems • Especially suited for data-centric XML – Building native XML-DBMS systems • Suited for data-centric and document centric XML XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 38 1.4 XML & Databases • XML documents thus can store all kinds of data • Thus, is an XML document already a database? • • • • • • • XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig • What are XML supporting RDBMS? – Maps XML data into relational tables – Main problem: How to create an efficient and meaningful mapping? • What are native XML databases? – „Native“ is a marketing term – Common Agreement: • Native XML DBs works with a logical model of the XML document (not directly with the data) – i.e. nodes, attributes, types, tree structure, CDATA entries, … • XML is the primary form of storage • Are not limited to a particular storage model (could use a relational DB, an object DB, file system, etc) – Main problem: How to query and store effieciently? 41 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 42 7 03.11.2008 1.2 XML & Databases 1.4 XML & Databases • Example (very simple): • RDBMS with XML support Relational Mapping Flights id airline origin destination 1 ABC Air Dallas Fort Worth id departure arrival flight_ref 1 09:15 09:16 1 2 11:15 11:16 3 13:15 13:16 Flight Native Mapping Tags id parent name value 1 1 null Flights null 1 2 1 Airline ABC Air 3 1 Origin Dallas 4 1 Destination Fort Worth 5 1 Flight Null 6 4 Departure 09:15 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig • Native XML-DBMS systems 43 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig Outline 1.5 XML Fundamentals • Reasons for the XML success: 1.1 Motivation 1.2 Relational Databases – Repetition 1.3 Why use XML? 1.4 XML & Databases 1.5 XML Fundamentals 1.6 Organisational matters 1.7 Overview 1.8 References XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 44 – – – – – – – – – 45 1.5 XML Fundamentals XML is a general data representation format XML is human readable XML is machine readable XML is internationalized (UNICODE) XML is platform independent XML is vendor independent XML is endorsed by the World Wide Web Consortium XML is not a new technology XML is not only a data representation format, it’s a full infrastructure of technologies [Fisch05] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 46 1.5 XML Fundamentals • W3C Process • W3C: World Wide Web Consortium – Established in 1994 – Initiator:Tim Berners-Lee – Over 400 member organizations from more than 40 countries – Mission: • " To lead the World Wide Web to its full potential by developing protocols and guidelines that ensure long-term growth for the Web." Source: Mario Jeckle, www.jeckle.de XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 47 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 48 8 03.11.2008 1.5 XML Fundamentals 1.5 XML Fundamentals • Document Type Definition • Structure of XML documents – XML prolog – Document Type Definition (DTD) – Document Instance <!DOCTYPE Bücher [ <!ELEMENT Bücher (Buch)* > <!ELEMENT Buch (Autor+, Titel, Untertitel?, Verlag > <!ELEMENT Autor (#PCDATA) > <!ATTLIST Autor id ID #REQUIRED email CDATA #IMPLIED > <Bücher> <Buch> <Autor id="1234567890">Rainer Eckstein</Autor> <Autor id="1234568723">Silke Eckstein</Autor> <Titel>XML und Datenmodellierung</Titel> <Untertitel>XML-Schema ...</Untertitel> <Verlag id="3-89864">dpunkt.Verlag</Verlag> <!ELEMENT Titel (#PCDATA) > <!ELEMENT Untertitel (#PCDATA) > <!ELEMENT Verlag (#PCDATA)> ]> </Buch> </Bücher> – Validity – Have to be well-formed XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 49 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 1.5 XML Fundamentals 1.5 XML Fundamentals • XML Schema • Misunderstanding about XML <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:element name="Bücher"> <xsd:complexType> <xsd:sequence> <xsd:element name="Buch" maxOccurs="unbounded" minOccurs="0" > <xsd:complexType> <xsd:sequence> <xsd:element name="Autor" maxOccurs="unbounded" > <xsd:complexType> <xsd:simpleContent> <xsd:extension base="string"> <xsd:attribute name="id" type="ID"/> <xsd:attribute name="email" type="string"/> </xsd:extension> </xsd:simpleContent> </xsd:complexType> </xsd:element> ... </xsd:schema> XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 50 – “Data is self-describing.” – Tags don’t hold semantics, they only hold the structure of the information – The interpretation of the tags is in the application that handles the data, not in the tags themselves. 51 1.5 XML Fundamentals [Fisch05] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 52 1.5 XML Fundamentals • XML as a family of technologies – – – – – – – – – – – XML Information Set XML Schema XML Query The Extensible Stylesheet Transformation Language (XSLT) XLink, XPointer XML Forms XML Protocol XML Encryption XML Signature Others … almost all the pieces needed for a good XML-based information hub Source: Mario Jeckle, www.jeckle.de [Fisch05] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 53 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 54 9 03.11.2008 1.5 XML Fundamentals Outline • Overview of XML Technologies 1.1 Motivation 1.2 Relational Databases – Repetition 1.3 Why use XML? 1.4 XML & Databases 1.5 XML Fundamentals 1.6 Organisational matters 1.7 Overview 1.8 References – W3C Standards • • • • Data: XML, Namespaces, Infoset, Schema Communication: SOAP, Encryption,WSDL, UDDI Processing: Xpath, XSLT, Xquery, Xupdate, Xquery Text Integration: RDF, OWL – Other Standards • Vertical domains: RosettaNet, ebXML, SBML, GML • Workflow: BPEL • Interfaces: DOM, SAX, JAXP, SQL/XML [Fisch05] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 55 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 1.6 Organisational matters 1.6 Organisational matters • Who is who ? • Lectures: – Silke Eckstein – Monday, 9:45 – 11:15, (IZ 131, lecture) – Monday, 11:30 – 12: 15, (IZ 131, tutorial) • (Lecture, exams) – Andreas Kupfer • Office hours: • (Tutorial) – Silke Eckstein: Tuesday, 12:30 – 13:30, IZ 232 – Andreas Kupfer: Friday, 10:30 – 11:30, IZ 213 – Regine Dalkıran • (Office) • Course homepage: – Wolf-Tilo Balke – http://infbsdb1.idb.cs.tu-bs.de/eckstein/xmldatabases – lecture notes, links, latest news etc. • (Head) • In case of questions, don't hesitate to ask us. XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 56 57 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 1.6 Organisational matters 58 1.7 Overview 1. 2. 3. 4. 5. Introduction XML Basics Schema definition XML query languages I Mapping relational data to XML 6. SQL/XML 7. XML processing • Assignments: – Presentations as well as programming – Details will be announced • Credits: 4 • Exams: Oral – Master students: agree on certain week in Feb./Mar. – Diploma students: on appointment 8. XML query languages II 9. XML storage I 10. XML storage - index 11. XML storage - native 12. Updates / Transactions 13. Systems 14. XML Benchmarks Please contact R. Dalkiran (regine.dalkiran at tu-braunschweig.de) for an exam appointment. XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 59 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 60 10 03.11.2008 1.8 References 1.8 References • http://www.w3.org/ [W3C] • XQuery: Grundlagen und fortgeschrittene Methoden [LS04] • XML in a Nutshell [HM04] – Lehner & Schöning – Dpunkt-Verlag, 2004, ISBN 3898642666 – Harold & Means – O'Reilly, 2004, ISBN 0596007647 • XML & Datenbanken. Konzepte, Sprachen und Systeme [KM02] • Beginning XML Databases [Pow07] – Gavin Powell – Wiley & Sons, 2007, ISBN 0471791202 – Klettke & Meyer – Dpunkt-Verlag, 2002, ISBN 3898641481 • XML und Datenbanken [Sch02] • Peter Fischer, "XML und Datenbanken", Lecture, ETH Zürich, WS 05/06 [Fisch05] – Harald Schöning – Hanser, 2002, ISBN 3446220089 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 61 1.8 References XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 62 Questions, Ideas, Comments • Fundamentals of Database Systems [EN06] • Now, or ... – Elmasri & Navathe – Addison Wesley, 2006, ISBN 032141506X • Room: IZ 232 • Office our: Tuesday, 12:30 – 13:30 Uhr or on appointment • Email: XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 63 [email protected] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 64 11