10. XML Storage 1 10.1 Motivation 10.2 Text-based storage XML Databases 10. XML Storage 1 – Overview 10.3 Model-based storage Silke Eckstein Andreas Kupfer Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 10.5 Conclusion 10.4 Schema-based storage 10.6 Overview and References XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.1 Motivation 10.1 Motivation • Applications require different types of XML documents • Characterisation of XML documents: – Structure vs. content – Regular vs. irregular – Data-centric documents • Thus, XML documents are • Structured, regular • E.g. product catalog, order, invoice – Data-centric – Document-centric – or somewhere in-between – Document-centric documents • Questions – Storage of XML documents – Efficient processing of queries on the stored documents or data • Unstructured, irregular • E.g. scientific article, book, email, web page • There are several methods for storage – Semi-structured documents – 1st goal: Learn and understand methods – 2nd goal: Classify methods • Data-centric and document-centric parts • E.g. publications, Amazon, MS Press (example chapters) • Principles • Advantages and disadvantages • Usage XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 3 10.1 Motivation 4 • Storage approaches for XML documents – Order preserving and lossless storage of XML documents – Efficient access to XML documents or parts thereof – Text-based • Storage as character data – Model-based • Quick response time for • On top of the relational DBMS • Inside the relational DBMS – Queries – Update operations – Schema-based Indexing Transaction processing Support of XPath and XQuery Support of SAX and DOM for applications XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.1 Motivation • Requirements for the physical layer: • • • • 2 • Mapping to (object-)relational databases – Deriving the database schema from the XML structure – Using user defined mapping procedures 5 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 6 10.1 Motivation Text-based storage Schema-based storage 10. XML Storage 1 Model-based storage 10.1 Motivation 10.2 Text-based storage 10.2.1 Index structures 10.3 Model-based storage 10.4 Schema-based storage 10.5 Conclusion 10.6 Overview and References – Examples – 7 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.2 Text-based storage 8 10.2.1 Index structures • The whole XML document text is stored as character data • Index structures for XML documents allow efficient access for specific queries – File in the file system – CLOB (Character-Large-OBject) in the DBS – Different types of indexes are optimized for different types of queries • Operations on documents as a whole are very efficient – Reading and writing the whole document – But the content is monolithic and opaque with respect to the relational query engine (query can't inspect a fragment) • Generate redundancy – Index has to be up-to-date by propagating data changes • Getting granular access requires additional support – Full text index – Path index • Index structures can be storage structures as well – They define the storage method XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 9 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.2.1 Index structures 10 10. 2.1 Index structures • Types of index structures • B-tree as value index for an XML fragment document – Value index • Indexes atomar values of an XML document, like element content or attribute values • Index format for structured parts of XML documents • Already known from databases (B-trees, hash index, …) – Full text index • Indexes single words from the full text • Index format for unstructured parts of XML documents • Already known from Information Retrieval (inverted lists, tries, suffix trees, …) – Path index • Indexes subtrees/paths in an XML document • Index format for semistructured parts of XML documents • Already known from object-databases (access support relations, …) XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 11 [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 12 10. 2.1 Index structures 10. 2.1 Index structures • Full text index • Inverted list as full text index for XML – Not limited to exact matches word occurrence word position in the text • Keyword-based search and boolean retrieval • Pattern search (with regular expressions) – Use of • Statistical, word-based methods – Stop word removal – Elimination of uncommon items • Linguistic methods – Normalization of words (e.g. capitalisation, hyphenation,) – Word decomposition by rules (engl.) or dictionaries (german) – Stemming • Knowledge-based methods – Use of ontologies and thesauri to search for synonyms, hypernyms and hyponyms XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 13 10. 2.1 Index structures word [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 14 10. 2.1 Index structures occurrence • Path index – Structure information must be identifiable and reconstructable • Assigning the markup to the content as well as • Representing the hierarchical nesting and order of elements/attributes – Especially suited for keyword search with regard to structure or path expressions FOR $b IN //book WHERE CONTAINS($b/author,"Benjamin") RETURN $b [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 15 10. 2.1 Index structures • Conclusion – Nested path index – Efficient query processing on XML documents requires different types of index structures – Value index • Access to root node from every node – Multi-index • For efficient access to structured parts • Keyword search, value search • Accessing parent nodes – Join-index – Full text index • Access parent and child nodes • For efficient access to unstructured parts – Access Support Relations (ASR) – Path index • Using the document structure • Navigating queries • Generalization of indexes above, by listing all paths in a table XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 16 10. 2.1 Index structures • Types of path indexes [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 17 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 18 10.2 Text-based storage 10. XML Storage 1 • Summary text-based storage 10.1 Motivation – Schema definition: • not required 10.2 Text-based storage – Document reconstruction: • documents stay in their original format 10.3 Model-based storage – Queries: • Information retrieval queries • Processing the markup of the queries • XML queries possible 10.3.1 On top of the relational DBMS 10.3.2 Inside the relational DBMS – Special features: • Full text functions 10.4 Schema-based storage – Efficiency: • Character string must be parsed on every access with XML processors expensive • No concurrency on read or write no parallel processing 10.5 Conclusion – Usage: 10.6 Overview and References • For document-centric XML applications • Suitable to only a limited extent also for semi-structured applications XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 19 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.3.1 Model-based storage 20 10.3.1 Model-based storage • The EDGE approach [FK99] • Idea: generic storage of the graph structure – XML elements, XML attributes, … are nodes of a graph – Nesting of elements defines edges – Nodes get an (internal) ID based on graph traversal • Using relations or object classes to store elements and attributes Elements ID Element name Value Reference to preceeding Rank Attributes ID Attribute name Value Reference to element XML documents • Document structure can be restored completely • Extension for data type adapted storage is possible XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig – Variant BINARY: horizontal partition of EDGE based on label 21 [Tür08] 10.3.1 Model-based storage XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 22 10.3.1 Model-based storage • XML queries • Example: list bargain buy with prices SELECT a.content, b.content FROM Edge a, Edge b WHERE (a.label = 'price') AND (a.content < 10.00) AND (b.label = 'description') AND (b.parent = a.parent) AND (a.key = b.key) – XML queries (XPath, XQuery) are mapped to SQL queries (taking storage structures into account) – Result of XML query is generated from result of database query • "Labeling" of the result tuples • Result is in XML format [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 23 [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 24 10.3.1 Model-based storage 10.3.1 Model-based storage DOM-based storage – example • DOM-based storage Node type: ELEMENT – Information from the Document Object Model are stored in the database – Storage alternatives • (Object-)relational databases • Object-oriented databases • Developing own data structure [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig Node type: ATTRIBUTE Node type: TEXT 25 [Tür08] 10.3.1 Model-based storage 26 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.3.1 Model-based storage Summary model-based storage on top of the DBMS • XML Queries – Schema definition: • not required for storage – Document reconstruction: • Possible, but expensive – Queries: – XML queries (DOM method invocations) are mapped to SQL queries (taking storage structures into account) – Result of method invocation is generated from result of database query • XML queries possible • Adapted database queries – Special features: • Querying many elements/attributes is expensive – Efficiency: • Navigation from the given context is efficient • Restoring the document and evaluating path expressions is inefficient – Usage: • For data- and document-centric as well as for semi-structured XML applications [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 27 28 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10. XML Storage 1 10.3.2 XML data type <City> • SQL/XML 10.1 Motivation 10.2 Text-based storage <City> Storing XML documents inside the database as values of type XML 10.3 Model-based storage 10.3.1 On top of the relational DBMS <City> <Name> Braunschweig </Name> <Zip>38100</Zip> <Zip>38106</Zip> <State> Niedersachsen </State> </City> Generating XML documents using SQL/XML functions – Datatype XML with Mapping between SQl and XML belonging functions SQL XQuery – Mapping between SQL and XML SQL database XML datatype – Embedding XQuery in SQL 10.3.2 Inside the relational DBMS 10.4 Schema-based storage 10.5 Conclusion 10.6 Overview and References XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 29 [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 30 10.3.2 XML data type 10.3.2 XML data type • A value of the data type XML can contain XML(SEQUENCE) – whole XML document – XML element – a set of XML elements NULL or document node Untyped elements & attributes, elements not NULL • All XML publishing operators from chapter 9.4 create values of the data type XML, not a string XML(CONTENT(ANY)) Validated against schema 1 element child XML(CONTENT(UNTYPED)) 1 element child XML(CONTENT(XMLSCHEMA)) XML(DOCUMENT(ANY)) 1 element child Validated against schema XML(DOCUMENT(UNTYPED)) XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 31 [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.3.2 XML data type • Create a table that is an XML data type in itself – CREATE TABLE XMLDOCUMENT OF XMLTYPE; XML [({DOCUMENT|CONTENT|SEQUENCE} [({ANY|UNTYPED|XMLSCHEMA schema name})])] • Create a table containing an XMLType data type column • Modifiers are optional • Primary type modifier – DOCUMENT (XML document) – CONTENT (XML element) – SEQUENCE (sequence of XML elements) CREATE TABLE XML ( ID XML CONSTRAINT ); • Secondary type modifier – UNTYPED – XMLSCHEMA (typed) – ANY (may be typed) XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 33 [Pow07] 10.3.2 XML data type 123 <Groups>Annabelle</Groups> 234 <Groups>Magdalena, Marius</Groups> 345 <?xml version 1.0?> <Groups> <Person>Patrick</Person> <Person>Robert</Person> </Groups> 654 34 – Allowed values: • XML documents (including prolog) • XML content according to XML 1.0 (includes pure text comments, PI?) • NULL – No comparison possible (compare CLOB in SQL) • User can define an order, if comparison is necessary – No corresponding type in programming languages for embedding in SQL available – Standard defines operators to convert to other SQL data types <Groups>Rebecca</Groups> <Groups>Torben</Groups> XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig • Characteristics CREATE TABLE Groups ( ID INTEGER, Name XML ); Name NUMBER NOT NULL, XMLTYPE, XPK PRIMARY KEY (ID) 10.3.2 XML data type • Example: Definition of an XML type column ID 32 10.3.2 XML data type • Specification of XML type [Tür08] XML(DOCUMENT(XMLSCHEMA)) 35 [Kud07] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 36 10.3.2 XML data type 10.3.2 XML data type • Parsing & Serialization – XMLParse: • Validation of XML <City> <Name> Braunschweig </Name> <Zip>38100</Zip> <Zip>38106</Zip> <State> Niedersachsen </State> </City> – Is like integrity constraints in DBs – Requires an XML Schema – XML Schemas may be registered with the SQL-server • Parses a string value using an XML parser • Produces value whose specific type is XML(DOCUMENT(ANY)), or …CONTENT…, or • Implementation-defined mechanism • Known by SQL name & by target namespace URI – XMLSerialize – Schema does need a unique name • Transforms an XML value into a string value (CHAR, VARCHAR, CLOB, or BLOB) • Used by XMLValidate(), IS VALID, and to restrict values of XML(DOCUMENT-or-CONTENT(XMLSCHEMA)) <City> <Name> Braunschweig </Name> <Zip>38100</Zip> <Zip>38106</Zip> <State> Niedersachsen </State> </City> [Mel05] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 37 [Mel05] 10.3.2 XML data type • Schema definition – Syntax Register XMLSCHEMA 'http://www.Alfred-Moos.de/GrussSchema.xsd' FROM 'file://c:/XML_Schemata/GrussSchema.xsd' AS GrussSchema COMPLETE ; XML(CONTENT(XMLSCHEMA) <schema> [<elements>])) <schema> := URI <namespace> [LOCATION <loc>] | NO NAMESPACE [LOCATION <loc>] | ID <registered schema name> – <element> := [NAMESPACE <namespace>] ELEMENT <element name> CREATE TABLE Dokument_XML (Dokument_XML_Nr CHAR (4) NOT NULL PRIMARY KEY, Dokument XML, CONSTRAINT validieren CHECK (Dokument IS VALIDATED ACCORDING TO XMLSCHEMA ID GrussSchema ) ) ; 39 [Tür08] 10.3.2 XML data type 40 • Benefits of schema registration – XMLValidate – Security issues • Validates an XML value against an XML Schema (or target namespace), returning new XML value with type annotations • Schemas cannot “disappear” without SQLserver knowing about it • Schemas cannot be “hijacked” (altered in inappropriate ways) without SQL-server knowing about it • Documents cannot be marked “valid” against schemas unless SQL-server knows about them – IS VALID • Tests an XML value to determine whether or not it is valid according to an XML Schema (or target namespace); return true/false without altering the XML value itself – IS DOCUMENT • determines whether an XML value satisfies the (SQL/XML) criteria for an XML document – IS CONTENT • determines whether an XML value satisfies the (SQL/XML) criteria for XML content XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.3.2 XML data type • New functions and predicates: [Mel05] 38 10.3.2 XML data type • Schema registration XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 41 [Mel05] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 42 10.3.2 XML standard 10.3.2 SQL/XML standard • SQL/XML standard published as • Predefined schemas (build-in namespaces) – ISO/IEC 9074-14:2003 – xs:http://www.w3.org/2001/XMLSchema – xsi:http://www.w3.org/2001/XMLSchema-instance – sqlxml:http://standards.iso.org/iso/9075/2003/sqlxml • Mappings and Publishing Functions – ISO/IEC 9075-14:2006 – More depending on the DB implementation • Adds XQuery, including Data Model,Validation • Completely supported per XML+Namespaces: – ISO/IEC 9075-14:2008 – XMLElement, XMLForest, XMLTable • Updates – Default namespace, explicit namespace (prefix) – Declare namespace within scopes of WITH clause, column definitions, constraint definitions, insert/delete/update statements, compound statements [Mel05] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 43 [Mel05] 10.3.2 SQL/XML standard • Overview of some operators for the XML type – Additional publishing functions – XQuery data model – More precise XML type (modifiers) – XMLQuery, XMLTable – XMLValidate, IS VALID – XMLExists, IS DOCUMENT, IS CONTENT – Casting between XML type and SQL types XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig – XMLELEMENT – creates an XML element node – XMLFOREST – creates a sequence of XML element nodes from a table – XMLCOMMENT – creates an XML comment node – XMLTEXT – creates a text node – XMLPI – creates a processing instruction – XMLAGG – aggregates XML values of a group – XMLCONCAT – concatenates XML type values – XMLTRANSFORM – applies an XSL to a document 45 [Tür08] 10.3.2 SQL/XML standard 46 • Review of SQL/XML – Two components • A data type XML to store XML data • Functions to map relational structures to XML – XMLPARSE – a well-formed SQL text to XML value – XMLSERIALIZE – converts an XML value to a SQL text – XMLDOCUMENT – creates an XML document node from an XML value – XMLVALIDATE – validates an XML value with a schema – XMLQUERY – evaluates an XQuery expression – XMLTABLE – transforms an XQuery result to a SQL table – XMLITERATE – transforms an XQery sequence to a SQL table XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.3.2 SQL/XML standard • ... Overview of some operators for the XML type [Tür08] 44 10.3.2 SQL/XML standard • SQL/XML:2003 plus [Mel05] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 47 – Only construction operators • No extraction of values or search • But construction operators are based on XQuery – Mapping of tables, schemas, catalogues ignores some information from the relational schema • UNIQUE • REFERENCES • CHECK – Further extensions are expected [Kud07] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 48 10.3.2 Model-based storage 10. XML Storage 1 Summary model-based storage inside the DBMS 10.1 Motivation – Schema definition: 10.2 Text-based storage • not required for storage – Document reconstruction: 10.3 Model-based storage • Possible without problems – Queries: 10.4 Schema-based storage • XML queries possible – Special features: 10.4.1 Automatic mapping • Full integration of SQL and XML 10.4.2 User-defined mapping – Efficiency: • Vendor specific • Optimized data structures possible 10.5 Conclusion – Usage: 10.6 Overview and References • For data- and document-centric as well as for semi-structured XML applications 49 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.4.1 Schema-based storage • Motivation 10.4.1 Schema-based storage • Generating the DB schema for a DTD: – XML content shall be stored in a conventional database – Accepting the loss of native access – DB schema is derieved from a DTD or an XML schema – Rules to map element types: XML element type Sequence of element types Alternative of element types Element type with quantifier ? Element type with quantifier +,* Nested element types • Problem – Generate DB schema automatically – Thereby use as much structure information as possible • General approach for mapping from a DTD – – – – Transform DTD into a tree representation Nodes: element types, attributes, etc. (type layer!!!) Edges: nesting relationships of element types and their restrictions Traverse tree in order to transform nodes and edges into database tables (according to certain rules) XML attribute IMPLIED REQUIRED Default value DTD is usually required Queries use SQL functionality RDBMS data types are used (e.g. prices are NUMERIC) Problem: Mapping of collection types • Subdivide into additional relations Customer_Info: ID Fname Customer_info Feedback C0001 F0001 Lname Comment: Email Comment_ID C0001 Charles Sanchez C.Sanchez@hotmail... Feedback: ID F001 10.4.1 Schema-based storage – DTD is usually required – Queries use SQL functionality – "Natural" mapping to tuple types, collection types – In case of irregular document structure databases contain many null values. – Example: 44901 52 • Mapping to object relational databases • Mapping to relational databases Comment: Comment_ID column of a table null values allowed null values not allowed DEFAULT constraint XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.4.1 Schema-based storage – – – – column of a table columns of a table column of a table column with null values set/list of columns (SET OF, LIST OF) TUPLE OF – Rules to map attributes: 51 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 50 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig Type Content opinion Darjeeling Special… XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 53 44901 <Customer_info> <Feedback> Fname Lname Email Type Charles Sanchez C.Sanchez@hotmail... opinion Darjeeling Specia… XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig Content 54 10.4.1 Schema-based storage 10.4.1 Schema-based storage • Mapping of recursive data definitions – DTDs can be recursive – Infinite recursion is impossible on instance layer of a database – Procedure: • • • • Marking the nodes Subdividing into separate tables Use primary and foreign keys in RDBMS Use reference types in ORDBMS <!ELEMENT book (front, body, references)> <!ELEMENT references (book+)> [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 55 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.4.1 Schema-based storage • Mapping of element sequences 56 10.4.1 Schema-based storage • Mapping of alternatives – Sequence can be important – XML allows to specify alternatives – Example: • Use an additional attribute in these cases – Example: <!ELEMENT car (compactCar | sedan | van)*> – Three possible storage variants <lecture> <lesson>Introduction</lesson> <lesson>XML basics</lesson> … • Each alternative is stored as separate table column • Subdivide alternatives in separate tables • Use a table column of type XML type ⇓ Order Lesson 1 Introduction 2 XML basics XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 57 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.4.1 Schema-based storage • Variant 1 – all alternatives in one table 58 10.4.1 Schema-based storage • Variant 2 – subdivided into multiple tables • • – Problem: many null values (wasting storage space) [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig – For queries, combination of tables is needed 59 [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 60 10.4.1 Schema-based storage 10.4.1 Schema-based storage • Mapping with STORED (Semistructured TO RElational Data) • Variant 3 – Using column type XML – Basic idea: Use data mining techniques on the XML structure to find a good mapping to tables [DFS99] – Input • • • • XML documents (or an average sample of the collection) Query workload Restrictions of storage space, number of tables, … No DTD or XML schema is required! – Output • Relational schema • STORED-queries: Mapping instructions for XML documents to DB tables – Procedure • Determine the XML subtrees with the largest support in the collection and in the queries • These subtrees are materialised in tables • Irregular data is stored in overflow tables according to the EDGE approach – XML type allows XML queries or DOM methods [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 61 10.4.1 Schema-based storage XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 62 10.4.1 Schema-based storage • Mapping with STORED – example • Schema-based storage with automatic mapping Subtrees with high support – Advantages • Queries, data types, aggregation functions, views • Integration in other databases when storing structured data – Disadvantages • Large schema, sparsely filled databases (many null values) • No flexible data types, storage of alternatives has problems • Less flexible queries XML documents shown as tree structure [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig – No information retrieval queries possible without additional extensions – No full text operations for semi- or unstructured data – Usually native access is not possible any more 63 10.4.1 Schema-based storage • Summary Schema-based storage with automatic mapping 64 10.4.2 User-defined Mapping • User defined mapping – Schema definition: – Idea • Is usually required and analysed • not required, e.g. for STORED • In all previously shown methods it is not possible to affect the storage in the DB • With user defined mappings the user defines the storage structure • The structure of XML documents and database schema can be designed independently from each other • Also possible: storing XML documents in existing databases – Document reconstruction: • Limited (requires logging of the mapping process) – Queries: • Database queries • XML queries possible,but lack the XPath horizontal axes, e.g. following, preceding-sibling – Annotation of DTD and XML schema, respectively – Special features: • In many cases the mapping definition is combined with existing schema information • Federation with existing databases is possible – Efficiency: – Only limited XML queries possible • High efficiency by using the DB-engine • Logging of the mapping process from XML documents to databases • For a given query all relevant data has to be stored (lossless mapping) – Usage: • For data-centric XML applications, but with limited nesting XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 65 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 66 10.4.2 User-defined Mapping 10.4.2 User-defined Mapping • Mapping solutions with different specializations • The shredder can be part of the DB – Algorithms, middleware, commercial applications, … – Varying amount of required input or user decisions – Many algorithms create different database schemas – Usually requires an XML schema – In the IBM Data Studio, the shredder is part of the "annotated XML schema decomposition" – Direct approach in DB2: • Two phases • register the XML schema and call the stored procedure: – Mapping register xmlschema http://our.org/custacc from dec_files/custacc.xsd as cust_schema ; complete xmlschema cust_schema enable decomposition ; call SYSPROC.XDBDECOMPXML ('VRODRIG', 'CUST_SCHEMA', ? , ?, 1, null, null, null) • Assign a place for each node type in the DB – Shredding • Import the XML data as DB tuples 67 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 68 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.4.2 User-defined Mapping 10.4.2 User-defined Mapping • Shredding without XML schema in DB2 • Example: XML document – XMLTABLE function in combination with an INSERT mapping instruction INSERT INTO ENVELOPEXT (MAILFROM, MAILTO, MAILDATE, SUBJECT) SELECT MAILFROM, MAILTO, MAILDATE, SUBJECT FROM XMLTABLE( XMLNAMESPACES('http://www.sal.com/mails' AS "email"), '$doc/email:mails/mail' (: some xquery-expression :) PASSING xml-source AS "doc" COLUMNS MAILFROM VARCHAR (100) PATH 'envelope/from', MAILTO VARCHAR (100) PATH 'envelope/to', MAILDATE VARCHAR (30) PATH 'envelope/email:Date', SUBJECT VARCHAR (100) PATH 'envelope/Subject') AS T; http://www.ibm.com/developerworks/db2/l ibrary/techarticle/dm-0801ledezma/ 69 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.4.2 User-defined Mapping 70 • Remarks – Example syntax for XML-DBMS (Roland Bourret) XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.4.2 User-defined Mapping • Mapping instruction <ClassMap> <ElementType Name="sales:SalesOrder"/> <ToClassTable> <Table Name="Sales"/> </ToClassTable> <PropertyMap> <Attribute Name="SONumber"/> <ToColumn> <Column Name="Number"/> </ToColumn> </PropertyMap> </ClassMap> [Tür08] – Many different mapping languages or schema annotations • Automatic mappings usually have an internal mapping language Connection between elements and tables – Remember the mapping constructs from last lecture (9). The SQL/XML annotations are a mapping language, too. – DB2 uses similar annotations as SQL/XML Connection between elements/attributes and table columns • see next slide: 71 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 72 <xsd:complexType xmlns:db2-xdb= "http://www.ibm.com/xmlns/prod/db2/xdb1" name="ROW.ACCOUNT"> <xsd:sequence> <xsd:element name="NAME" Mapping SQL type="CHAR_20" table columns to db2-xdb:rowSet="Account" XML elements db2-xdb:column="Name"/> <xsd:element name="BALANCE" type="NUMERIC_12_2"/> Mapping table db2-xdb:rowSet="Account" rows to XML db2-xdb:column="Balance"/> <row> </xsd:sequence> elements </xsd:complexType> Mapping SQL tables CREATE TABLE Account ( Name CHAR(20), Balance NUMERIC(12,2), ); Name Balance Joe 2000 Jim 3500 <ACCOUNT> <row> <NAME>Joe</NAME> <BALANCE>2000</BALANCE> </row> <row> <NAME>Jim</NAME> <BALANCE>3500</BALANCE> </row> </ACCOUNT> [Tür08] <xsd:complexType name="TABLE.ACCOUNT"> <xsd:sequence> <xsd:element name="row" type="ROW.ACCOUNT"/> </xsd:sequence> </xsd:complexType> 10.4.2 User-defined Mapping • Summary schema-based storage with user defined mapping – Schema definition: • Depends on mapping language – Document reconstruction: • Not possible in most cases (requires logging of the mapping process) – Queries: • Database queries • XML queries in rare cases only! – Special features: • Integration with existing databases is possible – Efficiency: • High efficiency by using the DB-engine – Usage: • For data-centric XML applications <xsd:element name="ACCOUNT" type="TABLE.ACCOUNT"/> XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 73 10. XML Storage 1 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.5 Conclusion • Different methods for storage of XML documents 10.1 Motivation – Text-based • Storing whole XML documents as string • Can use full text index or path index 10.2 Text-based storage 10.3 Model-based storage – Model-based 10.4 Schema-based storage – Schema-based 10.5 Conclusion – Hybrid approaches 10.6 Overview and References – No algorithm has the optimal solution for all kind of XML documents – Reasonable solution is heavily dependent on the application XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig • Generic mapping of the tree structure • Detect and analyse the structure of the XML documents • Derive a DB schema from the structure • A combination of some of those methods 75 10.6 References • "XML und Datenbanken" [Tür08] "XML und Datenbanken" [KM03] – M. Klettke, H. Meier – dpunkt.verlag, 2003 • "Generierung eines adaptiven Datenbankschemas für datenzentrierte XMLDokumente" [Bus08] • [FK99] • [DFS99] – Carsten Busche – Diplomarbeit,TU Braunschweig, 2008 – D. Florescu, D. Kossmann: Storing and Querying XML Data using an RDBMS. IEEE Data engineering Bulletin (DEBU),Volume 22(3), Seiten 27-34, 1999. – A. Deutsch, M.F. Fernández, D. Suciu: Storing Semistructured Data with STORED. Proceedings of the 1999 ACM SIGMOD international conference on Management of data, Seiten 431-442, ACM, 1999. XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 76 10.6 Overview – Can Türker – Lecture, University of Zurich, 2008 • 74 77 Introduction and Basics 1. Introduction 2. XML Basics 3. Schema Definition 4. XML Processing Querying XML 5. XPath & SQL/XML Queries 6. XQuery Data Model 7. XQuery XML Updates 8. XML Updates & XSLT Producing XML 9. Producing XML Storing XML 10. XML storage 11.Relational XML storage 12. Storage Optimization Systems 13. Technology Overview XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 78 Questions, Ideas, Comments • Now, or ... • Room: IZ 232 • Office our: Tuesday, 12:30 – 13:30 Uhr or on appointment • Email: [email protected] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 79