10. XML Storage 1 10.1 Motivation 10.2 Text-based storage XML Databases 10. XML Storage 1 – Overview 10.2.1 Index structures 10.3 Model-based storage Silke Eckstein Andreas Kupfer Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 10.4 Schema-based storage 10.5 Conclusion 10.6 Overview and References XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.1 Motivation 10.1 Motivation • Applications require different types of XML documents • Characterisation of XML documents: – Structure vs. content – Regular vs. irregular – Data-centric documents • Thus, XML documents are • Structured, regular • E.g. product catalog, order, invoice – Data-centric – Document-centric – or somewhere in-between – Document-centric documents • Questions – Storage of XML documents – Efficient processing of queries on the stored documents or data • Unstructured, irregular • E.g. scientific article, book, email, web page • There are several methods for storage – Semi-structured documents – 1st goal: Learn and understand methods – 2nd goal: Classify methods • Data-centric and document-centric parts • E.g. publications, Amazon, MS Press (example chapters) • Principles • Advantages and disadvantages • Usage XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 3 10.1 Motivation 4 • Storage approaches for XML documents – Order preserving and lossless storage of XML documents – Efficient access to XML documents or parts thereof – Text-based • Storage as character data – Model-based • Quick response time for • Generic storage of the graph structure • Storage of the DOM – Queries – Update operations – Schema-based Indexing Transaction processing Support of XPath and XQuery Support of SAX and DOM for applications XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.1 Motivation • Requirements for the physical layer: • • • • 2 • Mapping to (object-)relational databases – Deriving the database schema from the XML structure – Using user defined mapping procedures 5 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 6 10. XML Storage 1 10.2 Text-based storage • The whole XML document text is stored as character data 10.1 Motivation – File in the file system – CLOB (Character-Large-OBject) in the DBS 10.2 Text-based storage 10.2.1 Index structures • Operations documents as a whole are very efficient – Reading and writing the whole document – But the content is monolithic and opaque with respect to the relational query engine (query can't inspect a fragment) 10.3 Model-based storage 10.4 Schema-based storage • Getting granular access requires additional support 10.5 Conclusion – Full text index – Path index 10.6 Overview and References XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 7 10.2.1 Index structures 10.2.1 Index structures • Types of index structures • Index structures for XML documents allow efficient access for specific queries – Value index • Indexes atomar values of an XML document, like element content or attribute values • Index format for structured parts of XML documents • Already known from databases (B-trees, hash index, …) – Different types of indexes are optimized for different types of queries – Full text index • Generate redundancy • Indexes single words from the full text • Index format for unstructured parts of XML documents • Already known from Information Retrieval (inverted lists, tries, suffix trees, …) – Index has to be up-to-date by propagating data changes • Index structures can be storage structures as well – They define the storage method XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 8 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig – Path index • Indexes subtrees/paths in an XML document • Index format for semistructured parts of XML documents • Already known from object-databases (access support relations, …) 9 10. 2.1 Index structures XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10 10. 2.1 Index structures • Full text index • B-tree as value index for an XML fragment document – Not limited to exact matches • Keyword-based search and boolean retrieval • Pattern search (with regular expressions) – Use of • Statistical, word-based methods – Stop word removal – Elimination of uncommon items • Linguistic methods – Normalization of words (e.g. capitalisation, hyphenation,) – Word decomposition by rules (engl.) or dictionaries (german) – Stemming • Knowledge-based methods – Use of ontologies and thesauri to search for synonyms, hypernyms and hyponyms [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 11 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 12 10. 2.1 Index structures 10. 2.1 Index structures word • Inverted list as full text index for XML word [Tür08] occurrence occurrence word position in the text XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 13 [Tür08] 10. 2.1 Index structures • Types of path indexes – Nested path index – Structure information must be identifiable and reconstructable • Access to root node from every node • Assigning the markup to the content as well as • Representing the hierarchical nesting and order of elements/attributes – Multi-index • Accessing parent nodes – Join-index – Especially suited for keyword search with regard to structure or path expressions • Access parent and child nodes – Access Support Relations (ASR) FOR $b IN //book WHERE CONTAINS($b/author,"Benjamin") RETURN $b • Generalization of indexes above, by listing all paths in a table 15 10. 2.1 Index structures [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 16 10.2 Text-based storage • Summary text-based storage • Conclusion – Schema definition: – Efficient query processing on XML documents requires different types of index structures – Value index • not required – Document reconstruction: • documents stay in their original format – Queries: • Information retrieval queries • Processing the markup of the queries • XML queries possible • For efficient access to structured parts • Keyword search, value search – Special features: – Full text index • Full text functions – Efficiency: • For efficient access to unstructured parts • Character string must be parsed on every access with XML processors expensive • No concurrency on read or write no parallel processing – Path index • Using the document structure • Navigating queries XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 14 10. 2.1 Index structures • Path index XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig – Usage: • For document-centric XML applications • Suitable to only a limited extent also for semi-structured applications 17 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 18 10. XML Storage 1 10.3 Model-based storage • Idea: generic storage of the graph structure 10.1 Motivation – XML elements, XML attributes, … are nodes of a graph – Nesting of elements defines edges – Nodes get an (internal) ID based on graph traversal 10.2 Text-based storage • Using relations or object classes to store elements and attributes 10.2.1 Index structures 10.3 Model-based storage Elements ID Element name Value Reference to preceeding Rank 10.4 Schema-based storage Attributes ID Attribute name Value Reference to element 10.5 Conclusion • Document structure can be restored completely • Extension for data type adapted storage is possible 10.6 Overview and References XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 19 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.3 Model-based storage 20 10.3 Model-based storage • The EDGE approach [FK99] • XML queries – XML queries (XPath, XQuery) are mapped to SQL queries (taking storage structures into account) – Result of XML query is generated from result of database query • "Labeling" of the result tuples • Result is in XML format XML documents – Variant BINARY: horizontal partition of EDGE based on label [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 21 [Tür08] 10.3 Model-based storage XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 22 10.3 Model-based storage • Example: list bargain buy with prices • DOM-based storage – Information from the Document Object Model are stored in the database – Storage alternatives SELECT a.content, b.content FROM Edge a, Edge b WHERE (a.label = 'price') AND (a.content < 10.00) AND (b.label = 'description') AND (b.parent = a.parent) AND (a.key = b.key) • (Object-)relational databases • Object-oriented databases • Developing own data structure [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 23 [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 24 10.3 Model-based storage 10.3 Model-based storage DOM-based storage – example Node type: ELEMENT • XML Queries Node type: ATTRIBUTE – XML queries (DOM method invocations) are mapped to SQL queries (taking storage structures into account) – Result of method invocation is generated from result of database query Node type: TEXT [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 25 10.3 Model-based storage • not required for storage 10.2 Text-based storage – Document reconstruction: • Possible, but expensive 10.2.1 Index structures – Queries: • XML queries possible • Adapted database queries 10.3 Model-based storage – Special features: • Querying many elements/attributes is expensive 10.4 Schema-based storage – Efficiency: • Navigation from the given context is efficient • Restoring the document and evaluating path expressions is inefficient 10.5 Conclusion – Usage: 10.6 Overview and References • For data- and document-centric as well as for semi-structured XML applications 27 10.4 Schema-based storage XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 28 10.4 Schema-based storage • Motivation • Generating the DB schema for a DTD: – XML content shall be stored in a conventional database – Accepting the loss of native access – DB schema is derieved from a DTD or an XML schema – Rules to map element types: XML element type Sequence of element types Alternative of element types Element type with quantifier ? Element type with quantifier +,* Nested element types • Problem – Generate DB schema automatically – Thereby use as much structure information as possible • General approach for mapping from a DTD Transform DTD into a tree representation Nodes: element types, attributes, etc. (type layer!!!) Edges: nesting relationships of element types and their restrictions Traverse tree in order to transform nodes and edges into database tables (according to certain rules) XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 26 10.1 Motivation – Schema definition: – – – – XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10. XML Storage 1 Summary model-based storage XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig [Tür08] column of a table columns of a table column of a table column with null values set/list of columns (SET OF, LIST OF) TUPLE OF – Rules to map attributes: XML attribute IMPLIED REQUIRED Default value 29 column of a table null values allowed null values not allowed DEFAULT constraint XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 30 10.4 Schema-based storage 10.4 Schema-based storage • Mapping with STORED (Semistructured TO RElational Data) • Mapping to relational databases – – – – – Basic idea: Use data mining techniques on the XML structure to find a good mapping to tables [DFS99] – Input DTD is usually required Queries use SQL functionality RDBMS data types are used (e.g. prices are NUMERIC) Problem: Mapping of collection types • • • • • Subdivide into additional relations – Example: Comment: Comment_ID 44901 Customer_Info: ID Fname – Output Customer_info Feedback C0001 F0001 Lname • Relational schema • STORED-queries: Mapping instructions for XML documents to DB tables – Procedure Email • Determine the XML subtrees with the largest support in the collection and in the queries • These subtrees are materialised in tables • Irregular data is stored in overflow tables according to the EDGE approach C0001 Charles Sanchez C.Sanchez@hotmail... Feedback: ID F001 XML documents (or an average sample of the collection) Query workload Restrictions of storage space, number of tables, … No DTD or XML schema is required! Type Content opinion Darjeeling Special… XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 31 32 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.4 Structure-based storage 10.4 Schema-based storage • Mapping with STORED – example • Mapping to object relational databases Subtrees with high support – DTD is usually required – Queries use SQL functionality – "Natural" mapping to tuple types, collection types – In case of irregular document structure databases contain many null values. Comment: XML documents shown as tree structure Comment_ID 44901 [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 33 <Customer_info> <Feedback> Fname Lname Email Type Charles Sanchez C.Sanchez@hotmail... opinion Darjeeling Specia… Content XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.4 Schema-based storage 34 10.4 Schema-based storage • Mapping of recursive data definitions – DTDs can be recursive – Infinite recursion is impossible on instance layer of a database – Procedure: • • • • Marking the nodes Subdividing into separate tables Use primary and foreign keys in RDBMS Use reference types in ORDBMS <!ELEMENT book (front, body, references)> <!ELEMENT references (book+)> [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 35 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 36 10.4 Schema-based storage 10.4 Schema-based storage • Mapping of element sequences • Mapping of alternatives – Sequence can be important – XML allows to specify alternatives – Example: • Use an additional attribute in these cases – Example: <!ELEMENT car (compactCar | sedan | van)*> – Three possible storage variants <lecture> <lesson>Introduction</lesson> <lesson>XML basics</lesson> … • Each alternative is stored as separate table column • Subdivide alternatives in separate tables • Use a table column of type XML type ⇓ Order Lesson 1 Introduction 2 XML basics XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 37 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.4 Schema-based storage 38 10.4 Schema-based storage • Variant 1 – all alternatives in one table • Variant 2 – subdivided into multiple tables • • – Problem: many null values (wasting storage space) [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig – For queries, combination of tables is needed 39 [Tür08] 10.4 Schema-based storage XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 40 10.4 Schema-based storage Mapping of mixed content – example • Variant 3 – Using column type XML – XML type allows XML queries or DOM methods [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 41 [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 42 10.4 Schema-based storage 10.4 Schema-based storage • Schema-based storage with automatic mapping • Mapping of mixed content – Mapping to plain tables is ill-suited – Use variant 3 from above or – Advantages • Queries, data types, aggregation functions, views • Integration in other databases when storing structured data • Content model ANY is not representable at all – Disadvantages – Arbitrary content, arbitrary element types – Often the fitting storage structure can only be decided on instance layer XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig • Large schema, sparsely filled databases (many null values) • No flexible data types, storage of alternatives has problems • Less flexible queries – No information retrieval queries possible without additional extensions – No full text operations for semi- or unstructured data – Usually native access is not possible any more 43 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.4 Schema-based storage 44 10.4 Schema-based storage Algorithm/product |based on: n/a DTD schema |restrictions: keys cardin. types | DTD optimisation • Mapping solutions with different specializations – Algorithms, middleware, commercial applications, … – Varying amount of required input or user decisions – Many algorithms create different database schemas • Two phases – Mapping • Assign a place for each node type in the DB – Shredding • Import the XML data as DB tuples XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 45 10.4 Schema-based storage XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 46 10.4 Schema-based storage • The shredder can be part of the DB • Shredding without XML schema in DB2 – Usually requires an XML schema – In the IBM Data Studio, the shredder is part of the "annotated XML schema decomposition" – Direct approach in DB2: – XMLTABLE function in combination with an INSERT INSERT INTO ENVELOPEXT (MAILFROM, MAILTO, MAILDATE, SUBJECT) SELECT MAILFROM, MAILTO, MAILDATE, SUBJECT FROM XMLTABLE( XMLNAMESPACES('http://www.sal.com/mails' AS "email"), '$doc/email:mails/mail' (: some xquery-expression :) PASSING xml-source AS "doc" COLUMNS MAILFROM VARCHAR (100) PATH 'envelope/from', MAILTO VARCHAR (100) PATH 'envelope/to', MAILDATE VARCHAR (30) PATH 'envelope/email:Date', SUBJECT VARCHAR (100) PATH 'envelope/Subject') AS T; • register the XML schema and call the stored procedure: register xmlschema http://our.org/custacc from dec_files/custacc.xsd as cust_schema ; complete xmlschema cust_schema enable decomposition ; call SYSPROC.XDBDECOMPXML ('VRODRIG', 'CUST_SCHEMA', ? , ?, 1, null, null, null) XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig [Bus08] 47 http://www.ibm.com/developerworks/db2/l ibrary/techarticle/dm-0801ledezma/ XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 48 10.4 Schema-based storage 10.4 Schema-based storage • Summary Schema-based storage with automatic mapping • User defined mapping – Schema definition: – Idea • Is usually required and analysed • not required, e.g. for STORED • In all previously shown methods it is not possible to affect the storage in the DB • With user defined mappings the user defines the storage structure • The structure of XML documents and database schema can be designed independently from each other • Also possible: storing XML documents in existing databases – Document reconstruction: • Limited (requires logging of the mapping process) – Queries: • Database queries • XML queries possible,but lack the XPath horizontal axes, e.g. following, preceding-sibling – Annotation of DTD and XML schema, respectively – Special features: • In many cases the mapping definition is combined with existing schema information • Federation with existing databases is possible – Efficiency: – Only limited XML queries possible • High efficiency by using the DB-engine • Logging of the mapping process from XML documents to databases • For a given query all relevant data has to be stored (lossless mapping) – Usage: • For data-centric XML applications, but with limited nesting 49 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.4 Schema-based storage 10.4 Schema-based storage • Example: XML document • Mapping instruction mapping instruction – Example syntax for XML-DBMS (Roland Bourret) <ClassMap> <ElementType Name="sales:SalesOrder"/> <ToClassTable> <Table Name="Sales"/> </ToClassTable> <PropertyMap> <Attribute Name="SONumber"/> <ToColumn> <Column Name="Number"/> </ToColumn> </PropertyMap> </ClassMap> [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 51 Connection between elements and tables Connection between elements/attributes and table columns XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 52 <xsd:complexType xmlns:db2-xdb= "http://www.ibm.com/xmlns/prod/db2/xdb1" name="ROW.ACCOUNT"> <xsd:sequence> <xsd:element name="NAME" Mapping SQL type="CHAR_20" table columns to db2-xdb:rowSet="Account" XML SQL/XML elements db2-xdb:column="Name"/> schema <xsd:element name="BALANCE" annotations in type="NUMERIC_12_2"/> Mapping table DB2 db2-xdb:rowSet="Account" rows to XML (table is called db2-xdb:column="Balance"/> <row> rowSet) </xsd:sequence> elements </xsd:complexType> Mapping SQL tables 10.4 Schema-based storage CREATE TABLE Account ( Name CHAR(20), Balance NUMERIC(12,2), ); • Remarks – Many different mapping languages or schema annotations Name • Automatic mappings usually have an internal mapping language 2000 Jim 3500 <ACCOUNT> <row> <NAME>Joe</NAME> <BALANCE>2000</BALANCE> </row> <row> <NAME>Jim</NAME> <BALANCE>3500</BALANCE> </row> </ACCOUNT> • On the next slide, the example from lecture 6 is shown with DB2 syntax 53 Balance Joe – Remember the mapping constructs from lecture 5 and 6. The SQL/XML annotations are a mapping language, too. – DB2 uses similar annotations as SQL/XML XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 50 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig [Tür08] <xsd:complexType name="TABLE.ACCOUNT"> <xsd:sequence> <xsd:element name="row" type="ROW.ACCOUNT"/> </xsd:sequence> </xsd:complexType> <xsd:element name="ACCOUNT" type="TABLE.ACCOUNT"/> XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 54 10.4 Schema-based storage 10. XML Storage 1 • Summary schema-based storage with user defined mapping 10.1 Motivation – Schema definition: 10.2 Text-based storage • Depends on mapping language – Document reconstruction: 10.2.1 Index structures • Not possible in most cases (requires logging of the mapping process) – Queries: 10.3 Model-based storage • Database queries • XML queries in rare cases only! – Special features: 10.4 Schema-based storage • Integration with existing databases is possible – Efficiency: 10.5 Conclusion • High efficiency by using the DB-engine – Usage: 10.6 Overview and References • For data-centric XML applications XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 55 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10.5 Conclusion 10.6 References • Different methods for storage of XML documents • • Storing whole XML documents as string • Can use full text index or path index • • Generic mapping of the tree structure • – Schema-based – Hybrid approaches • [FK99] • [DFS99] – D. Florescu, D. Kossmann: Storing and Querying XML Data using an RDBMS. IEEE Data engineering Bulletin (DEBU),Volume 22(3), Seiten 27-34, 1999. • A combination of some of those methods – No algorithm has the optimal solution for all kind of XML documents – Reasonable solution is heavily dependent on the application – A. Deutsch, M.F. Fernández, D. Suciu: Storing Semistructured Data with STORED. Proceedings of the 1999 ACM SIGMOD international conference on Management of data, Seiten 431-442, ACM, 1999. 57 10.6 Overview XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 58 Questions, Ideas, Comments 8. XML query languages II – XQuery Data Model 9. XML query languages III – XQuery 10. XML storage I – Overview 11.XML storage II 12. Updates / Transactions 13. Systems XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig "Generierung eines adaptiven Datenbankschemas für datenzentrierte XMLDokumente" [Bus08] – Carsten Busche – Diplomarbeit,TU Braunschweig, 2008 • Detect and analyse the structure of the XML documents • Derive a DB schema from the structure Introduction XML Basics Schema definition XML query languages I Mapping relational data to XML 6. SQL/XML 7. XML processing "XML und Datenbanken" [KM03] – M. Klettke, H. Meier – dpunkt.verlag, 2003 – Model-based 1. 2. 3. 4. 5. "XML und Datenbanken" [Tür08] – Can Türker – Lecture, University of Zurich, 2008 – Text-based XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 56 59 • Now, or ... • Room: IZ 232 • Office our: Tuesday, 12:30 – 13:30 Uhr or on appointment • Email: [email protected] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 60