Indexing XML Data in RDBMS using ORDPATH Microsoft® SQL Server 2005™ Concepts developed by: Patrick O‘Neil, Elizabeth O‘Neil, (University of Massachusetts Boston) Shankar Pal, Istvan Cseri, Oliver Seeliger, Gideon Schaller, Leo Giakoumakis, Vasili Zolotov, Nigel Westbury (Microsoft Corporation) XML Data Model Sample XML Data (serialized form): <BOOK ISBN=”1-55860-438-3”> <SECTION> <TITLE> Bad Bugs</TITLE> Nobody loves bad bugs. <FIGURE CAPTION=”Sample bug”/> </SECTION> <SECTION> <TITLE> Tree Frogs </TITLE> All right-thinking people <BOLD> love </BOLD> tree frogs. </SECTION> </BOOK> 5. Juli 2006 Stephan Müller 2 XML Data Model XML Document / Fragment - Properties: 1 2 ISBN Hierarchy 4 Title 3 Book Section Section 8 5 Nobody 6 Figure Caption Title 9 All right Bold Frogs 10 11 12 7 Document Order: 1 < 2 < 3 < 4 < 5 < ….. < 11 < 12 5. Juli 2006 Stephan Müller 3 XML Data Stored in a Relational Database SQL Command: CREATE TABLE docs ( id INT PRIMARY KEY, xdoc XML ); SQL with embedded XQuery and XPath: Created docs Table: ID XDOC 1 XML Fragment as BLOB 2 XML Document as BLOB … … 7 XML Fragment as BLOB … … SELECT id, xdoc.query(‘ for $s in /BOOK[@ISBN=“1-55860-438-3“]//SECTION return <topic> { data($s/TITLE) } </topic> ‘) FROM docs; 5. Juli 2006 Stephan Müller 4 ORDPATH What we expect from a labeling scheme: Introduction ► Support for structural fidelity (Hierarchy + Document Order) ► Support for efficient structural modifications to the XML tree - insert sub-tree - delete sub-tree without relabeling !!! - move sub-tree ► Support for high-performance query plans for native XML queries using relational primitives ► Independence of XML schemas typing XML instances 5. Juli 2006 Stephan Müller 6 1 Example of an Initial Load Book 1.1 Section 1.3 1.5 Section Hierarchy ISBN 1.3.5 Title Nobody 1.3.1 1.3.3 Figure Caption Title All right Bold Frogs 1.5.1 1.5.3 1.5.5 1.5.7 1.3.5.1 Document Order: 1 < 1.1 < 1.3 < 1.3.1 < … < 1.5.7 5. Juli 2006 Primary Index: infoset ORDPATH TAG NODE_TYPE VALUE 1 1 (BOOK) 1 (Element) Null 1.1 2 (ISBN ) 2 (Attribute) '1'1-5586055860-438438-3' 1.3 3 (SECTION) 1 (Element) Null 1.3.1 4 (TITLE) 1 (Element) 'Bad Bugs' 1.3.3 -- 4 (Value (Value)) 'Nobody loves bad bugs' bugs' 1.3.5 5 (FIGURE) 1 (Element) Null 1.3.5.1 6 (CAPTION) 2 (Attribute) 'Sample bug' bug' 1.5 3 (SECTION) 1 (Element) Null 1.5.1 4 (TITLE) 1 (Element) 'Tree frogs' frogs' 1.5.3 -- 4 (Value (Value)) 'All rightright-thinking people' people' 1.5.5 7 (BOLD) 1 (Element) 'love' love' 1.5.7 -- 4 (Value (Value)) 'tree frogs' frogs' Li/Oi Pair Design Li /Oi Pair Design 1.5.3.-9.11 ORDPATH Example Value: Li /Oi Pair Desgin: ORDPATH bit pattern: L0 O0 L1 O1 … LK OK 0100101101010110001111111000011 We need a prefix-free Li encoding… 5. Juli 2006 Stephan Müller 9 Prefix–Free Encoding of the Li Bitstrings (using the Fano Condition) 5. Juli 2006 Stephan Müller 10 Li /Oi Pair Design 1.5.3.-9.11 ORDPATH Example Value: Using Li values from Figure 3.2a L0 = 3 O0 = 1 L1 = 3 O1 = 5 L2 = 3 O2 = 3 L3 = 4 O3 = -9 L4 = 4 O4 = 11 01 001 01 101 01 011 00011 1111 100 0011 ORDPATH bit pattern 0100101101010110001111111000011 (Figure 3.2a) 5. Juli 2006 Stephan Müller 11 Li /Oi Pair Design Advantages of comparing ORDPATH Values: ► Determination of ancestor–descendent relationships for any two ORDPATHs is very easy. ► Easy determination of the distance between two ORDPATHs. ► Simple bitstring (or byte-by-byte) comparison yields document order. 5. Juli 2006 Stephan Müller 12 Descendants of a given Context Node Context Node ( cn = 1.3 ) 1 Book 1.1 1.3 ISBN Section Section 1.5 1.3.5 Title Nobody 1.3.1 1.3.3 Figure Caption Title All right Bold Frogs 1.5.1 1.5.3 1.5.5 1.5.7 1.3.5.1 5. Juli 2006 Stephan Müller 13 Descendants of a given Context Node SQL Query: SELECT FROM WHERE AND Ordpath infoset 1.3 < Ordpath 1.4 > Ordpath (cn) (cn+1) Infoset Table: ORDPATH TAG NODE_TYPE VALUE 1 1 (BOOK) 1 (Element) Null 1.1 2 (ISBN ) 2 (Attribute) '1'1-5586055860-438438-3' 1.3 3 (SECTION) 1 (Element) Null 1.3.1 4 (TITLE) 1 (Element) 'Bad Bugs' 1.3.3 -- 4 (Value (Value)) 'Nobody loves bad bugs' bugs' 1.3.5 5 (FIGURE) 1 (Element) Null 1.3.5.1 6 (CAPTION) 2 (Attribute) 'Sample bug' bug' 1.5 3 (SECTION) 1 (Element) Null 1.5.1 4 (TITLE) 1 (Element) ‚Tree frogs' frogs' 1.5.3 -- 4 (Value (Value)) ‚All rightright-thinking people' people' 1.5.5 7 (BOLD) 1 (Element) ‚love' love' 1.5.7 -- 4 (Value (Value)) ‚tree frogs' frogs' 14 Arbitrary Inserts Arbitrary Insertions Rightmost / Leftmost Insertion: 3.5 Parent 5. Juli 2006 Child4 Child1 Child2 3.5.-1 3.5.1 3.5.3 Stephan Müller Child3 3.5.5 16 Arbitrary Insertions Careting in nodes between two existing nodes… 3.5 3.5.1 3.5.2 3.5.2.1 3.5.3 3.5.2.2 3.5.2.2.-1 5. Juli 2006 Stephan Müller 3.5.2.3 3.5.2.2.1 17 Arbitrary Insertions Careting in nodes between two existing nodes… 3.5 Parent 5. Juli 2006 Child1 Child3 Child6 Child5 Child4 3.5.1 3.5.2.1 3.5.2.2.-1 3.5.2.2.1 3.5.2.3 Stephan Müller Child2 3.5.3 18 Comment Note: ► Multiple levels of carets are extremely rare in practice. Advantage: ► Insertions require no relabelings of old nodes… We avoid updates to primary key values which would involve the primary index and all secondary indexes. 5. Juli 2006 Stephan Müller 19 Conclusion ORDPATH … ► … is a hierarchical prefix-based labeling scheme. ► … provides efficient access to subtrees. ► … provides all kinds of modifications. ► … 5. Juli 2006 Stephan Müller 20