print - IfIS - Technische Universität Braunschweig

10. XML Storage 1
10.1 Motivation
10.2 Text-based storage
XML Databases
10. XML Storage 1 – Overview
10.3 Model-based storage
Silke Eckstein
Andreas Kupfer
Institut für Informationssysteme
Technische Universität Braunschweig
http://www.ifis.cs.tu-bs.de
10.5 Conclusion
10.4 Schema-based storage
10.6 Overview and References
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
10.1 Motivation
10.1 Motivation
• Applications require different types of XML documents
• Characterisation of XML documents:
– Structure vs. content
– Regular vs. irregular
– Data-centric documents
• Thus, XML documents are
• Structured, regular
• E.g. product catalog, order, invoice
– Data-centric
– Document-centric
– or somewhere in-between
– Document-centric documents
• Questions
– Storage of XML documents
– Efficient processing of queries on the stored
documents or data
• Unstructured, irregular
• E.g. scientific article, book, email, web page
• There are several methods for storage
– Semi-structured documents
– 1st goal: Learn and understand methods
– 2nd goal: Classify methods
• Data-centric and document-centric parts
• E.g. publications, Amazon, MS Press (example chapters)
• Principles
• Advantages and disadvantages
• Usage
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
3
10.1 Motivation
4
• Storage approaches for XML documents
– Order preserving and lossless storage of XML
documents
– Efficient access to XML documents or parts thereof
– Text-based
• Storage as character data
– Model-based
• Quick response time for
• On top of the relational DBMS
• Inside the relational DBMS
– Queries
– Update operations
– Schema-based
Indexing
Transaction processing
Support of XPath and XQuery
Support of SAX and DOM for applications
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
10.1 Motivation
• Requirements for the physical layer:
•
•
•
•
2
• Mapping to (object-)relational databases
– Deriving the database schema from the XML structure
– Using user defined mapping procedures
5
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
6
10.1 Motivation
Text-based
storage
Schema-based
storage
10. XML Storage 1
Model-based
storage
10.1 Motivation
10.2 Text-based storage
10.2.1 Index structures
10.3 Model-based storage
10.4 Schema-based storage
10.5 Conclusion
10.6 Overview and References
– Examples –
7
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
10.2 Text-based storage
8
10.2.1 Index structures
• The whole XML document text is stored as character data
• Index structures for XML documents
allow efficient access for specific queries
– File in the file system
– CLOB (Character-Large-OBject) in the DBS
– Different types of indexes are optimized for different
types of queries
• Operations on documents as a whole are very efficient
– Reading and writing the whole document
– But the content is monolithic and opaque with respect to the
relational query engine (query can't inspect a fragment)
• Generate redundancy
– Index has to be up-to-date by propagating data
changes
• Getting granular access requires additional support
– Full text index
– Path index
• Index structures can be storage structures as well
– They define the storage method
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
9
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
10.2.1 Index structures
10
10. 2.1 Index structures
• Types of index structures
• B-tree as value index for an XML fragment
document
– Value index
• Indexes atomar values of an XML document, like element content or
attribute values
• Index format for structured parts of XML documents
• Already known from databases (B-trees, hash index, …)
– Full text index
• Indexes single words from the full text
• Index format for unstructured parts of XML documents
• Already known from Information Retrieval (inverted lists, tries, suffix
trees, …)
– Path index
• Indexes subtrees/paths in an XML document
• Index format for semistructured parts of XML documents
• Already known from object-databases (access support relations, …)
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
11
[Tür08]
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
12
10. 2.1 Index structures
10. 2.1 Index structures
• Full text index
• Inverted list as full text index for XML
– Not limited to exact matches
word
occurrence
word position in the text
• Keyword-based search and boolean retrieval
• Pattern search (with regular expressions)
– Use of
• Statistical, word-based methods
– Stop word removal
– Elimination of uncommon items
• Linguistic methods
– Normalization of words (e.g. capitalisation, hyphenation,)
– Word decomposition by rules (engl.) or dictionaries (german)
– Stemming
• Knowledge-based methods
– Use of ontologies and thesauri to search for synonyms, hypernyms and
hyponyms
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
13
10. 2.1 Index structures
word
[Tür08]
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
14
10. 2.1 Index structures
occurrence
• Path index
– Structure information must be identifiable and
reconstructable
• Assigning the markup to the content as well as
• Representing the hierarchical nesting and order of
elements/attributes
– Especially suited for keyword search with regard to
structure or path expressions
FOR $b IN //book
WHERE CONTAINS($b/author,"Benjamin")
RETURN $b
[Tür08]
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
15
10. 2.1 Index structures
• Conclusion
– Nested path index
– Efficient query processing on XML documents
requires different types of index structures
– Value index
• Access to root node from every
node
– Multi-index
• For efficient access to structured parts
• Keyword search, value search
• Accessing parent nodes
– Join-index
– Full text index
• Access parent and child nodes
• For efficient access to unstructured parts
– Access Support Relations
(ASR)
– Path index
• Using the document structure
• Navigating queries
• Generalization of indexes above,
by listing all paths in a table
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
16
10. 2.1 Index structures
• Types of path indexes
[Tür08]
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
17
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
18
10.2 Text-based storage
10. XML Storage 1
• Summary text-based storage
10.1 Motivation
– Schema definition:
• not required
10.2 Text-based storage
– Document reconstruction:
• documents stay in their original format
10.3 Model-based storage
– Queries:
• Information retrieval queries
• Processing the markup of the queries
• XML queries possible
10.3.1 On top of the relational DBMS
10.3.2 Inside the relational DBMS
– Special features:
• Full text functions
10.4 Schema-based storage
– Efficiency:
• Character string must be parsed on every access with XML processors
expensive
• No concurrency on read or write
no parallel processing
10.5 Conclusion
– Usage:
10.6 Overview and References
• For document-centric XML applications
• Suitable to only a limited extent also for semi-structured applications
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
19
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
10.3.1 Model-based storage
20
10.3.1 Model-based storage
• The EDGE approach [FK99]
• Idea: generic storage of the graph structure
– XML elements, XML attributes, … are nodes of a graph
– Nesting of elements defines edges
– Nodes get an (internal) ID based on graph traversal
• Using relations or object classes to store elements and
attributes
Elements ID Element name Value Reference to preceeding Rank
Attributes ID Attribute name Value Reference to element
XML documents
• Document structure can be restored completely
• Extension for data type adapted storage is possible
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
– Variant BINARY: horizontal partition of EDGE based on
label
21
[Tür08]
10.3.1 Model-based storage
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
22
10.3.1 Model-based storage
• XML queries
• Example: list bargain buy with prices
SELECT a.content, b.content FROM Edge a, Edge b
WHERE (a.label = 'price') AND (a.content < 10.00)
AND (b.label = 'description')
AND (b.parent = a.parent) AND (a.key = b.key)
– XML queries (XPath, XQuery) are mapped to SQL
queries (taking storage structures into account)
– Result of XML query is generated from result of
database query
• "Labeling" of the result tuples
• Result is in XML format
[Tür08]
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
23
[Tür08]
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
24
10.3.1 Model-based storage
10.3.1 Model-based storage
DOM-based storage – example
• DOM-based storage
Node type:
ELEMENT
– Information from the
Document Object Model
are stored in the database
– Storage alternatives
• (Object-)relational databases
• Object-oriented databases
• Developing own data
structure
[Tür08]
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
Node type:
ATTRIBUTE
Node type:
TEXT
25
[Tür08]
10.3.1 Model-based storage
26
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
10.3.1 Model-based storage
Summary model-based storage on top of the DBMS
• XML Queries
– Schema definition:
• not required for storage
– Document reconstruction:
• Possible, but expensive
– Queries:
– XML queries (DOM method invocations) are mapped
to SQL queries (taking storage structures into
account)
– Result of method invocation is generated from result
of database query
• XML queries possible
• Adapted database queries
– Special features:
• Querying many elements/attributes is expensive
– Efficiency:
• Navigation from the given context is efficient
• Restoring the document and evaluating path expressions is inefficient
– Usage:
• For data- and document-centric as well as for semi-structured
XML applications
[Tür08]
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
27
28
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
10. XML Storage 1
10.3.2 XML data type
<City>
• SQL/XML
10.1 Motivation
10.2 Text-based storage
<City>
Storing XML
documents inside
the database as
values of type XML
10.3 Model-based storage
10.3.1 On top of the relational DBMS
<City>
<Name>
Braunschweig
</Name>
<Zip>38100</Zip>
<Zip>38106</Zip>
<State>
Niedersachsen
</State>
</City>
Generating XML
documents using
SQL/XML functions
– Datatype XML with
Mapping between SQl and XML
belonging functions
SQL
XQuery
– Mapping between
SQL and XML
SQL database
XML datatype
– Embedding XQuery in SQL
10.3.2 Inside the relational DBMS
10.4 Schema-based storage
10.5 Conclusion
10.6 Overview and References
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
29
[Tür08]
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
30
10.3.2 XML data type
10.3.2 XML data type
• A value of the data type XML can contain
XML(SEQUENCE)
– whole XML document
– XML element
– a set of XML elements
NULL or document node
Untyped elements &
attributes,
elements not NULL
• All XML publishing operators from chapter 9.4
create values of the data type XML, not a string
XML(CONTENT(ANY))
Validated against schema
1 element child
XML(CONTENT(UNTYPED))
1 element child
XML(CONTENT(XMLSCHEMA))
XML(DOCUMENT(ANY))
1 element child
Validated against schema
XML(DOCUMENT(UNTYPED))
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
31
[Tür08]
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
10.3.2 XML data type
• Create a table that is an XML data type in itself
– CREATE TABLE XMLDOCUMENT OF XMLTYPE;
XML [({DOCUMENT|CONTENT|SEQUENCE}
[({ANY|UNTYPED|XMLSCHEMA schema name})])]
• Create a table containing an XMLType data type
column
• Modifiers are optional
• Primary type modifier
– DOCUMENT (XML document)
– CONTENT (XML element)
– SEQUENCE (sequence of XML elements)
CREATE TABLE XML
(
ID
XML
CONSTRAINT
);
• Secondary type modifier
– UNTYPED
– XMLSCHEMA (typed)
– ANY (may be typed)
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
33
[Pow07]
10.3.2 XML data type
123
<Groups>Annabelle</Groups>
234
<Groups>Magdalena, Marius</Groups>
345
<?xml version 1.0?>
<Groups>
<Person>Patrick</Person>
<Person>Robert</Person>
</Groups>
654
34
– Allowed values:
• XML documents (including prolog)
• XML content according to XML 1.0 (includes pure text
comments, PI?)
• NULL
– No comparison possible (compare CLOB in SQL)
• User can define an order, if comparison is necessary
– No corresponding type in programming languages for
embedding in SQL available
– Standard defines operators to convert to other SQL
data types
<Groups>Rebecca</Groups>
<Groups>Torben</Groups>
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
• Characteristics
CREATE TABLE Groups (
ID
INTEGER,
Name
XML );
Name
NUMBER NOT NULL,
XMLTYPE,
XPK PRIMARY KEY (ID)
10.3.2 XML data type
• Example: Definition of an XML type column
ID
32
10.3.2 XML data type
• Specification of XML type
[Tür08]
XML(DOCUMENT(XMLSCHEMA))
35
[Kud07]
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
36
10.3.2 XML data type
10.3.2 XML data type
• Parsing & Serialization
– XMLParse:
• Validation of XML
<City>
<Name>
Braunschweig
</Name>
<Zip>38100</Zip>
<Zip>38106</Zip>
<State>
Niedersachsen
</State>
</City>
– Is like integrity constraints in DBs
– Requires an XML Schema
– XML Schemas may be registered with the SQL-server
• Parses a string value using an
XML parser
• Produces value whose specific type is
XML(DOCUMENT(ANY)), or …CONTENT…, or
• Implementation-defined mechanism
• Known by SQL name & by target namespace URI
– XMLSerialize
– Schema does need a unique name
• Transforms an XML value into a string value (CHAR,
VARCHAR, CLOB, or BLOB)
• Used by XMLValidate(), IS VALID, and to restrict
values of
XML(DOCUMENT-or-CONTENT(XMLSCHEMA))
<City>
<Name>
Braunschweig
</Name>
<Zip>38100</Zip>
<Zip>38106</Zip>
<State>
Niedersachsen
</State>
</City>
[Mel05]
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
37
[Mel05]
10.3.2 XML data type
• Schema definition
– Syntax
Register XMLSCHEMA
'http://www.Alfred-Moos.de/GrussSchema.xsd'
FROM 'file://c:/XML_Schemata/GrussSchema.xsd'
AS
GrussSchema
COMPLETE
;
XML(CONTENT(XMLSCHEMA) <schema> [<elements>]))
<schema> := URI <namespace> [LOCATION <loc>]
| NO NAMESPACE [LOCATION <loc>]
| ID <registered schema name>
–
<element> := [NAMESPACE <namespace>]
ELEMENT <element name>
CREATE TABLE Dokument_XML
(Dokument_XML_Nr CHAR (4)
NOT NULL
PRIMARY KEY,
Dokument XML,
CONSTRAINT validieren
CHECK (Dokument IS VALIDATED
ACCORDING TO XMLSCHEMA
ID GrussSchema
)
)
;
39
[Tür08]
10.3.2 XML data type
40
• Benefits of schema registration
– XMLValidate
– Security issues
• Validates an XML value against an XML Schema (or target
namespace), returning new XML value with type annotations
• Schemas cannot “disappear” without SQLserver knowing about it
• Schemas cannot be “hijacked” (altered in
inappropriate ways) without SQL-server
knowing about it
• Documents cannot be marked “valid”
against schemas unless SQL-server knows
about them
– IS VALID
• Tests an XML value to determine whether or not it is valid
according to an XML Schema (or target namespace); return
true/false without altering the XML value itself
– IS DOCUMENT
• determines whether an XML value satisfies the (SQL/XML)
criteria for an XML document
– IS CONTENT
• determines whether an XML value satisfies the (SQL/XML)
criteria for XML content
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
10.3.2 XML data type
• New functions and predicates:
[Mel05]
38
10.3.2 XML data type
• Schema registration
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
41
[Mel05]
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
42
10.3.2 XML standard
10.3.2 SQL/XML standard
• SQL/XML standard published as
• Predefined schemas (build-in namespaces)
– ISO/IEC 9074-14:2003
– xs:http://www.w3.org/2001/XMLSchema
– xsi:http://www.w3.org/2001/XMLSchema-instance
– sqlxml:http://standards.iso.org/iso/9075/2003/sqlxml
• Mappings and Publishing Functions
– ISO/IEC 9075-14:2006
– More depending on the DB implementation
• Adds XQuery, including Data Model,Validation
• Completely supported per XML+Namespaces:
– ISO/IEC 9075-14:2008
– XMLElement, XMLForest, XMLTable
• Updates
– Default namespace, explicit namespace (prefix)
– Declare namespace within scopes of WITH clause, column
definitions, constraint definitions, insert/delete/update
statements, compound statements
[Mel05]
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
43
[Mel05]
10.3.2 SQL/XML standard
• Overview of some operators for the XML
type
– Additional publishing functions
– XQuery data model
– More precise XML type (modifiers)
– XMLQuery, XMLTable
– XMLValidate, IS VALID
– XMLExists, IS DOCUMENT, IS CONTENT
– Casting between XML type and SQL types
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
– XMLELEMENT – creates an XML element node
– XMLFOREST – creates a sequence of XML element nodes
from a table
– XMLCOMMENT – creates an XML comment node
– XMLTEXT – creates a text node
– XMLPI – creates a processing instruction
– XMLAGG – aggregates XML values of a group
– XMLCONCAT – concatenates XML type values
– XMLTRANSFORM – applies an XSL to a document
45
[Tür08]
10.3.2 SQL/XML standard
46
• Review of SQL/XML
– Two components
• A data type XML to store XML data
• Functions to map relational structures to XML
– XMLPARSE – a well-formed SQL text to XML value
– XMLSERIALIZE – converts an XML value to a SQL text
– XMLDOCUMENT – creates an XML document node
from an XML value
– XMLVALIDATE – validates an XML value with a schema
– XMLQUERY – evaluates an XQuery expression
– XMLTABLE – transforms an XQuery result to a SQL table
– XMLITERATE – transforms an XQery sequence to a SQL
table
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
10.3.2 SQL/XML standard
• ... Overview of some operators for the XML
type
[Tür08]
44
10.3.2 SQL/XML standard
• SQL/XML:2003 plus
[Mel05]
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
47
– Only construction operators
• No extraction of values or search
• But construction operators are based on XQuery
– Mapping of tables, schemas, catalogues ignores some
information from the relational schema
• UNIQUE
• REFERENCES
• CHECK
– Further extensions are expected
[Kud07]
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
48
10.3.2 Model-based storage
10. XML Storage 1
Summary model-based storage inside the DBMS
10.1 Motivation
– Schema definition:
10.2 Text-based storage
• not required for storage
– Document reconstruction:
10.3 Model-based storage
• Possible without problems
– Queries:
10.4 Schema-based storage
• XML queries possible
– Special features:
10.4.1 Automatic mapping
• Full integration of SQL and XML
10.4.2 User-defined mapping
– Efficiency:
• Vendor specific
• Optimized data structures possible
10.5 Conclusion
– Usage:
10.6 Overview and References
• For data- and document-centric as well as for semi-structured
XML applications
49
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
10.4.1 Schema-based storage
• Motivation
10.4.1 Schema-based storage
• Generating the DB schema for a DTD:
– XML content shall be stored in a conventional database
– Accepting the loss of native access
– DB schema is derieved from a DTD or an XML schema
– Rules to map element types:
XML element type
Sequence of element types
Alternative of element types
Element type with quantifier ?
Element type with quantifier +,*
Nested element types
• Problem
– Generate DB schema automatically
– Thereby use as much structure information as possible
• General approach for mapping from a DTD
–
–
–
–
Transform DTD into a tree representation
Nodes: element types, attributes, etc. (type layer!!!)
Edges: nesting relationships of element types and their restrictions
Traverse tree in order to transform nodes and edges into database
tables (according to certain rules)
XML attribute
IMPLIED
REQUIRED
Default value
DTD is usually required
Queries use SQL functionality
RDBMS data types are used (e.g. prices are NUMERIC)
Problem: Mapping of collection types
• Subdivide into additional relations
Customer_Info:
ID
Fname
Customer_info
Feedback
C0001
F0001
Lname
Comment:
Email
Comment_ID
C0001 Charles Sanchez C.Sanchez@hotmail...
Feedback: ID
F001
10.4.1 Schema-based storage
– DTD is usually required
– Queries use SQL functionality
– "Natural" mapping to tuple
types, collection types
– In case of irregular document
structure databases contain
many null values.
– Example:
44901
52
• Mapping to object relational databases
• Mapping to relational databases
Comment: Comment_ID
column of a table
null values allowed
null values not allowed
DEFAULT constraint
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
10.4.1 Schema-based storage
–
–
–
–
column of a table
columns of a table
column of a table
column with null values
set/list of columns (SET OF, LIST OF)
TUPLE OF
– Rules to map attributes:
51
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
50
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
Type
Content
opinion
Darjeeling Special…
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
53
44901
<Customer_info>
<Feedback>
Fname
Lname
Email
Type
Charles
Sanchez
C.Sanchez@hotmail... opinion Darjeeling Specia…
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
Content
54
10.4.1 Schema-based storage
10.4.1 Schema-based storage
• Mapping of recursive data definitions
– DTDs can be recursive
– Infinite recursion is impossible on instance layer of a database
– Procedure:
•
•
•
•
Marking the nodes
Subdividing into separate tables
Use primary and foreign keys in RDBMS
Use reference types in ORDBMS
<!ELEMENT book (front, body, references)>
<!ELEMENT references (book+)>
[Tür08]
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
55
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
10.4.1 Schema-based storage
• Mapping of element sequences
56
10.4.1 Schema-based storage
• Mapping of alternatives
– Sequence can be important
– XML allows to specify alternatives
– Example:
• Use an additional attribute in these cases
– Example:
<!ELEMENT car (compactCar | sedan | van)*>
– Three possible storage variants
<lecture>
<lesson>Introduction</lesson>
<lesson>XML basics</lesson>
…
• Each alternative is stored as separate table column
• Subdivide alternatives in separate tables
• Use a table column of type XML type
⇓
Order
Lesson
1
Introduction
2
XML basics
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
57
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
10.4.1 Schema-based storage
• Variant 1 – all alternatives in one table
58
10.4.1 Schema-based storage
• Variant 2 – subdivided into multiple tables
•
•
– Problem: many null values (wasting storage space)
[Tür08]
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
– For queries, combination of tables is needed
59
[Tür08]
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
60
10.4.1 Schema-based storage
10.4.1 Schema-based storage
• Mapping with STORED (Semistructured TO RElational Data)
• Variant 3 – Using column type XML
– Basic idea: Use data mining techniques on the XML structure to find a good
mapping to tables [DFS99]
– Input
•
•
•
•
XML documents (or an average sample of the collection)
Query workload
Restrictions of storage space, number of tables, …
No DTD or XML schema is required!
– Output
• Relational schema
• STORED-queries: Mapping instructions for XML documents to DB tables
– Procedure
• Determine the XML subtrees with the largest support in the collection and in the
queries
• These subtrees are materialised in tables
• Irregular data is stored in overflow tables according to the EDGE approach
– XML type allows XML queries or DOM methods
[Tür08]
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
61
10.4.1 Schema-based storage
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
62
10.4.1 Schema-based storage
• Mapping with STORED – example
• Schema-based storage with automatic
mapping
Subtrees with
high support
– Advantages
• Queries, data types, aggregation functions, views
• Integration in other databases when storing structured data
– Disadvantages
• Large schema, sparsely filled databases (many null values)
• No flexible data types, storage of alternatives has problems
• Less flexible queries
XML documents shown as tree structure
[Tür08]
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
– No information retrieval queries possible without additional extensions
– No full text operations for semi- or unstructured data
– Usually native access is not possible any more
63
10.4.1 Schema-based storage
• Summary Schema-based storage with automatic mapping
64
10.4.2 User-defined Mapping
• User defined mapping
– Schema definition:
– Idea
• Is usually required and analysed
• not required, e.g. for STORED
• In all previously shown methods it is not possible to affect the storage in
the DB
• With user defined mappings the user defines the storage structure
• The structure of XML documents and database schema can be designed
independently from each other
• Also possible: storing XML documents in existing databases
– Document reconstruction:
• Limited (requires logging of the mapping process)
– Queries:
• Database queries
• XML queries possible,but lack the XPath horizontal axes,
e.g. following, preceding-sibling
– Annotation of DTD and XML schema, respectively
– Special features:
• In many cases the mapping definition is combined with existing schema
information
• Federation with existing databases is possible
– Efficiency:
– Only limited XML queries possible
• High efficiency by using the DB-engine
• Logging of the mapping process from XML documents to databases
• For a given query all relevant data has to be stored (lossless mapping)
– Usage:
• For data-centric XML applications, but with limited nesting
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
65
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
66
10.4.2 User-defined Mapping
10.4.2 User-defined Mapping
• Mapping solutions with different specializations
• The shredder can be part of the DB
– Algorithms, middleware, commercial applications, …
– Varying amount of required input or user decisions
– Many algorithms create different database schemas
– Usually requires an XML schema
– In the IBM Data Studio, the shredder is part of the
"annotated XML schema decomposition"
– Direct approach in DB2:
• Two phases
• register the XML schema and call the stored procedure:
– Mapping
register xmlschema http://our.org/custacc from
dec_files/custacc.xsd as cust_schema ;
complete xmlschema cust_schema enable decomposition ;
call SYSPROC.XDBDECOMPXML ('VRODRIG', 'CUST_SCHEMA', ? ,
?, 1, null, null, null)
• Assign a place for each node type in the DB
– Shredding
• Import the XML data as DB tuples
67
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
68
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
10.4.2 User-defined Mapping
10.4.2 User-defined Mapping
• Shredding without XML schema in DB2
• Example:
XML document
– XMLTABLE function in combination with an INSERT
mapping instruction
INSERT INTO ENVELOPEXT (MAILFROM, MAILTO, MAILDATE, SUBJECT)
SELECT MAILFROM, MAILTO, MAILDATE, SUBJECT
FROM XMLTABLE(
XMLNAMESPACES('http://www.sal.com/mails' AS "email"),
'$doc/email:mails/mail' (: some xquery-expression :)
PASSING xml-source AS "doc"
COLUMNS
MAILFROM VARCHAR (100) PATH 'envelope/from',
MAILTO
VARCHAR (100) PATH 'envelope/to',
MAILDATE VARCHAR (30) PATH 'envelope/email:Date',
SUBJECT
VARCHAR (100) PATH 'envelope/Subject') AS T;
http://www.ibm.com/developerworks/db2/l
ibrary/techarticle/dm-0801ledezma/
69
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
10.4.2 User-defined Mapping
70
• Remarks
– Example syntax for XML-DBMS (Roland Bourret)
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
10.4.2 User-defined Mapping
• Mapping instruction
<ClassMap>
<ElementType Name="sales:SalesOrder"/>
<ToClassTable>
<Table Name="Sales"/>
</ToClassTable>
<PropertyMap>
<Attribute Name="SONumber"/>
<ToColumn>
<Column Name="Number"/>
</ToColumn>
</PropertyMap>
</ClassMap>
[Tür08]
– Many different mapping languages or schema
annotations
• Automatic mappings usually have an internal mapping
language
Connection
between elements
and tables
– Remember the mapping constructs from last lecture
(9). The SQL/XML annotations are a mapping language,
too.
– DB2 uses similar annotations as SQL/XML
Connection
between
elements/attributes
and table columns
• see next slide:
71
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
72
<xsd:complexType xmlns:db2-xdb=
"http://www.ibm.com/xmlns/prod/db2/xdb1"
name="ROW.ACCOUNT">
<xsd:sequence>
<xsd:element name="NAME"
Mapping SQL
type="CHAR_20"
table columns to
db2-xdb:rowSet="Account"
XML elements
db2-xdb:column="Name"/>
<xsd:element name="BALANCE"
type="NUMERIC_12_2"/>
Mapping table
db2-xdb:rowSet="Account"
rows to XML
db2-xdb:column="Balance"/>
<row>
</xsd:sequence>
elements
</xsd:complexType>
Mapping SQL tables
CREATE TABLE Account
(
Name CHAR(20),
Balance NUMERIC(12,2),
);
Name
Balance
Joe
2000
Jim
3500
<ACCOUNT>
<row>
<NAME>Joe</NAME>
<BALANCE>2000</BALANCE>
</row>
<row>
<NAME>Jim</NAME>
<BALANCE>3500</BALANCE>
</row>
</ACCOUNT>
[Tür08]
<xsd:complexType name="TABLE.ACCOUNT">
<xsd:sequence>
<xsd:element name="row"
type="ROW.ACCOUNT"/>
</xsd:sequence>
</xsd:complexType>
10.4.2 User-defined Mapping
• Summary schema-based storage with user defined
mapping
– Schema definition:
• Depends on mapping language
– Document reconstruction:
• Not possible in most cases (requires logging of the mapping process)
– Queries:
• Database queries
• XML queries in rare cases only!
– Special features:
• Integration with existing databases is possible
– Efficiency:
• High efficiency by using the DB-engine
– Usage:
• For data-centric XML applications
<xsd:element name="ACCOUNT"
type="TABLE.ACCOUNT"/>
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
73
10. XML Storage 1
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
10.5 Conclusion
• Different methods for storage of XML documents
10.1 Motivation
– Text-based
• Storing whole XML documents as string
• Can use full text index or path index
10.2 Text-based storage
10.3 Model-based storage
– Model-based
10.4 Schema-based storage
– Schema-based
10.5 Conclusion
– Hybrid approaches
10.6 Overview and References
– No algorithm has the optimal solution for all kind of XML
documents
– Reasonable solution is heavily dependent on the application
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
• Generic mapping of the tree structure
• Detect and analyse the structure of the XML documents
• Derive a DB schema from the structure
• A combination of some of those methods
75
10.6 References
•
"XML und Datenbanken" [Tür08]
"XML und Datenbanken" [KM03]
– M. Klettke, H. Meier
– dpunkt.verlag, 2003
•
"Generierung eines adaptiven Datenbankschemas für datenzentrierte XMLDokumente" [Bus08]
•
[FK99]
•
[DFS99]
– Carsten Busche
– Diplomarbeit,TU Braunschweig, 2008
– D. Florescu, D. Kossmann: Storing and Querying XML Data using an RDBMS. IEEE Data
engineering Bulletin (DEBU),Volume 22(3), Seiten 27-34, 1999.
– A. Deutsch, M.F. Fernández, D. Suciu: Storing Semistructured Data with STORED.
Proceedings of the 1999 ACM SIGMOD international conference on Management of
data, Seiten 431-442, ACM, 1999.
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
76
10.6 Overview
– Can Türker
– Lecture, University of Zurich, 2008
•
74
77
Introduction and Basics
1. Introduction
2. XML Basics
3. Schema Definition
4. XML Processing
Querying XML
5. XPath & SQL/XML
Queries
6. XQuery Data Model
7. XQuery
XML Updates
8. XML Updates & XSLT
Producing XML
9. Producing XML
Storing XML
10. XML storage
11.Relational XML
storage
12. Storage Optimization
Systems
13. Technology Overview
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
78
Questions, Ideas, Comments
• Now, or ...
• Room:
IZ 232
• Office our:
Tuesday, 12:30 – 13:30 Uhr
or on appointment
• Email:
[email protected]
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
79