How to effectively store the history of data in a relational DBMS Database systems MSE-Seminar 14.12.2008 © Raphael Gfeller, [email protected] 1 Agenda • Time – Definition – Storing within a DMBS – General problems • History – Motivation – On Data warehouses, Online Analytical Processing system, OLAP systems • Common patterns available: slowly changing dimensions, SDC – On Online transaction Processing system, OLTP systems • No common patterns available • Analyzed data models – Performed tests – Results – Conclusion • Questions • References 14.12.2008 © Raphael Gfeller, [email protected] 2 Time is defined as one second: “the duration of 9 192 631 770 periods of the radiation corresponding to the transition between the two hyperfine levels of the ground state of the caesium 133 atom.” 14.12.2008 © Raphael Gfeller, [email protected] 3 Storing within a DMBS • a date is represented by an offset with a defined accuracy on a reference point • an interval is represented by a value with a defined accuracy • a duration – is represented by a composite value of two dates or – a date value and an interval value 14.12.2008 © Raphael Gfeller, [email protected] 4 Storing within a DMBS \ Example Value to store: 2008-12-02 10:00 Used date type: smalldatetime (SQL 2), resolution 1 min, reference point: 1900-01-01, based on the Gregorian calendar Internal stored Value: 57286680 („Value to store“ - reference point [min]) 14.12.2008 © Raphael Gfeller, [email protected] 5 Storing within a DMBS \ Date Types Available date types on Microsoft SQL Server 2008 14.12.2008 © Raphael Gfeller, [email protected] 6 Time \ General Problems • • • • Different time zones Different implementation of data types Different calendars Time synchronization – local – networked • Summer, winter time problematic • Choosing the date type – Resolution is to small – Resolution is to high – Range is to small 14.12.2008 © Raphael Gfeller, [email protected] 7 History \ Motivation Why do we need to know the history of our data? For example for: • Legal requirements – A bank has to know at each time what the exact balance of the customer was – A internet provider has to be able to store all traffic from a user over a given time • Business requirements – A version control has to be able to manage multiple revisions of the same unit of information – A customer relationship management (CRM) software has to be able to present the volume of sales of a costumer over time • Entertainment requirements – A chat program has to be able to present the conversation between two person over time • Other requirements – The “Time Machine” function in the Mac OS X has to be able to go back in time for locating older version of your files 14.12.2008 © Raphael Gfeller, [email protected] 8 Main approaches to store data Data warehouses (Online Analytical Processing system, OLAP) • Designed for – Reporting – Analysis – Speed of data retrieval • Uses to following approaches – data are stored denormalised based on a dimension-based model (logical data grouped together) • Include often business intelligence tools to retrieve and analyze data • History – Common pattern exits for storing historical data (Slow Changing Dimensions (SCD)) Online transaction Processing system, OLTP • Designed for – Perform day-to-day transaction processing – Preservation of data integrity – Speed of recording of business transactions • Uses to following approaches – Database normalization • Codd rules of data normalization – Entity-relationship model • History – 14.12.2008 No common pattern available for storing historical data © Raphael Gfeller, [email protected] 9 History \ Slow Changing Dimensions [0/2] Type 0: an attribute of a dimension is fixed, no history available, is not frequently used yet Type 1: overwrites the old data with the new data, no history available Type 2: tracks historical data by creating multiple records with a separated key, unlimited history is possible Type 3: additional columns in the tables track changes, limited history is available Type 4: creates separate historical tables that stores the historical data Type 6: is a hybrid approach that combines SCD 1, 2 and 3 , unlimited history is possible, is not frequently used yet Type 1,2 and 3 are the most common 14.12.2008 © Raphael Gfeller, [email protected] 10 History \ Slow Changing Dimensions [1/2] SCD 1, overwrites the old data with the new data, no history available Results in SCD 2, tracks historical data by creating multiple records with a separated key, unlimited history is possible 14.12.2008 © Raphael Gfeller, [email protected] 11 History \ Slow Changing Dimensions [2/2] SCD 3, Additional columns in the tables track changes, limited history is available SCD 4, creates separate historical tables that stores the historical data 14.12.2008 © Raphael Gfeller, [email protected] 12 History \ OnLine Transaction Processing system, OLTP No common patterns available for storing historical data Common used rational data models are analyzed (all based on SCD type 2 (unlimited history is possible)) – – – – Method “Duplication” Method “Transaction” Method “Linked history items” Method “Bidirectional linked history items” Based on the following rational data model: 14.12.2008 © Raphael Gfeller, [email protected] 13 History \ OLTP \ Duplication Method “Duplication” Focused on • Fast access of historical data • Easy implementation • Data integrity ID Name Salary ID_Change ID_Company ID DateTime 1 Gfeller Raphael 1000 1 2 1 2008-12-2 1 Gfeller Raphael 2500 2 1 2 2008-12-3 2 Hans Meier 500 1 1 2 Hans Meier 550 2 1 3 Fritz Müller 1750 2 2 14.12.2008 ID Name ID_Change 1 UBS 1 2 HSR 1 1 UBS 2 2 HSR 2 © Raphael Gfeller, [email protected] 14 History \ OLTP \ Transaction Method “Transaction” Focused on • Less used data storage • Precious information about history at every point on time ID Name Salary ID_Company 1 Gfeller Raphael 2500 1 2 Hans Meier 500 1 3 Fritz Müller 1750 2 ID Name 2 HSR 1 ID DateTime Entry ID NewStringValue Action 2 2008-12-2 1 Gfeller Raphael Person.SetName 1 2008-12-1 1 Gfeller Raphal, 2500, 1 Person.Create UBS 14.12.2008 © Raphael Gfeller, [email protected] 15 History \ OLTP \ Linked history items Method “Linked history items” Focused on • Avoid huge changes to the underlying database • Easy to implement • Fast insertion of new entries ID Name Salary ID_Company FK_Old_ID 1 Gfeller Raphael 2500 1 5 2 Hans Meier 500 1 NULL 3 Gfeller Raphael 1500 1 NULL 4 Gfeller Raphael 1000 2 3 5 Fritz Müller 1750 2 NULL 14.12.2008 © Raphael Gfeller, [email protected] 16 History \ OLTP \ Bidirectional Linked history items Method “Bidirectional Linked history items” Focused on • Avoid huge changes to the underlying database • Extendibility by adding additional metadata to the separated table • Fast insertion of new entries • Providing additional backward and forward navigation Old Person New Person DateTime User 3 4 2008-12-3 User 1 4 1 2008-12-4 User 2 14.12.2008 ID Name Salary ID_Company 1 Gfeller Raphael 2500 1 2 Hans Meier 500 1 3 Gfeller Raphael 1500 1 4 Gfeller Raphael 1000 2 5 Fritz Müller 1750 2 © Raphael Gfeller, [email protected] 17 History \ OLTP \ Analyzed criteria’s Analyzing criteria's • Insert an entry • Updating an entry • Storage cost • Get an entry at (Time – 1) • Get en entry at (Time – n) • Entry at time x • Get an integrity state over all entries • Get the next entry by a entry at the past • Get the previous entry by a entry at the past • A person by a company at the past 14.12.2008 © Raphael Gfeller, [email protected] 18 History \ OLTP \ Theory 14.12.2008 © Raphael Gfeller, [email protected] 19 History \ OLTP \ Test Test environment CPU: Intel Core 2, 2Ghz Memory: 2 Gb Operating System: Windows XP, Sp3, Database: Microsoft SQL Server 2005, Express Edition with SP1, Benchmark written in C# Benchmark input • Count inserted companies • Count inserted persons • Count companies to change • Count persons to change Benchmark steps 0. Insert companies 1. Insert persons 2. Change companies 3. Change persons 4. Find a person by its parent person 5. Collect all persons and companies that are valid at a specific time. 6. Find a person in the past by a datetime value 7. Find a person by a company by a datetime value. 14.12.2008 © Raphael Gfeller, [email protected] 20 History \ OLTP \ Results The measurements confirm the theory Possible optimizations • Method Change Set based on Duplication – Only changed entries are duplicated acceptable overhead in reading fewer data storage used • Method “Transaction with anchors “ – Using “anchors transaction”, they resave the entire state of the entries Fewer network traffic Restoring an entry becomes linear O(maxChangesBetweenTwoAnchors) instead of O(nChanges) 14.12.2008 © Raphael Gfeller, [email protected] 21 History \ OLTP \ Result \ Conclusion Advices based on these tests • If storage is limited use the methods in the following order: – – – – – – • If network bandwidth is limited, use the methods in the following order: – – – – – – • transaction mechanism Linked history items Bidirectional linked history items Transaction with anchors Change Set based on Duplication Duplication Change Set based on Duplication Duplication Linked history items Bidirectional Linked history items Transaction with anchors transaction mechanism If the knowledge of the developers is low • use either method duplication or Linked history items 14.12.2008 © Raphael Gfeller, [email protected] 22 History \ OLTP \ Result \ Conclusion Advices based on these tests • If data volume is high, use the methods in the following order: – – – – – – • transaction mechanism Linked history items Bidirectional linked history items Transaction with anchors Change Set based on Duplication Duplication If change frequency of the data is high, use the methods in the following order: – – – – – – Transaction with anchors transaction mechanism Linked history items Bidirectional linked history items Change Set based on Duplication Duplication 14.12.2008 © Raphael Gfeller, [email protected] 23 History \ OLTP \ Result \ Conclusion Practical example of the usage of the methods 14.12.2008 © Raphael Gfeller, [email protected] 24 Questions? 14.12.2008 © Raphael Gfeller, [email protected] 25 References • Gfeller Raphael, How to effectively store the history of data in a ration DBMS, [Online] 2.12.2008. [Citied 2.12.2009] http://wiki.hsr.ch/Datenbanken/wiki.cgi?HistorieUndZeitInDatenbanken. 14.12.2008 © Raphael Gfeller, [email protected] 26