Technische Universität München Lecture 6: Data organisation - Review: codes and operating system - Basic data organisation: file system - Single level data structures - Multi level data structures - Primary and secondary keys Lect6-Page1 Technische Universität München ? Review: design ? What is a weaver (graphical scheme)? For what can it be used (care about key words)? What is UML? Give 5 guidelines from design rules, you have heard in the lecture? Lect6-Page2 Technische Universität München Review: codes - ASCII/ANSI-Code - Unicode transformation format (UTF) - Binary numbers 0010 0011 0000 1111bin = 8975dec … Lect6-Page3 Technische Universität München Review: abstraction by operating system Interface and device drivers for the different hardware devices, e.g hard drives or optical drives User prog. Games App. Dev. Systems … System Software Shell Editor GUI Batch Compiler, … Interface Interface Interface Operating System (Kernel) Device Driver Machine Code Microprogramming - High-level abstraction of hardware with simple interface (system calls) - Application independence - Ressource sharing, critical section management -… - Disk and file system - Device driver -… - Basic input/output methodes Physical Devices Hardware Device Lect6-Page4 Technische Universität München Lecture 6: Data organisation - Review: codes and operating system - Basic data organisation: file system - Single level data structures - Multi level data structures - Primary and secondary keys Lect6-Page5 Technische Universität München Basic data organisation: file system Filesystem Main memory Network Directories (folder folder) Organisation of data in useful blocks by the operating system Operating System Harddrives (HDD) Physically in blocks (e.g. 4 kByte Optical Drives Files … Schicker, E.: Datenbanken und SQL. Teubner 1996 Lect6-Page6 Technische Universität München Basic data organisation: file system Directories (folders folders) Organisation of data in useful blocks by the operating system Main memory Filesystem Operating System Files Supported operations from the operating system - Creating, deleting, copiing, moving of files - Searching for files - Manipulating of data in the files using system calls (load content into memory, change it and write back to medium) - Sequentially searching of content (example filesystems are FAT, FAT32, NTFS, ext2, ext3, reiserfs) Schicker, E.: Datenbanken und SQL. Teubner 1996 Lect6-Page7 Technische Universität München Basic data organisation: file system Matlab example functions: 1) fid = fopen(filename, mode) opens the file filename in the specified mode. The mode argument can be any of the following: r, w, a, r+, w+, a+, A, W 2) A = fread(fid) reads data in binary format from the file specified by fid into matrix A. or tline = fgetl(fid) returns the next line of the file associated with the file identifier fid. or A = fscanf(fid, format) reads all the data from the file specified by fid, converts it according to the specified format string, and returns it in matrix A. 3) count = fwrite(fid, A, precision) writes the elements of matrix A to the specified file, translating MATLAB values to the specified precision. or [count, errmsg] = fprintf(fid, format, A, ...) formats the data in the real part of matrix A (and in any additional matrix arguments) under control of the specified format string, and writes it to the file associated with file identifier fid. 4) status = fseek(fid, offset, origin) repositions the file position indicator in the file with the given fid to the byte with the specified offset relative to origin. 5) status = fclose(fid) closes the specified file if it is open, returning 0 if successful and -1 if unsuccessful. Argument fid is a file identifier associated with an open file. Lect6-Page8 Technische Universität München The ideas ... How to organize data within the files so that large data sets can be searched and accessed in a very fast way. The optimisation aims: - Fast searching within files - Optimized memory usage - Easy possibility for deleting, inserting and changing of data - Easy maintainable - Stabile and without errors Schicker, E.: Datenbanken und SQL. Teubner 1996 Lect6-Page9 Technische Universität München Sequential vs. address-based Data record or just record: Logic data unit in a file, e.g. a line in a text file or one line in a table Sequential file: The records are saved as sequence without gaps (e.g. lists) <Begin> Record1 Record5 Record9 Record2 Record3 Record6 Record7 Record10 … Record4 Record8 <End> Characteristics - Start is known - Each element has a direct successor - At the end is an end signal Address based file: The records are saved at dedicated addresses which allows direct access (e.g. tables) => fast access, but dealing with addresses Byte 0 Byte100 Byte200 Byte800 Byte900 Schicker, E.: Datenbanken und SQL. Teubner 1996 Record1 Record2 Record3 Characteristics - Each element is directly accessible - The elements have the same size Record4 Record5 … Lect6-Page10 Technische Universität München Lecture 6: Data organisation - Review: codes and operating system - Basic data organisation: file system - Single level data structures - Multi level data structures - Primary and secondary keys Lect6-Page11 Technische Universität München Address-based with search keys Directly addressed: The address is a function of the data or one element of the data Example: Calendar or filofax February 2nd is day 33 of the year Search key or just key If each entry day has a limited memory of 1000 Bytes it is possible to directly calculate the position of the record startpoint Address = StartaddressOfYear + (DOY-1)*1000 … Byte32000 DOY32 = Feb. 01 DOY33 = Feb. 02 Byte33000 Byte34000 DOY34 = Feb. 03 … Schicker, E.: Datenbanken und SQL. Teubner 1996 Lect6-Page12 Technische Universität München Ordered lists and binary search Ordered: The elements are part of an ordered data set, e.g. yellow pages or telefone book (ordered lists) But the address is not directly derivable. => Method to find records Binary search: <= ? <= ? <= ? <= ? Anton Anton Anton Anton Betty Betty Betty Betty Cesar Cesar Cesar >? Michael Michael >? Norbert Norbert Key: Otto Otto Betty Betty Paul >? Pauline Richard Efficiency (e.g. with 1 Mio. records): Sandy Sequential search would need 500000 search Tim accesses in mean Wolfgang Binary search would need 20 search accesses >? (= 1 Mio ~ 220) Problem: Deleting and inserting! Schicker, E.: Datenbanken und SQL. Teubner 1996 Lect6-Page13 Technische Universität München Linked lists Ordering with singly-linked or doubly-linked list with pointers Anton Betty Characteristics - Start is known - Each element has a link to ist successor - At the end is an end signal Cesar Michael Richard Tim Schicker, E.: Datenbanken und SQL. Teubner 1996 Lect6-Page14 Technische Universität München Linked lists Ordering with singly-linked or doubly-linked list with pointers Anton Betty Cesar Characteristics - Start is known - Each element has a link to its successor - At the end is an end signal - The physical order is not important Simple inserting and deleting of records! Michael 1) Search position 3) Link previous list with new element Pauline 4) Delete old pointer X 2) Link new element to rest of the list Richard Tim Schicker, E.: Datenbanken und SQL. Teubner 1996 Lect6-Page15 Technische Universität München Overview single-level data structures Sequential lists Tables Ordered lists Storage type Sequential Address-based Memory Memory consumption Sequential Very low (no gaps) Address-based Very high (large gaps) Sequential ordered Address-based Low (records with same size) Access time Sequential search Adding at the end One access direct Binary search Directly changeable Field of application Universal Very restricted Complexity Simple Simple Change possibility Schicker, E.: Datenbanken und SQL. Teubner 1996 Moving of following elements Universal on address-based memory Simple Ordered linked lists Address-based Address-based Low (additional memory for addresses Sequential search Address changes Universal on address-based memory Simple Lect6-Page16 Technische Universität München ? Review: single-level structures ? What is the difference between sequential lists and ordered linked lists? Describe the calculation of addresses in direct addressed methods (e.g. for a calendar)? Describe the steps for deleting an element in a linked list? Lect6-Page17 Technische Universität München Lecture 6: Data organisation - Review: codes and operating system - Basic data organisation: file system - Single level data structures - Multi level data structures - Primary and secondary keys Lect6-Page18 Technische Universität München Hierarchic multi-level data structures: trees Single-level structures: - Simple algorithms - But either long access times nor reduced changeability => Usage of hierarchy with administrative meta-information meta meta-information Binary trees or trees (see binary search): 8 4 12 2 1 2 6 3 4 5 6 10 7 8 9 10 14 11 12 13 14 15 Advantages: Keep the meta-information in the memory for fast searching and leave the real data in files and can be unordered Schicker, E.: Datenbanken und SQL. Teubner 1996 Lect6-Page19 Technische Universität München Hierarchic multi-level data structures: index-sequential Binary trees or trees: - Additional memory for meta-data needed - Unbalanced trees after inserts and deletes => index-sequential files or B-trees with usage of external addressable memory (blockoriented with equal block sizes) 8 Unbalanced tree 4 12 14 6 5 8 12 13 14 15 Anton: Addr. Block 8 Indexblock DataBlock 7 Block Paul: Addr. Block 7 1KB Tim: Addr. Block 9 Paul Pauline DataBlock 8 Anton Betty Cesar DataBlock 9 Tim Wolfgang Schicker, E.: Datenbanken und SQL. Teubner 1996 ISAM: Index Sequential Access Method (VSAM: Virtual Sequential Access Method) Lect6-Page20 Technische Universität München Hierarchic multi-level data structures: index-sequential Index-sequential search: 1) Load index block into memory 2) Search sequ. for “Betty” Meta-information can be kept in the memory, even when millions of records are in the file Anton: Addr. Block 8 Paul: Addr. Block 7 Tim: Addr. Block 9 3) “Betty” is lexically greater then “Anton” and lower then “Paul” 4) Load data block with possible record “Betty” into memory Anton Betty Cesar 5) Search sequ. for “Betty” At least two hard drive block accesses are necessary! Schicker, E.: Datenbanken und SQL. Teubner 1996 Paul Pauline Tim Wolfgang 6) Found record “Betty” Lect6-Page21 Technische Universität München Hierarchic multi-level data structures: index-sequential Index-sequential delete: 1) Load index block into memory 2) Search sequ. for “Betty” Anton: Addr. Block 8 Paul: Addr. Block 7 Tim: Addr. Block 9 3) “Betty” is lexically greater then “Anton” and lower then “Paul” 4) Load data block with possible record “Betty” into memory Paul Pauline Anton Betty Cesar 5) Search sequ. for “Betty” Tim Wolfgang 6) Delete found record “Betty” 8) Maybe adapt index block and save back Schicker, E.: Datenbanken und SQL. Teubner 1996 Anton Cesar 7) Save back Lect6-Page22 Technische Universität München Hierarchic multi-level data structures: index-sequential Index-sequential change/add (1): 1) Load index block into memory 2) Search sequ. for “Otto” Anton: Addr. Block 8 Paul: Addr. Block 7 Tim: Addr. Block 9 3) “Otto” is lexically greater then “Anton” and lower then “Paul” 4) Load data block with possible record “Otto” into memory Paul Pauline Anton Betty Cesar 5) Search sequ. for “Otto” Tim Wolfgang 6) Add “Otto” Anton Betty Cesar Otto Schicker, E.: Datenbanken und SQL. Teubner 1996 7) Save back Lect6-Page23 Technische Universität München Hierarchic multi-level data structures: index-sequential Index-sequential change/add (2, splitting): 1) Load index block into memory Anton: Addr. Block 8 Paul: Addr. Block 7 Tim: Addr. Block 9 2) Search sequ. for “Norbert” 4) Load data block with possible record “Norbert” into memory 9) Adapt index block and save back Keep fill level of data blocks at a max. of 70% to 80% 3) “Norbert” is lexically greater then “Anton” and lower then “Paul” 5) Search sequ. for “Norbert” Paul Pauline Anton Betty Cesar Otto 6) Block is full => splitting Tim Wolfgang 7) Add “Norbert” Anton Betty Cesar Schicker, E.: Datenbanken und SQL. Teubner 1996 Norbert Otto 8) Save back Lect6-Page24 Technische Universität München Hierarchic multi-level data structures: index-sequential Organize multiple index blocks: First level index block Address Second level index block It’s a tree structure again! Schicker, E.: Datenbanken und SQL. Teubner 1996 nth level index block Address … … Lect6-Page25 Technische Universität München Hierarchic multi-level data structures: index-sequential Organize multiple index blocks – a small calculation: A given a block size of 2048 Bytes, a record size of 116 Bytes and an index size of 24 Bytes with a max. fill level of 75% results in 64 record per index block and 13 records per data block 1 index level => 64 blocks => 64*13 = 830 data records => 97 kBytes data 2 index level => 642 blocks => 53000 data records => 6 MBytes data 3 index level => => 400 MBytes data 4 index level => => 25 GBytes data In addition integer keys reduce the index sizes and allow the administration of much more data records Schicker, E.: Datenbanken und SQL. Teubner 1996 Lect6-Page26 Technische Universität München Overview: index-sequential Index sequential access method (ISAM) Storage type Address based Memory Address based Memory consumption Low (if not ideal filled it is satisfiing) Access time Short via index levels Change possibility Local changeability with some changes on index levels Field of application Universal on address based memory with reorganizations from time to time Very high Complexity Schicker, E.: Datenbanken und SQL. Teubner 1996 Lect6-Page27 Technische Universität München Hierarchic multi-level data structures: hash Another way beside ISAM: ISAM combine linked lists and the binary/sequential search to setup trees which can be used for a fast administration of large data sets => Another possibility is to use direct addressing BUT: Names/strings are not usable for addresses => Method to calculate addresses out of strings Function f Set of keys Set of memory addresses Number of keys = n n >> = Number of addresses BUT: Number of keys = m = Number of addresses Hash function h Set of memory addresses e.g. possible number of names with max. 20 characters is 2620 Set of keys Schicker, E.: Datenbanken und SQL. Teubner 1996 Lect6-Page28 Technische Universität München Hierarchic multi-level data structures: hash What happens when hash functions calculates the same addfress: exception handling finds another address => Hash function of second order Number of keys = n >> m = Hash function h Number of addresses Hash function h2 X Set of memory addresses Set of keys Schicker, E.: Datenbanken und SQL. Teubner 1996 Lect6-Page29 Technische Universität München Hierarchic multi-level data structures: hash Example: Given a file with 100kByte containing address records of 100 bytes each with a name element of 20 characters => max. 1000 records can be saved Hash function h: 20 h(name) = addr = 100 * ((∑i*integer(name[i])) modulo 1000) i=1 Calculates the integer value of the character and sums up the values of the 20 name characters. Modulo is used to map into a smaller memory set of addressed data. The possible set of all names is hashed into the real finite memory set. Hash exception function h2: h2(addr) = exceptaddr = 100 * ((addr/100+43) modulo 1000) Because of modulo and the reduced number of possible addresses the same address can be calculated for different names. h2 then shifts the address 43 elements. If there is again already a record h2 is used again. (But in this case h2 is not optimal) Schicker, E.: Datenbanken und SQL. Teubner 1996 Lect6-Page30 Technische Universität München Hierarchic multi-level data structures: hash Hash search: 1) Use h Search for “Betty” addr=52000 2) Compare key with saved 3) If equal “Betty” found 5) If not equal use h2 as long until key and name is equal or record empty 4) If empty “Betty” not found Hash delete: 1) Use h Search for “Betty” addr=52000 2) Compare key with saved 3) If equal delete “Betty” 5) If not equal use h2 as long until key and name is equal or record empty 4) If empty “Betty” not found But delete is dangerous: it can destroy search pathes for h2 function => Select a good h2 and use additional structures to remember search pathes Schicker, E.: Datenbanken und SQL. Teubner 1996 Lect6-Page31 Technische Universität München Hierarchic multi-level data structures: hash Hash update/add: 1) Use h Search for “Betty” addr=52000 2) Compare key with saved 3) If equal “Betty” found, so update it Schicker, E.: Datenbanken und SQL. Teubner 1996 5) If not equal use h2 as long until key and name is equal or record empty 4) If empty “Betty” can be added Lect6-Page32 Technische Universität München Hierarchic multi-level data structures: hash Additional notes about hash: - If 50% of the possible, real memory is filled between 20 and 40 % of the h-usage lead in collisions so that h2 must be used (80 % fill status cost 5 attempts for over 80 % of the accesses in mean) => 40 – 60 % memory usage is a good compromise between memory consumption and access times - Large data sets cost a lot of additional memory - The defined memory usage is fixed from the beginning on Schicker, E.: Datenbanken und SQL. Teubner 1996 Lect6-Page33 Technische Universität München Overview: hash Hash tables Storage type Directly addressed Memory Address based Memory consumption Satisfiing for most data volumes Access time Very short via direct addressing Change possibility In general very easy but deleting is more complex Field of application Universal on address based memory, but not useful when memory usage changes Very high Complexity Schicker, E.: Datenbanken und SQL. Teubner 1996 Lect6-Page34 Technische Universität München Lecture 6: Data organisation - Review: codes and operating system - Basic data organisation: file system - Single level data structures - Multi level data structures - Primary and secondary keys Lect6-Page35 Technische Universität München Primary and secondary keys Car registration M KM 526 CHA KL 23 B SH 123 Familly name Smith Mills Sunny Calling name Betty Frank James … Car type Primary key: search key which defines the order of the file BUT: It is just easy to access records very fast by using the primary key. But each other search access on other columns must be operated sequentially! => Use additional search keys in inverted files Inverted files contain ordered additional keys and the according primary keys. Familly name Mills Smith Sunny Car registration CHA KL 23 M KM 526 B SH 123 Secondary key in an inverted file Schicker, E.: Datenbanken und SQL. Teubner 1996 Lect6-Page36 Technische Universität München Primary and secondary keys Searching with primary and secondary keys: 1) Is the primary key used a direct access can be used for a very fast access to ISAM or hash files 2) Is a secondary key used the inverted file gives a possibility to find the primary key. Inverted files can be organized as ISAM or as hash again. After finding the primary key it can be used for a direct access to the data files. 3) If no secondary and no primary key is used a sequential serach is necessary => But each key must be unique!!! Disadvatages: - Changes must be done on several files, the original data files and the oinverted - Addintional memory is needed - This redundancy is a problem when a computer crash happens while the different updates so that inconsistent files exist => Additional techniques are necessary (like transactions) which are offered by a data base management system (DBMS) Schicker, E.: Datenbanken und SQL. Teubner 1996 Lect6-Page37 Technische Universität München ? Review: multi-level structures and keys ? What happens in ISAM files when a data or an index block runs over its size? Describe the procedure! Describe the procedure to add a new value to ISAM structures! What is the difference to hash methods? Describe hashing methods? How can you improve their quality? Lect6-Page38 Technische Universität München Matlab (I) Thank you! Lect6-Page39