Lecture 6: Data organisation

Technische Universität München
Lecture 6: Data organisation
- Review: codes and operating system
- Basic data organisation: file system
- Single level data structures
- Multi level data structures
- Primary and secondary keys
Lect6-Page1
Technische Universität München
?
Review: design
?
What is a weaver (graphical scheme)?
For what can it be used (care about key
words)?
What is UML?
Give 5 guidelines from design rules, you
have heard in the lecture?
Lect6-Page2
Technische Universität München
Review: codes
- ASCII/ANSI-Code
- Unicode transformation format (UTF)
- Binary numbers
0010 0011 0000 1111bin = 8975dec
…
Lect6-Page3
Technische Universität München
Review: abstraction by operating system
Interface and device drivers for the different hardware devices, e.g
hard drives or optical drives
User prog.
Games
App. Dev.
Systems
…
System Software
Shell Editor GUI Batch Compiler, …
Interface
Interface
Interface
Operating System (Kernel)
Device
Driver
Machine Code
Microprogramming
- High-level abstraction of hardware with
simple interface (system calls)
- Application independence
- Ressource sharing, critical section management
-…
- Disk and file system
- Device driver
-…
- Basic input/output methodes
Physical Devices
Hardware
Device
Lect6-Page4
Technische Universität München
Lecture 6: Data organisation
- Review: codes and operating system
- Basic data organisation: file system
- Single level data structures
- Multi level data structures
- Primary and secondary keys
Lect6-Page5
Technische Universität München
Basic data organisation: file system
Filesystem
Main
memory
Network
Directories (folder
folder)
Organisation of data in useful blocks by the operating system
Operating
System
Harddrives
(HDD)
Physically
in blocks
(e.g. 4 kByte
Optical
Drives
Files
…
Schicker, E.: Datenbanken und SQL. Teubner 1996
Lect6-Page6
Technische Universität München
Basic data organisation: file system
Directories
(folders
folders)
Organisation of data in useful blocks by the operating system
Main
memory
Filesystem
Operating
System
Files
Supported operations from the operating system
- Creating, deleting, copiing, moving of files
- Searching for files
- Manipulating of data in the files using system calls (load content
into memory, change it and write back to medium)
- Sequentially searching of content
(example filesystems are FAT, FAT32, NTFS, ext2, ext3, reiserfs)
Schicker, E.: Datenbanken und SQL. Teubner 1996
Lect6-Page7
Technische Universität München
Basic data organisation: file system
Matlab example functions:
1) fid = fopen(filename, mode) opens the file filename in the specified mode. The mode
argument can be any of the following: r, w, a, r+, w+, a+, A, W
2) A = fread(fid) reads data in binary format from the file specified by fid into matrix A.
or
tline = fgetl(fid) returns the next line of the file associated with the file identifier fid.
or
A = fscanf(fid, format) reads all the data from the file specified by fid, converts it
according to the specified format string, and returns it in matrix A.
3) count = fwrite(fid, A, precision) writes the elements of matrix A to the specified file,
translating MATLAB values to the specified precision.
or
[count, errmsg] = fprintf(fid, format, A, ...) formats the data in the real part of matrix A
(and in any additional matrix arguments) under control of the specified format string,
and writes it to the file associated with file identifier fid.
4) status = fseek(fid, offset, origin) repositions the file position indicator in the file with the
given fid to the byte with the specified offset relative to origin.
5) status = fclose(fid) closes the specified file if it is open, returning 0 if successful and -1 if
unsuccessful. Argument fid is a file identifier associated with an open file.
Lect6-Page8
Technische Universität München
The ideas ...
How to organize data within the files so that large data sets
can be searched and accessed in a very fast way.
The optimisation aims:
- Fast searching within files
- Optimized memory usage
- Easy possibility for deleting,
inserting and changing of data
- Easy maintainable
- Stabile and without errors
Schicker, E.: Datenbanken und SQL. Teubner 1996
Lect6-Page9
Technische Universität München
Sequential vs. address-based
Data record or just record:
Logic data unit in a file, e.g. a line in a text file or one line in a
table
Sequential file:
The records are saved as sequence without gaps (e.g. lists)
<Begin>
Record1
Record5
Record9
Record2
Record3
Record6
Record7
Record10 …
Record4
Record8
<End>
Characteristics
- Start is known
- Each element has a direct successor
- At the end is an end signal
Address based file:
The records are saved at dedicated addresses which allows
direct access (e.g. tables) => fast access, but dealing with
addresses
Byte 0
Byte100
Byte200
Byte800
Byte900
Schicker, E.: Datenbanken und SQL. Teubner 1996
Record1
Record2
Record3
Characteristics
- Each element is directly accessible
- The elements have the same size
Record4
Record5
…
Lect6-Page10
Technische Universität München
Lecture 6: Data organisation
- Review: codes and operating system
- Basic data organisation: file system
- Single level data structures
- Multi level data structures
- Primary and secondary keys
Lect6-Page11
Technische Universität München
Address-based with search keys
Directly addressed:
The address is a function of the data or one element of the
data
Example: Calendar or filofax
February 2nd is day 33 of the year
Search key or just key
If each entry day has a limited memory of 1000 Bytes it is possible to
directly calculate the position of the record startpoint
Address = StartaddressOfYear + (DOY-1)*1000
…
Byte32000
DOY32 = Feb. 01
DOY33 = Feb. 02
Byte33000
Byte34000
DOY34 = Feb. 03
…
Schicker, E.: Datenbanken und SQL. Teubner 1996
Lect6-Page12
Technische Universität München
Ordered lists and binary search
Ordered:
The elements are part of an ordered data set, e.g. yellow
pages or telefone book (ordered lists)
But the address is not directly derivable.
=> Method to find records
Binary search:
<= ?
<= ?
<= ?
<= ?
Anton
Anton
Anton
Anton
Betty
Betty
Betty
Betty
Cesar
Cesar
Cesar
>?
Michael
Michael
>?
Norbert
Norbert
Key:
Otto
Otto
Betty
Betty
Paul
>?
Pauline
Richard
Efficiency (e.g. with 1 Mio. records):
Sandy
Sequential search would need 500000 search
Tim
accesses in mean
Wolfgang
Binary search would need 20 search accesses
>?
(= 1 Mio ~ 220)
Problem: Deleting and inserting!
Schicker, E.: Datenbanken und SQL. Teubner 1996
Lect6-Page13
Technische Universität München
Linked lists
Ordering with singly-linked or doubly-linked list with pointers
Anton
Betty
Characteristics
- Start is known
- Each element has a link to ist successor
- At the end is an end signal
Cesar
Michael
Richard
Tim
Schicker, E.: Datenbanken und SQL. Teubner 1996
Lect6-Page14
Technische Universität München
Linked lists
Ordering with singly-linked or doubly-linked list with pointers
Anton
Betty
Cesar
Characteristics
- Start is known
- Each element has a link to its successor
- At the end is an end signal
- The physical order is not important
Simple inserting and deleting of records!
Michael
1) Search position
3) Link previous list with
new element
Pauline
4) Delete old pointer
X
2) Link new element to rest
of the list
Richard
Tim
Schicker, E.: Datenbanken und SQL. Teubner 1996
Lect6-Page15
Technische Universität München
Overview single-level data structures
Sequential lists
Tables
Ordered lists
Storage type
Sequential
Address-based
Memory
Memory
consumption
Sequential
Very low (no
gaps)
Address-based
Very high (large
gaps)
Sequential
ordered
Address-based
Low (records
with same size)
Access time
Sequential
search
Adding at the
end
One access direct Binary search
Directly
changeable
Field of
application
Universal
Very restricted
Complexity
Simple
Simple
Change
possibility
Schicker, E.: Datenbanken und SQL. Teubner 1996
Moving of
following
elements
Universal on
address-based
memory
Simple
Ordered linked
lists
Address-based
Address-based
Low (additional
memory for
addresses
Sequential
search
Address changes
Universal on
address-based
memory
Simple
Lect6-Page16
Technische Universität München
?
Review: single-level structures
?
What is the difference between sequential
lists and ordered linked lists?
Describe the calculation of addresses in
direct addressed methods (e.g. for a
calendar)?
Describe the steps for deleting an
element in a linked list?
Lect6-Page17
Technische Universität München
Lecture 6: Data organisation
- Review: codes and operating system
- Basic data organisation: file system
- Single level data structures
- Multi level data structures
- Primary and secondary keys
Lect6-Page18
Technische Universität München
Hierarchic multi-level data structures: trees
Single-level structures:
- Simple algorithms
- But either long access times nor reduced changeability
=> Usage of hierarchy with administrative meta-information
meta
meta-information
Binary trees or trees (see binary search):
8
4
12
2
1
2
6
3
4
5
6
10
7
8
9
10
14
11
12
13
14
15
Advantages:
Keep the meta-information in the memory for fast searching
and leave the real data in files and can be unordered
Schicker, E.: Datenbanken und SQL. Teubner 1996
Lect6-Page19
Technische Universität München
Hierarchic multi-level data structures: index-sequential
Binary trees or trees:
- Additional memory for meta-data needed
- Unbalanced trees after inserts and deletes
=> index-sequential files or B-trees with
usage of external addressable memory
(blockoriented with equal block sizes)
8 Unbalanced tree
4
12
14
6
5
8
12
13
14
15
Anton: Addr. Block 8
Indexblock
DataBlock 7
Block Paul: Addr. Block 7
1KB Tim: Addr. Block 9
Paul
Pauline
DataBlock 8
Anton
Betty
Cesar
DataBlock 9
Tim
Wolfgang
Schicker, E.: Datenbanken und SQL. Teubner 1996
ISAM:
Index Sequential Access
Method
(VSAM:
Virtual Sequential Access
Method)
Lect6-Page20
Technische Universität München
Hierarchic multi-level data structures: index-sequential
Index-sequential search:
1) Load index block
into memory
2) Search
sequ. for “Betty”
Meta-information can be kept
in the memory, even when
millions of records are in the
file
Anton: Addr. Block 8
Paul: Addr. Block 7
Tim: Addr. Block 9
3) “Betty” is lexically
greater then “Anton”
and lower then “Paul”
4) Load data block
with possible record
“Betty” into memory
Anton
Betty
Cesar
5) Search
sequ. for “Betty”
At least two hard drive block
accesses are necessary!
Schicker, E.: Datenbanken und SQL. Teubner 1996
Paul
Pauline
Tim
Wolfgang
6) Found record “Betty”
Lect6-Page21
Technische Universität München
Hierarchic multi-level data structures: index-sequential
Index-sequential delete:
1) Load index block
into memory
2) Search
sequ. for “Betty”
Anton: Addr. Block 8
Paul: Addr. Block 7
Tim: Addr. Block 9
3) “Betty” is lexically
greater then “Anton”
and lower then “Paul”
4) Load data block
with possible record
“Betty” into memory
Paul
Pauline
Anton
Betty
Cesar
5) Search
sequ. for “Betty”
Tim
Wolfgang
6) Delete found record “Betty”
8) Maybe adapt
index block and
save back
Schicker, E.: Datenbanken und SQL. Teubner 1996
Anton
Cesar
7) Save back
Lect6-Page22
Technische Universität München
Hierarchic multi-level data structures: index-sequential
Index-sequential change/add (1):
1) Load index block
into memory
2) Search
sequ. for “Otto”
Anton: Addr. Block 8
Paul: Addr. Block 7
Tim: Addr. Block 9
3) “Otto” is lexically
greater then “Anton”
and lower then “Paul”
4) Load data block
with possible record
“Otto” into memory
Paul
Pauline
Anton
Betty
Cesar
5) Search
sequ. for “Otto”
Tim
Wolfgang
6) Add “Otto”
Anton
Betty
Cesar
Otto
Schicker, E.: Datenbanken und SQL. Teubner 1996
7) Save back
Lect6-Page23
Technische Universität München
Hierarchic multi-level data structures: index-sequential
Index-sequential change/add (2, splitting):
1) Load index block
into memory
Anton: Addr. Block 8
Paul: Addr. Block 7
Tim: Addr. Block 9
2) Search
sequ. for “Norbert”
4) Load data block
with possible record
“Norbert” into
memory
9) Adapt index
block and save
back
Keep fill level of data blocks
at a max. of 70% to 80%
3) “Norbert” is lexically
greater then “Anton”
and lower then “Paul”
5) Search
sequ. for
“Norbert”
Paul
Pauline
Anton
Betty
Cesar
Otto
6) Block is full
=> splitting
Tim
Wolfgang
7) Add “Norbert”
Anton
Betty
Cesar
Schicker, E.: Datenbanken und SQL. Teubner 1996
Norbert
Otto
8) Save back
Lect6-Page24
Technische Universität München
Hierarchic multi-level data structures: index-sequential
Organize multiple index blocks:
First level
index block
Address
Second level
index block
It’s a tree structure again!
Schicker, E.: Datenbanken und SQL. Teubner 1996
nth level
index block Address
…
…
Lect6-Page25
Technische Universität München
Hierarchic multi-level data structures: index-sequential
Organize multiple index blocks – a small calculation:
A given a block size of 2048 Bytes, a record size of 116 Bytes and an index size of 24
Bytes with a max. fill level of 75% results in 64 record per index block and 13 records per
data block
1 index level => 64 blocks => 64*13 = 830 data records => 97 kBytes data
2 index level => 642 blocks => 53000 data records
=> 6 MBytes data
3 index level =>
=> 400 MBytes data
4 index level =>
=> 25 GBytes data
In addition integer keys reduce the index sizes and allow the administration
of much more data records
Schicker, E.: Datenbanken und SQL. Teubner 1996
Lect6-Page26
Technische Universität München
Overview: index-sequential
Index sequential access method (ISAM)
Storage type
Address based
Memory
Address based
Memory consumption Low (if not ideal filled it is satisfiing)
Access time
Short via index levels
Change possibility
Local changeability with some changes on index levels
Field of application
Universal on address based memory with reorganizations from
time to time
Very high
Complexity
Schicker, E.: Datenbanken und SQL. Teubner 1996
Lect6-Page27
Technische Universität München
Hierarchic multi-level data structures: hash
Another way beside ISAM:
ISAM combine linked lists and the binary/sequential search to
setup trees which can be used for a fast administration of
large data sets
=> Another possibility is to use direct addressing
BUT: Names/strings are not usable for addresses
=> Method to calculate addresses out of strings
Function f
Set of keys
Set of memory addresses
Number of keys
=
n
n
>>
=
Number of addresses
BUT:
Number of keys
=
m =
Number of addresses
Hash function h
Set of memory addresses
e.g. possible number of names
with max. 20 characters is 2620
Set of keys
Schicker, E.: Datenbanken und SQL. Teubner 1996
Lect6-Page28
Technische Universität München
Hierarchic multi-level data structures: hash
What happens when hash functions calculates the same
addfress: exception handling finds another address
=> Hash function of second order
Number of keys
=
n
>>
m =
Hash function h
Number of addresses
Hash function h2
X
Set of memory addresses
Set of keys
Schicker, E.: Datenbanken und SQL. Teubner 1996
Lect6-Page29
Technische Universität München
Hierarchic multi-level data structures: hash
Example:
Given a file with 100kByte containing address records of 100
bytes each with a name element of 20 characters
=> max. 1000 records can be saved
Hash function h:
20
h(name) = addr = 100 * ((∑i*integer(name[i])) modulo 1000)
i=1
Calculates the integer value of the character and sums up the values
of the 20 name characters. Modulo is used to map into a smaller
memory set of addressed data. The possible set of all names is
hashed into the real finite memory set.
Hash exception function h2:
h2(addr) = exceptaddr = 100 * ((addr/100+43) modulo 1000)
Because of modulo and the reduced number of possible addresses the same
address can be calculated for different names. h2 then shifts the address 43
elements. If there is again already a record h2 is used again.
(But in this case h2 is not optimal)
Schicker, E.: Datenbanken und SQL. Teubner 1996
Lect6-Page30
Technische Universität München
Hierarchic multi-level data structures: hash
Hash search:
1) Use h
Search
for “Betty”
addr=52000
2) Compare
key with saved
3) If equal
“Betty”
found
5) If not equal use h2 as
long until key and name is
equal or record empty
4) If empty
“Betty” not
found
Hash delete:
1) Use h
Search
for “Betty”
addr=52000
2) Compare
key with saved
3) If equal
delete
“Betty”
5) If not equal use h2 as
long until key and name is
equal or record empty
4) If empty
“Betty” not
found
But delete is dangerous: it can destroy search pathes for h2 function
=> Select a good h2 and use additional structures to remember search
pathes
Schicker, E.: Datenbanken und SQL. Teubner 1996
Lect6-Page31
Technische Universität München
Hierarchic multi-level data structures: hash
Hash update/add:
1) Use h
Search
for “Betty”
addr=52000
2) Compare
key with saved
3) If equal
“Betty”
found, so
update it
Schicker, E.: Datenbanken und SQL. Teubner 1996
5) If not equal use h2 as
long until key and name is
equal or record empty
4) If empty
“Betty” can
be added
Lect6-Page32
Technische Universität München
Hierarchic multi-level data structures: hash
Additional notes about hash:
- If 50% of the possible, real memory is filled between 20 and
40 % of the h-usage lead in collisions so that h2 must be
used (80 % fill status cost 5 attempts for over 80 % of the
accesses in mean)
=> 40 – 60 % memory usage is a good compromise
between memory consumption and access times
- Large data sets cost a lot of additional memory
- The defined memory usage is fixed from the beginning on
Schicker, E.: Datenbanken und SQL. Teubner 1996
Lect6-Page33
Technische Universität München
Overview: hash
Hash tables
Storage type
Directly addressed
Memory
Address based
Memory consumption Satisfiing for most data volumes
Access time
Very short via direct addressing
Change possibility
In general very easy but deleting is more complex
Field of application
Universal on address based memory, but not useful when memory
usage changes
Very high
Complexity
Schicker, E.: Datenbanken und SQL. Teubner 1996
Lect6-Page34
Technische Universität München
Lecture 6: Data organisation
- Review: codes and operating system
- Basic data organisation: file system
- Single level data structures
- Multi level data structures
- Primary and secondary keys
Lect6-Page35
Technische Universität München
Primary and secondary keys
Car registration
M KM 526
CHA KL 23
B SH 123
Familly name
Smith
Mills
Sunny
Calling name
Betty
Frank
James
…
Car type
Primary key: search key which defines the order of the file
BUT: It is just easy to access records very fast by using the primary key. But each
other search access on other columns must be operated sequentially!
=> Use additional search keys in inverted files
Inverted files contain ordered additional keys and the according primary keys.
Familly name
Mills
Smith
Sunny
Car registration
CHA KL 23
M KM 526
B SH 123
Secondary key in an inverted file
Schicker, E.: Datenbanken und SQL. Teubner 1996
Lect6-Page36
Technische Universität München
Primary and secondary keys
Searching with primary and secondary keys:
1) Is the primary key used a direct access can be used for a very fast access to ISAM
or hash files
2) Is a secondary key used the inverted file gives a possibility to find the primary key.
Inverted files can be organized as ISAM or as hash again. After finding the primary key
it can be used for a direct access to the data files.
3) If no secondary and no primary key is used a sequential serach is necessary
=> But each key must be unique!!!
Disadvatages:
- Changes must be done on several files, the original data files and the oinverted
- Addintional memory is needed
- This redundancy is a problem when a computer crash happens while the different
updates so that inconsistent files exist
=> Additional techniques are necessary (like transactions) which are offered by a data
base management system (DBMS)
Schicker, E.: Datenbanken und SQL. Teubner 1996
Lect6-Page37
Technische Universität München
?
Review: multi-level structures and keys
?
What happens in ISAM files when a data
or an index block runs over its size?
Describe the procedure!
Describe the procedure to add a new value
to ISAM structures! What is the difference
to hash methods?
Describe hashing methods? How can you
improve their quality?
Lect6-Page38
Technische Universität München
Matlab (I)
Thank you!
Lect6-Page39