Editorial - Gesellschaft für Informatik

Datenbank-Spektrum
Zeitschrift für Datenbanktechnologien und Information Retrieval
Band 12 Heft 3 November 2012
Schwerpunkt: Scientific Data Management
Gastherausgeber: Michael Gertz
EDITORIAL
DISSERTATIONEN
Editorial
M. Gertz W. Müller 157
Dissertationen 219
COMMUNITY
SCHWERPUNKTBEITRÄGE
Data Management Challenges in Next Generation
Sequencing
S. Wandelt A. Rheinländer M. Bux L. Thalheim B. Haldemann U. Leser 161
Handling Big Data in Astronomy and Astrophysics:
Rich Structured Queries on Replicated Cloud Data
with XtreemFS
H. Enke A. Partl A. Reinefeld F. Schintke 173
Towards Large-Scale Meteorological Data
Services: A Case Study
D. Misev P. Baumann J. Seib 183
Der 175. Datenbankstammtisch an der HTW
Dresden
U. Wloka G. Gräfe 223
Bericht über den 24. GI-Workshop “Grundlagen
von Datenbanken”
I. Schmitt H. Höpfner 227
News 229
NACHRUF
Die deutsche DB-Community trauert um Hagen
Höpfner (25.01.1977–01.10.2012) 233
Scientific Workflows and Provenance: Introduction
and Research Opportunities
V. Cuevas-Vicenttín S. Dey S. Köhler S. Riddle B. Ludäscher 193
FACHBEITRÄGE
XPath and XQuery Full Text Standard and Its
Support in RDBMSs
D. Petković 205
CityPlot: Colored ER Diagrams to Visualize
Structure and Contents of Databases
M. Dugas G. Vossen 215
Weitere Artikel finden Sie auf www.springerlink.com.
Abstracts publiziert/indexiert in Google Scholar, DBLP,
io-port.net, OCLC, Summon by Serial Solutions.
Hinweise für Autoren für die Zeitschrift Datenbank Spektrum
finden Sie auf www.springer.com/13222.
Datenbank Spektrum (2012) 12:157–159
DOI 10.1007/s13222-012-0104-8
EDITORIAL
Editorial
Michael Gertz · Wolfgang Müller
Online publiziert: 5. Oktober 2012
© Springer-Verlag Berlin Heidelberg 2012
1 Schwerpunktthema: Scientific Data Management
Spricht man heutzutage von Herausforderungen an die Verwaltung und Analyse großer Datenmengen, so bezieht man
sich dabei meist auf Anwendungen im Bereich des eCommerce sowie neuerdings insbesondere auf Analysen sozialer Netzwerke. Dieser Fokus ist sicherlich gut begründet, da
hier typischerweise große internationale Firmen wie Facebook, Twitter, eBay oder Amazon mit Geschäftsmodellen
im Vordergrund stehen, bei denen der Verkauf von Produkten sowie die Pflege und Analyse von Kunden- und Verkaufsdaten die Geschäftsgrundlage bilden. Ähnliches gilt
für die Telekommunikationsindustrie, bei der in großen Data Warehouses Informationen zu Anwendern und deren Nutzung von Services verwaltet und zur Verbesserung jener Services analysiert werden. Schwerpunkte traditioneller Forschung und Entwicklung im Bereich Datenbanken lassen
sich generell in den oben genannten Bereichen der Verwaltung und Analyse von Geschäftsdaten ansiedeln.
Der Menge an geschäftsorientierten Daten steht aber eine noch größere Menge an wissenschaftlichen Daten gegenüber, der bisher weniger Aufmerksamkeit im Rahmen der
Mainstream-Datenbankforschung und -entwicklung gewidmet wurde. Gerade in den letzten Jahren haben die Naturwissenschaften immense Fortschritte in der Instrumentierung von Experimenten, Simulationen und Beobachtungen
erfahren. Dies betrifft nahezu alle Bereiche in den NaturM. Gertz ()
Heidelberg University, Heidelberg, Deutschland
e-mail: [email protected]
W. Müller
HITS gGmbH, Heidelberg, Deutschland
e-mail: [email protected]
wissenschaften. Hierzu gehören u.a. die Physik, insbesondere die Astrophysik, Kosmologie und Teilchenphysik, die
Biologie mit Schwerpunkten in der Genetik und Molekularbiologie sowie die in den Geowissenschaften angesiedelten
Umweltwissenschaften und die Klimaforschung. In der Biologie nimmt zum Beispiel die Fähigkeit, mit Experimenten
Daten zu generieren, schneller zu als die Rechenleistung zur
Verarbeitung der Daten, was insbesondere bei der Gensequenzierung ein großes Bottleneck darstellt.
Aus Datensicht handelt es sich bei wissenschaftlichen
Daten nicht um einfache transaktionale Daten, sondern um
Daten, die typischerweise sehr heterogen und komplex sind
und multiple Skalen beschreiben. Traditionelle Datenbanktechniken sind hierdurch teilweise nicht einfach auf diese
Anwendungsbereiche zu übertragen. Die Datenintegration
ist ein wichtiges Thema, das beispielsweise durch Scientific
Workflows angegangen wird. Gleichzeitig ergeben sich neue
Optimierungsmöglichkeiten durch die gegenüber universellen Datenbanken veränderte Domäne, denn neue Zugriffsmuster und andere vorherrschende konzeptuelle Datenstrukturen ermöglichen neue Optimierungen.
Die effiziente Verwaltung, Speicherung, Suche und Analyse wissenschaftlicher Daten stellt eine immense Herausforderung an diese und verwandte Bereiche der Naturwissenschaften dar. Wie kann man effektiv neues Wissen aus
den Daten ableiten? Wo stoßen aktuelle Systeme an ihre
Grenzen? Wo bieten sich neue und interessante Themen für
die Datenbankforschung?
In unserem Themenheft werden in vier Artikeln interessante Themenstellungen angegangen, die einen Eindruck
von der Vielfältigkeit der obigen Herausforderungen geben.
Der erste Beitrag Data Management Challenges in
Next Generation Sequencing von Sebastian Wandelt, Astrid
Rheinländer, Marc Bux, Lisa Thalheim, Berit Haldemann
und Ulf Leser widmet sich dem Next Generation Sequen-
158
cing, dem „Big Data“-Thema in der Biologie überhaupt.
Häufig ist zu hören, dass wir uns auf einen Zustand zubewegen, in dem die Datenverarbeitung von Gensequenzen
(technisch gesehen sind dies lange Strings) teurer wird als
ihre eigentliche Gewinnung. Der Artikel gibt einen Überblick über die Herausforderungen sowie die wichtigsten Lösungsansätze.
Der zweite Beitrag Handling Big Data in Astronomy and
Astrophysics: Rich Structured Queries on Replicated Cloud
Data with XtreemFS von Harry Enke, Adrian Partl, Alexander Reinefeld und Florian Schintke befasst sich mit der Anfragebearbeitung in verteilt vorliegenden Datensammlungen. Das Anwendungsgebiet sind hier virtuelle Observatorien und deren astronomische Daten, die entweder experimentell oder durch Simulation gewonnen worden sind. Hier
werden durch die Kombination eines verteilten Dateisystems und eines Anfrageoptimierers und -verteilers erhebliche Vorteile erzielt.
In dem dritten Artikel dieses Heftes befassen sich Dimitar Misev, Peter Baumann und Jürgen Seib unter dem Titel
Towards Large-Scale Meteorological Data Services: A Case Study mit Daten, die im Rahmen von Wettersimulationen verwendet werden. Diese umfassen u.a. verschiedenste Formen von Echtzeit-Sensordaten, Satellitendaten sowie
historische Daten, die in vieldimensionalen Arrays integriert
werden. Gegenstand ist hier rasdaman, eine Datenbank, die
zwar auf relationalen Datenbanken aufsetzt, aber auf dieser
Basis für die Verwaltung von und Anfrage an meteorologischen Daten effiziente und praktisch relevante Anfragemöglichkeiten bereitstellt.
Schließlich beschäftigt sich der Artikel Scientific Workflows and Provenance: Introduction and Research Opportunities von Víctor Cuevas Vicenttín, Saumen Dey, Sven Köhler, Sean Riddle und Bertram Ludäscher mit wissenschaftlichen Workflows. Wie eingangs erwähnt werden diese für die
Verarbeitung wissenschaftlicher Daten immer wichtiger. Bei
der Verarbeitung und Analyse wissenschaftlicher Daten stehen solche Workflows als komplexe Arbeitsabläufe im Hintergrund, die Daten auf verschiedenste Arten vorverarbeiten,
integrieren und umstrukturieren, um sie beispielsweise einer
Analyse zugänglich zu machen. Die Bedeutung von Workflows liegt darin, dass sie eine Perspektive bieten, auf einfachere Art und Weise Domänenwissen einzubringen. Im Bereich der wissenschaftlichen Datenbanken spielt das jeweilige Domänenwissen eine sehr große Rolle. Häufig arbeiten
Domänenexperten und Entwickler zusammen, um konkrete Probleme zu lösen. Den Domänenexperten interessiert im
Wesentlichen das Resultat, der Informatiker ist häufig an der
Durchführung, der Flexibilität, Generalität und Wiederverwendbarkeit interessiert.
Während oben die Unterschiede zwischen wissenschaftlichen und „normalen“ Datenbanken betont wurden, sollte
man aber auch darauf hinweisen, dass eine Vielzahl von Entwicklungen aus dem Datenbankbereich bei der Verarbeitung
Datenbank Spektrum (2012) 12:157–159
und Analyse wissenschaftlicher Daten erfolgreich eingesetzt
werden. Hierzu gehören z.B. eine Vielfalt von Indexstrukturen für räumliche und zeitlich veränderliche Daten, Verfahren zur Analyse von Datenströmen, effiziente Data-MiningMethoden zum Clustering hochdimensionaler Daten oder
Techniken zur Analyse von Graphstrukturen (wie sie beispielsweise gerade in der Molekularbiologie von Interesse sind). Nichtsdestotrotz bietet der Bereich Scientific Data
Management noch eine Vielzahl von interessanten Möglichkeiten, Methoden und Techniken aus der Datenbanktechnologie geeignet in den oben genannten und weiteren Anwendungsbereichen einzubringen und somit diesen Wissenschaften bei ihren Problemen mit der Datenflut („Data Deluge“) zu helfen. Ein guter Wegweiser hierzu war und ist
immer noch der Artikel von Jim Gray et al. Scientific data
management in the coming decade (SIGMOD Record 34(4):
34–41, 2005).
Diese Schwerpunktbeiträge werden ergänzt durch zwei
Fachbeiträge XPath and XQuery Full Text Standard and Its
Support in RDBMSs von Dus̆an Petković und CityPlot: Colored ER Diagrams to Visualize Structure and Contents of
Databases von Martin Dugas und Gottfried Vossen. Die Rubrik „Dissertationen“ enthält in diesem Heft sechs Kurzfassungen von Dissertationen.
In der Rubrik „Community“ berichten Uwe Wloka und
Gunter Gräfe von einer in der deutschen Datenbankgemeinde einmaligen und sehr lebendigen Wissenschaftseinrichtung, die seit fast 20 Jahren vom Fachinteresse der Dresdener Datenbankkollegen zeugt. In ihrem Beitrag Der 175.
Datenbankstammtisch an der HTW Dresden beschreiben sie
das Konzept und die Historie dieser Veranstaltungsreihe und
untermauern deren Erfolg mit verschiedenen statistischen
Daten. Weiterhin enthält diese Rubrik einen Bericht über
den 24. GI-Workshop „Grundlagen von Datenbanken“ von
Ingo Schmitt und Hagen Höffner sowie aktuelle Informationen.
2 Künftige Schwerpunktthemen
2.1 MapReduce Programming Model
MapReduce (MR) is a programming model which facilitates
parallel processing of large, distributed, and even heterogeneous data sets. To accelerate the development of specific
MR applications, an MR implementation provides a framework dealing with data distribution and and scheduling of
parallel tasks. The user only has to complement this framework by specifying a map function—processing key/value
pairs to generate intermediate key/value pairs—and a reduce function which groups all records with the same intermediate key and merges all values of such groups.
Datenbank Spektrum (2012) 12:157–159
Using this approach, programs written in such a functional style can automatically exploit large degrees of parallelism and thereby perfectly scale. As a consequence,
the MR model had tremendous success in recent years
covering many areas of Big Data processing. For this
reason, the “Datenbank-Spektrum” wants to publish research contributions—especially of the German database
community—providing an overview over ongoing work in
this particular area.
Submissions covering topics from the following nonexclusive list are encouraged:
–
–
–
–
–
Applications of the MR paradigm
Optimization of the MR framework and its applications
MR-conform compilation of DB languages
Schema flexibility (key/value stores) and MapReduce
Comparison of applications running under MapReduce/Hadoop and parallel DBMSs
– Cooperation of NoSQL and SQL when processing XXL
data
Paper format: 8–10 pages, double column.
Notice of intent for a contribution: July 15th, 2012.
Guest editor:
Theo Härder, University of Kaiserslautern,
[email protected]
159
a wide adoption in many other application domains including life sciences, multifaceted data integration, as well as
community-based data collection, and large knowledge bases like DBpedia.
This special issue of the “Datenbank-Spektrum” aims to
provide an overview of recent developments, challenges,
and future directions in the field of RDF technologies and
applications.
Topics of interest include (but are not limited to):
– RDF data management
– RDF access over the Web
– Querying and query optimization over RDF data—especially when accessed over the Web
– Applications and usage scenarios
– Case studies and experience reports
Paper format: 8–10 pages, double column.
Notice of intent for a contribution: Nov. 15th, 2012.
Guest editors:
Johann-Christoph Freytag, Humboldt-Universität zu Berlin,
[email protected]
Bernhard Mitschang, University of Stuttgart,
[email protected]
Deadline for submissions: February 1st, 2013.
Deadline for submissions: October 1st, 2012.
2.3 Best Workshop Papers of BTW 2013
2.2 RDF Data Management
This special issue of the „Datenbank-Spektrum“ is dedicated to the Best Papers of the Workshops running at the BTW
2013 at the TU Magdeburg. The selected Workshop contributions should be extended to match the format of regular
DASP papers.
Nowadays, more and more data is modeled and managed by
means of the W3C Resource Description Framework (RDF)
and queried by the W3C SPARQL Protocol and RDF Query Language (SPARQL). RDF is commonly known as a
conceptual data model for structured information that was
standardized to become a key enabler of the Semantic Web
to express metadata on the Web. It supports relationships
between resources as first-class citizens, provides modeling
flexibility towards any kind of schema, and is even usable
without a schema at all. Furthermore, RDF allows to collect data starting with very little schema information and refining the schema later, as required. This flexibility led to
Paper format: 8–10 pages, double column.
Selection of the Best Papers by the Workshop chairs and the
guest editor: April 15th, 2013.
Guest editor:
Theo Härder, University of Kaiserslautern,
[email protected]
Deadline for submissions: June 1st, 2013.
Datenbank Spektrum (2012) 12:173–181
DOI 10.1007/s13222-012-0099-1
S C H W E R P U N K T B E I T R AG
Handling Big Data in Astronomy and Astrophysics:
Rich Structured Queries on Replicated Cloud Data
with XtreemFS
Harry Enke · Adrian Partl · Alexander Reinefeld ·
Florian Schintke
Received: 2 July 2012 / Accepted: 28 July 2012 / Published online: 10 August 2012
© Springer-Verlag 2012
Abstract With recent observational instruments and survey
campaigns in astrophysics, efficient analysis of big structured data becomes more and more relevant. While providing good query expressiveness and data analysis capabilities through SQL, off-the-shelf RDBMS are yet not well
prepared to handle high volume scientific data distributed
across several nodes, neither for fast data ingest nor for fast
spatial queries. Our SQL query parser and job manager performs query reformulation to spread queries to data nodes,
gathering outputs on a head node and providing them again
to the shards for subsequent processing steps. We combine
this data analysis architecture with the cloud data storage
component XtreemFS for automatic data replication to improve the availability and access latency. With our solution we perform rich structured data analysis expressed using SQL on large amounts of structured astrophysical data
distributed across numerous storage nodes in parallel. The
cloud storage virtualization with XtreemFS provides elasticity and reproducibility of scientific analysis tasks through
its snapshot capability.
H. Enke · A. Partl
Astrophysical Institute in Potsdam (AIP), Berlin, Germany
H. Enke
e-mail: [email protected]
A. Partl
e-mail: [email protected]
A. Reinefeld () · F. Schintke
Zuse Institute Berlin (ZIB), Berlin, Germany
e-mail: [email protected]
F. Schintke
e-mail: [email protected]
1 Introduction
In astronomy and astrophysics, immense amounts of scientific data are produced by the ever growing capabilities of observational instruments, surveys covering substantial parts of the whole sky in various frequency ranges,
and computational simulations. For instance, the LOw Frequency ARray (LOFAR), a European radio telescope based
on distributed antennae fields located in the Netherlands,
Germany, France, UK, and Denmark produces up to five
Petabyte of raw data for a mean observation time of only
four hours [1]. Even the subsequently processed and correlated data from the various stations still amounts to several tens of Terabytes for such an observation. The general
characteristic of this raw, first stage data is that it is huge
in size and uniform in structure. Additionally, the data format is usually well known and any infrastructural problems
in managing these data can be resolved by improving the
hardware setup hosting the data. If we plot the size of individual data sets against the number of such data sets in
astronomy and astrophysics (see Fig. 1), the first stage data
can be found in area I.
Any further data processing steps such as removing instrument characteristics or correlating the station data in the
case of LOFAR yield intermediate data (area II in Fig. 1),
which is often referred to as ‘science ready’ data.1 The resulting datasets are already smaller in size and their structure
and formats are diverse. As a whole, ‘science ready’ data
still accumulate to large storage needs. In the case of LOFAR it is expected that several Petabytes of ‘science ready’
data are produced per year.
Today, organizing access to and enabling further processing of ‘science ready’ data usually requires dedicated data
1 This
process is also called ‘pipelining’ or data reduction.
174
Fig. 1 Schematic view of the number of data sets as a function of the
size of an individual data set. Area I refers to the raw data or first stage
data, area II marks the second stage data known as ‘science ready’
data, and area III the third stage data (see text for further discussion).
The real interesting challenges for scientific data management in astronomy arise in the second area: huge and highly structured data
centers. It is no longer feasible to store such large quantities
at the scientist’s workstation. Because the data is generally
used by other scientists than the producers, it also has to
be curated, validated, checked for consistency, and enriched
by metadata. Thus, any tool handling and analyzing such
amounts of ‘science ready’ data has to be co-located with
the data. Then only a small fraction of the data (i.e. results)
needs to be transferred to the individual scientist for publication. This highly structured and individual data is what we
refer to as third stage data in Fig. 1. Formerly, only such condensed results were publicly available and regarded as scientifically interesting, and second stage data was located at the
scientist’s workstation. This view has considerably changed
and more focus is laid on the publication of the underlying
‘science ready’ data.
Observation Data Most of the published observational
data in astronomy is now available in digital form.2 Many
data archives also feature databases of their objects with core
properties and provide access through web based services.
The most successful examples are the Sloan Digital Sky
Survey (SDSS),3 the Centre des Données Astronomiques de
Strasbourg (CDS) with a huge collection of small archives,4
2 A common file format in astronomy, the Flexible Image Transfer Sys-
Datenbank Spektrum (2012) 12:173–181
and all the archives published using the Virtual Observatory
protocols.
In observations, highly structured ‘science ready’ data
such as images, table data, graphs, and others pose new
problems when various sources of information or archives
need to be integrated for a compound view. Such a multifrequency view of an object from infrared, microwave, radio, and gamma ray catalogs demand standardized calibrations and uniformly accessible data structures. The need for
a multi-wavelength view of the Universe gave rise to the
International Virtual Observatory Alliance (IVOA) with its
goal of a distributed search and retrieval engine for astronomical data [4].
The application of data mining concepts and the multifrequency approach moves the scientific focus to the second
stage data. It is therefore essential that in the production and
operation of science ready data collections the whole process of data curation and validation is applied. Formerly, this
was used only for scientific papers and the few data tables
summarizing scientific results. For observational data, the
organization of key properties in database tables is already
one of the ubiquitous features of the data preparation, where
Table 1 provides a brief overview.
Simulation Data To study the validity of theoretical models, simulations are an essential tool. Especially in the field
of cosmology, simulations are needed to interpret observational results. Despite their importance in cosmology, data
archives of large simulations were established only in the
recent years with the Millennium Simulation5 [12] and the
MultiDark databases6 [17]. The Millennium database introduced this archiving based on relational database management systems (RDBMS) to the simulation community as a
valuable tool in obtaining scientific results. Using a simulation data archive, more scientists can work on the data and
ask questions far beyond the focus of the scientists who generated the simulation.
Formerly, scientific work on simulations required direct
access to the raw data. However, because of its large sizes
and difficulties in transferring the data, this was only feasible for small groups. The results of this limitation were nonpublicly available data mining tools and a large amount of
bespoke data formats and access procedures for simulation
data. This proved to be an obstacle for efficient scientific
work in larger collaborations. As is the case with observational data, a collaborative access to the simulations requires
the collection of this data in one location with the provision
of a collaborative data analysis environment. Furthermore,
tools for simulation analysis require compute facilities tuned
tem (FITS) was created in 1979 and is approved as a standard by the
IAU [2].
3 http://www.sdss.org.
5 http://gavo.mpa-garching.mpg.de/Millennium/.
4 http://cds.u-strasbg.fr/.
6 http://www.multidark.org.
Datenbank Spektrum (2012) 12:173–181
175
Table 1 Selection of digital astronomical archives with observational data [3]
Year
Survey
Data size
1994
Digitized Sky Survey (DSS)
∼73 GB
Digitized photo plates
2001
Catalog of DSS
∼16 GB
Digitized 89 Mio. objects
1997–2001
2 Micron All Sky Survey (2MASS)
∼200 GB
300 Mio. point sources,
1 Mio. extended sources
2006
Sloan Digital Sky Survey (SDSS)
∼6 TB Catalog
70 TB images and spectra
<200 Mio. objects,
1 Mio. spectra
2011 ff
LOw Frequency ARray (LOFAR)
∼2 TB/day
Radio data (images, spectra)
2013 ff
GAIA (satellite)
∼50 GB/day
1 Billion objects
2014 ff
Large Synoptic Survey Telescope (LSST)
∼30 TB/day
∼60 PB, 20 Billion objects
Table 2 Selection of
astrophysical simulations [3]
Year
Simulation
N particles
Data size (raw)
CPUh
1990
Collision of two galaxies
323
∼20 MB
104
2009
Collision of two galaxies
10243
∼100 TB
106
2009
Star dynamo
–
∼50 GB
105
2009
Black Hole Collision
–
∼5 TB
106
Large Scale Structures
20483
∼1 PB
2 × 107 .. >1 × 108
2010 ff
to the individual analysis workflows as opposed to ‘number
crunching’ in the generation of the first stage data.7
Piloted by the Millennium Simulation database and later
followed by the MultiDark database, the organization of
simulation data in databases focused on providing second
stage products such as halo catalogs, merger tree analysis
and other information extracted from the raw data through
RDBMS. However, it turned out that for various scientific
questions the full 6D information of the phase space is required. For instance, full phase space information is required, when halos and especially subhalos need to be redetermined with different algorithms than the ones that produced the provided halo catalogues. Additionally, the gravitational potential of a simulated object can only be determined from the full phase space information, which is an
important quantity in the study of gravitational lenses.
A snapshot of a modern cosmological N-body simulation has around 20483 ≈ 9.6 × 109 particles (see Table 2)
and consists of multiple (≈100) snapshots. Available SQL
database implementations show performance issues for this
kind of data. One of the problems is connected to the spatial
nature of the queries. Additionally, huge chunks of the data
have to be returned and processed for scientifically useful
queries.
Until today, only simulation data sets were large enough
to pose problems for available RDBMS. However, the challenges faced in the simulation community start to appear in
7A
Content
Millennium like simulation requires for the production of raw data
(aka. snapshots) several Million CPU hours (see Table 2).
..
40963
observational data management as well. Only recently, the
observational community started to explore various roads to
deal with the upcoming large observational datasets. For instance, GAIA, a satellite mission to be launched in 2013,
will produce large data sets that are highly complex and require data mining facilities with efficient access to the whole
data. One concept currently explored by the GAIA consortium is storing the data (>1 PB) in a Hadoop file system, so
that MapReduce algorithms can be applied. With this approach a constant performance for most types of queries
can be achieved. However, since lots of known information about the data structure is neglected in this approach,
RDMBS using this information perform much better for certain types of queries, whereas for other cases the Hadoop
approach provides better performance. Thus there will be
different types of GAIA archives and tools for their exploitation.
2 Challenges for Current Relational Database Systems
Overcoming the problems posed by large datasets is a task
where various approaches such as Hadoop, column based
databases (such as SciDB), NoSQL, or sharded relational
databases are currently considered in science data centers [14]. One very important consideration besides performance when choosing a big data management system is usability. From the perspective of scientists it is important that
a system provides a standardized and easy to use interface
to the data, which is flexible enough for the implementation
of complex data mining and analysis algorithms.
176
Data Access Since observational astronomers started to
slowly adopt SQL based query interfaces thanks to the success of the SDSS survey, it is important to continue and
promote this development. One key aspect of this development is, that by using SQL it is possible to standardize
various science questions and thus use similar SQL queries
on different datasets. It is also easier to publish in a paper
the SQL query used in an analysis than a documentation of
the specialized data format used in storing the data and the
associated analysis tools. This greatly improves on the aspect of reproducibility of the scientific results. The IVOA
even develops and maintains an extension to SQL named
ADQL (Astronomical Data Query Language [7]) enabling
additional query possibilities such as geometrical functions
on spheres.
Having a standardized data query language is therefore
very important. Even though new NoSQL developments
such as SciDB or Hadoop may provide sometimes better
suited data management and processing facilities for scientific data, the main problem about NoSQL is, that each solutions provides its own query language. Any use of these new
and thus not established access methods would only alienate
the user base and thus render a data service less valuable.
Therefore, solutions that enable the use of SQL are to be
preferred and relational databases are a natural choice.
Performance A major challenge for any data store solution is performance. One crucial point for any solution is
to provide competitive data analysis and retrieval performance to the scientists’ FORTRAN or C tools. Any data
service that is significantly slower than these bespoke tools
will not be accepted. Unfortunately, this cannot be easily
achieved with available RDBM systems. They are usually
designed with completely different use cases and requirements in mind than scientific data analysis requires. Until
recently, RDBMS running on single instances were sufficient for the amount of data that scientists needed to handle. The large increase in the amount of data produced by
simulations (Table 2) or by large scale surveys with new instruments (Table 1) start to conflict with the limitations of
these systems. A single instance is no longer sufficient and
parallel solutions are needed. Only with the various developments in business intelligence and data warehousing solutions, distributed commercial (and open source) tools start
to cope with the requirements of these scientific data sets. It
is therefore crucial on the one hand, that new optimized indexing methods and algorithms are developed for the scientific data. On the other hand, large sharded RDBM systems
need to be built, which allow for efficient aggregation and
data mining algorithms and fast retrieval of large amounts
of data (multiple million or billions of rows at a time) at
affordable expenses.
The huge datasets require considerable preparatory work
(curation, ingestion, management) when using a database
Datenbank Spektrum (2012) 12:173–181
system. Many RDBMS are not designed to handle large binary datasets and customized ingestion tools need to be written. Even worse, many RDBMS systems provide only bulk
ingest facilities through ASCII files. A solid and performant
API is therefore desirable and should provide fast bulk ingestion facilities. In our experience, available ODBC data
connectors do not generally meet all these requirements (for
instance, no fast open source implementation of TDS, the
MS SQL Server protocol, is available) and even APIs such
as the MySQL C API provide only limited performance.
The MySQL C API e.g. implements prepared statements
and binding of variables with too many string parsing operations.
3 Rich Structured Queries on Replicated Cloud Data
Storage
We therefore developed an open source solution for a parallel sharded RDBMS by customizing MySQL setups. We
are building a package for scientific data centers that allows
easy maintenance of a MySQL cluster with head and data
nodes, and features a facility for exploiting the sharded architecture through parallel queries. We chose MySQL over
other open source RDBMS such as Postgres mainly because
MySQL allows custom implementation of storage engines.
We plan to directly access the raw simulation result files
through custom storage engines in the future. Since we only
require read access to the data, this approach will greatly
facilitate the ingestion process of the raw simulation files.
Postgres on the other hand would provide the Generalized
Search Tree (GiST) which would be a natural choice for implementing an optimized algorithm for spatial queries (see
below Sect. 3.3).
Our development of a MySQL cluster hosting scientific
data focuses on the needs of computational cosmologists.
The main goal is to provide fast access to all the raw simulation data such as the full particle data (≥20483 ≈ 1010
rows per time step, see Table 2) and its derived products
such as halo catalogues, merger trees and mock galaxy catalogues [17]. The system should be able to handle complex
spatial queries such as object queries in arbitrary geometries
and nearest neighbor searches. At the moment, no emphasis is laid on optimizing cross products between large distributed tables, but this will be a priority for future development.
The main tasks in the setup of our MySQL cluster involve
two distinct stages. The first stage is the setup of the MySQL
cluster itself with the development of administrative tools,
replication and backup solutions, data partitioning strategies
and a parallel SQL query facility. The second stage involves
the development and optimization of customized data access
strategies (such as indexing).
Datenbank Spektrum (2012) 12:173–181
177
3.1 Parallel Query Facility
The actual sharding of the data across the individual MySQL
nodes is provided by the Spider storage engine developed
by Kentoku Shiba.8 The Spider engine runs on the head
node and shards data tables using partition functions to the
MySQL nodes. It is similar in functionality to the federated
engine and provides transparent links to the MySQL nodes
from the head to the sharding nodes. Unfortunately, the Spider engine was developed with large web applications in
mind, where the distribution of the server load was one of
the main development goals. It does not provide the possibility to transparently run SQL queries in parallel yet. We
therefore developed a SQL query parser and job manager
facility, that is able to schedule and run queries in parallel
on the sharding nodes and collect the results on the head
node.
For the parallel query facility to provide correct results, queries need to be reformulated. Basically, we take a
MapReduce like approach. Implicit and explicit joins have
to be reformulated in such a way, that intermediate results
are collected in parallel from the sharding nodes to the head
node. The aggregated intermediate result on the head node is
then made available to all the shards and is used in the next
step to execute the join on the shards. The query is transformed in such a way that the joins are rendered into nested
subqueries. Each subquery can be executed in parallel starting from the deepest nested subquery. The intermediate results of each sharding node are collected on the head node
and are subsequently made available to the sharding nodes
for further processing of the next outer subquery (see Fig. 2).
Fig. 2 Flow chart illustrating the execution of joins and subqueries in
parallel on the shards
3.2 Parallel SQL Queries
3.3 Spatial Queries
In order to keep the amount of data transferred to the head
node as small as possible, the nested subqueries need to be
ordered according to their selectivity. Currently our system
does not make use of the information that is available to the
MySQL query optimizer. We make the rather crude assumption that the subquery with the most constraints is also the
one with the highest selectivity. A schematic example of the
query reformulation algorithm’s output is given in Fig. 3.
Additionally, any queries with aggregation functions
such as summation, averaging, or the calculation of standard deviations need to be reformulated to account for the
distribution of the data. For instance, to average over distributed data, the sum and count of a given column needs to
be determined and its final value calculated accordingly on
the head node.
Secondly, spatial queries need to be further optimized.
A common spatial query on the raw particle information is
the cutout of boxy or spherical regions around dark matter
halos. This data is then used to study the environment a halo
resides in. Another example of spatial queries in cosmological simulations are nearest neighbor searches. To study
the formation of our own Milky Way for instance, all halos
within a given mass range, that have a massive companion
within a given radius, and lack the presence of other massive
objects in a larger radii are needed. An even more ambitious
science case for spatial queries is the retrieval of clusters in
the 6D phase space. Such clusters correspond to dark matter halos. Since such spatial queries are an essential tool in
the analysis of cosmological simulations, it is important to
facilitate them in the database.
Currently, the MultiDark database uses the B-tree index
provided by MS SQL Server together with Peano-Hilbert
keys calculated from the spatial information to acceler-
8 http://spiderformysql.com/.
178
Datenbank Spektrum (2012) 12:173–181
SELECT
a .∗ , b .∗ , c .∗
FROM
a, b, c
WHERE
b = 2 AND
b . i d = c . b _ i d AND
a . id = b . a_id ;
transforms to
SELECT
a . ∗ , tmp . ∗
FROM
a,
( SELECT b . ∗ , c . ∗
FROM
c,
( SELECT b . ∗
FROM
b
WHERE b = 2 ) AS b
WHERE b . i d = c . b _ i d )
WHERE
a . i d = tmp . b . a _ i d
Fig. 3 Example of a schematic SQL query that uses implicit joins and
its transformed equivalent using nested queries. The nested queries are
ordered by selectivity, with the inner most query being the most selective one
ate spatial queries [11]. This approach works reasonably
well for normal boxy or spherical cutouts. However nearest neighbor searches cannot be efficiently addressed with
these index types.
To further improve the query performance we investigate spatial queries through R-trees. Initial tests using a
Hilbert Packed R-tree [20] on the simulation particle data
show a promising increase in the retrieval performance of
spatial range queries. The performance of nearest neighbor
searches with the Packed R-tree still needs to be assessed
and a proper implementation as a MySQL server plugin has
to be developed.
3.4 Distributed File System Snapshots for Database
Backup
Since the database tables storing the simulation data are
written once and read many times, the XtreemFS cloud file
system (see below) can be used for the backup and replication of the tables. The overhead of this method is smaller
than using the MySQL server build in replication.
With this setup we increase the availability of data on the
shard nodes. If the data on the shard nodes should become
unavailable, MySQL can still access the data table through
XtreemFS on a remote data server. In this case, the query
performance is slightly degraded but the data remains accessible. This also ensures that any temporary in memory
table on the shard nodes remain accessible to MySQL and
operation can continue without interruption.
This setup additionally integrates nicely with the storage
setup for the raw data and all other data that is not stored
in the database. Details of the configuration are given in the
next Section.
4 Cloud Data Storage Layer with XtreemFS
XtreemFS9 [5, 18] is a distributed file system developed at
ZIB. It was designed to fill the gap between parallel file
systems such as Lustre, PVFS, GPFS or Panasas on the
one hand, and the specialized data storage software used
in datacenters like Amazon S3 or Google Cloud Storage
on the other hand. XtreemFS allows to run legacy applications in the Cloud since it supports the complete POSIX semantics and APIs. XtreemFS stores its data and metadata
on servers in local or wide area networks. For fast file access, XtreemFS provides striping of large files, read-only
and read-write replication, and an efficient mechanism for
snapshots.
4.1 Cloud Storage for Astrophysics
Most of the scientific data in astrophysics is stored in various formats and only about one quarter of it can be handled by databases. Various kinds of binary data like imaging
data from CCD readouts or files with spectral information
form the bulk of data that must be managed. These require
large file systems, where most of the data is written once
and never modified. All data must be kept online for further
processing. The data processing requires efficient access to
both, RDBMS and bulk data. File backup solutions should
take advantage of this situation of only partially changing
data. On top of this, mechanisms for the automatic data distribution across several sites are needed to support collaborative research of working groups at different geographical
sites.
XtreemFS comes with a set of properties, which address
theses desired features. Bulk data can be stored directly in
XtreemFS. The read-only replication of XtreemFS provides
redundancy for fault-tolerance by storing all data on several geographically distributed sites. Replicating database
files would require the read-write replication of XtreemFS
when these are frequently modified. However, active/active
replication for the database can be solved more efficiently
by the MySQL cluster itself. Nevertheless, XtreemFS is
beneficial for reliably storing database tables that are written once and accessed many times. Furthermore, database
dumps generated by mysqldump can be easily managed
with XtreemFS. Instead of collecting numerous complete
dump files, one could repeatedly overwrite the same file
9 http://www.xtreemfs.org.
Datenbank Spektrum (2012) 12:173–181
and create an XtreemFS snapshot for the written file shortly
thereafter to provide access to older versions of the dump.
As XtreemFS’ snapshots use copy-on-write, only the modified blocks of the dump add to the actual disk usage footprint.
4.2 Object Based Architecture
Instead of storing the metadata and file content together
on the same disk as typically done by local file systems,
XtreemFS is implemented as an object based file system
where the metadata is kept separate from the file content—
the objects. This leads to a better sequential I/O performance, because reading or writing large amounts of consecutive data blocks is not interrupted by metadata disk operations which would cause time-consuming head movements
on the same disk.
From an architectural point of view, the XtreemFS software comprises four different modules which are briefly described below: (1) the metadata and replica catalog, (2) the
object storage devices, (3) the directory service, and (4) the
client.
Metadata and Replica Catalog (MRC) The MRC contains
all metadata information that is necessary for providing file
system services. It maintains, for example, the folder hierarchy, access permissions of directories and files, file sizes,
available replicas, and other metadata. The MRC is implemented with BabuDB [19], a fast append-only database developed at ZIB, which uses log-structured merge trees [15].
This data structure has the advantage that it mainly performs
sequential disk access for good I/O performance. For a typical size of the MRC database see Table 3.
Object Storage Device (OSD) The OSDs hold the contents
of the files, that is, the ‘objects’. Objects are allowed to be
striped over several OSDs to improve the bandwidth of the
data access. Additionally, the data availability can be improved by replicating the objects over several OSDs.
Directory Service (DIR) The directory service is the service registry of XtreemFS, where MRCs and OSDs of an
XtreemFS deployment are registered.
Client The XtreemFS client builds the link between MRCs
and OSDs. When opening a file, the client asks the MRC
for the corresponding metadata and access permission. The
MRC replies with an object-id, a capability (that is a digitally signed access permission checked by the OSDs), a list
of OSDs serving the file, the replication scheme and the
striping scheme of that file. For read and write operations
(see Fig. 4), the client then directly accesses the OSDs with
the obtained capability.
179
Table 3 Typical numbers and sizes in a practical XtreemFS installation (file system used for simulation intermediate data on ZIB’s
D-Grid-Cluster, June 2012)
Number of OSDs
16 (files are striped over 2 OSDs)
OSD’s total storage capacity
85.92 TB (5.37 TB * 16)
OSD’s storage used
16.48 TB
Number of files
1,302,691
Number of directories
172
Number of objects
17,2 millions
MRC-DB’s disk usage
267 MB
MRC-DB’s RAM usage
Max. identical to MRC-DB disk usage
Note that we deliberately decided to omit any direct communication between MRC and OSDs. This was done to
avoid any potential performance bottleneck of clients consulting the MRC(s) before performing any data block movement. Instead, the client builds this missing link by asking
the MRC once and then acting autonomously on the OSDs
with the obtained file capabilities.
4.3 Replication Scheme
XtreemFS provides two replication schemes, a read-only
replication with asynchronous replication for the efficient
distribution of write-once data and a read-write replication
for mutable files.
Read-only replication is done asynchronously and therefore does not affect the write performance of a client. When
a file is marked in the MRC as ‘read-only replicated’, the
OSDs automatically create the necessary number of replicas when a new file is written. In addition to full replicas,
XtreemFS also supports partial replicas where the objects
are replicated only on demand, that is, at a remote read operation. This scheme is well-suited for handling large-scale
experimental raw data with many objects. When a client attempts to access a missing object, the OSD automatically
pre-fetches subsequent objects to reduce the access latency.
Read-write replication provides linearizable data consistency as required by the POSIX standard [6]. This means
that after a successful write operation a subsequent read always returns the written data, irrespective of which replica
was accessed. To ensure the total order of reads-after-writes
and writes-after-writes, a primary/backup scheme is used at
file granularity. File operations are only performed on OSDs
that hold the primary role for that file. The primary thereby
acts as a sequencer and it propagates file modifications to
the other OSDs holding replicas. For fault-tolerance, at least
a quorum of the replicas has to acknowledge file modifications before they are declared stable.
Fail-over of a primary is done with leases. A lease grants
an OSD the role of a primary for a defined time period.
For long-running operations, leases can be renewed. When
180
Datenbank Spektrum (2012) 12:173–181
a variant of PAXOS consensus [10, 16]. Flease does not require disk operations for stable storage, because leases become invalid after their expiration. Thereby interventions of
the OSD’s normal disk operations are avoided, which otherwise would considerably degrade the I/O performance.
Replication of MRC and DIR Besides the replication
schemes for file content established between a set of OSDs
per file, also the MRC and DIR services may be replicated
for improved fault-tolerance. They use the same masterslave approach based on leases and fault-tolerant lease management as the read-write replication of the file content. This
allows a minority of servers to fail without rendering the system unavailable for much longer than a lease timeout and to
support a consistent recovery from persistent storage if a
majority or more of the servers crash and recover.
4.4 Snapshots
Fig. 4 File access with XtreemFS
a lease is expired, another OSD can become the new primary. Hence, when the primary (i.e. the current lease holder)
crashes, the file becomes unavailable only for the lease timeout period.
File Access Figure 4 illustrates the file access protocol in
XtreemFS. When a client tries to access a file, it sends a request to the MRC which returns a file capability and a list of
OSDs holding a replica of the file. The MRC sorts the list of
OSDs according to the configured replica selection strategy
taking the requesting client into account, for example, based
on vicinity or access latency.
The client then selects a suitable OSD and sends the request to that OSD. If the OSD is not yet the primary, it tries
to acquire the lease for the file and thus becomes primary. If
another OSD holds the lease, it tells the client to redirect its
request to the primary.
When an OSD has acquired the lease, it must ensure that
its local objects of the file are up to date. This is necessary,
because the OSD may have missed previous changes which
have been written on a majority of the other OSDs. Hence
it asks a majority to send their contents to update the local
version.
When the local replica is up-to-date, the OSD can answer
read and write requests to the client directly. Write requests
are, however, only acknowledged to the client after the OSD
has sent the data to all other OSDs and a majority has answered. Once a majority has written the data and acknowledged the update, the primary informs the client about the
successful operation.
Handling Leases The lease coordination is done between
the OSDs themselves with the Flease algorithm [8], a scalable and fault-tolerant distributed algorithm that is based on
A snapshot represents a consistent state of all data objects
of the file system at a given time. Snapshots are written to
stable storage (disks) to allow users to restore a backup of
the file system in case of data loss or any kind of malfunctioning.
Providing a consistent snapshot in a distributed system
is a non-trivial task, because (1) there is no global clock,
(2) replicas may be in a transient (unsynchronized) state, and
(3) messages may be delayed for an arbitrary time. Based
on Lamport’s happened before relations [9] and Mattern’s
vector clocks [13] numerous algorithms have been devised
that address the first problem. But in practice the challenge
is to find a snapshot algorithm that collects all necessary
data on-the-fly without pausing the file system—even under harsh conditions with crashing components and weak
network links.
Any client may initiate a snapshot in XtreemFS by sending a request to the MRC. This invokes a protocol that keeps
a consistent state of the current metadata at the MRC and the
corresponding data at the OSDs. As all data is versioned,
care must be taken to write the correct versions to stable
storage.
MRC Snapshot A snapshot in the MRC is written by creating a new active tree of the log-structured merge trees and
to write the old tree, which represents the snapshot, to stable storage. From this point on, all newly incoming updates
are recorded in the log and also in the in-memory tree which
becomes the new active tree. The former tree can be compactified in the background with all previous trees to save
storage space.
OSD Snapshot OSDs provide object versioning and copyon-write. This makes it simple to write a snapshot: at a new
Datenbank Spektrum (2012) 12:173–181
write, a new copy is created with a newer version number
given by the current time stamp. As XtreemFS uses loosely
synchronized clocks with a maximum deviation10 of , a
snapshot at time t then contains all changes up to t − and
no changes that happened after t + . Modifications during
the interval, may or may not be part of the snapshot.
In consequence, a message delay much shorter than could lead to a server requesting a snapshot immediately after a file modification not seeing the change in the snapshot.
Piggybacking the local time on all messages in response to a
local change, helps in solving this issue: A server receiving
time-stamped messages delays its request until its own clock
passes the received time stamp.
5 Conclusion
Requirements of scientific data management in astrophysics
are diverse and depend to a large extent on the concrete application goals. When handling big data, especially the highvolume structured data needed to study large data collections, raise high demands and challenges on the underlying
data infrastructure.
For the efficient access and flexible use of such data
we presented a data infrastructure based on MySQL-cluster
and an extension of the Spider-engine. Combined with the
storage virtualization provided by the Cloud file system
XtreemFS, we get an extendable platform with seamless
data redundancy and improved data availability. Thanks to
the XtreemFS snapshot capability, we can store database
dumps with moderate storage overhead to make scientific workflows reproducible by re-executing them on the
recorded state of the database.
For big unstructured raw data a storage virtualization as
provided by XtreemFS can be used as a collaborative content delivery network inside the community. It provides data
redundancy and improved availability by automatic replication and fast data access by good data placement strategies.
In contrast to typical other solutions, applications do not
need to be modified and can access XtreemFS data in the
Cloud as they access files in a local file system, regardless
of the actual storage location.
Acknowledgements Part of the work on the RDBMS is funded by
“Virtuelles Datenzentum (VDZ)”, BMBF grant 05A09BAB. We thank
K. Riebe and J. Klar (AIP) for critical discussions and suggestions.
The XtreemFS development was partly funded by the EU projects
XtreemOS (2006–2010) and Contrail (2010–2013) and by the German BMBF projects MoSGrid (2009–2012) and VDZ (2010–2012).
We thank the XtreemFS team for the design and implementation of
XtreemFS which, with its unique feature set, became a perfect tool for
research on distributed data management.
10 Note
that is given by the system environment. Typically, is between 10 and 100 ms for WANs and approx. 1 ms in LANs.
181
References
1. Begeman K et al (2011) LOFAR information system. Future generation computer systems, vol 27. Elsevier, Amsterdam, pp 319–
328
2. A brief introduction to FITS. http://fits.gsfc.nasa.gov/fits_
overview.html
3. Enke H, Wambsganss JK (2012) Astronomie und Astrophysik. In:
Langzeitarchivierung von Forschungsdaten – eine Bestandsaufnahme. Verlag Werner Hülsbusch, Göttingen
4. Guidelines for participation, IVOA note 2010 July 7. Chaps. 1
and 3. http://www.ivoa.net/Documents/latest/IVOAParticipation.
html
5. Hupfeld F, Cortes T, Kolbeck B, Stender J, Focht E, Hess M,
Malo J, Martí J, Cesario E (2008) The XtreemFS architecture—
a case for object-based file systems in grids. Concurr Comput
20:2049–2060
6. IEEE Std 1003.1-2008. POSIX.1-2008, The Open Group Base
Specifications Issue 7. The Open Group, 2008
7. IVOA astronomical data query language. http://www.ivoa.net/
Documents/cover/ADQL-20081030.html
8. Kolbeck B, Högqvist M, Stender J, Hupfeld F (2011) Flease—
lease coordination without a lock server. In: 25th IEEE international symposium on parallel and distributed processing (IPDPS
2011), pp 978–988
9. Lamport L (1978) Time, clocks, and the ordering of events in a
distributed system. Commun ACM 21(7):558–565
10. Lamport L (1998) The part-time parliament. ACM Trans Comput
Syst 16(2):133–169
11. Lemson G, Budavari T (2011) Implementing a general spatial indexing library for relational databases of large numerical simulations. In: 23rd international conference on scientific and statistical
database management. Springer, Berlin
12. Lemson G, Virgo Consortium (2006) Halo and galaxy formation histories from the millennium simulation: public release of a VO-oriented and SQL-queryable database for studying the evolution of galaxies in the LambdaCDM cosmogony.
arXiv:astro-ph/0608019
13. Mattern F (1988) Virtual time and global states of distributed
systems. In: Cosnard M (ed) Proc workshop on parallel and distributed algorithms. Elsevier, Amsterdam, pp 215–226
14. O’Mullane W (2011) Blue skies and clouds, archives of the future.
GAIA-TN-PL-ESAC-WOM-057-0
15. O’Neil P, Cheng E, Gawlick D, O’Neil E (1996) The logstructured merge-tree (LSM-tree). Acta Inform 33(4):351–385
16. Prisco RD, Lampson BW, Lynch NA (2000) Revisiting the
PAXOS algorithm. Theor Comput Sci 243(1–2):35–91
17. Riebe K et al (2011) The MultiDark database: release of the
Bolshoi and MultiDark cosmological simulations. arXiv:1109.
0003v2
18. Stender J, Berlin M, Reinefeld A (2012, to appear) XtreemFS—
a file system for the cloud. In: Kyriazis D, Voulodimos A, Gogouvitis S, Varvarigou T (eds) Data intensive storage services for
cloud environments. IGI Global Press
19. Stender J, Kolbeck B, Högqvist M, Hupfeld F (2010) BabuDB:
fast and efficient file system metadata storage. In: 2010 international workshop on storage network architecture and parallel
I/Os (SNAPI ’10), Washington, DC, USA. IEEE Comput Soc, Los
Alamitos, pp 51–58
20. Kamel I, Faloutsos C (1993) On packing R-trees. In: Proceedings
of the second international conference on information and knowledge management (CIKM ’93), pp 490–499
Datenbank Spektrum (2012) 12:223–225
DOI 10.1007/s13222-012-0101-y
COMMUNITY
Der 175. Datenbankstammtisch an der HTW Dresden
Uwe Wloka · Gunter Gräfe
Online publiziert: 27. September 2012
© Springer-Verlag 2012
Der im Jahr 1993 ins Leben gerufene Datenbankstammtisch
führt am 14.11.2012 an der Hochschule für Technik und
Wirtschaft Dresden (HTW Dresden) seine 175. Veranstaltung durch.
In einer Zeit, als sich in den neuen Bundesländern auf
Grund der zunehmenden Kommerzialisierung der Beziehungen zwischen Hochschulen und Wirtschaft eine gewisse
Sprachlosigkeit zwischen beiden Seiten breit machte, fanden sich beherzte Kollegen zusammen, um dieser Sprachlosigkeit entgegen zu wirken. Das waren Herr Bittner, Geschäftsführer der SQL GmbH Dresden (jetzt SQL Projekt AG) und die Universitäts- und Hochschulprofessoren
Meyer-Wegener, Benn und Wloka.
Die zu konzipierenden Veranstaltungen sollten interessierte Fachleute aus der Wirtschaft und dem Hochschulwesen in einer Diskussionsrunde vereinen, in der unkonventionell theoretische und praktische Probleme vorgestellt,
diskutiert und daraus gemeinsame Arbeiten abgeleitet werden sollten. Die Hochschulen wollten dabei ihr beträchtliches Know-how einbringen und im Rahmen von Beleg-,
Praktikums-, Diplom- und Forschungsarbeiten Aufgaben
der Praxis lösen helfen.
Diesen Zielen wurde am besten die Einrichtung eines
Stammtisches gerecht.
Der Stammtisch verfügt dabei über folgende Charakteristika, die denen von Stammtischrunden in einer Gaststätte
ähneln:
U. Wloka () · G. Gräfe
Fakultät Informatik/Mathematik, Hochschule für Technik und
Wirtschaft Dresden (FH), Friedrich-List-Platz 1, 01069 Dresden,
Deutschland
e-mail: [email protected]
G. Gräfe
e-mail: [email protected]
– Es erfolgt keine Anmeldung zu den Stammtischveranstaltungen (schließlich meldet man sich auch nicht an, wenn
man zum Stammtisch in eine Gaststätte geht).
– Es werden keine Gebühren erhoben (das trifft auch auf
die Stammtischbrüder in der Gaststätte zu).
– Es sollen sich weitgehend die gleichen Teilnehmer an den
Diskussionen beteiligen (auch das charakterisiert einen
Stammtisch)
– Die Diskussionen sollen im Gegensatz zu großen Tagungen auch zu Details und bis zum Ende geführt werden
können.
– Im Gegensatz zu großen Tagungen ist der Teilnehmerkreis, zwischen 20 und 40 Teilnehmern schwankend, aber
doch klein gehalten.
– Letztendlich wird der Stammtischcharakter dadurch dokumentiert, dass nach einem ca. einstündigem Vortrag in
der Pause und während der Diskussion Bier und Brötchen
gereicht werden (letzteres wurde dankenswerter Weise
von den beteiligten oder referierenden Firmen finanziert).
Der Datenbankstammtisch wird seit seiner Initiierung einmal im Monat an einem Mittwochnachmittag an der HTW
Dresden durchgeführt (ausgenommen ist die Sommerpause). 15 Jahre wurde der Datenbankstammtisch von Prof.
Wloka mit großem Engagement organisiert und moderiert.
Bis zu seiner Verabschiedung aus dem aktiven Hochschuldienst im Jahr 2008 hat Prof. Wloka den Datenbankstammtisch mit 8 Vorträgen und zahlreichen Diskussionsbeiträgen
bereichert. Seit 2008 wurde die Organisation und Moderation des Datenbankstammtisches von Prof. Gräfe übernommen.
Inhaltlich werden auf den Stammtischveranstaltungen
Themen zu folgenden großen Problemkreisen behandelt:
– Grundlegende Datenbanktechnologien unabhängig von
konkreten Produkten
224
– Stand und Neuentwicklungen bei Datenbanktechnologien, Datenmodellen und Datenbankbetriebssystemen
– Konkrete Datenbankanwendungen und Anwendungserfahrungen
– Technologien führender Datenbankbetriebssysteme (Vorstellung neuer Feature und Versionen).
Bei den ersten beiden Problemkreisen können sich vor allem die Hochschulen und Universitäten einbringen. Bei dem
dritten Problemkreis kommen die Datenbankanwendungsentwickler und die Endanwender zu Wort. Beim vierten Problemkreis können die Anbieter großer Datenbankbetriebssysteme neue Feature und neue Versionen ihrer Produkte
vorstellen.
Dabei wird großer Wert darauf gelegt, dass die Stammtischveranstaltungen nicht zu Marketingveranstaltungen
missbraucht werden, dass Datenbanktechnologien immer im
Vordergrund stehen.
Weiterhin wird darauf geachtet, dass Themen aus den
Problemkreisen herausgesucht werden, die von einer großen
Anzahl der Teilnehmer als interessant und wichtig empfunden werden. So ist es für die Praktiker meist nicht relevant,
die allerneuesten Forschungen aus dem Gebiet theoretische
Grundlagen der Datenbanken zu erfahren.
Die Pausen zwischen dem Vortrags- und dem Diskussionsteil wurden oft zu als Praktikums-, Abschlussarbeitenoder Jobbörse zwischen interessierten Studenten und Firmenvertretern oder zu bilateralen Absprachen zwischen den
Teilnehmern genutzt.
Wenn man die Liste der Referenten durchsieht, kann man
rückblickend auch die Qualität der 175 Veranstaltungen einschätzen:
1. Führende Datenbankwissenschaftler aus den Universitäten und Hochschulen stellten sich als Referenten
zur Verfügung. Wie bereits Prof. Lehner auf dem 125.
Datenbankstammtisch feststellte, ist „das Who-is-Who
der deutschsprachigen Datenbank-Community“ auf dem
Datenbankstammtisch an der HTW Dresden aufgetreten, wie z. B. Prof. Benn (4 Vorträge), Prof. Dittrich,
Prof. Freitag, Prof. Freytag, Prof. Härder (4 Vorträge),
Prof. Heuer, Prof. Jasper, Prof. Kemper, Prof. Küspert,
Prof. Lehner (3 Vorträge), Prof. Lockemann (3 Vorträge), Prof. Meyer-Wegener (5 Vorträge), Prof. Mitschang
(2 Vorträge), Prof. Rahm (2 Vorträge), Prof. Saake, Prof.
Sattler (2 Vorträge), Prof. Scheck, Prof. Schmidt (Hamburg), Prof. Schmitt (Cottbus – 2 Vorträge), Prof. Scholl,
Prof. Thalheim (2 Vorträge), Prof. Vossen, Prof. Wedekind, Prof. Weikum u. a.
2. Vertreter aller großen Datenbankanbieter, wie z. B. der
Firmen Oracle, IBM/Informix (6 Vorträge), Microsoft (5
Vorträge), Sybase/SAP (4 Vorträge), Intersystems (3 Vorträge), Software AG (3 Vorträge) u. a., stellten auf dem
Datenbankstammtisch Technologien ihrer Produkte vor.
Datenbank Spektrum (2012) 12:223–225
3. Vertreter überregionaler Wirtschaftsunternehmen, wie
beispielsweise die Deutsche Bahn AG, Global Foundries (3 Vorträge), ILOS AG Grevenmacher Luxemburg
(2 Vorträge), Infineon, Linde KCA (2 Vorträge), die Otto
Group, Pit-Cup, Xavo AG Bayreuth (2 Vorträge) u. a., referierten über ihre Datenbankanwendungssysteme bzw.
-projekte.
4. Vertreter lokaler Datenbankanwendungsentwickler und
Datenbankanwender, wie z. B. SQL Projekt AG Dresden (17 Vorträge), Robotron Datenbank-Software GmbH
Dresden (10 Vorträge), CAD-Systemhaus Dr. Joachim
Oelschlegel Dresden (6 Vorträge), T-Systems Multimedia Solution GmbH Dresden (4 Vorträge), GICON
GmbH (2 Vorträge), Newtron AG, SALT Solutions
GmbH oder das Arzneimittelwerk Dresden GmbH referierten über die zur Anwendung benutzten Entwicklungstools, Datenbankbetriebssysteme oder Datenbankanwendungen.
Anlässlich des 175. Datenbankstammtisches durchgeführte
Analysen zeigen, dass es gelungen ist, ein nahezu ausgewogenes Verhältnis von Vertretern der Hochschulen/Universitäten und der Wirtschaft, sowohl bei den Referenten als
auch bei den Teilnehmer zu erreichen (zu ähnlichen Ergebnissen kam schon Herr Bittner in seiner Laudatio auf dem
100. Datenbankstammtisch). Damit wurde über einen längeren Zeitraum kontinuierlich ein fruchtbarer Dialog zwischen
Vertretern der Wissenschaft und der Wirtschaft geführt und
die eingangs beschriebene Sprachlosigkeit überwunden. Mit
Herrn Bittner, Vorstandsvorsitzender der SQL Projekt AG
(früher SQL GmbH) Dresden, führt mit 17 Vorträgen ein
Vertreter der Wirtschaft die Liste der aktivsten Referenten
an.
Ein Blick auf die Liste der Referenten zeigt auch,
dass nur knapp 50 % der Vortragenden aus dem Dresdner
Industrie- und Hochschulraum kommen. Der größere Anteil der Referenten reiste von außerhalb aus dem gesamten Bundesgebiet, dem europäischen Ausland (Schweiz, Luxemburg, Tschechien, Dänemark), aber auch aus den USA
oder der VR China an.
Eine Analyse der Vortragsthemen zeigt, wie sich im
Laufe der vergangenen 19 Jahre Forschungsschwerpunkte und praktische Probleme des Datenbankeinsatzes gewandelt haben. Während in den ersten Jahren des Datenbankstammtisches objektorientierte Modelle und Systeme
sowie Features zur Verwaltung von multimedialen Daten
sehr häufig ein Thema waren, rückten später weitere Entwicklungen wie Datenbanken in GIS-Anwendungen, DataWarehouse-Technologien, Datenbanken und XML, Internet/Web-Technologien für den Datenbankzugriff sowie insbesondere in den letzten Jahren Technologien zur Verwaltung großer Datenbestände (Big Data) in den Mittelpunkt des Interesses. Aus Anwendersicht waren Themen
Datenbank Spektrum (2012) 12:223–225
wie Datenbank-Tuning, Performance-Sicherung oder Hochverfügbarkeit aber zu jeder Zeit ein Schwerpunkt der Präsentationen.
Auf Grund des nach wie vor großen Interesses von
Vertretern der Hochschulen, Universitäten und aus der
Wirtschaft an den Stammtischveranstaltungen, aber auch
auf Grund der Wünsche einer zum Teil festen Teilnehmerschaft, die von Berlin über Magdeburg bis nach Jena
reicht, sollen die Veranstaltungen des Datenbankstammti-
225
sches auch über die Jubiläums-Zahl von 175 fortgeführt
werden.
Dazu laden wir alle Interessenten und potentiellen Referenten ganz herzlich ein.
Weitere Informationen zum Datenbankstammtisch, die
vollständige Referenten- und Themenliste sowie die Liste
der geplanten Veranstaltungen befindet sich unter: http://
www.htw-dresden.de/fakultaet-informatikmathematik/
veranstaltungen/datenbankstammtisch.html.
Datenbank Spektrum (2012) 12:229–231
DOI 10.1007/s13222-012-0106-6
COMMUNITY
News
Online publiziert: 28. September 2012
© Springer-Verlag Berlin Heidelberg 2012
1 Neues Leitungsgremium der Fachgruppe „Mobilität
und Mobile Informationssysteme (FG MMS)“
Die im Frühjahr 2005 gegründete GI-Fachgruppe „Mobilität
und Mobile Informationssysteme (FG MMS)“ hat sich zum
Ziel gesetzt, eine offene Plattform und ein Diskussionsforum für mit mobilen Technologien und Anwendungen zusammenhängende Themen und Fragestellungen zu schaffen.
Sowohl organisatorisch als auch inhaltlich gehört die Fachgruppe zu den beiden Fachbereichen „Datenbanken und Informationssysteme (FB DBIS)“ und „Wirtschaftsinformatik
(FB WI)“. Bereits seit 2006 führt die Fachgruppe jährlich
die Konferenz „Mobile und Ubiquitäre Informationssysteme
(MMS)“ durch. Ihre Schwerpunktsetzung alterniert konsequenter Weise zwischen Wirtschaftsinformatik (gerade Jahre) und Informatik (ungerade Jahre), wodurch die angestrebte Interdisziplinarität sehr erfolgreich erreicht wurde. Des
Weiteren ist die Fachgruppe Mitveranstalter der Konferenz
„Mobile Communications – Technologien, Märkte, Anwendungen (MCTA)“, die bereits seit 2001 jährlich stattfindet
und auch mit der Gründung der Fachgruppe eng verbunden ist. Die MCTA setzt ihren Schwerpunkt dabei im Bereich „science meets industry“, hat sich über die Jahre bei
hochkarätigen Praktikern etabliert und zieht überregionales
Medieninteresse auf sich. Die Fachgruppe beteiligt sich zudem an der Ausrichtung zahlreicher Workshops sowie international führender Veranstaltungen wie der „International Conference on Innovative Internet Community Systems
(I2CS)“ und der „International Conference on Mobile Business (ICMB)“ in Deutschland.
Ende April 2012 wurde das neue Sprechergremium der
FG MMS gewählt. Die Fachgruppenmitglieder waren aufgerufen, aus zehn Kandidatinnen und Kandidaten acht Kolleginnen und Kollegen zu wählen, die die Leitung der Fachgruppe für drei Jahre übernehmen. Gewählt wurden:
– Prof. Dr. Markus Bick, ESCP Europe Wirtschaftshochschule Berlin
– Prof. Dr. Martin Breunig, Karlsruher Institut für Technologie
– Prof. Dr.-Ing. Hagen Höpfner (JP), Bauhaus-Universität
Weimar
– Prof. Dr. Birgitta König-Ries, Friedrich-Schiller-Universität Jena
– PD Dr. Key Pousttchi, Universität Augsburg
– Prof. Dr. Kai Rannenberg, Johann Wolfgang Goethe Universität Frankfurt
– Prof. Dr. Thomas Ritz, Fachhochschule Aachen
– Prof. Dr. Frédéric Thiesse, Julius-Maximilians-Universität Würzburg
In der konstituierenden Sitzung des neuen Leitungsgremiums am 8. Mai 2012 wurden darüber hinaus PD Dr. Key
Pousttchi (FB WI) als Sprecher und Prof. Dr.-Ing. Hagen
Höpfner (JP) (FB DBIS) als Stellvertreter einstimmig gewählt. Damit spiegelt sich die thematische Interdisziplinarität auch in einer organisatorischen Zusammenarbeit zwischen dem Sprecher und dem stellvertretenden Sprecher
wider. Nachdem es für 2013 gelungen ist, die „International Conference on Mobile Business (ICMB)“ erstmals
nach Deutschland zu holen, wurde im Leitungsgremium beschlossen, die MMS im kommenden Jahr co-located zur
ICMB zu veranstalten, Termin ist der 10.–13.06.2013 in
Berlin. Außerdem wird die FG MMS begleitend zur BTW
2013 einen Workshop organisieren, welcher die Erarbeitung einer Forschungslandkarte zu mobilen und ubiquitären Technologien, Märkten, Systemen und Anwendungen im
deutschsprachigen Raum zum Ziel hat. Dadurch soll einerseits das thematische Fundament der FG MMS systematisiert und gefestigt werden, andererseits aber auch ein Angebot an benachbarte Communities gemacht und das interdis-
230
ziplinäre Partnering für Forschungsprojekte unterstützt werden.
Das Leitungsgremium der FG MMS freut sich auf die
Fortsetzung der konstruktiven Zusammenarbeit sowohl innerhalb der Fachgruppe als auch fachgruppenübergreifend.
Nehmen Sie also gerne mit uns Kontakt auf!
2 Gerard-Salton-Award für Norbert Fuhr
Die ACM Special Interest Group on Information Retrieval
(SIGIR) hat Norbert Fuhr in diesem Jahr mit dem SaltonAward geehrt. Er erhält diesen Preis für seine grundlegenden
Beiträge zu wesentlichen Methoden heutiger Suchmaschinen. Fuhr, der Professor an der Universität Duisburg-Essen
ist, entwickelte probabilistische Retrievalmodelle für Datenbanken und XML. Seine Forschung über probabilistische
Modelle hat das aktuelle Interesse an Learning-to-RankAnsätzen vorweggenommen. Fuhr hat den Gerard-SaltonAward anlässlich der diesjährigen ACM SIGIR-Konferenz
in Portland, Oregon, erhalten, wo er die Eröffnungskeynote
gehalten hat.
Neben seinen Beiträgen zur theoretischen Suchmodellen
hat Norbert Fuhr in den 1980er und 1990er Jahren an probabilistischen IR-Verfahren gearbeitet. Seine Arbeiten zeigten
den Nutzen des Schätzens von Modellparametern anhand
von aus Trainingsdaten extrahierten Features.
In seiner aktuellen Forschung befasst sich Norbert Fuhr
mit verschiedenen Aspekten von Information Retrieval, wie
zum Beispiel Text Mining, verteiltem IR, interaktivem IR,
und dem benutzerorientierten Design von IR-Systemen. Er
hat mehr als 200 Veröffentlichungen in den Gebieten IR, Datenbanken, und digitale Bibliotheken. Er war von 1991 bis
2008 Sprecher des Leitungsgremiums der GI-Fachgruppe
Information Retrieval, war mehrfach PC-Chair von internationalen IR-Konferenzen, und ist im Herausgebergremium
von zwei internationalen Fachzeitschriften.
Der Salton-Award wird alle drei Jahre an eine Person verliehen, die bedeutende Beiträge zur IR-Forschung erbracht
hat. Er ist nach Gerard Salton benannt, dem Pionier des
Information Retrieval, der von 1958 bis 1995 in den USA
wirkte und im Jahr 1983 der erste Preisträger war.
Die Fachgruppe Information Retrieval gratuliert Norbert
Fuhr herzlich zu dieser außergewöhnlichen Auszeichnung!
Datenbank Spektrum (2012) 12:229–231
Dach für die Workshops verschiedener Fachgruppen innerhalb der Gesellschaft für Informatik. Ziel des IR-Workshops
war es, ein Forum zur wissenschaftlichen Diskussion und
zum Austausch neuer Ideen zu schaffen. Der Workshop richtete sich daher gezielt auch an Nachwuchswissenschaftler
und Teilnehmer aus der Industrie. Dementsprechend boten
sich interessante Vorträge zu aktuellen Themen aus dem Information Retrieval wie Evaluierung von Retrievalsystemen,
Interaktives Information Retrieval und Anwendungsszenarien. Gemeinsamen Vortragssitzungen mit den Fachgruppen
Knowledge Discovery, Data Mining und Maschinelles Lernen sowie Wissensmanagement waren eine willkommene
Gelegenheit, Arbeiten auch über die Grenzen der eigenen
Fachgruppe hinaus zu präsentieren und zu diskutieren.
Vier Keynotes rundeten die LWA in diesem Jahr ab:
Kristian Kersting (Fraunhofer IAIS und Universität Bonn)
sprach über „High Throughput Phenotyping“, Johannes
Fürnkranz (TU Darmstadt) stellte verschiedene Lernverfahren vor, die auf Preference Learning basierten, Benno Stein
(Universität Weimar) präsentierte drei Anwendungen, die
das Web als großen Korpus verwenden, und Petra Perner
(ibai) berichtete über „Case-Based Reasoning and the Statistical Challenges“.
Auch im nächsten Jahr wird die Fachgruppe IR wieder
mit einem Workshop an der LWA teilnehmen, die im Herbst
2013 in Bamberg stattfinden wird.
4 Produkt-News
4.1 MongoDB: Version 2.2 mit Aggregatfunktionen
Die wichtigste Neuigkeit in MongoDB 2.2 ist ein Aggregationsframework, das die Ausführung von Aggregatfunktionen ohne die Verwendung von MapReduce-Funktionen
ermöglicht. Die Verwendung des Aggregationsframeworks
für Analysen ist nicht nur einfacher, sondern soll laut Aussagen der Entwickler auch signifikant schneller als die Ausführung von MapReduce-Funktionen sein. Damit bestätigt
MongoDB die gegenwärtig generell zu beobachtende Tendenz der Anreicherung von NoSQL-Datenbanksystemen
mit mächtigeren Datenbank-Operatoren als in den ersten
Versionen.
MongoDB, http://www.mongodb.org/.
4.2 Multimodel-DBMS OrientDB 1.0 veröffentlicht
3 Workshop der Fachgruppe Information Retrieval im
Rahmen der LWA 2012
Der traditionelle Herbstworkshop der Fachgruppe Information Retrieval (IR) fand vom 12.–14. September 2012 im
Rahmen der Konferenz „Lernen, Wissen, Adaption“ (LWA)
2012 in Dortmund statt. Traditionell bot die LWA wieder ein
Mit OrientDB ist ein Java-basiertes NoSQL-Datenbanksystem entwickelt worden, welches mehrere der NoSQLModellansätze (hier: Dokumentendatenbank und Graphdatenbank) vereint und deshalb in die relativ neue Kategorie der Multimodel-DBMS eingeordnet wird (http://nosqldatabase.org/). Bezüglich der Schemata sind drei Modi
Datenbank Spektrum (2012) 12:229–231
möglich: schemafrei, vollständige oder teilweise SchemaUnterstützung. Außerdem bietet OrientDB optional eine
SQL-API – auch dies ein in letzter Zeit häufiger zu beobachtender Trend von „NoSQL“-Datenbanksystemen.
OrientDB, http://www.orientdb.org/.
4.3 Software AG: Data Masking for Adabas
Die Software AG hat das Produkt „Data Masking for Adabas“ vorgestellt. Mit diesem Produkt können aktuelle Produktionsdaten in geänderter Form für das Design und Testen von Anwendungen verwendet und damit der Schutz vertraulicher Daten sichergestellt werden. Damit entfällt das
zeitaufwändige und fehleranfällige manuelle Maskieren vertraulicher Daten und Unternehmen wird es ermöglicht, Datenschutzauflagen wie HIPAA, PCI DSS, GLBA oder die
EU-Datenschutzverordnung einzuhalten.
Software AG, http://www.softwareag.com/de.
4.4 Relationales DBMS Google F1 vorgestellt
Google hat ein selbstentwickeltes relationales DBMS mit
dem Namen F1 vorgestellt. Google F1 soll die Skalierbarkeit moderner NoSQL-Ansätze mit dem Funktionsumfang
klassischer SQL-DBMS vereinen. So hat F1 ein festes Sche-
231
ma, eine parallele SQL-Query-Engine, Indexstrukturen und
unterstützt die bekannten ACID-Transaktionskonzepte. F1
setzt auf einem verteilten Storage-System auf, welches sich
auf Standardhardware skalieren lässt und eine transaktionskonsistente Replikation über Rechenzentren hinweg unterstützt. Als erstes Einsatzszenario wurde der MySQL-AdServer des Google Ad-Word-Systems durch F1 abgelöst.
Langfristig soll F1 in weiteren kritischen Google-Systemen
die bestehenden MySQL-Umgebungen ersetzen.
Google, http://research.google.com/pubs/pub38125.html.
4.5 Genealogie relationaler Datenbanksysteme
Am Hasso-Plattner-Institut (HPI) in Potsdam wurde in
der Arbeitsgruppe von Felix Naumann eine Grafik entwickelt, welche die Genealogie von ca. 50 relationalen Datenbanksystemen visualisiert. Dabei werden Einführungszeitpunkt, Versionen, Zusammenschlüsse etc. dargestellt.
Das inzwischen bereits in der dritten Version vorliegende
Poster kann unter http://www.hpi.uni-potsdam.de/naumann/
projekte/rdbms_genealogy.html heruntergeladen werden.
HPI, http://www.hpi.uni-potsdam.de/.
Uta Störl