Big Data Live selbst analysieren

Big Data Live selbst analysieren
Hands on Workshop zu IBM InfoSphere Big Insights
Harald Gröger
Wilfried Hoge
Gerhard Wenzel
IBM
© 2013 IBM Corporation
Agenda
15:00 - 15:10
Einführung IBM Big Data Plattform und BigInsights
15:15 - 15:25
Lab 1: Managing your big data environment
15:25 - 16:05
Lab 2: Analyzing big data with BigSheets
16:05 - 16:10
Demo BigSheets Highlights
16:10 - 16:20
Demo Textanalyse Highlights
Was ist Big Data?
Volume
Variety
Velocity
Veracity
Data at Scale
Data in Many
Forms
Data in Motion
Data Uncertainty
Analysis of streaming
data to enable decisions
within fractions of a
second.
Managing the reliability
and predictability of
inherently imprecise data
types.
Terabytes to
petabytes of data
Structured, unstructured,
text, multimedia
Die IBM Big Data Zonen-Architektur
Real-time
Analytics
Intelligence
Analysis
Data in
Motion
Integrated
Exploration
Ingestion and
Integration
Decision
Management
Streams
Data at
Rest
ETL, Quality, MDM
Data in
Many Forms
Landing, Analytics
and Archive
BI and Predictive
Analytics
Warehouse / Marts
Navigation
and Discovery
MapReduce
Hadoop
Information Governance, Security and Business Continuity
Was ist Hadoop?
Apache™ Hadoop® is an open
source software project that
enables the distributed processing
of large data sets across clusters of
commodity servers.
MapReduce - The framework that
understands and assigns work to
the nodes in a cluster.
HDFS - A file system that spans all
the nodes in a Hadoop cluster for
data storage. It links together the
file systems on many local nodes to
make them into one big file system.
HDFS assumes nodes will fail, so it
achieves reliability by replicating
data across multiple nodes
Scalable – add nodes without changing
data formats, how data is loaded, how
jobs are written, or the applications on top
Cost effective – massively parallel
computing on commodity servers with
sizeable decrease in storage cost, which
makes it affordable to model all your data
Flexible – schema-less, can absorb any
type of data, data from multiple sources
can be joined and aggregated in arbitrary
ways enabling deep analyses
Fault tolerant – loss of a node results in
work redirect to another location of the
data and continues processing
Enterprise class
Umfang der IBM BigInsights Hadoop-Distribution
PureData for Hadoop
- Appliance simplicity
Enterprise Edition
Sold by # of terabytes managed
Quick Start Edition
New for V2.1. Free.
Non-production only
Basic Edition
Free download
- Jaql
- Integrated install
Apache
Hadoop
Enterprise ready
- Integrated web console
- Administrative tools, security
- RDBMS, warehouse connectivity
- Enterprise Integration
- Performance Optimization
- Pre-built applications
Analytics included
- Visualization Capabilities
- Spreadsheet-style tool
- Big SQL
- Text analytics
- Eclipse development
-- Accelerators
PureData for
Hadoop
brings
BigInsights
as an
appliance
form factor
to the market
Breadth of capabilities
6
© 2013 IBM Corporation
Generelle Informationen
• Name
• Hostname der VM = bivm
• Login
• Benutzer = biadmin
• Kennwort = biadmin
Tutorial - Managing your Big Data environment
• Dauer ca. 10 Minuten
• Start „BigInsights Web Console“ über Desktop Icon,
• dann weiter mit Chapter 2 / Lesson 1 / Schritt 3 (Seite 4).
Tutorial - Analyzing Big Data with BigSheets
• Dauer ca. 40 Minuten
• Alle Prerequisites sind bereits erfüllt.
• Die Daten sind heruntergeladen und importiert.
• Start im Files Tab der BigInsights Web Console
• mit Lesson 1 / Schritt 3 (Seite 14),
(hdfs/biginsights/sheets/Watson_data_preloaded)
• Ende nach Lesson 6 / Schritt 3 (Seite 21).
Console Demo
BigSheets Demo
Blog
News
Spreadsheet
Format
From unstructured text
to formatted spreadsheets and charts
Chart
Text Analytics Demo
generate
Labels /
Examples
AQL Regex /
Dictionary
unstructured text
From unstructured text
documents to text
analytics result table
text highlight
create
AQL Candidates
combination of regex and dictionaries plus distance, case, ...
AQL Filter
Result
Table
result table
duplicates, irrelevant candidates, ...
Thank You!
© 2013 IBM Corporation