Data Center Monitoring and Analysis at LRZ

Werbung
Data Center Monitoring and Analysis at LRZ
Torsten Wilde, Hayk Shoukourian, Detlef Labrenz, Albert Kirnberger, and Herbert Huber
7th European Workshop on HPC Centre Infrastructures, LRZ, Germany
The Leibniz Supercomputing Centre
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
2
Scientific Computing at LRZ
http://www.lrz.de/services/compute/supermuc/magazinesbooks/2014_SuperMUC-Results-Reports.pdf
http://www.gauss-centre.eu/gauss-centre/EN/Projects/projects_node.html
3
Motivation: Improve the Energy Efficiency of LRZ
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
4
LRZ Trend of Energy Costs and Energy Consumption from 2000 till
2015 with prediction for 2016 and 2017
8.000.000 €
7.000.000 €
6.000.000 €
HPC System
5.000.000 €
LRZ Total
4.000.000 €
3.000.000 €
Costs for 1kWh: 2000 – 0.07€
2015 – 0.161€
2.000.000 €
1.000.000 €
0€
Energy Cnsumption in MWh
50.000
45.000
40.000
35.000
HPC System
30.000
LRZ Total
25.000
20.000
15.000
10.000
5.000
0
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
5
Where does Our Energy go?
• Need to understand
each pillar
External Influences/Constraints
Data Center (Goal: Reduce Total Cost of Ownership)
Seasonal support
(well, vapor)
Neighboring Buildings
Cooling towers
(dry, wet, hybrid)
Water (W1 – W5)
Air (18°C – 27°C)
Others
Network Equipment
Compilers
Storage (Tape, Drives)
Virtualized Systems
Cluster Systems
HPC Systems
Software Stack
Programing Models
Schedulers
Others
(DSP, FPGA)
Many Core
GPU (Nvidia, AMD)
CPU/APU
(x86, ARM, Power)
Utility
Providers
Reduce Hardware
Power
Consumption
• Need global approach
for optimal results
• includes utility
provider
• define operating
points
Global Optimization Strategy
Improve PUE
(Power Usage
Effectivness)
• Optimize and measure
(KPIs) for each
Optimize Resource
Usage
Tune System
Optimize
Performance
Pillar 1
Pillar 2
Pillar 3
Pillar 4
Building Infrastructure
IT System Hardware
System Software
Applications
• keep infrastructure
efficiency constant
over the whole
operating range
• measure and assess
Open Access 4 Pillar Framework Paper: http://www.springerlink.com/openurl.asp?genre=article&id=doi:10.1007/s00450-013-0244-6
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
6
Data Collection
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
7
Current Data Center Monitoring at LRZ
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
8
Data Consolidation
Hayk Shoukourian, Torsten Wilde, Axel Auweter, Arndt Bode: “Monitoring Power Data: A first step towards a unified energy efficiency evaluation toolset for HPC data centers” published in
Environmental Modelling & Software (Thematic issue on Modelling and evaluating the sustainability of smart solutions), Volume 56, June 2014, Pages 13–26; DOI: http://dx.doi.org/10.1016/j.envsoft.2013.11.011
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
9
Monitored Systems
Data Center (Goal: Reduce Total Cost of Ownership)
Clustware
iCinga
CooLMUC 2
CooLMUC
WinCC
Johnson Controls
SuperMUC Phase1 & 2
IPMI
Slurm
LoadLeveler
Climate /
Weather
Neighboring Buildings
External Influences/Constraints
Utility
Providers
PowerDAM V2.X
Pillar 1
Pillar 2
Pillar 3
Building Infrastructure
IT System Hardware
System Software
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
Pillar 4
Applications
10
Data Design
Root
Resource
(System)
Systems
Rack
PowerDAM internal API:
Circuit
Sensors
๐‘น๐’๐’๐’•๐‘น๐’†๐’”๐’๐’–๐’“๐’„๐’†(. ๐‘น๐’†๐’”๐’๐’–๐’“๐’„๐’†) ∗ _๐‘บ๐’†๐’๐’”๐’๐’“๐‘ป๐’š๐’‘๐’†
= ๐‘ฝ๐’‚๐’๐’–๐’†; ๐‘ป๐’Š๐’Ž๐’†๐’”๐’•๐’‚๐’Ž๐’‘
PowerDAM sensor name:
Node 1
Sensors
Node 1
Device
Sensors
Sensors
๐‘น๐’๐’๐’•๐‘น๐’†๐’”๐’๐’–๐’“๐’„๐’†(. ๐‘น๐’†๐’”๐’๐’–๐’“๐’„๐’†) ∗ _๐‘บ๐’†๐’๐’”๐’๐’“๐‘ป๐’š๐’‘๐’†
Data Storage
(currently SQL Table)
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
11
Collect only specific Sensors
1.
2.
3.
4.
Agent Registration with PowerDAM
๏‚ง Generate XML file of all available
sensors
File with sensors and/or sensor roots to
activate
Activate sensors
Collection uses only activated sensors in
XML file
๏‚ง XML file can be updated, new sensor
will be added, removed sensors will be
marked as inactive
๏‚ง Can add additional properties (for
example, valid range)
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
12
Fun With Sensor Names
PowerDAM API:
๐‘น๐’๐’๐’•๐‘น๐’†๐’”๐’๐’–๐’“๐’„๐’†(. ๐‘น๐’†๐’”๐’๐’–๐’“๐’„๐’†) ∗ _๐‘บ๐’†๐’๐’”๐’๐’“๐‘ป๐’š๐’‘๐’†
= ๐‘ฝ๐’‚๐’๐’–๐’†; ๐‘ป๐’Š๐’Ž๐’†๐’”๐’•๐’‚๐’Ž๐’‘
๏ฎ
•
•
๏ฎ
•
•
jci.IUGSS41.EZ04_MW__PE_power
IUGSS_41EZ__PEMW04
@JCSQL:NAE054-01:NAE054-01/N2 Trunk1.
NAE054MIG136.IUGSS_41EZ__PEMW04.Present Value
WinCC:
Names relate to individual circuit names without
any regard to possible structure
Use of German Umlauts
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
13
Current Status
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
14
Data Collection using PowerDAM
๏‚ง
Collection via: Pulling, Publish – Subscribe (test phase)
๏‚ง
Import options: SQL, CSV, SSH-script
๏‚ง
Collection interval 1min, Backfilling (for specific sensors,
timeframe)
๏‚ง
Export options: Report plug-ins; CSV file export; XLSX file
export
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
15
LRZ PUE and SuperMUC System PUE
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
16
Lightweight Adaptive Consumption Prediction (LACP)
Application
identifier
Number of
resources
CPU frequency of the
resources
(1)
(3)
(3)
Monitoring tool
(3)
(2)
Available application
relevant history data
[1] Hayk Shoukourian, Torsten Wilde, Axel Auweter, and Arndt
Bode. Predicting the Energy and Power Consumption of Strong
and Weak Scaling HPC Applications. Journal of Supercomputing
Frontiers And Innovations. ISSN: 2313-8734
(online) DOI: 10.14529/jsfi
URL: http://superfri.org/superfri/article/view/9/8
[2] Hayk Shoukourian, Torsten Wilde, Axel Auweter, Arndt Bode,
and Daniele Tafani. Predicting Energy Consumption Relevant
Indicators of Strong Scaling HPC Applications for Different
Compute Resource Configurations. Proceedings of the 23rd High
Performance Computing Symposium, Society for Modeling and
Simulation International (SCS), ACM, 2015
๐‘ท๐’“๐’†๐’…๐’Š๐’„๐’•๐’๐’“
(4)
[3] Hayk Shoukourian. Adviser for Energy Consumption
Management: Green Energy Conservation Ph.D. dissertation,
München, Technische Universität München (TUM), 2015
Predicted TtS/APC/EtS of the
application for a given number of
resources and given CPU frequency
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
17
Power Prediction for running Epoch on 311 Compute Nodes
| ๐ด๐‘ƒ๐ถ๐ฝ |๐‘š๐‘–๐‘› = ๐ด๐‘ƒ๐ถ(๐ฝ)๐‘– − σ๐‘ข ๐‘ข๐‘ก๐‘–๐‘™๐‘–๐‘ง๐‘’๐‘‘ ๐‘›๐‘œ๐‘‘๐‘’ ๐‘œ๐‘“ ๐ฝ(๐‘ƒ๐‘ข −๐‘ƒm๐‘–๐‘› )
| ๐ด๐‘ƒ๐ถ๐ฝ |๐‘š๐‘Ž๐‘ฅ = ๐ด๐‘ƒ๐ถ(๐ฝ)๐‘– + σ๐‘ข ๐‘ข๐‘ก๐‘–๐‘™๐‘–๐‘ง๐‘’๐‘‘ ๐‘›๐‘œ๐‘‘๐‘’ ๐‘œ๐‘“ ๐ฝ(๐‘ƒ๐‘š๐‘Ž๐‘ฅ −๐‘ƒu )
System-vendor provided approximation
for 311 nodes is: 118108 W
LACP predicted maximum
power consumption for 311
nodes is: 55993 W
Power Cap
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
18
Monitoring vs. Data Analysis
๏ฎ
Data collected by building automation system can be inaccurate for system
analysis
๏ฏ
๏ฏ
Changes of +-X are ignored (not recorded) since value is used for control
decision
Sensor readout frequency limited by connectivity. Same value is recorded for inbetween timestamps.
๏ฎ
Gaps in data not considered critical
Data quality and accuracy is not guaranteed
๏ฎ
Currently thinking about data verification and validation options
๏ฎ
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
19
Analysis Example
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
20
Performance of Cooling Infrastructure 2015
COP = Q / P
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
21
Warm Water Cooling Loop
∞ KLT 11 ∞
∞ KLT 12 ∞
∞ KLT 13 ∞
M
M
M
T
M
T
T
M
M
N
N
T
T
M
M
∞ KLT 14 ∞
M
T
T
M
M
S
S
N
N
T
M
S
S
M
M
M
M
M
SuperMUC
Cluster
HRR
NSR
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
22
KLT14 (2/2015) Power&Heat Profile Q(t), P(t)
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
23
KLT14 (2/2015) Power Profile P(Q)
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
24
KLT14 (9/2015) Power&Heat Profile Q(t), P(t)
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
25
KLT14 (9/2015) Power Profile Q(P)
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
26
KLT14 (9/2015) Power Profile Fans//Pumps
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
27
KLT14 (9/2015) Mode of Operations of Pumps
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
28
Acknowledgment
Part of the work presented here has been carried out within the SIMOPEK
project, which has received funding from the German Federal Ministry for
Education and Research under grant number 01IH13007A, at the Leibniz
Supercomputing Centre (LRZ) with support of the State of Bavaria, Germany,
and the Gauss Centre for Supercomputing (GCS).
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
29
Torsten Wilde
[email protected]
www.simopek.de
Data correlation with different read-out frequencies
Longest read-out interval
Sensor Type Unit
average
average
average
Timestamp
Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016
31
Herunterladen