Data Center Monitoring and Analysis at LRZ Torsten Wilde, Hayk Shoukourian, Detlef Labrenz, Albert Kirnberger, and Herbert Huber 7th European Workshop on HPC Centre Infrastructures, LRZ, Germany The Leibniz Supercomputing Centre Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 2 Scientific Computing at LRZ http://www.lrz.de/services/compute/supermuc/magazinesbooks/2014_SuperMUC-Results-Reports.pdf http://www.gauss-centre.eu/gauss-centre/EN/Projects/projects_node.html 3 Motivation: Improve the Energy Efficiency of LRZ Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 4 LRZ Trend of Energy Costs and Energy Consumption from 2000 till 2015 with prediction for 2016 and 2017 8.000.000 € 7.000.000 € 6.000.000 € HPC System 5.000.000 € LRZ Total 4.000.000 € 3.000.000 € Costs for 1kWh: 2000 – 0.07€ 2015 – 0.161€ 2.000.000 € 1.000.000 € 0€ Energy Cnsumption in MWh 50.000 45.000 40.000 35.000 HPC System 30.000 LRZ Total 25.000 20.000 15.000 10.000 5.000 0 Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 5 Where does Our Energy go? • Need to understand each pillar External Influences/Constraints Data Center (Goal: Reduce Total Cost of Ownership) Seasonal support (well, vapor) Neighboring Buildings Cooling towers (dry, wet, hybrid) Water (W1 – W5) Air (18°C – 27°C) Others Network Equipment Compilers Storage (Tape, Drives) Virtualized Systems Cluster Systems HPC Systems Software Stack Programing Models Schedulers Others (DSP, FPGA) Many Core GPU (Nvidia, AMD) CPU/APU (x86, ARM, Power) Utility Providers Reduce Hardware Power Consumption • Need global approach for optimal results • includes utility provider • define operating points Global Optimization Strategy Improve PUE (Power Usage Effectivness) • Optimize and measure (KPIs) for each Optimize Resource Usage Tune System Optimize Performance Pillar 1 Pillar 2 Pillar 3 Pillar 4 Building Infrastructure IT System Hardware System Software Applications • keep infrastructure efficiency constant over the whole operating range • measure and assess Open Access 4 Pillar Framework Paper: http://www.springerlink.com/openurl.asp?genre=article&id=doi:10.1007/s00450-013-0244-6 Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 6 Data Collection Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 7 Current Data Center Monitoring at LRZ Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 8 Data Consolidation Hayk Shoukourian, Torsten Wilde, Axel Auweter, Arndt Bode: “Monitoring Power Data: A first step towards a unified energy efficiency evaluation toolset for HPC data centers” published in Environmental Modelling & Software (Thematic issue on Modelling and evaluating the sustainability of smart solutions), Volume 56, June 2014, Pages 13–26; DOI: http://dx.doi.org/10.1016/j.envsoft.2013.11.011 Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 9 Monitored Systems Data Center (Goal: Reduce Total Cost of Ownership) Clustware iCinga CooLMUC 2 CooLMUC WinCC Johnson Controls SuperMUC Phase1 & 2 IPMI Slurm LoadLeveler Climate / Weather Neighboring Buildings External Influences/Constraints Utility Providers PowerDAM V2.X Pillar 1 Pillar 2 Pillar 3 Building Infrastructure IT System Hardware System Software Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 Pillar 4 Applications 10 Data Design Root Resource (System) Systems Rack PowerDAM internal API: Circuit Sensors ๐น๐๐๐๐น๐๐๐๐๐๐๐(. ๐น๐๐๐๐๐๐๐) ∗ _๐บ๐๐๐๐๐๐ป๐๐๐ = ๐ฝ๐๐๐๐; ๐ป๐๐๐๐๐๐๐๐ PowerDAM sensor name: Node 1 Sensors Node 1 Device Sensors Sensors ๐น๐๐๐๐น๐๐๐๐๐๐๐(. ๐น๐๐๐๐๐๐๐) ∗ _๐บ๐๐๐๐๐๐ป๐๐๐ Data Storage (currently SQL Table) Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 11 Collect only specific Sensors 1. 2. 3. 4. Agent Registration with PowerDAM ๏ง Generate XML file of all available sensors File with sensors and/or sensor roots to activate Activate sensors Collection uses only activated sensors in XML file ๏ง XML file can be updated, new sensor will be added, removed sensors will be marked as inactive ๏ง Can add additional properties (for example, valid range) Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 12 Fun With Sensor Names PowerDAM API: ๐น๐๐๐๐น๐๐๐๐๐๐๐(. ๐น๐๐๐๐๐๐๐) ∗ _๐บ๐๐๐๐๐๐ป๐๐๐ = ๐ฝ๐๐๐๐; ๐ป๐๐๐๐๐๐๐๐ ๏ฎ • • ๏ฎ • • jci.IUGSS41.EZ04_MW__PE_power IUGSS_41EZ__PEMW04 @JCSQL:NAE054-01:NAE054-01/N2 Trunk1. NAE054MIG136.IUGSS_41EZ__PEMW04.Present Value WinCC: Names relate to individual circuit names without any regard to possible structure Use of German Umlauts Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 13 Current Status Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 14 Data Collection using PowerDAM ๏ง Collection via: Pulling, Publish – Subscribe (test phase) ๏ง Import options: SQL, CSV, SSH-script ๏ง Collection interval 1min, Backfilling (for specific sensors, timeframe) ๏ง Export options: Report plug-ins; CSV file export; XLSX file export Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 15 LRZ PUE and SuperMUC System PUE Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 16 Lightweight Adaptive Consumption Prediction (LACP) Application identifier Number of resources CPU frequency of the resources (1) (3) (3) Monitoring tool (3) (2) Available application relevant history data [1] Hayk Shoukourian, Torsten Wilde, Axel Auweter, and Arndt Bode. Predicting the Energy and Power Consumption of Strong and Weak Scaling HPC Applications. Journal of Supercomputing Frontiers And Innovations. ISSN: 2313-8734 (online) DOI: 10.14529/jsfi URL: http://superfri.org/superfri/article/view/9/8 [2] Hayk Shoukourian, Torsten Wilde, Axel Auweter, Arndt Bode, and Daniele Tafani. Predicting Energy Consumption Relevant Indicators of Strong Scaling HPC Applications for Different Compute Resource Configurations. Proceedings of the 23rd High Performance Computing Symposium, Society for Modeling and Simulation International (SCS), ACM, 2015 ๐ท๐๐๐ ๐๐๐๐๐ (4) [3] Hayk Shoukourian. Adviser for Energy Consumption Management: Green Energy Conservation Ph.D. dissertation, München, Technische Universität München (TUM), 2015 Predicted TtS/APC/EtS of the application for a given number of resources and given CPU frequency Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 17 Power Prediction for running Epoch on 311 Compute Nodes | ๐ด๐๐ถ๐ฝ |๐๐๐ = ๐ด๐๐ถ(๐ฝ)๐ − σ๐ข ๐ข๐ก๐๐๐๐ง๐๐ ๐๐๐๐ ๐๐ ๐ฝ(๐๐ข −๐m๐๐ ) | ๐ด๐๐ถ๐ฝ |๐๐๐ฅ = ๐ด๐๐ถ(๐ฝ)๐ + σ๐ข ๐ข๐ก๐๐๐๐ง๐๐ ๐๐๐๐ ๐๐ ๐ฝ(๐๐๐๐ฅ −๐u ) System-vendor provided approximation for 311 nodes is: 118108 W LACP predicted maximum power consumption for 311 nodes is: 55993 W Power Cap Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 18 Monitoring vs. Data Analysis ๏ฎ Data collected by building automation system can be inaccurate for system analysis ๏ฏ ๏ฏ Changes of +-X are ignored (not recorded) since value is used for control decision Sensor readout frequency limited by connectivity. Same value is recorded for inbetween timestamps. ๏ฎ Gaps in data not considered critical Data quality and accuracy is not guaranteed ๏ฎ Currently thinking about data verification and validation options ๏ฎ Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 19 Analysis Example Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 20 Performance of Cooling Infrastructure 2015 COP = Q / P Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 21 Warm Water Cooling Loop ∞ KLT 11 ∞ ∞ KLT 12 ∞ ∞ KLT 13 ∞ M M M T M T T M M N N T T M M ∞ KLT 14 ∞ M T T M M S S N N T M S S M M M M M SuperMUC Cluster HRR NSR Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 22 KLT14 (2/2015) Power&Heat Profile Q(t), P(t) Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 23 KLT14 (2/2015) Power Profile P(Q) Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 24 KLT14 (9/2015) Power&Heat Profile Q(t), P(t) Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 25 KLT14 (9/2015) Power Profile Q(P) Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 26 KLT14 (9/2015) Power Profile Fans//Pumps Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 27 KLT14 (9/2015) Mode of Operations of Pumps Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 28 Acknowledgment Part of the work presented here has been carried out within the SIMOPEK project, which has received funding from the German Federal Ministry for Education and Research under grant number 01IH13007A, at the Leibniz Supercomputing Centre (LRZ) with support of the State of Bavaria, Germany, and the Gauss Centre for Supercomputing (GCS). Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 29 Torsten Wilde [email protected] www.simopek.de Data correlation with different read-out frequencies Longest read-out interval Sensor Type Unit average average average Timestamp Torsten Wilde (LRZ), 7th European Workshop on HPC Centre Infrastructures 2016 31