Scaling on AWS for the First 10 Million Users

Von 1 auf 10 Millionen:
Skalieren mit AWS
Matthias Jung, Solutions Architect AWS
AWS Summit Berlin, 15.Mai 2014
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Wie funktioniert das
also mit dem Skalieren?
eine Menge Ergebnisse
nicht der richtige Startpunk
Zunächst ein paar
Grundlagen…
AWS Regionen
EU-WEST (Ireland)
US-WEST (Oregon)
CHINA (Beijing)
ASIA PAC (Tokyo)
AWS GovCloud (US)
US-EAST (Virginia)
ASIA PAC
(Sydney)
US-WEST (N. California)
ASIA PAC
(Singapore)
SOUTH AMERICA (Sao Paulo)
Verfügbarkeitszonen (AZs)
EU-WEST (Ireland)
US-WEST (Oregon)
CHINA (Beijing)
ASIA PAC (Tokyo)
AWS GovCloud (US)
US-EAST (Virginia)
ASIA PAC
(Sydney)
US-WEST (N. California)
ASIA PAC
(Singapore)
SOUTH AMERICA (Sao Paulo)
Edge-Standorte
AWS
OpsWorks
Amazon
SNS
Amazon
SES
Amazon
CloudSearch
Amazon SWF
Amazon
SQS
Amazon
Amazon Elastic
AWS
AWS IAM
CloudWatch
Beanstalk
CloudFormation
Amazon
EMR
Amazon
Route 53
Amazon
RDS
Amazon
RedShift
Anwendungsdienste
Amazon
Elastic
Transcoder
Speicherung
&
Bereitstellung
Datenbanken
Amazon S3
Amazon
CloudFront
AWS Storage
Gateway
Amazon
VPC
AWS
Direct
Connect
Amazon
ElastiCache
Amazon
DynamoDB
Netzwerk
Amazon
Kinesis
AWS
CloudTrail
Bereitstellung & Verwaltung
DV
Amazon
EC2
AWS
Data
Pipeline
AWS Globale Infrastruktur
Amazon
Glacier
1
(Sie selbst)
1.Tag, 1 Anwender
• Eine einzige EC2-Instanz
Amazon
Route 53
Anwender
– mit komplettem Stack
•
•
•
•
Web-App
Datenbank
Verwaltungswerkzeuge
usw.
• Elastic IP Address
• Amazon Route 53 für DNS
Elastic IP
Address
EC2Instanz
“Wir brauchen eine grössere Kiste”
•
•
•
•
•
Wechseln der Instanz-Grösse
Wechseln des Instanz-Typen
EBS mit PIOPS konfigurieren
Ziemlich unkompliziert
Mehrere Tausend Anwender
möglich
i2.4xlarge
m3.xlarge
m1.small
“Wir brauchen eine grössere Kiste”
•
•
•
•
•
Wechseln der Instanz-Grösse
Wechseln des Instanz-Typen
EBS mit PIOPS konfigurieren
Ziemlich unkompliziert
Mehrere Tausend Anwender
möglich
• Funktioniert nicht unendlich
i2.4xlarge
m3.xlarge
m1.small
Erste Schritte
• Keine Ausfallsicherheit
• Keine Redundanz
• Keine Risikostreuung
Amazon
Route 53
Anwender
Elastic IP
Address
EC2 Instanz
1000
1000 Anwender und mehr
Anwender
Trennen von Datenbank
und Web-Anwendung
Datenbank-Dienst statt
Selbst-Verwaltung?
Amazon
Route 53
Elastic IP
Address
WebInstanz
DatenbankInstanz
Datenbank-Optionen
Datenbank-Dienste
Selbst-Verwaltung
Datenbank auf
Amazon EC2
Amazon
RDS
Amazon
DynamoDB
Amazon
Redshift
Wahl der Software
und Version
Microsoft SQL,
Oracle, Postgre &
MySQL als Dienst
NoSQL-Dienst mit
SSD-Speicher
Data-WarehouseDienst (SQL)
Nahtlose
Skalierung, kein
Betriebsaufwand
Massiv parallel,
hohe Skalierbarkeit,
schneller Zugriff
Eigene Lizenz
(BYOL)
Lizenz inkludiert
oder BYOL
Welche DatenbankTechnologie?
Warum eine SQL-Datenbank?
•
•
•
•
Weit verbreitet und bestens verstanden
Tausende Werkzeuge, Code, Communities, Bücher, …
Erprobte Skalierungsmuster und –rezepte
Schafft auch 10 Millionen Anwender*
* Ausser bei extrem vielen Schreib-Operationen auf riesigen Datenmengen. Und
selbst dann gibt es einen Platz für SQL-Datenbank in Ihrem Stack.
Wann passt NoSQL besser?
•
•
•
•
•
•
•
Riesige Datenmengen (im TB-Bereich)
Tausende von Schreib- und Update-Operationen
Unstrukturierte Daten, keine festen Tabellen
Daten mit losen Beziehungsstrukturen
Speichern von Meta-Daten
Anwendungen mit Anforderungen an geringe Latenz
Erfahrung und Kompetenz im Team
Amazon DynamoDB
• Vollständig verwalteter Dienst
• Hohe und berechenbare Leistung
Konfigurierbarer und konstanter Durchsatz
• Vollständig verteilte und
Automatische Skalierung von Tabellen
fehlertolerante Architektur
Integrierte Fehlertoleranz
Starke Konsistenz und atomare Zähler
Integriertes Monitoring
Hohe Sicherheit
Integration mit Amazon EMR
Mehr als 1000 Anwender
Anwender
Trennen von Datenbank
und Web-Anwendung
Datenbank-Dienst RDS
statt Selbst-Verwaltung
Amazon
Route 53
Elastic IP
Address
WebInstanz
RDS
DatenbankInstanz
10,000
10000 Anwender und mehr
Amazon
Route 53
Anwender
Ausfallsicherheit und
Redundanz
• Verteilung auf
Verfügbarkeitszonen
• Elastic Load Balancing
• Amazon RDS Multi-AZ
Elastic Load
Balancing
WebInstanz
Amazon RDS DB-Instanz
mit aktiviertem Multi-AZ
Verfügbarkeitszone A
WebInstanz
Amazon RDS DB-Instanz in
Bereitschaft (Multi-AZ)
Verfügbarkeitszone B
Elastic Load Balancing
• Für fehlertolerante und
hochskalierbare Anwendungen
Hochverfügbar und Elastisch
Zustandsprüfungen
Lastverteilung auf Layer 4 und 7
SSL-Unterstützung und -Auslagerung
Integriertes Monitoring
Protokollierung
IPv6-Unterstützung
Elastic Load
Balancing
500,000
Horizontales Skalieren
Anwender
Amazon
Route 53
Elastic Load
Balancing
WebInstanz
WebInstanz
WebInstanz
RDS DB-Instanz RDS DB-Instanz
Lese-Replica
Lese-Replica
WebInstanz
RDS DB-Instanz
Master (Multi-AZ)
Verfügbarkeitszone A
WebInstanz
RDS DB-Instanz
Standby (Multi-AZ)
WebInstanz
WebInstanz
RDS DB-Instanz
Lese-Replica
Verfügbarkeitszone B
WebInstanz
RDS DB-Instanz
Lese-Replica
Entlasten…
Anwender
Von Web-Servern und Datenbank
• Statische Inhalte auf S3
speichern
• Statische (und dynamische)
Inhalte über CloudFront
ausliefern
• Datenbank-Abfragen in
ElastiCache cachen
• Session-Zustände auf
ElastiCache oder DynamoDB
auslagern
Amazon
Route 53
Amazon
CloudFront
Elastic Load
Balancing
Amazon S3
WebInstanz
ElastiCache
RDS DB-Instanz
Master (Multi-AZ)
Verfügbarkeitszone
Amazon
DynamoDB
Zugriffe im November auf Amazon.com
November
Zugriffe im November auf Amazon.com
76%
Bereitgestellte Kapazitäten
November
24%
Zugriffe im November auf Amazon.com
November
Auto Scaling
Automatische Anpassung
Auslösen von Auto-Scaling Policy
Amazon
CloudWatch
der Kapazitäten an die Last
• Integration mit Amazon
CloudWatch
• Integration mit Elastic
Load Balancing
• Für Skalierung und
Verfügbarkeit
as-create-auto-scaling-group MyGroup
--launch-configuration MyConfig
--availability-zones us-east-1a
--min-size 4
--max-size 200
500 Tsd Anwender
Anwender
Amazon
Route 53
Amazon
CloudFront
Elastic Load
Balancing
WebInstanz
WebInstanz
RDS DB-Instanz
Master (Multi-AZ)
WebInstanz
RDS DB Instanz
Lese-Replica
Availability Zone
Amazon S3
WebInstanz
ElastiCache
WebInstanz
WebInstanz
RDS DB-Instanz RDS DB-Instanz
Standby (Multi-AZ) Lese-Replica
Availability Zone
ElastiCache
Amazon
DynamoDB
Automatisierung
AWS
Elastic Beanstalk
Bequemlichkeit
AWS
OpsWorks
AWS
CloudFormation
Amazon EC2
Kontrolle
SERVER
METRIKEN
AGGREGIERTE
METRIKEN
LOG
ANALYSE
EXTERNE
MESSUNGEN
AWS Marketplace & Partners
• Suche und Kauf von
vorinstallierten SoftwareLösungen
• Verbrauchsorientierte
Abrechnung
• Starten in wenigen Minuten
• Abrechnung in AWS-Konto
integriert
• Mehr als 1300 Produkte und 20
Kategorien
Learn more at: aws.amazon.com/marketplace
1,000,000
Komponenten entkoppeln
• Entkoppeln hilft beim Skalieren und Optimieren
–
–
–
–
Unabhängige Komponenten
Konzeption als Blackbox
Klare Schnittstellen
Interaktionen entkoppeln
Komponenten entkoppeln
Vorher
Controller A
Loose coupling
Q
Controller B
Amazon SQS als Puffer
Nachher
Q
Q
Controller A
Controller B
In Diensten denken
• Monolytische Blöcke in feingranulare Dienste teilen
• Dienste in eigene Module packen
• Jeden Dienst als 100%
unabhängige Komponente
betrachten
• Jeden Dienst unabhängig
skalieren
In Diensten denken
• Monolytische Blöcke in feingranulare Dienste teilen
• Dienste in eigene Module packen
• Jeden Dienst als 100%
unabhängige Komponente
betrachten
• Jeden Dienst unabhängig
skalieren
= Grundprinzip von AWS und Amazon.com
Das Rad nicht neu erfinden
Wenn ein passender Dienst bereits existiert, warum selbst
einen eigenen bauen?
Beispiele
• Benachrichtigungen
• E-Mail
• Suche
• Workflows
• Queuing
• Transcoding
• Monitoring
Amazon SNS
Amazon
CloudSearch
Amazon SQS
Amazon SES
Amazon SWF
Amazon Elastic
Transcoder
1 Mio Anwender und mehr
Anwender
Amazon
Route 53
Amazon
CloudFron
t
Elastic Load
Balancing
Amazon SQS
WebInstanz
WebInstanz
WebInstanz
WebInstanz
WorkerInstanz
WorkerInstanz
Amazon
DynamoDB
ElastiCache
RDS DB-Instanz RDS DB-Instanz
Lese-Replica
Lese-Replica
Verfügbarkeitszone
RDS DB-Instanz
Master (Multi-AZ)
Amazon S3
Interner
Interner
App-Server App-Server
Amazon
CloudWatch
Amazon SES
10,000,000
Zwischen 5-10 Mio Anwendern
Schreiben in die Datenbank wird zum Flaschenhals
Lösungen
• Federation: Verteilen der Datenbank-Struktur auf mehrere
Datenbank-Systeme nach Funktion
• Sharding: Verteilen der Daten auf mehrere DatenbankSysteme (z.B. Anwender nach Region)
• NoSQL: Auslagerung von bestimmten Daten auf NoSQLDatenbanken
Auslagerung in eine NoSQL-DB
• Beispiele
– Punktestände und Leaderboards
– Clickstream oder Log-Daten
– Zwischenspeicherung
• (z.B. Einkaufswagen, Session-Informationen)
– Extrem populäre Daten
– Meta-Daten
Amazon
DynamoDB
…und so sind wir bei
10 Millionen Anwendern
Von 1 auf 10 Millionen Anwender
• Wahl der passenden DB zum passenden Usecase
• Komplette Infrastruktur auf Verfügbarkeitszonen verteilen
(Multi-AZ)
• Caching, caching, caching
• Entkoppeln und in Diensten denken (SOA)
• Das Rad nicht neu erfinden
– Elastic Load Balancing, Amazon S3, Amazon SNS, Amazon SQS, Amazon SWF,
Amazon SES, ...
• Auto-Scaling (wenn die Grundlagen dafür gelegt wurden)
• Monitoring auf allen Ebenen
• Automatisierung des Betriebs
100,000,000?
10-100 Millionen Anwender
• Noch mehr und spezielle fein-granulare Dienste
• Noch mehr Leistungsoptimierung durch genaue
Vermessung des gesamten Stacks
• Von Multi-AZ zu Multi-Region
• Mehr individuelle Lösungen
Nächste Schritte?
Lesen?
• aws.amazon.com/de/documentation
• aws.amazon.com/de/architecture
• aws.amazon.com/de/start-ups
Hören?
• Motain - Onefootball
From 1 to 10 Millions
Jonathan Lavigne
15/05/2014
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Today
•Jonathan Lavigne, CTO/CPO, Onefootball
•What we’ll talk about:
–From a German iPhone app to a multi-countries success
–How we use AWS
–Lessons learned while growing
–Our TODOs for the next evolution of our infrastructure
We fuel people’s addictions
It’s a team effort
•Balancing new feature development,
performance, monitoring, quality, process
management
•All while avoiding silos, delays, systems being
down, unhappy users
Thanks to my team!
The initial infrastructure
The initial infrastructure
•Classic infrastructure
•1 server to produce content, 4 servers to deliver it
–Very big servers: 2 x Xeon CPU w/ 16 cores, 48GB RAM, SSD
–Central NAS w/ 1TB space over a 1GB/s connection
–1 DB master and 4 DB slaves on each frontend server
–2 load balancers (1 active, 1 hot stand-by)
•Very predictable metrics
•Worked well for a long time
The initial infrastructure
Until… Euro 2012
Until… Euro 2012
•From 1 to 2 millions users in 2
weeks
•International traffic: everybody
follows the same games
•Servers start to send warnings, we
NEED to do something
Until… Euro 2012
•Monolith = not easy to scale
•Fast code optimizations only get you so far
•Solution: CloudFront and S3 to the rescue
•Lessons learned
–Think early of systems
–Shift load to CDN or other services (S3, etc..)
Post Euro 2012 - Optimizations
•From 2 to 4 million users
•Refine CDN caching and invalidation rules
•Move static assets to S3 (images, static files)
•Optimize code, services and infrastructure
•Admit when optimization effort costs more than
benefits
Small change, big difference
Time for elasticity - AWS
•Difficult to do quickly
•Software too tightly coupled
•Internal resources busy with other projects
•Step-by-step
•Evaluate performance hot-spots
•Create a plan to decouple software
•Choose the setup for YOUR needs
Our score architecture today
Technologies we use
•EC2
•RabbitMQ and HAProxy
•PHP & Go Stack
•RDS and Elastic Cache
•ELB and Route 53
•OpsWork
•Deployment with Chef
•VPC
RabbitMQ
•OpsWork handles clustering with Chef
•Queues are mirrored across the instances
•HAProxy used to connect on EC2 instances
which connect to Rabbit
•Balancing between various Rabbit hosts
•Don‘t change anything in the software
•10000 messages/second on a C3.large instance
OpsWorks
•Facilitates for any engineer to manage the
infrastructure
•Integrates with Chef out of the box
•Allows us to layer the infrastructure and optimize
accordingly
•Benefit from time based scaling since usage
patterns are predictable (football matches)
•Allows load-base scaling if required
Monitoring
•Amazon
•Basic CPU and Load usage
•Monitoring the ELBs
•Copper Egg
•Monitors instances, helps determine instance
size and optimize/improve
The future
•Fine tune auto-scaling rules
•Move more of our infrastructure to OpsWorks
•Unify databases
•Deploy to multiple AWS regions
•Continue move to Go
•Leverage more AWS technologies such as
RedShift and Elastic MapReduce
THANK YOU