2005-06-22 Fault-tolerance A Fault Tolerant Java Virtual Machine Introduction Fault-tolerance Fault-tolerance What is Fault-tolerance ? Definition ... is the property of a system that continues operating properly in the event of failure of some of its parts. www.wikipedia.org In our case we implement a System (JRE), that tolerates fail-stop failures: In response to a failure, the component changes to a state that permits other components to detect that a failure has occurred, and then stops. Note, that this do not cover Byzantine Failures. A Fault Tolerant Java Virtual Machine Malte Tiedje Seminar Zuverlässigkeit von Software in sicherheitskritischen Systemen 28. Juni 2005 Malte Tiedje ( Seminar Zuverlässigkeit von Software A FaultinTolerant sicherheitskritischen Java Virtual Systemen) Machine Introduction 28. Juni 2005 1 / 25 Introduction Fault-tolerance Fault-tolerance Why Java? What is Fault-tolerance ? Java is ... Why Java? portable Definition ... is the property of a system that continues operating properly in the event of failure of some of its parts. secure: strong-typing, ... distributed: RMI and of course: OO, simple, wide-used www.wikipedia.org Malte Tiedje ( Seminar Zuverlässigkeit von Software sicherheitskritischen A FaultinTolerant Java Virtual Systemen) Machine but Java is not fault-tolerant 28. Juni 2005 2 / 25 Malte Tiedje ( Seminar Zuverlässigkeit von Software sicherheitskritischen A FaultinTolerant Java Virtual Systemen) Machine 28. Juni 2005 3 / 25 Why Java? 2005-06-22 2005-06-22 A Fault Tolerant Java Virtual Machine Introduction Why Java? Java is ... portable secure: strong-typing, ... distributed: RMI Why Java? and of course: OO, simple, wide-used but Java is not fault-tolerant Because the JVM is defined independently of the hardware that implements it, Java programmes can run unmodified on any platform that implements a JVM. This implementation also only changes machine independent code to archive fault-tolerance. 4 Steps A Fault Tolerant Java Virtual Machine The approach 4 Steps 1. Define a deterministic state machine a unit of replication 2. Implement independently failing replicas of the state machine 3. Ensure all replicas start from identical states and perform the same sequence of state transitions 4 Steps 4. Ensure each output-producing transition yields in a single output to the environment Currently, fault-tolerance is solved on application-level, such transaction numbers or group technology. The approach The approach 4 Steps State-Machines State Machines State Machines are .. a set of state variables and a sequence of commands 4 Steps A command ... 1. Define a deterministic state machine a unit of replication 2. Implement independently failing replicas of the state machine reads a subset of state variables (read set values = rsvs) 3. Ensure all replicas start from identical states and perform the same sequence of state transitions modifies a subset of states variables (write set values = wsvs) A command is deterministic ... 4. Ensure each output-producing transition yields in a single output to the environment when a comand produces a deterministic wsvs and outputs an given rsvs A deterministic state machine ... reads fixed sequence of deterministic commands Malte Tiedje ( Seminar Zuverlässigkeit von Software sicherheitskritischen A FaultinTolerant Java Virtual Systemen) Machine 28. Juni 2005 4 / 25 Malte Tiedje ( Seminar Zuverlässigkeit von Software sicherheitskritischen A FaultinTolerant Java Virtual Systemen) Machine 28. Juni 2005 5 / 25 The approach The approach State-Machines Fault-tolerance by duplication JVM as State Machine JVM as State Machine II Problem: JVM is multi-threaded and a state-machines typical are not Replication Solution: every thread is a state-machine and the JVM is a set of cooperating state-machines Definition Providing multiple identical instances of the same system, directing tasks to all of them in parallel, and choosing the correct result on the basis of a quorum In particular: BEE (Bytecode Execution Engines) as set of functions that define together a replica www.wikipedia.org Each replica undergoes the sames sequence of state transitions and produces the sames output! Napper 2003 Malte Tiedje ( Seminar Zuverlässigkeit von Software A FaultinTolerant sicherheitskritischen Java Virtual Systemen) Machine The approach 28. Juni 2005 6 / 25 Malte Tiedje ( Seminar Zuverlässigkeit von Software A FaultinTolerant sicherheitskritischen Java Virtual Systemen) Machine JVM as State Machine 2005-06-22 JVM as State Machine I JVM as State Machine JVM as State Machine II 1. not all commands executed by the JVM are deterministic 2. replicas of a JVM do not in general execute identical sequence of commands 3. the read-set for a given command is not guaranteed to contain identical values at all replicas 28. Juni 2005 7 / 25 8 / 25 JVM as State Machine II Problem: JVM is multi-threaded and a state-machines typical are not Solution: every thread is a state-machine and the JVM is a set of cooperating state-machines In particular: BEE (Bytecode Execution Engines) as set of functions that define together a replica Napper 2003 Although BEE’s do not explicitly exist as components of the JVM, we can conceptually associate a BEE with a set of function that perform bytecode execution and track the state of each tread. Implement replica coordination in the JVM: 3 Challenges Malte Tiedje ( Seminar Zuverlässigkeit von Software sicherheitskritischen A FaultinTolerant Java Virtual Systemen) Machine A Fault Tolerant Java Virtual Machine The approach 28. Juni 2005 Details Details Non-deterministic commands Non-deterministic commands Non-deterministic commands Restriction 1 and 2 Restriction 2 Native methods must invoke other methods deterministically Exclusively invoked by Java Native Interface (JNI) Example e.g read the hardware clock Problem: the replicas have different input values, because the input is performed outside the scope of the JVM Solution: the protocol forces the backup to adopt the writes-set values produces by the primary But: this is not enough: we have to restrict the behavior of the native methods Malte Tiedje ( Seminar Zuverlässigkeit von Software A FaultinTolerant sicherheitskritischen Java Virtual Systemen) Machine Details 28. Juni 2005 9 / 25 native void DoNotDo() { lc = read time of day (); if ( lc > 17:24:32) acquire lock ();} native long Input () { return read time of day (); } void do(long lc ) { lc = Input (); if ( lc > 17:24:32) acquire lock ();} Malte Tiedje ( Seminar Zuverlässigkeit von Software A FaultinTolerant sicherheitskritischen Java Virtual Systemen) Machine Details Non-deterministic commands Restriction 1 and 2 28. Juni 2005 11 / 25 Non-deterministic commands Implementation Restriction 1 Native methods must not produce non-deterministic output to the environment Checked all native methods in JRE libraries less then 100 are non-deterministic Example Stored signature of each method in hash table (class, method, arguments) native void DoNotDo() { lc = read time of day (); print ( lc ); } When primary invokes native method, check hash table On match, send backup return values and modified arguments native long Input () { return read time of day (); } native void Output(long lc) { print ( lc ); } Malte Tiedje ( Seminar Zuverlässigkeit von Software sicherheitskritischen A FaultinTolerant Java Virtual Systemen) Machine On recovery, backup may use logged values 28. Juni 2005 10 / 25 Malte Tiedje ( Seminar Zuverlässigkeit von Software sicherheitskritischen A FaultinTolerant Java Virtual Systemen) Machine 28. Juni 2005 12 / 25 Details Non-deterministic rsvs 2005-06-22 Non-deterministic Read Sets I Because of Multi-Threading in the JVM the values of shared variables are non-deterministic Solution I: Implementation I A Fault Tolerant Java Virtual Machine Details Non-deterministic rsvs Implementation I Napper 2003 Definition < tid , tasn , lid , lasn > tid thread id of the locking thread asn thread acquire sequence number recording the number of locks acquired so far by thread tid lid lock id lasn lock acquire sequence number recording the number of times lid has been acquired so far Sun’s JVM provides two implementations of multithreading. The native threads version provides scheduling in the underlying OS, while the green threads version implements a user-level thread library for a uniprocessor inside the JVM. All access to shared data is wrapped by correct use of monitors (using synchronized) therefore we need replicating the Lock Synchronization Malte Tiedje ( Seminar Zuverlässigkeit von Software A FaultinTolerant sicherheitskritischen Java Virtual Systemen) Machine Details 28. Juni 2005 13 / 25 Details Non-deterministic rsvs Implementation I Non-deterministic rsvs Implementation II Hard to create unambiguous ids Cannot use object address as li d Napper 2003 Cannot use order of events at primary Definition < tid , tasn , lid , lasn > tid thread id of the locking thread asn thread acquire sequence number recording the number of locks acquired so far by thread tid lid lock id lasn lock acquire sequence number recording the number of times lid has been acquired so far Malte Tiedje ( Seminar Zuverlässigkeit von Software sicherheitskritischen A FaultinTolerant Java Virtual Systemen) Machine 28. Juni 2005 14 / 25 Napper 2003 Malte Tiedje ( Seminar Zuverlässigkeit von Software sicherheitskritischen A FaultinTolerant Java Virtual Systemen) Machine 28. Juni 2005 15 / 25 Details Details Non-deterministic rsvs Non-deterministic Read Sets II Non-deterministic rsvs Implementation Solution I: many programs do not meet this condition (not even Sun’s JRE) Example Definition < brcnt , pcoff , moncnt , lasn , tid > brcnt counts the control flow changes executed (e.g. branches, jumps, and methods invocations) class Example { pcoff records the bytecode offset of the PC within the method currently executed by t static Formatter shared data = null ; moncnt counts the monitor acquisitions and releases performed by t String toString (){ if ( shared data == null){ shared data = new Formater(); synchronized method(); ... }}} lasn records the lock acquisition sequence number when t is rescheduled while waiting on a lock tid the thread id of the next scheduled thread Malte Tiedje ( Seminar Zuverlässigkeit von Software A FaultinTolerant sicherheitskritischen Java Virtual Systemen) Machine Details 28. Juni 2005 16 / 25 Malte Tiedje ( Seminar Zuverlässigkeit von Software A FaultinTolerant sicherheitskritischen Java Virtual Systemen) Machine Details Non-deterministic rsvs Non-deterministic Read Sets II 28. Juni 2005 18 / 25 Output to the environment Output to the environment Objective: Simulate a single, fault-tolerant state-machine Solution II: A thread has exclusives access to all shared variables while scheduled In general impossible Restriction 3 All native method output to the environment is either idempotent or testable therefore we need to replicate the thread scheduling therefor we need a Side Effect Handler Malte Tiedje ( Seminar Zuverlässigkeit von Software sicherheitskritischen A FaultinTolerant Java Virtual Systemen) Machine 28. Juni 2005 17 / 25 Malte Tiedje ( Seminar Zuverlässigkeit von Software sicherheitskritischen A FaultinTolerant Java Virtual Systemen) Machine 28. Juni 2005 19 / 25 2005-06-22 Evaluation Output to the environment A Fault Tolerant Java Virtual Machine Details Output to the environment Objective: Simulate a single, fault-tolerant state-machine Evaluation In general impossible Restriction 3 All native method output to the environment is either idempotent or testable Output to the environment therefor we need a Side Effect Handler A function f (x) is idempotent, iff (f ◦ f )(x) = f (x). A action is testable, when the environment can be queried to determine a specific output completed. Overhead: depends on application and Rep. Lock-Sync / Rep. Thread Sched. Experiments: SPEC JVM98 benchmark (i.a: compress, db, raytracer rendering) Qualitative: differ from 5% up to 375%, average 60% for rts, 140% for rla Malte Tiedje ( Seminar Zuverlässigkeit von Software A FaultinTolerant sicherheitskritischen Java Virtual Systemen) Machine Details 28. Juni 2005 21 / 25 Evaluation Output to the environment Eval.: Replicated Lock Acquisition Side Effect Handler register: method’s signature, what should be logged, etc test: called on testable, uncertain commands log & receive: how primary and backup exchange state restore: called at the backup during recovery Napper 2003 Malte Tiedje ( Seminar Zuverlässigkeit von Software sicherheitskritischen A FaultinTolerant Java Virtual Systemen) Machine 28. Juni 2005 20 / 25 Malte Tiedje ( Seminar Zuverlässigkeit von Software sicherheitskritischen A FaultinTolerant Java Virtual Systemen) Machine 28. Juni 2005 22 / 25 Evaluation Evaluation Eval.: Replicated Thread Scheduling References A Fault-Tolerant Java Virtual Machine: Jeff Napper, Lorenzo Alvisi, Harrick Vin http://www.cs.utexas.edu/users/jmn/papers/napper03fault.ppt www.wikipedia.org Napper 2003 Malte Tiedje ( Seminar Zuverlässigkeit von Software A FaultinTolerant sicherheitskritischen Java Virtual Systemen) Machine 28. Juni 2005 23 / 25 28. Juni 2005 24 / 25 Evaluation Conclusion A fault-tolerant JVM (at a reasonable cost) Write Once, Run Anywhere A framework for replicating multi-threaded SMs Malte Tiedje ( Seminar Zuverlässigkeit von Software sicherheitskritischen A FaultinTolerant Java Virtual Systemen) Machine Malte Tiedje ( Seminar Zuverlässigkeit von Software A FaultinTolerant sicherheitskritischen Java Virtual Systemen) Machine 28. Juni 2005 25 / 25