27 Nov 2017

Sinfonia Transactions

###Sinfonia

Ideas to think about:

  1. Is NFS a good application for Gigapaxos.
  2. Did it work for sinfonia because of node locality.
  3. If we can use node locality to increase throughput, we can use reconfigurability of gigapaxos to have an evolving database, whose performances gets better with time.

Key Term: Distributed Infrastructure Application: Application used to build other application.

Features of the system:

  1. Provides a paradigm for distributed systems without worrying about message passing protocols.(Instead of system design, design datastructures within sinfonia)
  2. Access Data on the system, through a linear address space exported.
  3. Minitransaction primitive to handle transactions.

System Design: 1) Consists of memory nodes and application nodes. 2) Application nodes contain user library which manipulates memory node 3) Memory nodes keep uninterpreted bytes and exports address space <memory node id, node locality> 4) Minitransactions: Application nodes play the role of co-ordinators 5) Minitransaction class: cmp(memid,addr,len,data),read(memid,addr,len,buf),write(memid,addr,len,data) IMPORTANT:: Mintransaction is structured in a way that for a single memid The projection of the transaction on this memid can be executed in the first phase and then based on a vote.It is committed or rejected This is from the principle of reduce coupling

6) If all compare operations succeed read from location and write to locations 7) Can build more complicated feateures such as swap, atomic reads .. 8) Applications can build their own cache handling management. Data accessed through sinfonia is always current. 9) Fault tolerance through : disk image: written asynchronously Append log: written synchronously Primary Copy replication 10) Load balancing:Application choose where to place data on memory node. Applications get memory load information to perform balancing.

Implementation and ALgorithms

  1. Modified 2 Phase Commit.
         Co-ordinator:		
             Init:	generate tid
                 Collasce operations per memory node q
                 q_read,q_write,q_compare:
                 D be all memory nodes being involved				
                 For all q (paralley) send_prepare(tid,q_read,q_write,q_compare)
             recv_all_votes:
                 if all vote ok:
                     send_commit
                 else abort
                        
         Participant q:
             recv_prepare(D,tid,q_read,q_write,q_compare):
                 Put tid in in_doubt
                 acquire all locks
                 if q_compare success			
                     Perform q_read,q_write
                     return vote_ok
                 else:
                     return vote fail			
             recv_accept(tid):
                 if tid not in_doubt:
                     \\recovery co-ordinator fixed it.
                     abort
                 if commit:
                     apply write 
                     release locks
                 else: 
                     abort
    

2.Participants acquire locks without blocking . If a transaction fails due to lock acquire failure, it reboots after a random time out.

3.Co-ordinator has no log. Can get away with this because, block on participant failure not on co-ordinator failure. Can resume the transaction by getting back all the votes from all participants. A periodic probe checks for tid in doubt, from this participant get the memory nodes across which the transaction is spread and resumes it.

4.Indoubt log: contains tid after prepare phase of transaction Decided_log: After state is decided Abort_log: After aborted

5.Log garbage collection: A committed tid can be cleaned after it is committed by ever other memory node involved in the transaction. Memory Node periodically sends status information to all other nodes involved in the transaction.Forced aborts are removed after 2 epochs

  1. If a transaction can be done using just one memory node, then can be committed in one message. This node locality increases throughput. Can we do this ???? Ask Arun

  2. A primary-copy replication, relies on synchrony so can lead to false fail overs. Solution: use lights out management to kill the primary.

  3. Applications refer to memory nodes with logical memory id. This is remapped to physical memory at the sinfonia directory server.

Application

  1. Cluster File System: Application nodes act as cluster nodes. Beating Berkley DB, as it locks per page while sinfonia does it at the granualarity of a memory location. Beating Linux FileSystem, because of sequential write ahead logging.
  2. Group Communication: All Process in the group receive, broadcasts in the same order. Implemented by having a tail queue and updating using transactions.
    • Threaded queue to decrease duration of the transaction.

Interesting notes:

  1. Key to aceiving scalability is to decouple operations executed by hosts. This is done as sinfonia,provides fine grained address space without imposing structure.
  2. Node locality is the opposite of data striping. It seems useless, if there is replication But in this model of master slave, node locality is present at the slave also
  3. Batching commit of a transaction with perpare of the next transaction saves round trip time [This mini transaction primitive seems important]
  4. Transactions can be tagged to perform, load analysis.
  5. By finding the correct application for a defined abstraction.
  6. Across transactions,spread load ; within a transaction focus load.

Doubts.

  1. Because of limited scope, Sinfonia claims to happen in 2 network round trips while database transactions take more than 2(2 to just commit and a start and execute)

  2. Is node locality useless, if there is replication.

  3. look at the use cases to see number of application nodes vs data nodes

  4. What is the advantage of having the co-ordinator not have a log

Reading Continuiation: 1) VoltDb 2) Percolator (I have no idea, where I got these but made a note of them) 3)

Til next time,
Sandeep Polisetty at 00:00

scribble