Two-phase commit - Everything2.com

Context: databases, transactional data

Two phase commit (2PC, sometimes tpc) is a technique used for distributing data between databases. There is no standard two phase commit protocol or implementation; every database has their own implementation. Two phase commit is used when data has to be updated in several databases, and when it is paramount that all databases are in synchronization. Examples of such database environments are

airline booking systems
stock and money markets
banking systems
credit card systems.

In these systems it is important that all data is in synch, in order to maintain data integrity. If just one of the many databases in these systems are out of synch, the whole system can be corrupt. End-users can be working on the same data entities but in different databases, and if they aren't correctly updated throughout the system, there's a huge problem. Systems with only one database use locking of the data that is accessed, but this doesn't work in distributed database systems.

The problem with synchronizing data is not as trivial as one first may think. If two separate databases are to be kept in synch, they have to rely on transactions. They also have to know that the transaction was not interrupted in any way. If you have a Transaction Processing Monitor (TPM), you rely on the TPM to deliver the transaction at any price - which is also done by a two phase commit protocol. Usually, the price is high, since TPM's are very expensive. The other solution is to use database two phase commit, which also will ensure data integrity when transactions fail. The implementations differ between TPM and DB two phase commits.

Consider the following setup: One database system (A) has to send data to another database (B), and A cannot use the data itself until it knows that the data has entered the other database as well. As you know from using the internet, data transactions can disappear for no apparent reason. How can A be sure that B has received the data ? This could be accomplished by having B send an acknowledgement when it has received the data. But, and this is the crux; the ACK is data in itself, so how can we know that the ACK has been received ? We would end up in an infinite loop of ACK's being sent between A and B.

 ___        ___
|   |      |   |   A: Sending data
| A |-->---| B |   B: Waiting 
|___|      |___|
 ___        ___
|   |      |   |   A: Waiting for ack for data received
| A |------> B |   B: Receiving data 
|___|      |___|
 ___        ___
|   |      |   |   A: Waiting for ack for data received
| A |---<--| B |   B: Sending ack for data received
|___|      |___|
 ___        ___
|   |      |   |   A: Receiving ack for data received
| A <------| B |   B: Waiting for ack for ack received 
|___|      |___|
 ___        ___
|   |      |   |   A: Sending ack for ack received
| A |-->---| B |   B: Waiting for ack for ack received 
|___|      |___|
 ___        ___
|   |      |   |   A: Waiting for ack for ack for ack received 
| A |------> B |   B: Receiving ack for ack received 
|___|      |___|

This would go on and on and on, as you can see, and therefore the two phase commit technology is used.

There are two steps in the two phase commit, as the name suggests:

Prepare - The initiating database ask all receiving databases to promise to either commit or rollback a given transaction
Commit - The receiving databases respond whether they are prepared to commit the transaction, and the initiating database request the commit, if all are ready.

If not all receiving databases respond that they are prepared, the initiating database will require everyone to rollback the transaction. When the commit is issued, all databases commit the transaction and send back an acknowledgement. No further communication is used, and it is expected and taken for granted that all data is simultaneously updated throughout the whole system.

In order to protect the commit process from network, software or hardware problems, the databases have a pending transactions table, which is used by the recovery process that is automatically invoked if the commit fails. After the decision has been made by the initiating database whether to commit or rollback, this decision will be honored by each system as soon as the become available again. The reason is that when the databases reply that they are prepared to commit, they have already committed the transaction in the pending transactions table. If they don't get an answer back that they should commit, the pending transaction stays until they get this message. The initiating database will send the commit until it gets the the acknowledgement back.

The downside with the two phase commit is, besides that it requires a lot of network traffic, it is also dependent on its weakest link. If something fails, the data entity will be locked for use. If you don't have valuable data, or trust your network technicians, you can just leave the two phase commit turned off.

SQL optimization	Two Phase Commit	Automatic teller machine	Transaction
Boston Latin School	Distributed system	atomicity	Diamond
acid	ATM	3pc	Melanie Griffith
Apostrophe Protection Society	The Promise Ring	Prepare	Disadvantage
abort	initiator	commit	Athenaeum
Protocol