There’s always a next problem.
A few days ago, Subhash, our web app’s main programmer passed along his analysis of a question one of Margie’s customers had raised. He ended the analysis with a suggestion that we meet to discuss the implications. I read the message, filed it, thought about the implications a bit, re-read the note more carefully, and seconded the suggestion: We’d lost a transaction. I don’t see how we could have prevented it, and I cannot imagine an easily-implemented workaround. This is a little troubling.
A couple days after reading my response, Mel (my boss, and the implementation project’s manager) was wondering aloud whether we’d gotten to a bad place. I’ve not yet an informed opinion on that, but I spent much of yesterday researching the issue and the results are not encouraging. I figure I’m three or four days from being able to discuss it intelligently.
The app runs through this processing sequence at Checkout:
- User presses [Order Now].
- If this is a Credit Card transaction:
- The web app debits the card.
- The credit card agency acknowledges the billing.
- Account customers are approved earlier in the flow, and billed later.
- The app passes the transaction to the back end app.
- The back end acknowledges receipt of the transaction.
- The back end app processes the transaction.
- The back end app notifies the web app that the documents are available.
- If this is an account transaction, the customer’s account is billed.
- The web app retrieves the documents and stores them.
- The web app emails the user with a link to the appropriate URI.
- The user retrieves the documents from the web app’s Order History page.
The processing duration after the back end’s transaction acknowledgement is quite variable. We need to study this in better detail, but it clearly ranges between very few minutes and perhaps four hours. Better data on this issue is one of next week’s projects. I expect this to be outside the main discussion, though, as the ACK itself appears to be the issue.
Most of the time the process works, or fails in harmless (albeit annoying) ways. But failures during step three are, well, inconvenient–the transaction might be completely lost, but more likely the transaction will be completed but improperly delivered. Two scenarios appear to be troublesome:
- The app “stalls” just after [Order Now]–so the user presses the button again.
- Two documents, and two billings, are produced.
- This is an operator error, but we can’t treat it that way.
- We can perhaps mitigate this failure by modifying the application.
- The app loses the connection as the back end acknowledges receipt.
- Since that receipt does not reach the web app, that app times the transaction out.
- The back end app continues to process the transaction (and the customer is billed).
- Since the web app has discarded the transaction record, the document, although completed, never gets delivered.
- Typically, the user repeats the obviously-lost transaction the next workday.
Both scenarios result in duplicate processing, and generate two billings. The users typically figure out there’s a problem when they reconcile their bills. We don’t yet know the scope of either problem, but the second scenario needs to be extremely rare to avoid becoming a major concern for a system with a normal workload approaching a thousand transactions each day. Scoping those details out will be another project for next week.
Two points are clear: First, the problems result directly from the decoupled design, and will not be easily fixed. Second: We must reduce the network latencies, as the problems are triggered by messaging delays. We will be asking both coding teams to suggest ways to improve transaction reliability.
I spent yesterday afternoon searching for discussions on the web, as it was clear to us that our problem wouldn’t be unique to our application. That led me into the thickets and jungles of WS-Transaction and its kindred. ACID lives, uncomfortably, in the space, as well. This could be an interesting venture….