Bernie's Blog A confusing concoction of Java, mobile devices, technology and photography

19Aug/070

Batch Processing Pains

A recent project that I worked on had an interesting challenge to programming - batch processing. Specifically, batch processing is non-trivial because of a combination of a few factors. I'll try to go into detail regarding these factors.

Is transactional, yet...
Now by itself, this requirement isn't hard to achieve. There are numerous frameworks available already that helps - Spring and EJB immediately comes to mind (both allows declarative transaction management, which is IMHO the best way to do this). However, if frameworks ain't your style, feel free to use pure JDBC. Now what makes transaction management so tough in batch processing is the fact that it is usually a long conversational process. Long conversation conflicts with the need to quickly release transactions to avoid deadlocks. This is even more important when the batch processing is multi-threaded. If each thread creates a hold on a transaction for hours, your database connection pool will run dry (pun intended). To make matters worse, there is no way of finding out accurately the amount of time needed for the batch to compete. It might be as quick as 10 minutes or it could span a few hours, depending on the volume of processing. There are a few ways of going about this problem. One way is to make everything non-transactional, keeping tabs via various status at different stages of the batch processing to do manual rollbacks. This is usually a bad idea because even accessing the DB to update statuses can fail and usually this requires a rollback - something you cannot achieve without transactions. The inability to rollback is potentially disastrous because there's no way to guarantee data integrity. Also, this sort of manual attempt to control transactions will scale exponentially in complexity with the number of external systems and data store you have. If the batch processing involves multiple data sources, relying on JTA makes your work very easy. Of course making the whole batch processing transactional would not do as well. For example, having a method called doSomeBatchWork() and making it transactional would most likely result in a deadlock, or if you have a timeout declared, would result in a timeout-exception should the batch processing take too long to complete. Extending the timeout makes matter worse because it can ultimately result in a deadlock as well (not to mention that with no way to predict a batch's completion time, the transaction might still timeout). Bummer. The solution of course would be to choose a middle-ground, break up the batch processing into various method calls and make those transactional. Of course, how coarse or fine grained is up to you and the business rules to decide. Believe me, it is not easy ;)

Runs unmonitored
Usually, batch processing runs late at night when there is little load on the servers, which is good. But with no immediate human intervention, it will have to be very reliable. So the processing has to take into account potential issues such as memory management (flushing to disk or data store is the usual practice) and if persistent frameworks are used, ways to use these effectively for batch processing has to be explored. However, even though the system is designed to be resilient, it is also designed for failure. A paradox? Yes, but a necessity nevertheless. Batch processing must have sufficient and informative logging to allow easy debugging as well as the ability to see at which point did it fail. It must also have ways to send out system alerts should when a failure occurs. There must also be provisions for it to retry itself before ultimately giving up. And finally, assuming the system did fail, the batch must have the option of resuming from where it left off or starting all over again. Unfortunately, there is no magic bullet for these issues. However, if you've already been practicing good programming, some of these shouldn't come off as too difficult. Having a strong foundation of programming definitely helps as well. Reliable and fault-tolerant systems ain't easy to design and code.

Must be scalable
Now this is another headache, because the trick is how to achieve this without messing up data. The only way to achieve scalability is to ensure that there are dividable work units. Reason being that these work units can then be run via multiple threads or utilize a messaging bus to be sent to other servers for processing. This way, whenever a CPU upgrade takes place or, in the second scenario, servers added to the server farm, your batch processing should see an increase in performance. However, the challenge is that some business processes cannot run  concurrently or you risk losing data integrity. Therefore scalability is not so much of a programming challenge anymore but rather an art of finding the balance of achieving the best multi-threaded performance while still respecting the business rules. In other words, it is a design challenge. Of course this does not mean I say programming for scalability is easy, I'm saying it's easier than before, especially with the advent of JMS and concurrency packages in JDK5 and above. Of course, don't forget about clustering (which is quite common in JEE projects) and all the programming caveats that comes with it. Lucky that clustering support has matured greatly in most application servers that it's almost just a matter of configuration.

Suffice to say that the recent project has been a roller coaster ride. There has been a number of times when all problems seemed unsurmountable. Then came a time of celebration for solving all of those, only to find more issues later on. This cycle then repeated itself a couple of times, more frequent than in other projects I've been involved with. I'm grateful though for the open source community and the excellent work they've come up with. Hibernate, Spring and Quartz were used greatly to speed up development time and they are definitely production worthy. Hibernate really eased the pain of transaction management and modeling the database. Quartz surprised me by providing support for clustered, persistent scheduling. Spring worked it's magic by being the glue that held the project together (as well as providing many convenience classes). I could go on and on about them but these projects take up a lot of space on bookshelves already so I suggest reading a good book about them instead.

P/S: Keep an eye on Spring Batch.

Tagged as: Leave a comment
Comments (0) Trackbacks (0)

No comments yet.


Leave a comment


No trackbacks yet.