Hmmm.., I just realized I don’t talk much about work here. Well, this post is about as good as any to start :). I’ve just finished co-developing a very fun project. Fun in the sense that it’s a critical system (financially) and it’s a high-transaction one. I’ve never really gotten involved in heavy duty projects before and this experience was refreshing. It was a re-write of an existing application, one that was starting to break at its seams after load has increased. The project is for DiGi, Malaysia’s No. 1 prepaid telco provider. It shouldn’t be a surprise then that the project has to do with its prepaid business. It doesn’t have an internal code name, but its commercial name should be Flexi e-load (I could be mistaken, so correct me if you know better :)). About 30% of DiGi’s reloads are done using this system, a maximum of 20 transactions per second (with rumors that this load might increase much). Through internal testing, it was determined that the upgraded system can take about 130 transaction per second (up from a paltry 7), so I believe we have head room here :)

OK, now let’s talk about Flexi e-load’s guts. It runs on 4 Sun Sparc servers; 3 application servers, 1 database server. The application servers are hardware load balanced and clustered to provide high-availability. The J2EE application server used is BEA Weblogic 8. Parts of the J2EE stack used - JDBC connection pooling and JMS. Both exposed via JNDI. JMS is great for fast asynchronous transactions (which suites the business requirement of Flexi e-load). Believe it or not, the previous the application wasn’t even threaded (this is partly why it crumbled when load increased). Earlier upgrades included some threading but it too was bug riddled. Switching to JMS provided a clean and easy way for multi-threading the application. Another problem was that data was being transfered among classes via XML, yes, XML. It was slow and memory consuming (to put it lightly). I have no idea why this was used - every single class that communicated with another had a bunch of XML parsing and XML building code resulting in a huge bloat, none of which was easily debug-able. So another part of the optimization performed was the removal of most of the XML junk and relied on POJOs and MDBs instead. Maybe XML was used to facilitate integration with external systems? I guess we’ll never know at this point, everything’s been removed :). Yet another problem was the persistence layer used - iBATIS. I want to like the framework (really), but coding in SQL is soooo yesterday. I don’t mind hand-tuning certain SQL for performance, but iBATIS forces the developer to write every single SQL statement required. Dang, that hurt. It also didn’t help that some SQL were so badly written (performance wise) that when it came to tweaking it, this part was just such a pain (in the midst of the pain, I wished Hibernate was used instead). Because SQL was handwritten this resulted in a problem - a lot of bad performing ones were written (see below) and they were very hard to fix since rewriting is risky (we’re talking about business rules here). Database access was also way too frequent. A single business transaction sometimes result in 5 - 10 database hit. I’m not joking. Lookup tables weren’t cached and object already in memory was re-queried without good reason nor need (feels like developers that worked on different modules weren’t communicating properly, yes?). If I were asked to summarize the optimizations performed, it would be: removal of all the XML junk, use JMS for async, multi-threaded processing and reduction of database hits (on average, an access to a small lookup table incurs a 600ms wait, assuming no locks, while a HashMap cached version read takes 2ms). Not to say those are the only tweaks though. I know that the rewrite for multi-threaded socket connectors to SMSC servers also improved reliability and performance.

Speaking about database hits, part of the performance bottleneck was (still is?) the database - Oracle. Ok, granted that application developers doesn’t know a lot about databases, but I question whether any thought at all was put into optimizing the database before this. I think that a little bit of database knowledge is essential for every programmer, more so after this project. Indexing was totally absent from most of the tables (I’m not talking about PK indexes, those were there). Because of this, Oracle’s explain plan showed most of the SQL queries were resulting in full table scans. There are also too many dynamic SQL (see, I told you developers write bad SQL), resulting in very little SQL re-use (a high parse-execute ratio). Then there’s the problem of partitioning, no tweaking of block size (some of these should really have been considered). Let’s not go into other known tweaks - pinning lookup tables in the db cache (some of these was cached at the application level eventually), optimizing the cache-hit ratio in SGA, locking SGA in memory (we have 4GB of RAM…). Furthermore, some odd options such as db tracing and logging was enabled, gasp! In fact, it’s still enabled. I’m gonna try to have that disabled unless there’s a good reason why tracing and logging was enabled on a fully-functional OLTP database.

So that’s what I’ve been up to for the past 3 months. Now I’m working on some parts of Flexi e-loads reports (accessible via the Web) and general performance. The tech-lead for the project has resigned recently resulting in a great loss for the team (he was responsible for maybe 70%-80% of the re-write). It was a pleasure working with you man! I certainly wish you a great future with the new company.

Looking forward, hopefully, there’ll be a new web project for me to work on. I wanna use AppFuse ;)