Managing Batch Processing in an SOA
Most enterprise IT operations rely heavily on batch processing
operations. The reliance doesn't go away when you move to a service-oriented architecture
(SOA), yet SOA just means online transaction processing to many people. Sridhar Sudarsan
has met this problem head on. As an executive IT architect with IBM's Software Lab Services
, he has led enterprise architecture solutions for several customers worldwide, including major enterprises in the finance, public sector and automobile industries.
For Sudarsan's clients, batch processing remains a big question mark as clients migrate toward SOAs.
"Almost every legacy modernization scenario that I work with asks, 'When I transform to SOA, what about the batch applications and system?'," Sudarsan said.
Sudarsan cocreated IBM's batch programming model in Java 2 Platform Enterprise Edition (J2EE), and spends a lot of his time working on batch modernization efforts. He also gives presentations on the topic, such as "Batch Processing Best Practices in SOA: Transformation Scenarios," as he did at the Open Group's Enterprise Architecture Practitioners Conference in San Francisco on January 29th.
Have Batch and Real-Time Processing, Too
Back when batch processing could simply run overnight, the architecture was simple. It consisted of jobs submission, scheduling (using a scheduler and a dispatcher), and execution. You'd have a large stream of data, processing in a loop, with some sort of check-pointing mechanism.
No more. Modern batch processing has to happen while everything else is happening as well, more or less. So you have to deal with batch windows thinning and the concomitant need for rigorous scheduling and optimizing IT resources. You have to integrate batch processing into modern design methods, and process on Java/multiple platforms, offloading some processing to UNIX platforms for cheaper cost calculations.
For some, the logical conclusion of having to integrate batch processing into modern design methods means that all processing will become transactional.
"If I hit a SUBMIT, I want my answer right away," Sudarsan observed. But he firmly believes that's not happening any time soon.
"From an efficiency standpoint you need to have some kind of bulking. You can't process every request at the same time," he explained. That's why you need rigorous scheduling.
Sudarsan doesn't see batch processing being replaced by online transaction processing (OLTP). However, he has found that enterprises seeking a competitive advantage (or, sometimes, sheer survival) need to blend and integrate batch and real-time/online processing.
With the integration of batch and real-time processing in place, enterprises realize cost advantages by maintaining fewer systems, as well as by skills consolidation. The same people run the batch and OLTP systems, using an open and flexible architecture. Processing systems get spread out across various geographies. So batch processing occurs more often, in smaller batches, in more locations, concurrently with OLTP.
Java Coders Don't Know Batch
Sudarsan acknowledged that there's a fear factor involved in talking about touching an enterprise's batch processing systems.
"People are really scared to touch legacy systems," he admitted. "The creators of systems written 15 or 20 years ago are retiring. And the new people are not skilled in these areas, including batch processing and how to integrate it with your current systems and business context."
There's no way around it, though.
"Customers can't afford to maintain two code bases for Java applications and batch jobs," he explained. Moreover, both "need to reuse the same logic but can't."
The need for 24x7 processing means that batch applications employing the old "hog the system at the end of the day" approach must be replaced. Also, you can't just make everything OLTP because current hardware and software won't allow it. Sudarsan understands these limitations because he tried to create a real-time solution once.
"I was also a victim of one of these situations where you come up with an ambitious architecture," he said. "We tried to make the whole thing real-time, but when we tried to test and scale it up to production, it all broke down."
However, he did find that integrating batch and OLTP was both useful and necessary.
SOA Enables Batch/OLTP Integration
Sudarsan summed up the need for SOA to enable integration in a modernization effort, citing a Gartner Research quote:
"…business function used in online transactions may be the same business function used in batch processes, so organizations should think about their IT modernization strategy and consider SOA as a standardized application integration mechanism."
Sudarsan now uses service composition for applying batch in a SOA environment.
"A batch environment can be applied as a lightweight wrapper to an existing OLTP application infrastructure," he said. So your process choreography makes batch a business process step -- another legacy integration approach. You no longer do OLTP from 8 a.m. to 8 p.m. and batch processing from 8 p.m. to 8 a.m. Instead, you run batch processing jobs throughout the 24-hour day.
Conceptually, you divide SOA-based batch workload management into three parts: batch clients, batch scheduler and workload resource managers. The clients require services for planning, scheduling, execution, monitoring and management of batch jobs. The scheduler plans, optimizes, triggers and choreographs unattended execution of jobs or networks of jobs. The scheduler provides services for monitoring and managing batch jobs. Then the workload resource managers dispatch application workloads across all available resources to match job policies and service level agreement requirements.
The batch execution environments can be anything from clusters of PCs to grid pools and distributed environments, right up to good old mainframes.
IT groups trying to bring batch processing into the 21st century often make the same kinds of mistakes, according to Sudarsan. Typically, such groups work up a complicated application support infrastructure and security system. They spin their wheels developing overengineered frameworks. They make excessive use of third-party libraries in an effort to throw money at the problem.
Java application developers get too much control of the process -- despite their general lack of knowledge about batch processing systems. Meanwhile, the platform support staff lacks the skills to argue their case.
Detail the Business Logic Before Coding
Sudarsan has adopted a lessons-learned approach when it comes to moving batch processing into an SOA environment. His list of best practices starts with avoiding the "Ready! Fire! Aim!" orientation of so many IT shops.
If you just do a "code translation" of batch applications from, say, COBOL to Java without due diligence, it will probably prove expensive and unworkable. Moreover, you'll produce unreadable, unmaintainable, inefficient code. If you want to understand what the platform-independent model for the batch jobs would be, you have to abstract the business logic and flows.
Sudarsan advocated using manual processes and tools to create these business logic/flow artifacts. Doing so ensures that current implementations and future requirements will be captured, he said. Automated design processes won't do that.
Speed Trumps Portability
Next, Sudarsan said you need to push down processing as close to the system as possible for efficiency. This can be hard for Java programmers to swallow, because they've been so indoctrinated about the virtue of portability above all else.
But even if Java is being used as the target batch platform, that isn't a mandate for implementing every component or job in Java.
For example, you should push sorting down to the operating system level. You can still invoke it from Java. Partitioning shouldn't be done in Java either; likewise operational interactions (operating system-specific logging, auditing and monitoring). You may be doing these in Java in OLTP systems, but in batch processing, the largeness of the volume of data (files, databases) precludes using Java.
You should only opt for a more portable implementation if the batch flows are unusually light. But Sudarsan said that overall, "pure data processing or ETL [extract/transform/load]-type processing is best done at the data level rather than at the object level."
Aim for Efficient Data Access
Batch performance problems can bog down the whole system. The key to avoiding them is to have efficient data access, Sudarsan said. He offered nine areas to exploit in order to ensure speedy access while designing your data model:
- Separation of Read-Only from Update
- Global file systems
- Data partitioning, movement
- Bulk data access vs. single record at a time
- Data federation/transformation/virtualization
- Declarative ways of specifying data intent and need for a job.
Put Data Close to the Application Layer
Lastly, according to Sudarsan, you'll greatly improve throughput it you place data close to the application layer.
Performance bogs down if your application-layer and data-layer interaction is chatty. You can also bog down performance if you place data in a different subsystem from the application -- doing so leads to overhead problems, which are caused by network load, translation, serialization and the like.
Do the Migration Four-Step
Your current implementation probably employs several distinct batch functions. If so, Sudarsan said you need to do four steps for each function:
- Ask whether a particular step is even required as a separate batch function when implementing it in Java. Maybe it can be merged, or perhaps it needs to be split. This should take 10 percent of the total time for one implementation.
- Identify the batch function's components -- data streams, logic, checkpoint and relevant job control parameters. This should take 30 percent of your time.
- Write the business logic independent of the batch container logic; provide an API that the batch application would call. This should take 20 percent of your time.
- Test (unit, performance and scalability) and tune. This should take 40 percent of your time.
This process enables you to develop multiple jobs in parallel, according to Sudarsan, as long as you have the people resources to go along with a scalable development model.
Sudarsan wound up his presentation by describing IBM's batch programming model for J2EE. Its key features include asynchronous execution, transactional architecture, record-oriented and container-managed design, a true batch programming model, implementation with J2EE components, and use of XML job control language (xJCL). He finished by describing how he designs Java wrappers, which execute existing batch processing applications on mainframes.
His bottom line for Java programmers is that they can develop batch processing systems just fine, as long as they acquaint themselves with the peculiar needs of 24x7 batch processing and how these needs dovetail with those of OLTP.