Home

Blog

A Non-Blocking Call of an External Service Inside a Process

Articles 4 months ago

A Non-Blocking Call of an External Service Inside a Process

Stas Makarov

Quite often one has to make API calls to various external services. In essence, it is a standard system orchestration scenario or even a microservice orchestration scenario (sorry for a buzzword). It looks simple and logical on a BPMN diagram – we knock at some door using an API, receive a response, and move on to the next task. For the analysis level models it is all quite normal.

Synchronous and asynchronous running

BPM engines support two types of task running: synchronous and asynchronous. In my opinion, those names are rather unfortunate and unfitting, and only add to confusion.

In fact, it is all about the transaction boundaries and not the real asynchrony. In the context of BPMN, asynchronous running of the tasks is related to how the process stores its state and handles the step execution.

When you mark a task as asynchronous (e.g., with asyncBefore in Camunda, or async in Flowable), the process execution is explicitly split into separate transactions. At the async boundary, the engine persists the current state of the process in the database and creates a job in the job queue. The special component later picks up that job and continues execution in a new transaction. (In Camunda, asyncAfter slightly differs — it means the split happens after the task executes but before the next step.)

Therefore, asynchrony in BPMN is a way of handling transactions or task processing via a queue and not parallel execution of code or threads.

Synchronous tasks

Service tasks are created as synchronous by default. There is certain logic in that: the engine works faster this way, since there is no need to store the process state in the database after every task. Everything is executed within the same transaction until we encounter a user task or another element that causes a waiting state. For simple tasks, which, for example, change the state of the process variable, carry out trivial calculations or write something into the log, this works fine. But in those cases when the result is not guaranteed and a failure is highly probable, synchronous running is exceedingly undesirable. Because it will cause a transaction rollback, the result of all previous tasks will be cancelled, and the process will throw an error. Do you really need any of it?

In general, if there is even a little potential risk that a certain task may fail to complete because of some external conditions, don't leave it synchronous by default. The asynchronous option is not a silver bullet either, but we are going to discuss that below. It is better than nothing, anyway.

Asynchronous tasks and Fail Retry

Now, let's look at asynchronous tasks. As you have already realized asynchrony in BPMN doesn't imply independent execution. The token won't move forward until the task is completed. The only difference is this: now it is not the BPM engine that is responsible for the task execution, but a special component called the Job Executor. That is to say, all asynchronous tasks from all processes are lumped together in a common queue, and this very Job Executor executes them according to the queue order. If the task is completed successfully, the Job Executor informs the process about it, and the token moves forward.

Meaning that we have artificially created a new transaction boundary, and now the result of the previous tasks won't be cancelled if something goes wrong:

Okay, and what is going to happen if our task is timed out or has thrown an exception? In BPMN terms it will be called a "technical error", and if the fail retry parameter is set, the Job Executor will attempt to execute it again and again until the fail retry counter is down to zero. Once it happens, Camunda will create an incident, and Flowable will just mark this task as failed. Then the admin can try to correct the situation manually somehow.

It is a rather primitive solution, but it was invented twenty years ago, and in those days, it was considered the norm. Nobody hurried anywhere, the majority of processes involved some human actions, and human beings tend to slow the processes down. Therefore the recommended value for the fail retry parameter in Camunda and Flowable is still R3/PT5M, which means repeat three times in five minutes. For the completely automated processes adjusted for microservice orchestration, the five minute long pause between attempts looks like eternity. When we are in a fight for efficiency, it's usually the matter of milliseconds, and now this…

And why can't we set the fail retry, for example, to a hundred times in a hundred milliseconds? Well, it won't work. This mechanism works like a timer event. That is, there is a special record in the database, and the already familiar Job Executor checks from time to time whether this timer is ready to click or not. From our practical experience, the minimum time for this is about 20 seconds, and less than that just doesn't work. Of course, it depends on the hardware, but it is never fast enough.

But speed is not even the main consideration. Let's look at it from the logical point of view. We have a service that doesn't respond. And the system creates an instance of a process every second, and they keep trying to call the service in question. Or a hundred instances. Or a thousand. Before long you will find yourself overcrowded with processes knocking at the door of an unresponsive service. Fail retry cannot resolve such a situation by design. It can only exacerbate it.

Which means that the time has come to accept the fact that this mechanism is outdated. It shouldn't be used for the solutions involved in completely automated processes with high intensity. So, what to do? You could look at the Circuit Breaker pattern, but that's a completely different story for another article.

Attached (boundary) events

BPMN contains another mechanism, which can be leveraged for our scenario with an unreliable external service. We are talking about attached (a.k.a boundary) events. They can help us implement the scenario, just as we used to do it with fail retry, but in a more flexible way. For example, you can go into a repeated attempt via timeout, just the same as before, or you can escalate the task to an employee for manual completion.

Suppose your process calls a certain service automatically, it is a counterparty checking service, and the API has been changed at the other side, so everything fails. Instead of creating incidents, the task can be delegated to a human being, who can simply phone your business partner. With error handling, the guiding principle is basically the same. If the service is returning some suspicious code, we can model any logic to react to it. It is a much more flexible approach than using fail retry, and a more visually comprehensible one, as well. The diagram becomes more complex, though. But life is a complex thing overall.

However, the service call in this implementation is still of the blocking type: the process is waiting for the task to complete. And an open transaction is left hanging.

– What will happen if the server fails while the transaction is still open?

Nothing entirely fatal is going to happen. The transaction will be rolled back, since the engine uses a two-phase commit, and this behaviour is normal. However, if the transaction was not persisted, the BPM engine system will discover an incomplete task in the ACT_RU_JOB table at the restart and will process it again. That is, the task will be executed a second time. And if it is not idempotent, this can cause duplication of data. This is true for both the synchronous and the asynchronous mode.

As you can see, long transactions in BPMN can create problems, even if they include only one task. Long transactions keep database-related resources and other system resources blocked, which can cause deadlocks and reduce the system's performance. And let's remember that there can be a lot of instances of that process. To summarize, long transactions are evil. So we would like not to leave the task hanging for an indefinite period of time, but to send a request to an external service, complete the task, move forward with the process, and then, later, receive the response.

Event-based gateway

Event-based gateway is an element of BPMN used for making decisions based on events. Unlike other gateways, such as exclusive gateway, which make decisions based on data, event-based gateway waits for one of the several possible events and chooses a path inside a process depending on which event took place first.

It looks like something useful and particularly fitting our scenario – we send a request to an unreliable service, add this magical gateway and then receive all versions of the response. Such a diagram is very easy to read; it clearly demonstrates all the possible results of the task execution and allows us to model scenarios for each possible outcome. Furthermore, this approach makes it possible to create separate branches inside the process for various error types, which substantially increases verbosity and flexibility of the model.

It looks beautiful on paper! But it doesn't work, if you make an API call from the usual JavaDelegate or Spring Bean by means of the good old RestTemplate. (Of course, RestTemplate is in itself a synchronous HTTP client that blocks the thread while the request is being completed. But this approach is easier and more familiar to many people than WebClient, so we are going to use it.)

We just need to allow the service task to complete itself, and the process will be able to move forward and switch to the waiting state with the event based gateway. But what to do?

Well, let's send the HTTP request not from the service task itself, but from a different place. Let the service task initiate this action through an application event. Then everything will be fine: our service task will quickly shoot out an event, and its mission will be immediately accomplished. Then the event will be grabbed by a listener, but that will happen outside the process context. The listener, in its turn, will execute the HTTP request using RestTemplate, wait for the response and send one message or another into the process.

This is what it looks like on a time sequence diagram. We have uncoupled the BPMN process execution and the external service call. The process will go on along one path or another depending on the request result, OK or error.

The last little detail: our listener has to be asynchronous; otherwise, it will block the process by itself, and we will again have to wait until the external service condescends to respond. Fortunately, Spring has the @Async annotation; when we add it, the method is executed in a separate thread and blocks nothing.

Implementing the process in Jmix BPM with the Flowable engine

Okay, let's move from theory to practice and try to implement a non-blocking external service call from a business process. Supposing that we are receiving addresses from somewhere, and it is necessary to check whether they are real. This check will be done with the help of the Nominatim service: if the address exists, the service will return a location on the map; otherwise, it returns a null.

Open-source geocoding with OpenStreetMap data

Nominatim (from the Latin "by name") is a tool that searches for OSM data by name and address and also generates synthetic addresses for OSM locations (reverse geocoding). It also has a limited possibility of searching for objects by their type (pubs, hotels, churches, etc…) It has an API, which we are going to use.

In the middle of the diagram, you can see a service task named Send request to Nominatim. As we discussed above, the task itself doesn't make the API call, it only sends an application event, which gets caught by our listener. It is not present on the BPMN diagram, but it carries out the most important work, the geocoding as such. After the service task we have the event based gateway set up to handle three options for the process continuation:

Normal, when a positive response is received, and real coordinates have been found for the address.
Error, when the address turned out to be faked, and the service has returned a null. In this case, the process throws a BPMN error with a certain code, which can be processed in the higher level process.
Timeout exit. Naturally, an external service is not always available.

The external service doesn't have to be down, it can be outside the accessible zone, and the application is not connected to the internet. But the process shouldn't just hang. This situation must be processed correctly. Therefore, we model a fail retry and give it three attempts to receive a response. At every attempt the counter is incremented, and after three attempts we throw a BPMN error, but with a different code.

How it is different from the standard approach:

The risks related to long transactions in the process are completely eliminated.
All exceptional situations are visible and can be further improved. For example, after three attempts we can escalate the task to a human being instead of throwing an error. Or you can invent your own option for this.

Implementation

The Send request to Nominatim service task calls the AddressVerificator bean:

@Component(value = "ord_AddressVerificator") 

public class AddressVerificator { 

   private static final Logger log = LoggerFactory.getLogger(AddressVerificator.class); 

   @Autowired 

   private ApplicationEventPublisher applicationEventPublisher; 

   public void verify(Order order, DelegateExecution execution) { 

       String processInstanceId = execution.getProcessInstanceId(); 

       if (order.getAddress() != null) { 

           applicationEventPublisher.publishEvent(new RequestSendEvent(this, order, processInstanceId)); 

       } else { 

           log.error("Order # {}: Address is NULL", order.getNumber()); 

       } 

   } 

}

In fact, the verify method from this bean only sends a RequestSendEvent message with the help of ApplicationEventPublisher, and it contains no other logic.

This message is caught by the onRequestSendEvent listener:

@Async 

@EventListener 

public void onRequestSendEvent(RequestSendEvent event) { 

   Order order = event.getOrder(); 

   String processInstanceId = event.getProcessInstanceId(); 

   Point point = geoCodingService.verifyAddress(order.getAddress()); 

   if (point != null) { 

       orderService.setLocation(order, point); 

       sendMessageToProcess("Address OK", processInstanceId); 

       log.info("Order # {}, Address verified: {}", order.getNumber(), order.getAddress()); 

   } else { 

       sendMessageToProcess("Fake address", processInstanceId); 

       log.info("Order # {}, Invalid address: {}", order.getNumber(), order.getAddress()); 

   } 

} 

private void sendMessageToProcess(String messageName, String processInstanceId) { 

   ProcessInstance processInstance = runtimeService.createProcessInstanceQuery() 

           .processInstanceId(processInstanceId) 

           .singleResult(); 

   if (processInstance == null) return; 

   Execution execution = runtimeService.createExecutionQuery() 

           .messageEventSubscriptionName(messageName) 

           .parentId(processInstance.getId()) 

           .singleResult(); 

   if (execution == null) return; 

   runtimeService.messageEventReceivedAsync(messageName, execution.getId()); 

}

It is important for this listener to be asynchronous, because otherwise it will be executed in the same thread with the process, and the task will be completed only when it finishes working.

The @Async annotation in Spring is used to execute methods in the asynchronous mode. It means that the method, which is annotated by it, will be executed in a separate thread without blocking the main thread where the program runs. This way, the main thread can continue to execute other tasks without waiting for the async method to be completed.

Then, we call the verifyAddress method from the GeoCodingService bean, and this method makes the actual call to the Nominatim service. The method returns a point on the map: the Point object. Depending on the result, we send a message to the process, which can be "Address OK" or "Fake address". And then the event based gateway gets to work.

The geocoding service executes a typical HTTP request using RestTemplate:

@Service 

public class GeoCodingService { 

   private final RestTemplate restTemplate; 

   public GeoCodingService(RestTemplate restTemplate) { 

       this.restTemplate = restTemplate; 

   } 

   public Point verifyAddress(String address) { 

       String NOMINATIM_URL = "https://nominatim.openstreetmap.org/search?q={address}&format=json&polygon_kml=1&addressdetails=1"; 

       ResponseEntity<NominatimResponse[]> response = restTemplate.getForEntity(NOMINATIM_URL, 

               NominatimResponse[].class, 

               address); 

       if (response.getStatusCode().is2xxSuccessful() && response.getBody() != null) { 

           NominatimResponse[] results = response.getBody(); 

           if (results.length > 0) { 

               double latitude = results[0].latitude(); 

               double longitude = results[0].longitude(); 

               GeometryFactory geometryFactory = new GeometryFactory(); 

               return geometryFactory.createPoint(new Coordinate(longitude, latitude)); 

           } 

       } 

       return null; 

   } 

}

If the service hasn't responded within a certain period of time, the timer goes off and the repeated attempt is carried out. To avoid getting an eternal loop, we use the attempt counter. In theory, after exhausting all the attempts we can escalate the task to an employee, who will make the check-up manually.

Conclusion

When we design a business process, it is important to pay attention to the duration of system task executions and possible exceptional situations, which should be correctly processed. The fail retry mechanism should be treated with caution. It was invented a long time ago, and now we have more advanced patterns for such situations.

By combining the application events and BPMN events, we can uncouple the external service request execution and the task completion, in order to be able to use the event based gateway. This approach reduces the risks related to long transactions.