The 100x slow bug; 10 line fix
The story of how we caused our system to be slower by 100x and how we didn't catch it for 4 years.
Background
The story of our system starts some years back when we were writing a service to process invoices and run some validations on them. At the time of writing the service, invoices were fairly small. To process an invoice, the workflow was fairly simple:
Download the invoice header.
Get all transactions of an invoice (line items).
Aggregate and match the total amounts.
For this, we would make a call to a different service that would return the transactions and their details.
The crippling slowness
Over the years, we started seeing bigger invoices coming into our system. Soon, invoices would have millions of line items. Naturally, this slowed down the speed at which invoices were being processed. So, it was ignored as a side-effect of larger invoices coming in for validation.
Soon, the problem grew bigger. It now took around 6-12-24 hours for invoices to process. Since the problem grew slowly, it did not cause an alarm. Rather, it got accepted as a norm. And now, we had teams building system architecture and solutions to account for the long-running validations.
Identification
As part of one of our resource utilization reviews, it was identified that the particular validation service was neither compute-intensive nor memory-heavy. When asked to justify this, the team said that they process huge invoices (1Gb to 10Gb) and write them to disk, which is slow. This is why they need more hosts to scale their systems to process more invoices.
A large invoice can be 1 gigabyte to 10 gigabytes in file size. This should have been the first hint for the team to go back to the drawing board and find the real issue. Modern disks do not take a day to write to in 202x. This was somehow overlooked and accepted as a reason for the slowness of the service by the team. During the review, however, this did not fly because of the 202x (2020-2030s because the future looks dark) logic.
The blunder
Yeah, so for years, we accepted the slowness of the system and changed the architecture around the slow validation process. However, during one review, we were unable to identify a single resource that was causing the host to be slow. So, we took a deep dive into the code. And lo and behold, we identified a bug in our code. I'm sharing the pseudo-code below so that readers can identify the issue in the code.
class TransactionLogger {
// Client to call external service for getting transaction details
private TransactionAuthorityService tasSAO;
public List<Transaction> getTransactions(List<String> transactionIds) {
List<Transaction> transactionDetails = new DiskList();
for(String transactionId : transactionIds) {
transactions.append(tasSAO.getTransactionDetails(transactionId);
}
return transactionDetails;
}
}
Now, at first sight, there is nothing wrong with this code. We are doing what we are expected to do: get transactions and store them to disk.
The issue is that this code is sequential. The disk is not the bottleneck; it is the damn RPC call, which waits for every call to return before making the next one.
The 10 line fix to 5 year old architectural problem
The fix for the problem is rather simple. We add parallel calls to fetch the data from dependency service.
class TransactionLogger {
// Client to call external service for getting transaction details
private TransactionAuthorityService tasSAO;
private ExecutorService executorService;
public List<Transaction> getTransactions(List<String> transactionIds) {
List<Transaction> transactionDetails = new DiskList<>();
// Create transactions in parallel
List<Future<Transaction>> transactionFutures = new ArrayList<>();
for(String transactionId : transactionIds) {
List<Transaction> future = executorService.submit(() -> tasSAO.getTransactionDetails(transactionId));
transactionFutures.add(future);
}
// Fetch all transactions
for(Future<Transaction> future: transactionFutures) {
transactionDetails.add(future.get());
}
return transactionDetails;
}
}
Disclaimer: The above code is not production-ready. You need to:
Add batching to control the size of running futures at any time.
Add rate limiters to your downstream service so you don't brown them out.
There are better ways to write this code, but I just wanted to keep it simple for the readers.
The takeaway
As developers, we need to know the breaking point of our services and what resources we are constrained on before we resort to horizontal scaling for every problem.