When application code is not available and logs lie!
Debugging issues in an proprietary software with unreliable application logs.
Preface
“Proprietary” ( sounds like - pruh·prai·uh·tree) is a tough to spell word with an interesting pronunciation. Skipping the tedious explanation of how I’ll inevitably flunk the dictation despite my repeated attempts let’s get straight to the issue at hand. We find ourselves in an interesting setup for this one:
We have a client claiming very high latency from our service.
We have a backend service claiming very low latency.
needless to say the initial few calls with the client went like:
because we thought it was their network and they thought it was ours.
So, we ran a bunch of tests find some minor configuration issues on the application end and okayed the system to be used by the client. And to our surprise the client kept seeing the same issue. At this point, I’m more curious than frustrated as to what is happening, since there has been no improvement whatsoever.
Problem
Upon picking up high latency request we traced it across the logs in our system and this is what we saw from it:
So when we looked at the gateway logs (the proprietary software) we can clearly see the request reached the gateway with a 2 second delay and like all other developers we agreed it was an network issue. Because why not? I can see I did not see the request on my system before that point (nice word play huh ! ). Anyway after doing tons of optimization we saw no improvements. So we sat with a laptop next to the server plugged it in and ran the tests and we still see 2 - 3 SECOND latency. This tells us :
Network is not the problem.
Either I have a bad machine.
Or my logs are not reliable.
We ruled out point 2 because we tested on multiple machines and the probability of us getting that lucky to have all the bad servers is quite low (knowing my luck we cannot however completely rule that out :D). And now the meaty problem for this post. How does one debug an application who’s code you don’t own and whose logs are not reliable?
We reached out to the proprietary software support and after multiple discussion they refused to acknowledge that this is a problem in the software. We are now stuck and we still need to find the issue.
Debugging without code and logs
So we are at a point where the application owners are not able to come up with any suggestions on what might be causing this. I decided to now deep dive on the issue. While I don’t have access to the applications internal or any authoritative logs I do have access to server. What can we do with server access in a proprietary software? Well unless the application is designed keeping security in mind you can pretty much figure out everything the application is doing. In my case, I used 2 tools to identify what is happening:
strace
This allowed me to see what system calls the application is making to the kernel and where is it spending time from an linux’s perspective. Strace is very useful in identifying what is happening in pretty much all applications.tcpdump
This allowed me to see (not completely correct my traffic was over TLS) what is happening on the connection stream.
How to’s
This post is useless if I don’t give back a quick way to use both strace and tcpdump. Within the next 2 mins:
strace
For linux OS you can use the following command to identify all high latency requests and what they are doing in terms of time spent and data written in these calls.
strace -f -s 100 -p PID 2>&1 | awk 'gsub(/[<>]/,"",$NF); if($NF+0 > 0.8) print $0 " " $NF}'
What does it do:
-p → uses the parent process id.
-f → tracks all the child process for PID
-s → decides what size of data to be read in the call params.
2>&1 → redirects stderror to stdout for linux (default output of strace is stderr)
awk → parses the output of strace to only show syscalls where 0.8 secs or more was spent.
On seeing the results of this command we were able to see a lot of time being spent on calls to a SQL DB.
tcpdump
For seeing what’s happening in your system on a per connection basis you can use :
tcpdump -i INTERFACE -w -s 0 -w output.pcap
What does it do:
It captures all the traffic on INTERFACE and writes it to output.pcap
You’ll need Wireshark to see or use the generated PCAP file. In my case, it clearly showed a bunch of DB calls to the MySQL server. These DB calls when they saw high latency were creating a back-pressure and causing all calls to have a higher latency (my guess is the proprietary code has some synchronized block blocking all threads but that is for another day or blog post).
I highly recommend the short read on using by Julia Evans : How I use Wireshark by now you have any ways invested 5 mins in learning why not invest another 5mins and have some more depth into the tool.
Closing thoughts
Since we didn’t own the Gateway code there was only so much we could do. The application team acknowledged the issue and decided to provide a patch for the same. I think the big takeaway was how much speed the two tools provided in deep diving on the issue even when application team denied any issues. My recommendation would be to definitely give them a shot.