Thanks to Riverbed SteelCentral that helped me to pinpoint a very complicated performance outage at one of my clients. Yes, it helped and made me a superhero! I wanted to share this workflow with you and hope you can be the APM superhero within your organization.
I have a client who invested to Riverbed SteelCentral and our consultancy services about a year ago. They have AppResponse, AppInternals, Packet Analyzer and Transaction Analyzer.
Recently they outsourced one of the application projects to a very famous software development company and app has been deployed in production. As expected outsourced dev company had their own APM solutions in place and there was a grey intermittent performance problem which could also be reproduced during load tests. Their application monitoring solution(s) were finger pointing to backend search server for being the root cause...Finding was "it's not the oursourced application"
Luckly monitoring that backend search server has been implemented during our SteelCentral project and it was in AppResponse's radar. ( Only with AppResponse becuase it was a black box which we were not allowed to install AppInternals by it's vendor )
Wait a sec... AppResponse was saying backend server response time was ok ( red portion of stacked graph ) but the there were significant re-transmissions ( yellow ) and increased connection setup time ( blue ).
Before digging into packet level I wanted to monitor their outsourced application with AppInternals and see what would be the difference from their monitoring solution. Thanks to the team that they accepted my invitation and contributed very well. I call this teamwork which is very important for successful APM.
Immediately AppInternals was able to detect the performance degradation during load test like their ex-profiler did. But there was a big difference which Riverbed folks call "Big Data Approach". This time we had all transactions examined during the load test .
AppInternals said it was a method named classXYZ.get where they were calling the backend search server. On the other hand AppInternals was showing something very interesting in traces which we call socket or remote delays. Developers were calling .Net Http client class within "get" method asynchronously. ( which is very normal to do ) Different then their monitoring solution AppInternals was not only blaming "get" method ( Backend server in this case where the "get" method was calling to ) but also socket level delays which was very well aligned with AppResponse's finding.
Would you call it a network problem in LAN? I did not before looking deep into the packets...
First I wanted to take a tomography of packages with Packet Analyzer before starting surgery with Transaction Analyzer.
Time to start surgery with Transaction Analyzer
Wow wow owww.....Wait a sec 576KB response for each backend call. They had more than 10000 requests per minute to backend call during prime times . It makes approx 96MB per second to read from backend search server. What should the poor TCP stack or HTTP Client buffer do ? Get full of course... What is that payload ? XML data transferred over HTTP.... Is it compressed ? Lets ask to AppDoctor...
There is no HTTP compression support notification to backend server in the GET request. Thus the response payload is uncompressed and very big in size and blows the app server's receive buffer.
Solution : Explaining the finding to developers and kindly asking them to put HTTP compression support in their .Net HTTP Client invocation class fixed the problem.
Conclusion : We (DevOps) have complicated problems like this , complicated problems requires advanced tools like Riverbed StellCentral APM-NPM , not fancy UI APM toys
Thanks for reading.