Having used AppInternals for four years, my breakdown of where my investigations end have always fallen into these areas:
Changes - newly introduced code, change in type or amount of payload being processed
Database - a slow query, table scan, index or plan chosen
Remote System - downstream dependency, webservice
Resource - disk space, cpu or memory drainage
Capacity - just too much volume on the old spec
Until that is....the day DNS died.
A P1 was about to be initiated, sounded quite big and lots of gatherings on chat and email alerts starting to arrive. All users lost access to their applications.
1 minute since the first reports, we opened up AppInternals and verified that there was a complete drop in application transactions, increase in response time.
Using the "most" used analysis operator "| timeseries" we quickly listed all urls, locations, servers and instances...there was definitely a problem, my alerts already told me this anyway.
Trawling through the transactions, we began to think about the pieces of infrastructure that would cause so much widespread damage:
Network - not this time
Load Balancer - nope, the vips are working, I could connect
Reverse Proxies - definite correlation here, I could connect, but not much else was loading
2 minutes into the first alert, we had already narrowed in on the Reverse Proxies...these were java, so focused on the HTTP 500's and exceptions:
| timeseries -group_by httpstatus
| timeseries -group_by exception
The exception java.net.UknownHostException in the class java.net.InetAddress, the method is getAddressFromNameService had appeared out of the blue...this was new to me, never seen before, here is what it looks like....
TIP: If you are a user of AppInternals, define a transaction for these requests, add an alert for this class.method based on the exception count bring high or response time etc. It may save your ar%% one day, also copy in the DNS team to the alert. If we sampled our applications for feedback now and then, there would be a very high chance we'd miss this exception!
AppInternals was 100% right. The Reverse Proxies lost their ability to resolve DNS. A change took place TWO minutes before the failures. The TTL(Time to Live) on all load balanced DNS entries was 120 seconds. A change had blocked the access to DNS for those hosts, we captured the cause and helped divert a major outage from becoming business impacting!
Another TIP, once you find your query to find the exceptions, start using the |timeline analysis operator, you see all the transactions at the same time (in time order) which just confirms everything again.
We had restored service before every member of the support team joined the call.....this is the true value of having "All of the transactions, all of the time".
You may ask why the agent continued to work if DNS stopped on the Reverse Proxies...well, we have a 3 hour TTL on our Analysis Server address...not 120 seconds. Our connectivity was never in question.
Hope you found this useful
This document was generated from the following discussion: The day DNS Died