Picture this...you manage many non-production environments that are pumping out logs at a rate of 200TB a per environment per day. No users using or testing anything for a while, yet, your logging infrastructure is very active regardless. The production environment is logging at a much more aggressive rate, web-app-db servers all continuously writing to files and then those files are mysteriously moving across the network to a central collection point or out to the internet SaaS env (AKA Splunk or ELK). I bet regardless of testing or user activity, the logs fill up at the same rate and the retention is poor (or you pay for more!).
Sound familiar? Users asking for more retention? Your storage guys telling you that "its not like buying a disk from amazon and just installing it, it’s expensive and you really need to forecast growth in advance when asking for more disk, or database storage is the priority"?
Its gets worse right, when it comes to troubleshooting important customer impacting issues, you are literally tasked with finding a needle in a haystack because of the noisiness of verbose logging. Data Retention is poor, most of the data you have been collecting is useless and not related to the needle you are trying to find...going back to my point above, regardless of user activity the logging throughput it the same, nearly all your logging is "noise".
I was quoted at my customer recently "99% of our logs are junk, a running commentary on stuff going on that is non critical, most of the time we focus on less than 1% of the logs when investigating something. There is a low chance of us building a complete picture of what happened because by the time we were asked to help - the data had already rolled out".
What if AppInternals could provide the high value 1%? After all, a log entry is written to a file by your developers anyway, outputting parameters using a method within a class – easy right? Well, we have been able to extract this information for some time, see the following post: Extracting application log messages as parameters (log4net example). In the v10 release of AppInternals we actually built this capability into the product – how cool!
So, a few weeks back(same customer) we upgraded all the v9 agents to v10 and enabled two features for a very critical online application (Web Parameters & Log Packages). We enriched the already excellent application transactions by adding any logging or transaction parameters to the traces. An incident arrived, slightly delayed by the time it reached the right team – it was related to a customer who had an issue completing a transaction. Problem was, the email and screenshot was a few days old, all attempts to find the error in ELK were abandoned because the retention was too low.
We had all we needed to try out these cool new features. We had the date, time, user, transaction type, new the user received a http500 and even the exception output from the screenshot:
Exception looked like this:
Action notification.unsubscribe could not be found. Error raised is Controller controllers.notification not found
play.exceptions.ActionNotFoundException: Action notification.unsubscribe not found
Caused by: java.lang.Exception: Controller controllers.notification not found
So, first thing was to select a time range either side of the exception, then apply a search to find my exception or log parameters – using typeahead. I worked this out very easily:
http.status = 500 and application.id = 16 and parametername = 'logmessage' and parametervalue = '*play.exceptions.ActionNotFoundException: Action notification.unsubscribe not found*'
We found transactions…we also visually inspected the parameters to work out the product selected and few other combining data items to be able to simulate the same journey as the user.(as good as having the xml itself!) The same thing happened for us – http500 error, same exception. The problem originated from a product configuration issue where the application would only complete if some notification features were enabled.
Success, we proved that we could find a log entry normally rolled out in Kibana. We helped fix an issue thanks to the new features in AppInternals 10.
Whilst logged on the UI, we decided to look a little further into the exceptions….provided a nice report to detail all the exceptions, correlated them to the product found in the parameters, listed the customers potentially impacted…allowing the customer service teams to proactively correct the
A long read I know, but if you are using ELK or Splunk, it’s worth looking at these new features on your application servers as we focus on the 1% high value log messages and attach them to the actual transactions continuously!