NetProfiler analytics are sensitive out-of-box to variances from the predicted baseline. In my opinion it is better that we desensitize the analytics to reduce "false positive" reporting than it is to have the system tuned out so far that we miss important event notifications.
Why "false positive"? The system has baselined the traffic using historical data. Therefore any variance outside of a reasonable tolerance from the sigma level should be considered an event. The question is "How sensitive does my service monitoring need to be to suit my environment and the application's business priority?" One persons "false positive" could be another's "serious issue". Low latency networks are an example, a response time greater than 1ms may be considered a major event, yet in the "normal" network environment we would be satisfied to see times far greater than this. (If you are seeing times greater than 50ms on your WAN then perhaps you should consider deploying SteelHead's).
Those application and network owners that find these alerts are too sensitive for their specific needs can carry out some simple adjustments in the analytics to reduce the "false positives".
The only metrics in the services that are specifically indicating problems are % retransmissions and response time.
The others such as "active connections" and "resets" may be related purely to an increase in traffic rather than any underlying issue. These metrics should be viewed together with the others to determine if a problem is being identified e.g. an increase in active connections may be related to an increase in the volume of resets. As will be discussed below, resets in themselves may not be a problem but local knowledge and experience of the application and network path may actually indicate that this is a problem. (If you don't have the local experience or knowledge then run historical reports looking at these metrics to see if this behaviour is extraordinary).
Active connections from a user perspective are generally unpredictable because the individual users and the way they use an application are themselves erratic, e.g.
- Do you open and use this Splash site in exactly the same way as me or any other users here today?
- Do you always open it at the same time every day?
- Do you always view the same pages?
- Do you always close the page after the same amount of time after opening it?
Probably not to all of these questions, you are human and consequently and individual and "fickle". With this in mind the host group you are using for services also becomes an important factor. The default host group type is ByLocation. If host groups within this contains locations with a small number of active users the consequence is that individual users will have a disproportionate affect on the baseline. As with any statistical measurement, the larger the sample the less likely individual users with their random methods of using an application will result in a false positive event. If your groups contain a small number of users perhaps changing to a larger sample group would be beneficial i.e. ByRegion, ByCountry, but this may not meet with your business needs.
However, server to server traffic tends to be far more predictable. A transaction is required, a connection opened, transaction completed, connection closed, job done. Very predictable. In fact this traffic is so predictable that left to its own devices the NetProfiler analytics will gradully adjust the "green river" tolerances so tight to the typical line that even the smallest glitch will generate an event. We need to manage the analytic tuning to mitigate false positive events caused by user unprectability and by too much predictability of the server to server traffic.
Another metric measured that can be interesting when seen as a holistic view but individually may not always in themselves be indicative of problems is Resets.
Resets - essentially a time-out. Some applications are known to use a reset rather than "waste time" performing a [FIN,ACK] or the [FIN,FIN-ACK,ACK] handshake. This as rather like robooting a server by unplugging the power, effective but dirty and potentially destructive. Web servers and security devices such as firewalls are also common sources of resets. Both of these will force a connection reset after a predefined period of connection inactivity and inevitably another new active connection will be generated by the host system.
To desensitize the analytics I would always review each event and determine if the metric is a minor infringement of the tolerances or indiciative of a genuine problem. There are a number of steps to reduce or eliminate the "false positives":
- Disable the end user to server Active Connections monitoring (or detune them as described below).
- Change the tolerance sliders further to the right (e.g. 5 for low and 6 for high). This is a fine-tuning adjustment so may not in itself reduce the false positives dramatically, so you'll need something a little less subtle...
- Add noise floor values for response time and % retransmit. These would be dependant upon "normal" values for your network, so running reports or reviewing the event notification and looking at the "context" graph would be helpful. The noise floor increases the tolerance bandwidth around the baseline calculated by the analytic process. Setting a value of 25-50ms to the response time and 1% to 2% on the %retransmits I normally find will reduce the number of alerts.
- Increase the event duration from 1 to 2. This will eliminate event triggers that only occur within 1 x 15 minute period. These are often found to be a short spike and can usually be ignored. Events that last more than 15 minutes tend to be more remarkable, they are either long-lived or multiple outliers so are more indicative of a an issue.
Dont forget, you can update multiple policies with all these settings simultaneously by tuning from upper levels of the service:
Note: Care should be taken to not over-compensate with these settings to the point where you stop reporting genuine problems.
Thanks for your valuable input.
I have already adjusted the settings and considerably reduced on the number of alerts.
on the process of monitoring the performance of users that are still causing alerts and will therefore tune settings accordingly while having a better knowledge on the normal application behavior.
The customer is deploying SteelHead throughout the enterprise network. But this is being done in a phase approach. and from our observation issues happening at sites with SteelHead already installed has so far been TCP retransmissions.
That was a very useful reply.
Can you give more details on Tolerance & NoiseFloor
As per documentation Noise Floor is the minimum amount of change that is considered significant.
Say I specify Noise Floor for Resp Time as 50ms so only change more than 50ms will trigger an event ..right ?
Question is change relative to what ?
Also Tolerance values are specifying what..These numbers from 2 to 6 i.e. sigmas mean what ?
This is actually a good question. I am also looking for more details on the Tolerance and NoiseFloor. I have been reading about standard deviations and tolerance intervals, but not sure I am any closer to understanding.
I have a very basic understanding that the high and low tolerance for the high and low level alerts levels, but I don't understand exactly how. What exactly do the sigmas mean? How are they used to compute the high and low range?
How does the Noise Floor affect the tolerance?
I found this link which basically states a 4.5 Sigma for 100Kb traffic is 45 Kb, but this seams almost too simple of an explanation.
If this is the case, then how does the noise floor affect this?
Thanks in advance for any help.
The NetProfiler analytics are continually measuring and updating each metric that is being monitored. If a metric value is very little variation i.e. has a constant or near constant value, or has a regular predictable pattern of behaviour the analytic engine will reduce the tolerance down to be very close to these values. Eventually you will find that the "green river" tolerance is so close to the predicted/actual value it will be almost impossible to discern the 3 different graph traces (actual/typical/tolerance).
(I was trying to insert an image here to demonstrate this, but it didn't work)
The result of this is that a very small change or deviation in metric behaviour will trigger an alert. In most networks these small "bumps" in behaviour are not significant enough to indicate a problem and could therefore be classed as a false-positive. To mitigate this the Noise Floor can be set to a value that represents the amount of change required before the deviation is considered to be significant. In practice I find that adding a noise floor even on a metric that is unpredictable is sufficient to minimise the number of false-positives.
Sigmas (or standard deviations) are calculated on the traffic observed over a period of time and are not influenced by the Noise Floor value. The Sigma calculates (using clever Riverbed proprietry statistical analysis) the typical metric values for a given day and time and applies the tolerances that you can see in the policy tuning graphs as the normal range often known as the "green river". As mentioned above, if there is little or no variation in observed traffic then the normal range will become tighter and tighter to the typical value to the point where it is almost unobservable.
So, on a given day at a particular time the system, based on previous observations on the same day and time in previous weeks, will compare actual traffic to the standard deviations (normal range). If the variance is outside of the normal range, the system will flag an outlier. Inserting a noise floor to the policy will add this metric value to the normal range, therefore increasing the overall tolerance. This is easily observed in the policy tuning pages by increasing/decreasing the noise floor values and watching the "green river" adjust accordingly.