Detecting Intermittent UC Outages: Why It’s Hard, and How We Do It
Detecting intermittent outages is one of the tougher challenges in large unified communications environments. For one thing, it can be hard to recreate the problem. For another, users tend not to complain if they get through on the next try. And if users don’t complain, how can IT know there’s a problem?
Cisco IT now has the tools and processes to detect intermittent outages before hearing about them from our users. We use Cisco Unified Operations Manager to conduct “synthetic tests,” which replicate user activity like getting dial tone, making phone calls, leaving voicemail, and creating or joining conference calls. We have several dozen virtual IP phones in each region of the world that make test calls every minute, both within and between clusters, for a total of 268,000 test calls daily.
For example, one report showed that a portion of calls were failing every night at about the same time, for 30 minutes to an hour. Calls randomly dropped or were not answered, and then the problem went away all by itself…until the next night. We never heard about the outages from users, who probably just hung up and tried again.
Based on their time range, we realized that the failed calls might be related to the nightly Cisco Unity server refresh. This gave us the clue that the problem was probably caused by the way we implemented the dial string: we were inserting commas to slow down dialing. This seemingly minor error heaped errors on the Unity memory stack that took several hours to clear out. We reported the problem to the team responsible for Cisco Unity, and they revised the configuration software to make sure that our customers don’t have the same issue.
The value of synthetic testing in this case is that we were able to fix an availability issue before we ever heard from users. That’s the goal for Cisco IT.