Avatar

Every IT leader faces the same paradox: innovate faster while maintaining rock-solid stability. At Cisco IT, we were deploying AI systems and new technologies at breakneck speed—and watching our incident rate climb. Then we turned it around. Here’s how we reduced major incidents by 25% in one year while accelerating our pace of innovation.

The innovation tax: When speed becomes your enemy

Like most IT organizations, we were adding AI capabilities, deploying cloud services, and modernizing applications at an unprecedented pace. Innovation was our mandate. 

But with each new system came hidden costs: 

  • Visibility gaps: New technologies brought new dashboards — each siloed, none talking to each other. Our operations team was drowning in alerts with no unified view of actual business impact. 
  • Change-driven instability: We discovered a direct correlation; the more changes we pushed, the more incidents we experienced. Innovation was causing outages. 
  • AI uncertainty: While AI promised efficiency, it also introduced new failure modes. How do you monitor what you don’t fully understand? 

The question became urgent: How do we innovate without disruption? 

To address this, Cisco IT has made observability a cornerstone of our approach. 

Our North Star: Innovation without disrupt 

Rather than slow down innovation, we made a different choice: become radically better at observability. 

Our Service Operations team and Enterprise Operations Center (EOC) set three clear objectives: 

  1. Detect faster – Spot issues before users report them, with complete business impact context 
  2. Assign smarter – Route problems to the right experts immediately, no handoffs 
  3. Resolve proactively – Fix issues automatically when possible, communicate clearly when not 

The goal wasn’t just faster incident response. It was to make our environment so observable that we could innovate faster, and with less risk. 

 Cisco IT’s observability approach and technology

For Cisco IT, observability is critical to delivering end-to-end visibility, actionable insights, and AI-driven automation to enable us to detect, address, and even prevent issues before they impact the business. 

Cisco IT’s observability strategy is built on a layered approach spanning three teams. In the first two ‘layers’, dedicated teams are responsible for end-to-end observability across our network, applications, services, and infrastructure. Leveraging critical solutions like ThousandEyes and Splunk, they aggregate telemetry from our global environment and transform raw data into meaningful insights.  

  • Splunk: Our central nervous system for IT health. By aggregating logs, metrics, and events across our global infrastructure, Splunk gave us something we’d never had: a single source of truth. When an issue emerges, our team sees correlated signals across system — not isolated alerts — enabling us to understand root cause in minutes, not hours. 
  • Cisco ThousandEyes: Our eyes on the end-user experience. ThousandEyes provides deep visibility into network paths and application performance from the user’s perspective — pinpointing exactly where and why slowdowns occur. When a critical application underperforms, our Service Operations team doesn’t guess whether it’s our network, a third-party provider, or the application itself. We know immediately, isolate the issue, and engage the right team to fix it — often before users open a ticket.

Our Service Operations team is where these insights are put into action to quickly identify, address, and even prevent issues before they impact the business. 

To enable our team to use the data and insights from these solutions even more effectively, we deploy AI-driven automation across a variety of incident management use cases: 

  • Predict assignment groups: AI analyzes incident descriptions against historical patterns to route issues to the right team immediately. This has resulted in a 19% reduction in reassignments and faster time-to-expertise. 
  • Suggest resolution options: By matching current issues to our knowledge base of 100,000+ resolved incidents, AI surfaces proven fixes instantly.  
  • Automate resolution: Self-healing systems now handle routine issues like storage cleanup and session resets without human intervention. AI-automations now handle 99.998% of ~4 million daily alerts that represent potential issues/incidents. 

While observability platforms and automation provide a critical foundation, technology alone isn’t enough. That’s where our team and established best practices make the difference. 

Beyond the technology: the human element of observability

The true value of our team goes beyond technology — it lies in the people and processes that convert information and insights into action. We work to quickly detect, analyze, assign, and resolve issues to minimize disruption.  

To do this effectively, we’ve recognized 3 best practices are key to our success: 

  • Intelligent change management: Not all changes carry equal risk. Treat them accordingly.We didn’t slow down changes — we got smarter about them. By categorizing changes based on risk, we automated approvals for 80% of standard, low-risk tasks while intensifying our focus and monitoring for higher-risk initiatives. The takeaway here is that not all changes carry equal risk. Treat them accordingly.

 

  • Data quality and accuracy: Quality AI requires quality data. Prioritize CMDB hygiene.Our foundation for AI effectiveness. AI is only as intelligent as the data feeding it — garbage in, garbage out. We built a comprehensive data quality framework around our Enterprise Service Platform (ESP), with our Configuration Management Database (CMDB) serving as the single source of truth for our entire technology environment. Through automated quality reporting and workflows, we continuously identify gaps, flag stale information, and trigger updates in real-time. When our AI predicts assignment groups or suggests resolutions, it’s working from accurate, current data — not outdated records from three months ago.  

 

  • Effective communications: In a crisis, clarity is as valuable as speed.Our bridge between technical chaos and business clarity. During critical incidents, technical teams understand the problem, but business stakeholders need to understand the impact. Our Service Operations team translates complex technical issues into clear business language: which services are affected, how many users are impacted, what we’re doing to fix it, and when normal operations will resume. This disciplined communication approach keeps executives informed without overwhelming them, enables business units to make contingency decisions quickly, and maintains trust even during disruptions.  

The bottom line: Measurable business impact

Over 18 months, our observability transformation delivered results that directly enabled business agility: 

  • 25% reduction in major incidents – Fewer disruptions to employee productivity and customer-facing services 
  • 20% fewer change-related incidents – Innovation without instability 
  • 45% faster mean time to restore – From hours to minutes for critical service recovery 
  • 80% of changes now auto-approved – Faster deployment, lower risk 

What this means: Cisco employees experience fewer disruptions, IT teams spend less time firefighting and more time innovating, and the business moves faster with confidence. 

 

Ready to transform your IT operations?

The lessons from Cisco IT’s observability journey are clear: you don’t have to choose between innovation and stability. With the right approach to observability, AI-driven automation, and operational discipline, you can have both. 

 

 

Next Steps: 

 

Authors

Mark Hutchins

Director

IT Service Management