Avatar

Four telemetry pillars for clarity from a torrent of signals

In this hyper-connected age of distributed applications, when users face poor digital experiences, they do not care much about what is causing them. They just want it fixed – and now! Whether transferring funds, ordering dinner, collaborating remotely with a colleague, or streaming the latest movies, customers and end users want flawless digital experiences that work every time, are secure, and highly personalized.

Consistent delivery of these digital experiences is immensely complex. Results depend on a myriad of  interactions across disparate systems hosted in multi-cloud environments – each generating a torrent of metrics, events, logs, and traces (MELT) containing fragmented information about performance, connectivity, responses, experiences, and outcomes.

Collectively, this telemetry data contains what teams need to ensure problems do not lead to security, performance, or experience issues down the line. It also holds the information that developers require to deliver optimized applications.

However, well-publicized examples show us that when something goes wrong, ranging from degraded performance to complete unavailability, the digital experience breakdown can be difficult to understand, analyze, and resolve.

Moreover, in a world demanding real-time flawless experiences, availability and performance are key success metrics and a “blink” of disruption comes at a high cost. For example, in the case of downtime alone, the average cost per hour approaches $250,000 according to a 2023 IDC global survey on full-stack observability (FSO).

Considering the real-time and near real-time expectations from the business, it would be paramount – and even faster – to determine where a problem is not, than to pinpoint the root cause of a multi-domain incident before an initial remedial action can even be considered.   This applies to both reactive and proactive / predictive motions.The challenge is typically two-fold.

The sheer volume of siloed telemetry, even further for real-time use cases, makes it almost impossible to assess the relevant data in a workable timeframe due to the lack of proper context. Solutions have emerged that rapidly surface anomalies or issues that are out of baseline, but just 17% of IDC’s survey respondents said their current monitoring and visibility solutions deliver the necessary context to take meaningful action.

Additionally, the distributed nature of today’s applications and workloads mean that relevant data may not even be captured by some monitoring solutions because they lack visibility into the full application stack from the application itself to infrastructure and security, up to the cloud and out to the internet.

Telemetry in a complex, distributed world

To be truly useful, an observability solution must have a clear line of sight to every possible touchpoint that could affect the way an application and its dependencies perform as well as how it is consumed by their users.

This requires a massive stream of incoming telemetry which can be extracted from networks, security devices and services and used to gain visibility as a basis for actions. Cisco has long sourced telemetry data from routers, switches, access points and firewalls, just to name a few.

Every day, Cisco surfaces more than 630 billion observability metrics, derived from telemetry streams from applications down to infrastructure, through the network, and out to the internet, while absorbing 400 billion security events.

In addition, telemetry from other sources such as application security solutions, the internet, and business applications themselves provide performance insights, uptime records, and even logs from public cloud providers. Here again, modern telemetry architecture ensures that observability gets the required streams of data to work without compromise.

In fact, with distributed workforces and the new reality of working from home, the correlation between end-to-end connectivity, application performance, and end user experience is so significant that any fast path to problem resolution must be able to assess MELT signals through the lens of connectivity, performance, and security, as well as looking at elements such as dependencies, code quality, and the end-user journey.

Furthermore, artificial intelligence (AI) and machine learning (ML) have become a requirement to arrive at reliable predictive data models for deriving actionable insights that are directly tied to business goals and objectives. Finally, organizations now demand more integration points to collect different pieces of data, and analysis of root cause, pattern matching, behavioral analysis, and predictive capabilities.

To that extent, standardization with open source projects such as OpenTelemetry has made it possible to normalize data ingestion, ensuring it can be uniformly collected. OpenTelemetry provides an open, extensible observability framework that uses vendor-neutral APIs, and other tools for collecting data from traditional to cloud-native applications and services as well as the associated infrastructure, supporting teams to understand normal business operations. It also enriches the foundation of correlation solutions handling application performance, security threats, and ultimately business outcomes.

Cisco, one of the leading contributors to the OpenTelemetry project, has long been committed to open standards to build products and platforms such as Cisco Observability Platform.

Telemetry diversity drives performant digital experiences

For effective observability, all four types of telemetry data are essential.

  1. Metrics are useful for creating baselines and triggering alerts when the output falls outside of the expected range.
  2. Events are helpful to confirm or notify that a particular action occurred at a particular time.
  3. Logs are versatile and empower many use cases from security analytics to those that rely on a detailed, play-by-play record of what happened at a particular time.
  4. Traces record the chains of events within and between applications and are also key to tracking end-user experiences. Traces, in particular, have the potential to move observability beyond single domain monitoring into full-stack visibility, insights, and actions in a multi-cloud environment. For instance, through integrations with key portfolio solutions, Cisco has tapped the power of traces among the domains of applications, security and networking, to drive the correlations that reveal insights mapped to business risk and other crucial business indicators.

Not only does telemetry diversity allow organizations to derive insights from the broadest set of data, but also teams can see it in their own context. For instance, the impact of end-user experience on business outcomes associated with a mobile application hosted in a multi-cloud environment – SaaS or otherwise – can be seen through the lens of a consolidated visualization (c-suite) as well as through the automated action required by site reliability engineers (SREs) to address the issue causing that impact.

While their perspectives differ, teams within IT and across other business functions increasingly rely on each other in a world where applications, and the digital experiences they create, are crucial to business success.

This is at the root of the ongoing industry transformation associated with observability, and Cisco brings the observability perspective across the full-stack by tapping into billions of points of telemetry data across multiple sources to achieve cross-domain ingestion and analysis.

With Cisco Full-Stack Observability solutions, teams can prioritize and then remediate issues together, becoming true partners in achieving business objectives while ensuring customers and end users always get the best digital experiences.

Learn more about Cisco Observability Platform



Authors

Carlos Pereira

Cisco Fellow and Chief Architect

Strategy, Applications, Emerging Technologies and Incubation