This week we saw the largest solar storm in nearly a decade and such “solar weather” or cosmic radiation is what generates such phenomena as the “Northern Lights”. However, intense solar activity which creates electromagnetic storms can generate exceptionally strong power surges that damage electrical distribution systems, knock out satellites, and affect sensitive electronics. This has happened in the past, including grid failures in Quebec in 1989 which blacked out the entire province.
It’s a well known problem for aerospace engineers designing electronics for airplanes and satellites, but these “Single Event Upsets” are an issue even in terrestrial-based systems that must meet high reliability operating requirements (although such problems on the ground would typically be the result of reasons other than cosmic radiation). The key challenge is that as electronics operate at faster speeds (beyond 10G) and the density of silicon chips increases, it becomes more likely that a stray bit of energy could cause problems which affect the performance of a router or switch. And despite being rare, for service providers that are building mission-critical networks, “very rare” is still too often. Our challenge was therefore to figure out how to prevent these unusual events, despite the lack of data or industry standards.
Cisco kicked off a program back in 2001 to research the effects of these rare but real events and determine how to prevent them, especially for our larger, mission-critical systems such as the CRS-3. We’ve even gone as far as to place equipment in a particle accelerator to simulate the effects of cosmic radiation over the long term. One key discovery was that simply making small, incremental changes was insufficient. It was necessary to architect systems from the ground up in order to hit our reliability objectives – and to consider system, component, and software elements working together. To validate our designs we also tested the performance of our competitors under the same accelerated conditions.
Several current and former Cisco employees – Allan Silburt, Shi-Jie Wen, David Ward, Adrian Evans, and Dean Hogle wrote a path breaking paper on the subject which was published by “IEEE Transactions on Nuclear Engineering” back in 2008 under the title “Specification and Verification of Soft Error Performance in Reliable Internet Core Routers”. Needless to say, it’s not light reading – but if you are an IEEE member you can download a copy (Digital Object Identifier: 10.1109/TNS.2008.2001742).
However, the key points of this paper are that achieving reliable performance requires a top down understanding of the system to define how the hardware needs to behave and then a bottom up design methodology. This methodology must include from custom silicon chips, to software, and to protocols that leverage the resiliency features.
As a result of this research Cisco has sought to innovate with ASICs, system architectures, and software designs for our mission critical service provider platforms to minimize the impact from Single Event Upsets. As our lives depend more and more on networked electronic devices to get us through the day an increased emphasis on reliability is bound to propagate out from the network core to a mobile device near you.
So can the Internet survive a blast of cosmic rays? When it’s built on Cisco, the answer is yes.