Service providers (SPs) often face a number of service quality challenges. These challenges, more often than not, result from hardware failures, software bugs, network outages, packet loss, and capacity issues. The majority of these challenges may not be new, and may have already been resolved by SPs’ technology partners, or by other operators. Indeed, SPs could capture significant operational benefits simply by adopting well-established best practices.
However, adopting these best practices requires a proactive and open relationship between SPs and their technology partners. Without open cooperation, adopting these best practices and continuous improvement will always prove to be a challenge.
To explore the relationship between an SP’s culture and the adoption of best practices, I will be writing a series of articles on the SP360 blog covering operational and engineering best practices, challenges, and benchmarks observed in the course of working with major service providers worldwide. The specific topics I will cover include: operational practices such as testing, certification, engineering rules, go-live, and incident management; as well as organizational capabilities (planning, program management, culture, management practices, IP skillsets, and staffing levels).
A good place to start is testing. Testing is critical, as any complex system will always have bugs. The way in which new network elements and software are tested prior to their integration into a production network can heavily influence that network’s quality. We have found that leading SPs test new software extensively both within their own labs as well as Cisco’s. Testing typically lasts eight to ten weeks and includes functional, scale, integration, and regression testing. However, there can be significant differences in how a given SP coordinates the testing with Cisco, configures the test environment, and performs the actual testing. SPs with the best service quality often develop common test plans with Cisco. Both the SP and Cisco will then use the common test plans to coordinate and perform the testing in their respective environments.
In most instances, the SP and Cisco will each follow the same test procedures indicated by the common test plan to the letter, and then compare results with one another. That means that everything running on the SP’s production network is shared with Cisco. Furthermore, every change to the network, or new added feature, is updated on the test plan.
In one instance, the SP and Cisco actually divided up testing responsibilities 50-50, essentially reducing the testing cycle by 50%. This was possible only because their respective labs closely resembled the SP’s actual production network, in several ways. First, they reproduced the network topology and routing architecture, using a sufficiently large number of routers to simulate real traffic flows. Second, the labs’ networks reproduced the feature functionality configured on the routers in the production network. This was critical because service-impacting bugs can be perpetrated by individual features, as well as by unintended interactions among them. Third, any external software, such as scripts and MIBs, interacting with routers were simulated, as these could conceivably impact service availability. Last, they reproduced absolute traffic levels on the production network and simulated multi-user environments, because some bugs only emerge when a router’s CPUs or interfaces are under heavy loads. Other SPs went so far as to simulate adverse effects, such as route flaps, because these can potentially trigger non-linear effects, which in turn can lead to a service outage.
On the other hand, SPs with the worst service quality record tend to not follow many of these aforementioned testing practices. Therefore, chances are good that they miss a significant number of software bugs which then manifest themselves on the production network, leading to service outages.
In closing, the need for extensive testing highlights that network-related issues are indeed a significant challenge for SPs. But, through the use of industry best practices and collaboration with technology partners, SPs can take a more proactive approach to better managing these issues and can dramatically improve operational quality. I will cover more best practices in upcoming columns.