Service Capacity Planning at Cisco
Capacity planning is getting far more complicated as network services get more complex, and it requires understanding each service as a whole, cutting across several traditional IT services like network and data center capacity planning. Here’s how Cisco IT is starting to address these new service-based capacity issues, mainly focusing on Network and Voice Capacity Management
Cisco IT design and support teams are divided into regional and global technology areas. Campus LAN and regional WAN support are handled by one of four different regional teams around the world. Other services – the global network backbone, data center support, wireless services, Extranet services, and others, as well as collaboration services like voice and video and web conferencing – are supported by global teams.
In the past, capacity planning was handled by each team, acting separately. If a WAN link or a voice gateway or a data center floor was reaching full capacity, that team would identify the issue and set about dealing with it. When IT services were mostly providing a set of working applications in the data center and providing the network required to connect to them, this relatively simple and fragmented capacity planning process worked fine. But as services got more complex and more interconnected, Cisco IT needed to take an overall Capacity Management service view, looking at the architecture as a whole and all the services within that architecture, to make sure that we provide the capacity to support our business services. We also need to be more proactive, as we deliver services like networked video or data center storage, which use so much in the way of costly IT resources and are growing so quickly, and are critical to the smooth functioning of our business, while only installing as much of these costly resources as needed.
An excellent example of a complex service requiring an architectural service view is Webex conferencing. Webex conferencing allows groups of Cisco employees to meet and share a lot of information – sharing laptop screens and presentations, combined with voice and video conferencing, and also allows the meeting host to record the meeting to share with people who couldn’t attend. Webex conference usage has been growing rapidly at Cisco as we work more globally, and also as we try to reduce our travel spending
This rapid growth in Webex conferencing is adding a lot more voice traffic to the network, and changing its direction as well. People used to dial into a meeting bridge – increasing inbound traffic; but as people use the callback function from the Webex voice bridges in San Jose (and soon in London, as well) to dial back to their phones – office phones, home phones, mobile phones — our outbound traffic has increased dramatically. All this Webex conferencing usage also generates network traffic as people are sharing screens and presentations and video streams over the WAN. And as more people are recording their calls (at around 25-30MB per hour of recorded conference call) on the shared Webex storage frames, storage capacity is becoming an issue too, at a time when Cisco data centers are so full that it is forcing a major migration to new data centers worldwide. And each part – voice, video, web, and data center storage – all come together to make up the conferencing experience. If voice or video is bad, or screen sharing fails, or you can’t record the meeting, people will be unhappy with the service. And measuring the capacity and quality of the Webex requires knowing the capacity and quality of each of these areas of the service – down to knowing which voice queues and gateways used by Webex are doing, how the servers and storage available to Webex are doing, and so on; and only when we can monitor, track, and do capacity planning for all of these traditional IT services together can we be certain that we are delivering good service. We can still measure the WAN capacity, and that of voice, and of the data center, as separate IT components; but we have to start thinking about them, and managing capacity, as parts of a large set of business services, at least from a planning perspective.
Capacity planning depends on accurate measurement; but what you do with the measurements depends on the service, the region, and where your business is going. Here’s how we do it, and what we expect to be facing in the future.
Measuring WAN circuit capacity depends on the circuit design at each branch office. Standard Cisco architecture for any WAN connection is a primary and a secondary WAN circuit. For most sites, where available and cost effective, the two circuits are the same size and we load balance across the two. Sometimes however, to reduce costs we provide a smaller backup circuit, and assume that some of the traffic will not be served during the short time of a primary WAN link outage (video conferencing may stop, voice may go out the voice gateway, etc.). Capacity planning gets done on the primary circuit.
There are not many tools available for doing capacity planning, and not much automation that has grown up around that process. Mostly, we use 3 different homegrown reports for this. The first of these reports remain the same from our earlier capacity-planning days; the second helps us deal with transient peak traffic; and the third helps us look at service levels.
(1) Mostly we use a “60:10” report – hitting 60% utilization for more than 10% of business hours flags this to be looked at. If utilization reaches this threshold we flag it, and then investigate the sources to determine: Is it justified traffic, or are there problems with the router configuration or QoS settings, or is this traffic due to malware or infection? If we see that this is business-justified traffic due to organic growth, new sites, or new services, we provide a final recommendation to individual implementation teams for bandwidth or equipment upgrades, or QoS or configuration changes. Requires partnership with Workplace resources who know when locations are going to open, close, grow or shrink. We maintain a spreadsheet showing each site with the number of people and (based on the services they use at each site) and try to provide a certain amount of bandwidth per person based on the services used at each site.
(2) We are also starting to use a peak report – looking for locations where a site might peak beyond 90% on rare occasions. In the past most traffic was somewhat predictable, but as broadband desktop video is increasing at Cisco, bandwidth variability is increasing dramatically. When we see a peak load, even for a short time, we need to figure out what caused that peak, and whether we need to plan for that peak happening again, or more often.
(3) And we are looking more carefully at utilization not just at the circuit level, but also utilization and packet loss within each of the QoS queues. Packet loss in the voice or video queues are far more significant than packet loss in the “scavenger” (or “batch”) queues. We’re using NetVoyant and other NetQoS tools for this, which pulls information from SNMP traps. We can tell when a TP session drops packets, so we can monitor sessions for capacity issues very easily. Other tools pull from Netflow information on routers.
(4) The combination of the NetQos Report Analyzer, based on Netflow data and the NetQos Netvoyant tool, based on SNMP traps allows us to build dashboards to monitor special virtual events. All critical links are on one dashboard which shows the utilization by queue, top talker applications, dropped packets,…
(5) Very similar processes are followed for our Unified Communication environment. PhonEx from MindCTI is the tool we use. It collects all CDR data from our UC environment and allows us to report on voice utilization a similar way. It also has a Voice fraud function which alerts when there are changes in the call patterns
There are common variations on the “60% utilization” rule:
(1) The 60% utilization rule is varied depending on region. Some regions have longer lead times for circuit upgrades than others. In the US you can often get a new circuit installed in 30-45 days. In Europe it can be 60 days, and in parts of the Middle East, Africa, and LATAM, it can be 90-180 or more. For these regions we set our utilization thresholds lower, to give us more time to respond
(2) The 60% utilization rule varies depending on size of current circuit, for the same reason. Larger circuits can take longer for service providers to deliver.
(3) However, we also tend to reduce our upgrade lead time for larger circuits since they cost more money, and every time we upgrade a circuit several months too soon, we spend money we shouldn’t have.
(4) We also look for offices that have low utilization (e.g. 10% or less) and consider reducing their bandwidth to reduce costs. (10% is used as a rule of thumb because in general, bandwidth is purchased in multiples of 4, and drops in multiples if ¼; (e.g. from an OC-3 to an OC-12) ).
Capacity planning is facing some significant problems with two new services in the future: high definition desktop/laptop video, and home Telepresence. Video has a significant impact on bandwidth use, and these two services threaten to place new demands on the network.
Home telepresence will place significant load on the Internet and on our enterprise Internet points of presence. Currently our few Internet POPs don’t handle the sort of traffic that will be generated by large numbers of high definition video streams; and in addition to traffic load, we have to maintain strict control over latency and packet drops for real-time video. We are still working on figuring out how to size and architect the Internet POPs to maintain security and still deliver high quality video through the POPs.
High definition desktop / laptop video has the most potential for stressing the network., since the video streams will come from unpredictable locations.. Telepresence video sites on our WAN are fixed in location, but we are about to enable high quality, high definition (and thus higher bandwidth) video from the desktop – both with new IP phones (8900 and 9900), and with HD video cameras for conferencing. We could have a branch office which has one or two TP rooms – but with 50 people we could suddenly have 50 HD additional video streams coming from a single site. The next day that same building could host a meeting where 50 more people come and they could generate another 50 HD video streams from that one site. This could come from any Cisco building at any time. This is a huge difference from the WAN suddenly carrying twice the normal email traffic: video takes more bandwidth, and is very sensitive to latency and packet-loss. This is the major reason for flagging the peak (90%) traffic utilization. We don’t have any other solution to this problem as yet.
Without perfect solutions to these problems, we are looking at a range of other options. For example, today our standard practice is that all voice goes over the WAN. But we are beginning to offload a lot of our outbound voice traffic via SIP trunks to the Internet, and are open to looking at the price of Internet voice traffic vs. the price of added WAN bandwidth. Also, we’re carrying a lot of 1080p HD video, when we could reduce the load considerably by downgrading the video streams to 720p HD. We are working to provide more flexibility in our video services, and would like to be able to drop individual streams from 1080 to 720 intelligently when some portion of the end to end circuit reaches capacity.
We are preparing various options for our business decision makers: supporting video for all users, or supporting video for some, and providing cost information for each option; and we will see what the business would like to support. We are also trying to build additional capacity into our network in the core to support increased traffic demands (see NLR blog – hyperlink).
For more information on Capacity Planning, see