Capacity Planning at Cisco
Capacity planning depends on accurate measurement; but what you do with the measurements depends on the service, the region, and where your business is going. Here’s how we do it, and what we expect to be facing in the future.
Measuring WAN circuit capacity depends on the circuit design at each branch office. Standard Cisco architecture for any WAN connection is a primary and a secondary WAN circuit. For most sites, where available and cost effective, the two circuits are the same size and we load balance across the two. Sometimes however, to reduce costs we provide a smaller backup circuit, and assume that some of the traffic will not be served during the short time of a primary WAN link outage (video conferencing may stop, voice may go out the voice gateway, etc.). Capacity planning gets done on the primary circuit.
There are not many tools available for doing capacity planning, and not much automation that has grown up around that process. Mostly, we use 3 different homegrown reports for this. The first of these reports remain the same from our earlier capacity-planning days; the second helps us deal with transient peak traffic; and the third helps us look at service levels.
(1) Mostly we use a “60:10” report – hitting 60% utilization for more than 10% of business hours flags this to be looked at. If utilization reaches this threshold we flag it, and then investigate the sources to determine: Is it justified traffic, or are there problems with the router configuration or QoS settings, or is this traffic due to malware or infection? If we see that this is business-justified traffic due to organic growth, new sites, or new services, we provide a final recommendation to individual implementation teams for bandwidth or equipment upgrades, or QoS or configuration changes. This requires partnership with Workplace resources who know when locations are going to open, close, grow or shrink. We maintain a spreadsheet showing each site with the number of people and (based on the services they use at each site) and try to provide a certain amount of bandwidth per person based on the services used at each site.
(2) We are also starting to use a peak report – looking for locations where a site might peak beyond 90% on rare occasions. In the past most traffic was somewhat predictable, but as broadband desktop video is increasing at Cisco, bandwidth variability is increasing dramatically. When we see a peak load, even for a short time, we need to figure out what caused that peak, and whether we need to plan for that peak happening again, or more often.
(3) And we are looking more carefully at utilization not just at the circuit level, but also utilization and packet loss within each of the QoS queues. Packet loss in the voice or video queues are far more significant than packet loss in the “scavenger” (or “batch”) queues. We’re using NetVoyant and other NetQoS tools for this, which pulls information from SNMP traps. We can tell when a TP session drops packets, so we can monitor sessions for capacity issues very easily. Other tools pull from Netflow information on routers.
(4) The combination of the NetQos Report Analyzer, based on Netflow data and the NetQos Netvoyant tool, based on SNMP traps allows us to build dashboards to monitor special virtual events. All critical links are on one dashboard which shows the utilization by queue, top talker applications, dropped packets,…
(5) Very similar processes are followed for our Unified Communication environment. PhonEx from MindCTI is the tool we use. It collects all CDR data from our UC environment and allows us to report on voice utilization a similar way. It also has a Voice fraud function which alerts when there are changes in the call patterns
There are common variations on the “60% utilization” rule:
(1) The 60% utilization rule is varied depending on region. Some regions have longer lead times for circuit upgrades than others. In the US you can often get a new circuit installed in 30-45 days. In Europe it can be 60 days, and in parts of the Middle East, Africa, and Latin America, it can be 90-180 or more. For these regions we set our utilization thresholds lower, to give us more time to respond
(2) The 60% utilization rule varies depending on size of current circuit, for the same reason. Larger circuits can take longer for service providers to deliver.
(3) However, we also tend to reduce our upgrade lead time for larger circuits since they cost more money, and every time we upgrade a circuit several months too soon, we spend money we shouldn’t have.
(4) We also look for offices that have low utilization (e.g. 10% or less) and consider reducing their bandwidth to reduce costs. (10% is used as a rule of thumb because in general, bandwidth is purchased in multiples of 4, and drops in multiples if ¼; (e.g. from an OC-3 to an OC-12) ).
For more information on Capacity Planning, see