Cisco Blogs


Cisco Blog > SP360: Service Provider

Optimize Your Software-Defined Network by Hardware Requirements

Software-based techniques are transforming networking. Commercial off-the-shelf hardware is finding a place in several networking use cases. However, high-performance hardware is also an important part of a successful software-defined networking (SDN). As you optimize your networks using SDN tools and complementary technologies such as network function virtualization (NFV), an important step is to strategically assess your hardware needs based on the functions and performance requirements. These need to be aligned with your intended business outcome for individual applications and services.

Two Categories of High Performance Hardware

  • Network hardware that utilizes purpose-built designs. These often involve specialized Application-Specific Integrated Circuit (ASIC)s to achieve significantly higher performance than what is possible or economically feasible using commercial off-the-shelf servers that are based on state of the art, x86-based, general purpose processors.
  • Network hardware that uses standard x86 servers that is enhanced to provide high performance and predictable operation for example, via special software techniques that bypass hypervisors, virtualization environments, and operating systems.

Where to Deploy Network Functions
Can virtualized network functions be deployed like cloud-based applications? No. There is a big difference between deploying network functions as software modules on x86 general purpose servers and using a common cloud computing model to implement network virtualization. Simply migrating existing network functions to general purpose servers without due regard to all the network requirements leads to dramatically uneven and unpredictable performance. This unpredictability is mainly due to data plane workloads being often I/O bound and/or memory bound and software layers containing important configuration details that may impact performance.
These issues are not specifically about hardware but how the software handles the whole environment. Operating systems, hypervisors, and other infrastructure that is not integrated into best practices for data plane applications will continue to contribute to unpredictable performance.

Bandwidth and CPU Needs

Optimization 10.20

A good way to begin to assess hardware requirements is to examine network functions in two dimensions: I/O bandwidth or throughput needs, and computational power needs. In considering which network function to virtualize and where to virtualize it, CPU load required and bandwidth load required throughout different layers of the network can help determine that some but not all network functions are suitable for virtualization.

Applications with lower I/O bandwidth and low-to-high CPU requirements may be most appropriate for virtualized deployment on optimized x86 servers. Applications with higher I/O bandwidth and low-to-high CPU requirements may be best deployed on specialized high-performance hardware with specialized silicon. Many other factors may play a role in determining what hardware to use for which applications, including cost, user experience, latency, networking performance, network predictability, and architectural preferences.
Service-Network Abstraction is Key
Additionally, you might not need high performance hardware for certain functions initially. But as such a particular function scales, it might require a high performance platform to meet its performance specifications, or it might be more economical on a purpose-built platform. So you might start out with commercial off-the-shelf hardware and then transfer the workload to the high performance hardware later. If you have focused on establishing a clean abstraction of the services from the underlying hardware infrastructure using SDN principles, the network deployment can be more easily changed or evolved independently of the upper services and applications. This is the true promise of SDN.
Read more about how to assess hardware performance requirements in your SDN in the Cisco® white paper “High-Performance Hardware: Enhance Its Use in Software-Defined Networking.” You can find it here: “Do You Know your Hardware Needs?” along with other useful information.

Do you have questions or comments? Tweet us at @CiscoSP360

Tags: , , , , , , , ,

HDX Blog Series #2: Scaling with Turbo Performance

Editor’s Note: This is the second of a four-part deep dive series into High Density Experience (HDX), Cisco’s latest solution suite designed for high density environments and next-generation wireless technologies. For more on Cisco HDX, visit www.cisco.com/go/80211ac.  Read part 1 here

With any new technology comes a new set of obstacles to overcome.  802.11ac is no exception.  Last week we talked about CleanAir for 802.11ac and why spectrum intelligence still matters. Another challenge is scalability. In this post I will give you some details on new HDX feature, Turbo Performance, which allows the AP 3700 overcome common scaling issues to scale amazingly well.

What’s Different with 802.11ac?

802.11ac means higher data rates, which means more packets per second (PPS).  There are three reasons for more PPS with 11ac: wider channels, increased modulation and increased aggregation.  Channel width doubled to 80 MHz, modulation increased from 64 QAM to 256 QAM, and aggregation increased from 64k to 1MB!

With 802.11n, an AP might have had to push 30,000 1500 byte packets per second through the APs data plane. Today with 802.11ac that could now be 75,000+ PPS.  More PPS means more load on the APs CPU, so to really keep up with the demands of 802.11ac, we needed to go back to the drawing board.   Read More »

Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

VDI “The Missing Questions” #7: How memory bus speed affects scale

March 13, 2013 at 11:59 am PST

This was the test I most eagerly anticipated because of the lack of information on the web regarding running a Xeon-based system at a reduced memory speed. Here I am at Cisco, the company that produces one of the only blades in the industry capable of supporting both the top bin E5-2690 processor and 24 DIMMs (HP and Dell can’t say the same), yet I didn’t know the performance impact for using all 24 DIMM slots. Sure, technically I could tell you that the E5-26xx memory bus runs at 1600MHz at two DIMMs per channel (16 DIMMs) and a slower speed at three DIMMs per channel (24 DIMMs), but how does a change in MHz on a memory bus affect the entire system? Keep reading to find out.

Speaking of memory, don’t forget that this blog is just one in a series of blogs covering VDI:

The situation. As you can see in the 2-socket block diagram below, the E5-2600 family of processors has four memory channels and supports three DIMMs per channel. For a 2-socket blade, that’s 24 DIMMs. That’s a lot of DIMMs. If you populate either 8 or 16 DIMMs (1 or 2 DIMMs per channel), the memory bus runs at the full 1600MHz (when using the appropriately rated DIMMs). But when you add a third DIMM to each channel (for 24 DIMMs), the bus slows down. When we performed this testing, going from 16 to 24 DIMMs slowed the entire memory bus to 1066MHz, so that’s what you’ll see in the results. Cisco has since qualified running the memory bus at 1333MHz in UCSM maintenance releases 2.0(5a) and 2.1(1b), so running updated UCSM firmware should yield even better results than we saw in our testing.

 

As we’ve done in all of our tests, we looked at two different blades with two very different processors. Let’s start with the results for the E5-2665 processor. The following graph summarizes the results from four different test runs. Let’s focus on the blue lines. We tested 1vCPU virtual desktops with the memory bus running at 1600MHz (the solid blue line) and 1066MHz (the dotted blue line). The test at 1600MHz achieved greater density, but only 4% greater density. That is effectively negligible considering that the load is random in these tests. LoginVSI is designed to randomize the load.

Read More »

Tags: , , , , , , ,

VDI “The Missing Questions” #4: How much SPECint is enough

February 25, 2013 at 6:35 am PST

In the first few posts in this series, we have hopefully shown that not all cores are created equal and that not all GHz are created equal. This generates challenges when comparing two CPUs within a processor family and even greater challenges when comparing CPUs from different processor families. If you read a blog or a study that showed 175 desktops on a blade with dual E7-2870 processor, how many desktops can you expect from the E7-2803 processor? Or an E5 processor? Our assertion is that SPECint is a reasonable metric for predicting VDI density, and in this blog I intend to show you how much SPECint is enough [for the workload we tested].

You are here. As a quick recap, this is a series of blogs covering the topic of VDI, and here are the posts in this series:

Addition and subtraction versus multiplication and division. Shawn already explained the concept of SPEC in question 2, so I won’t repeat it. You’ve probably noticed that Shawn talked about “blended” SPEC whereas I’m covering SPECint (integer). As it turns out, the majority of task workers really exercise the integer portion of a processor rather than the floating point portion of a processor. Therefore, I’ll focus on SPECint in this post. If you know more about your users’ workload, you can skew your emphasis more or less towards SPECint or SPECfp and create your own blend.

The method to the madness. Let me take you on a short mathematical journey using the figure below. Starting at the top, we know each E5-2665 processor has a SPECint of 305. It doesn’t matter how many cores it has or how fast those cores are clocked. It has a SPECint score of 305 (as compared to 187.5 for the E5-2643 processor). Continuing down the figure below, each blade we tested had two processors, so the E5-2665 based blade has a SPECint of 2 x 305… or 610. The E5-2665 blade has a much higher SPECint of 610 than the E5-2643 blade with just 375. And it produced many more desktops as you can see from the graph embedded in the figure (the graph should look familiar to you from the first “question” in this series).

And now comes the simple math to get the SPECint requirement for each virtual desktop in each test system:

Read More »

Tags: , , , , , , ,

VDI “The Missing Questions” #3: Realistic Virtual Desktop Limits

So this is the Million Dollar Question, right? You, along with the executives sponsoring your particular VDI project wanna know: How many desktops can I run on that blade? It’s funny how such an “it depends” question becomes a benchmark for various vendors blades, including said vendor here.

Well, for the purpose of this discussion series, the goal here is not to reach some maximum number by spending hours in the lab tweaking various knobs and dials of the underlying infrastructure. The goal of this overall series is to see what happens to the number of sessions as we change various aspects of the compute: CPU Speed/Cores, Memory Speed and capacity. Our series posts are as follows:

 

You are Invited!  If you’ve been enjoying our blog series, please join us for a free webinar discussing the VDI Missing Questions, with Doron, Shawn and myself (Jason)!  Access the webinar here!

But for the purpose of this question, let’s look simply at the scaling numbers at the appropriate amount of RAM for the the VDI count we will achieve (e.g. no memory overcommit) and maximum allowed memory speed (1600MHz).

As Doron already revealed in question 1, we did find some maximum numbers in our test environment. Other than the customized Cisco ESX build on the hosts, and tuning our Windows 7 template per VMware’s View Optimization Guide for Windows 7, the VMware View 5.1.1 environment was a fairly default build out designed for simplicity of testing, not massive scale. We kept unlogged VMs in reserve like you would in the real world to facilitate the ability for users to login in quickly…yes that may affect some theoretical maximum number you could get out of the system, but again…not the goal.

And the overall test results look a little something like this:

E5-2643 Virtual Desktops

E5-2665 Virtual Desktops

1vCPU, 1600MHz

81

130

2vCPU, 1600MHz

54

93

 

As explained in Question 1, cores really do matter…but even then, surprisingly the two CPUs are neck and neck in the race until around 40 VM mark. Then the 2 vCPU desktops on the quad core CPU really take a turn for the worse:


Why?

Co-scheduling!

When a VM has two (or more) vCPUs, the hypervisor must find two (or more) physical cores to plant the VM on for execution within a fairly strict timeframe to keep that VM’s multiple vCPUs in sync.

MULTIPLE vCPU VMS ARE NOT FREE!

Multiple vCPUs create a constraint that takes time for the hypervisor to sort out every time it makes a scheduling decision, not to mention you simply have more cores allocated for hypervisor to schedule for the same number of sessions: DOUBLE that of the one vCPU VM. Only way to fix this issue is with more cores.

That said: the 2 vCPU VMs continue to scale consistently on the E5-2665 with its double core count to the E5-2643. At around the 85 session mark, the even the E5-2665 can no longer provide a consistent experience with 2vCPU VDI sessions running. I’ll stop here and jump off that soap box…we’ll dig more into the multiple vCPU virtual desktop configuration in a later question (hint hint hint)…

Now let’s take a look at the more traditional VDI desktop: the 1 vCPU VM:


With the quad-core E5-2643, performance holds strong until around the 60 session mark, then latency quickly builds as the 4000ms threshold is hit at 81 sessions. But look at the trooper that the E5-2665 is though! Follow its 1 vCPU scaling line in the chart and all those cores show a very consistent latency line up to around the 100 session mark, where then it becomes somewhat less consistent to the 4000ms VSImax of 130. 130 responsive systems on a single server! I remember when it was awesome to get 15 or so systems going on a dual socket box 10 or so years ago, and we are at 10x the quantity today!

Let’s say you want to impose harsher limits to your environment. You’ve got a pool of users that are a bit more sensitive to response time than others (like your executive sponsors!). 4000ms response time may be too much and you want to halve that to 2000ms. According to our test scenario, the E5-2665 can STILL sustain around 100 sessions before the scaling becomes a bit more erratic in this workload simulation.

021813_1657_VDITheMissi3.png

Logic would suggest half the response time may mean half the sessions, but that simply isn’t the case as shown here. We reach Point of Chaos (POC!) where there is very inconsistent response times and behaviors as we continue to add sessions. In other words: It does not take many more desktop sessions in a well running environment that is close to the “compute cliff” before the latency doubles and your end users are not happy. But on the plus side, and assuming storage I/O latency isn’t an issue, our testing shows that you do not need to drop that many sessions from each individual server in your cluster to rapidly recover session response time as well.

So in conclusion, the E5-2643, with its high clock speed and lower core count, is best suited for smaller deployments of less than 80 desktops per blade. The E5-2665, with its moderate clock speed and higher core count, is best suited for larger deployments of greater than 100 desktops per blade.

 

Next up…what is the minimum amount of normalized CPU SPEC does a virtual desktop need?

 

Tags: , , , , , , ,