Cisco Logo


Data Center and Cloud

Few years ago, in order to interact with the audience, I started a Cisco Live presentation involving some Spanning Tree design with three questions:

I drew two conclusions from this:

FabricPath, Dilbert-style

Layer 2 vs. Layer 3, a classic

When Layer 3 guys think of STP, they see RIP. Sure, 10% of STP is a RIP-like distance vector protocol, but the 90% remaining complexity is about preventing loops, even transient ones. Just to give you a feeling of the problem, we’ve seen critical network conditions caused by bridging loops lasting less than 100ms. It’s that bad, and it’s something Layer 3 guys just don’t see because this problem does not exist in their world. In their world, when there is a power outage, the lift just stops. In the Layer 2 world, the lift falls with a hissing sound and crashes in a cloud of smoke. Not a friendly world. Where does this difference come from?

When a router receives a packet, it looks in its forwarding table to determine where to forward it. If there is no prefix in the forwarding table, the router does not forward (i.e. it drops the packet). When a bridge receives a frame, it looks into its filtering database where not to send this frame. If there is no entry in this filtering database, also known as mac address table, the frame is not filtered (i.e it is flooded.) Yes, your high end switch behaves by default as a glorified hub and this has nothing to do with STP.

The consequence is well known. If for any reason a loop is introduced in the network, a single frame could be forwarded forever because there is no TTL at Layer 2. Worse, the powerful ASICs of your glorified hubs will flood the frame on all their ports, instantly saturating them. Then, just because you’re lucky, the frame is carrying a Layer 3 broadcast that is hitting directly and killing immediately the CPU of all the hosts. Hissing sound. Wait, it’s not over! You might have some low end switches at the access, with poor control plane protection. Their CPU might also be impacted by this traffic, and they might not be able to run their STP process any more. Interestingly enough, most people don’t realize that those edge switches, the further away from the root, are responsible for blocking ports in a bridged network. If they can’t run STP properly, they’re going to open even more loops… Cloud of smoke. Not only the servers are affected, but the condition of the failure can be maintained indefinitely. A local issue can have global, permanent impact.

Meanwhile, in the Layer 3 world, link state protocols like IS-IS or OSPF commonly introduce transient loops during their convergence. It does not matter. There’s no flooding at Layer 3, so a packet looping would only have local effect on the links part of the cycle and the TTL in the data plane would get rid of it eventually. Multidestination traffic is strictly constrained by a powerful reverse path forwarding check (RPFC) in the data plane again. Even if the CPU of a router was affected, adjacencies would drop, routes would be removed from forwarding tables and in the end, packets would stop being forwarded. The system would fall back to a stable state… The lift stops.

Routing frames

Would replacing STP by IS-IS help Layer 2? It would not – simply because the control plane was not the problem.

Indeed, the revolutionary change that FabricPath/TRILL introduces is not IS-IS, it’s a new the data plane. The mac-in-mac encapsulation reduces dramatically the number of mac addresses in the network. As a result, they can be advertised by the control protocol and used as entries in a real routing table, not a filtering database. If a destination is not known, traffic is dropped. The frame also features a TTL and the RPF check can apply to multidestination traffic. Frames are de facto routed within the network even if there is still bridging at the edge. Some people consider that routing frames is a heretic statement. I don’t know why “routing” would belong to Layer 3. We route planes, interrupts, why not frames?

What about Shortest Path Bridging?

One of the most significant differences between TRILL and the IEEE Shortest Path Bridging (SPB, 802.1aq) is precisely the data plane. The IEEE did not want to change the existing bridging data plane, as it implies new hardware. Remember that it’s because the Layer 2 data plane needs a tree that we have STP, not the other way around. As result, SPB still builds trees, with all the synchronization mechanisms that STP was hauling: we’ve replaced the 10% RIP with 10% IS-IS, but the 90% loop prevention complexity is still there (have fun reading clause 13.) Because the IEEE guys are not idiots either, they realized they had to do something and they split SPB in two flavors. The first, faithful to their pledge of maintaining the existing Layer 2 data plane is called SPB-V. From my perspective, the main enhancement SPB-V provides over STP is that its name does not include “Spanning” or “Tree”. I’m sure it will not be deployed anywhere but in a way, I’m sorry about it.  It would have shown the world that the kind of problem I’ve described earlier was just as likely with IS-IS as it is with STP. The second flavor of 802.1aq is SPB-M. That one has a data plane closer to TRILL thanks to its use of the 802.1ah IEEE standard. The twist is that, because 802.1ah was just finalized, it allowed the IEEE to claim that SPB-M did not require new hardware. In reality, few data center switches, if any, are capable of supporting 802.1ah and running SPB-M means replacing your hardware for going half way to TRILL. At last, because it could not go further within the charter of 802.1aq, the IEEE recently initiated 802.1Qbp. This latter looks great and will finally introduce a new frame format with a TTL this time.  The sad thing is that it will require yet another hardware change and will probably be available 10 years after TRILL was started with very similar goals…

In an effort to keep conversations fresh, Cisco Blogs closes comments after 90 days. Please visit the Cisco Blogs hub page for the latest content.

19 Comments.


  1. Fabric path looks great! Now let’s see it implemented across more then just a handful of switch platforms (at least all of the higher end devices – Nexus, 6500, 4900, 4500, and 3750). A lot of these new features seem to only get implemented in Nexus gear, but spanning tree is something that affects remote sites more then datacenters. Anyone that works in large enterprise environments knows that there are sites that are often just as important as a datacenter.

       0 likes

    • Francois Tallet
      Francois Tallet

      Hi Dave,
      FabricPath really makes sense when offered as an end-to-end solution. That’s why our priority was to provide it consistently across our DC portfolio (it’s been available for a year on the Nexus 7000 with F series IO Modules, and the software will be released on the Nexus 5500 in few weeks.)
      Because FabricPath requires ASIC support, some Catalyst platforms will not be capable of running it. Some others could be adapted (because their hardware can meet the new requirements or because they’ve been refreshed recently). Again, because you would ideally want a solution that is available on the whole product line, the final answer is not just technical, and if I know that the Catalyst team is working on this, I don’t know what they’re going to do.
      Regards,
      Francois

         0 likes

  2. I am a CCNP student in Croatia and also a CCAI Instructor in this area. I enjoy learning and teaching other about networking so I started a blog howdoesinternetwork.com with post about networking. I hope that one day my knowledge, and this project to, will be somewhere close to your (I am working on it :) )
    I must say, I really like your way of presenting this “complicated” advanced networking technology in an enthusiastic and fun way with little sarcasm inside.
    I enjoy reading your post and hoping you will write more.
    Regards

       0 likes

  3. Actually SPB loop avoidance is easily explained and is not 90% of the complexity as you would suggest….

    When a topology change occurs, a node determines the set of multicast trees where the distance to the root has changed. Rip the state for those trees, and advertise a digest of the new topology database to your peers.

    When you receive a matching digest from a peer you know they have done the same and it is safe to install updated multicast state on the interface to that peer.

    Short story is if you agree on the topopology database, you agree on the distances to the roots. Trees unaffected by the topology change are left alone. Add RPFC and you have belt and suspenders….

    There are details as to how the strictness of this can be relaxed slightly without losing any robustness, but that is the fundamentals.

       0 likes

  4. Francois Tallet
    Francois Tallet

    Hi Dave,
    The loop avoidance mechanism (the belt of SPB) is pretty much the same hop-by-hop agreement mechanism that the STP sync was relying on – an insult to link state protocols nimble convergence should I say. It’s certainly bulletproof on paper, but so was STP sync. My point was mainly that this belt has proven to be insufficient for the users of Layer 2. We definitely need the RPFC suspenders, and those are not possible in SPB-V. We need a routed data plane.
    I’ll let you name your own percentage for the overall complexity (STP sync was probably simpler, but RIP was also simpler;-)
    Thanks for your expert comment and regards,
    Francois

       0 likes

  5. Peter Ashwood-Smith

    SPBM on NNI (i.e. tandem to/from core switches) links looks up the DA and VID and forwards to the next hop while disabling learning. I have 5 switches in my lab with several vanilla cards that are perfectly happy to run SPBM NNI’s without upgrades and infact did so in our last interop which involved a 3 different vendors switches and two different test tools.

    It is true that UNI behavior, requires mac-in-mac which is not supported on every ASIC but is on at least two major ASICs that have been shipping for a few years now in dozens of different card types in my own company.

    SPBM therefore does not require hardware upgrades on the most expensive NNI link cards but may require upgrades on the UNI links.

    Translation, your core switches could remain unchanged with software only upgrades while the TOR’s would require upgrades to ASICs that have been shipping for a few years now.

    Note also that upgrading hardware to provide a TTL function on a core NNI link in a 2 layer fat tree is unnecessary as fat trees don’t have loops. Likewise upgrading the hardware to support multiple hash based next hops is also unnecessary as the 2 layer fat tree only has ECMP choices on the uplinks and not the downlinks.

    Note also that Dave’s comments r.e. loop prevention are spot on. Do an MD5 over an adjacency, add (or subtract) it to/from a running sum, advertise that sum in the hello, compare what you advertise with what you receive and if different remove multicast entries on that link. Not exactly rocket science.

       0 likes

    • Hi Peter,
      I remember the same kind of argument when we introduced 802.1ah. Only the edge devices need support for the new data plane, the core can remain 802.1ad bridges. This is true, but the reality of a data center is that you commonly get a ratio 2 spine bridges for 50 leaf bridges. So there’s not a lot you can recycle from your previous generation of switches.
      Furthermore, unless you have a non-blocking network, the port density at the access is way higher than in the core, and the devices supporting 802.1ah are typically high-end SP oriented switches. The cost of using them at the access is just not realistic. I don’t need to elaborate too much on that, customers will notice anyway.

      So in my opinion, the idea of recycling hardware is good from the marketing standpoint, but it’s not going to happen in reality. SPB-M (the only relevant one) requires new hardware.

      Thanks and regards,
      Francois

         0 likes

      • Not necessarily true, as that blanket statement confuses the reader and hides the truth. If you are running SPBm for the 802.1ah encapsulation then you only need to support the mix/demux function of 802.1ah on the edge devices only. Core nodes are simply switching on the outer regular ethernet header.

        Most chips shipping for the last several years support the 802.1ah header too so if you’ve bought a switch in the last 4 years then you most likely can run SPBm at the edge and the core nodes that aren’t participating in the the edge function don’t need new chips either if they aren’t a NPU of some kind that can be reprogrammed, as they just need a new control plane update for managing the SPB paths.

           0 likes

        • Francois Tallet
          Francois Tallet

          Hi Paul,
          I think I’ve elaborated enough on why SPB-M requires new hardware.
          Now I’d like you to elaborate on how you reached the conclusion that most chips support 802.1ah. That looks to me like a blanket statement that confuses the reader and hides the truth;-)
          Cisco has maintained a 70%-80% market share in data center switching over the period of time you mentioned, and our ASICs do TRILL, not 802.1ah. Even in the remaining 20%-30% remaining, I’d be surprised if there was a significant percentage of switches .1ah capable (especially if, again, you consider the devices that are at the price point for a top of rack/end of row position in the network.)
          Regards,
          Francois

             0 likes

  6. “•Who thinks that the root bridge can block a port?”

    so when and how does this happen?

       0 likes

  7. The thesis of this piece is that “the problem” is Ethernet’s data plane learning mechanism, otherwise “flood-on-unknown”, which is an entirely reasonable point of view. What is not stated explicitly, so I will do so, is that the SPB-M flavor performs drop-on-unknown within the routed domain, so for unicast traffic SPB-M behaves just like Layer 3. This, and SPB-M’s used of RPFC also, means that the synchronization mechanisms are not required for unicast traffic, because unicast is never flooded, so network reconvergence is not delayed by synchonisation.

    For multicast tree installation, synchronization is used, as Dave Allan has outlined. However, comparing this with the traditional behaviour of IS-IS is not an “apples to apples” comparison, because multicast tree computation is not a traditional IS-IS application, but a post process after IS-IS has converged, and the valid comparison is of the complete processes.

       0 likes

    • Hi Nigel,
      I think we’re clear on the suspenders. The whole article is about the limitation of the “belt” approach. Go ahead and introduce IS-IS as a control protocol, I will not trust the resulting SPB-V technology more than I trust the current STP.
      Thanks and regards,
      Francois

         0 likes

  8. Hi Francois

    Thanks for replying, a few points to ponder…

    1) SPB can use the multicast capability for LSA flooding, so “nimble convergence” is not really severly impaired for multicast, unaffected for unicast as Nigel Bragg observed.

    2) You have to contrive topologies (long daisy chains of two connected nodes) for digest exchange to really emerge as a problem. And yes examples were trotted in front of 802.1 and considered during the deliberations…

    3) Densely meshed multipath deployments would need a coordinated attack on the fabric before distances to roots started changing for the set of multicast sources we are worried about…so digest exchange would be a very rarely used “safety valve”…if at all.

    So this “ain’t your mother’s spanning tree” so to speak. It is link state shortest path mesh…and cogniscence of all paths exists.

    cheers
    D

       0 likes

    • Hi Dave,
      The funny part is that I’m rather your mother STP’s fanboy;-) Except for the pathological case of the loss of the root bridge (and when customers complain about STP, it’s not because of that) there is no reason why STP would not converge in 50ms. STP is pretty good on paper, but the platforms inefficiencies (mainly in changing the state of the vlans) make this kind of performance impossible to reach. STP is also pretty good at loop prevention on paper, but it’s again not what customers are remembering about it.
      SPB-V is only the promise that this time, we did things right in the control plane. As I didn’t see anything wrong in the STP control plane, I don’t see either why SPB-V would add any value.
      But in the end, this is post is more about a marketing message than a technical one. You can claim that SPB runs on existing hardware. Then you’re talking about SPB-V and good luck with your new STP. You can claim that SPB has an efficient routed data plane, but then go and replace your hardware. You can’t have it both ways.
      Regards,
      Francois

         0 likes

  9. Guillermo Ravera

    Hi, it’s quite easy to understand your point of view but I suposed that you are always speaking about Fabricpath as the title suggest.

    But, what about TRILL (RFC 6325) which was proposed as standard on july 2011, cisco was in the process but will it be supported by nexus platforms? or cisco will add some changes to fabricpath and have one unique protocol?

    I think this is important to ease the interoperability between platforms ;)

       0 likes

    • Francois Tallet
      Francois Tallet

      Hi Guillermo,
      The post only describes basic FabricPath operation which is exactly identical in TRILL.
      Our hardware can do TRILL, we plan on providing a “TRILL mode” for FabricPath.
      Right now, FabricPath includes some few enhancements that are in fact out of the scope of TRILL (distributed port channel, multiple active default gateways, multiple topologies etc…). We’re trying to get those into TRILL too.
      Regards,
      Francois

         0 likes

Trackbacks and Pingbacks:

  1. Return to Countries/Regions
  2. Return to Home
  1. All Data Center and Cloud
  2. Return to Home