Distributed? Centralized? Both?
Part of the interest in programmatic interfaces is fueled by the desire to logically centralize network control functions. A global view of network state can have many benefits but it does not preclude the use of distributed protocols within the network. Network Programming Interfaces (NPIs) provide a facility to construct global state, mutate that state and distribute that state to the network which in combination with distributed protocols can aid in achieving greater network efficiencies, improve visibility, robustness and add to the value of the network overall. When used the right way, these NPIs will help set a new balance between centralized and distributed control. Key to this balance will be domain or deployment specific constraints.
Finding the appropriate balance in a domain means deciding where authoritative network state is kept, how the network needs to react to state changes, and how quickly applications need stability. If we desire 50ms restoration in case of link failures, some form of control logic on the individual network element will be necessary. Whereas, if we’re interested in placing a new service appliance in our network and we know the traffic access patterns, or would want to perform some maintenance task and wanted to cleanly isolate the effected systems, we might opt for logically centralized control.
Centralized control for packet forwarding? When it comes to controlling packet forwarding, things become more subtle. Networks are distributed systems and can include a high degree of uncertainty. Today’s network protocols were designed to operate in environments without communication guarantees, and to cope with nearly unbounded latency, highly variable resources, temporary inconsistencies, as well as arbitrary failures and changes. Today’s distributed routing protocols serve these environments well. That said many of today’s network deployments are not that random, e.g. data center networks are often built using very few types of network elements and have a very regular and stable network topology. Would this allow us to re-consider our control plane architecture? Could we centralize parts of the forwarding control so we can adapt more easily to specific needs of a business or application? For instance, we might want to choose a route on the least expensive path for our backup traffic as opposed to the shortest one. Another example might be to route our time sensitive multicast traffic on our lowest delay links, with the individual link delay measured dynamically.
Consistency requires protocols. When evolving the control plane design key considerations need to be given to conditions where the view that the network control plane has of the network differs from the physical reality. This is commonly the case right after for example a link change. How quickly can one converge to consistent behavior? Distributed protocols deal with these temporary inconsistencies by using mechanisms such as monotonically increasing sequence numbers with IP’s TTL probably being the number one example, or proposal-agreement handshakes (i.e. two-phase commit) which rapid spanning tree employs. The latter achieves strong consistency, at the cost of increased delay– while the earlier relies on higher level protocols to deal with the temporary inconsistencies.
Let’s keep in mind that part of the success of protocols is due to the fact that they are “pass by value” and leverage well defined state machines. That way protocols avoid the problem that RPC-style “request/response” communication suffers from: Far too often requestor and responder have a different understanding of the semantics of the information exchanged. For example, does the return of a positive acknowledgment mean that the request was just received and queued, or that the request was executed? Consensus building protocols, which are the basis for state machines in the network, resolve this issue via their very nature.
Consequently, even with logically centralizing components, the need for protocols and state machines between all elements does not disappear. We should not fool ourselves to believe that there could be just a single central authority which owns the master state of the network and just “programs” the network via RPC. For who might still believe in full centralization, consider how difficult it is to debug the situation where “Routing” and “Forwarding” tables of a router become out of synch.
Distributed? Centralized? It depends on your deployment. The answer to whether, and if so where, we’ll use logically centralized components will have many factors. Key factors include:
- whether there is a need to run a custom control algorithm
- the ability to centrally access sources of information which are not easily accessible through one of today’s network protocols
- the expected performance and scale of the solution; how quickly can we respond to network events, and what is the event frequency
- the need to handle multiple concurrent failures
Handling events where they occur, i.e. on the network element, obviously helps network performance and system scale. But there are benefits to centralization. Therefore with the advent of NPIs, it is time to re-examine combining fast reacting fully distributed control with a highly optimized centralized control for specific domains and deployments. The question is: What types of network deployments does this matter to first?
Coming back to our data center example: It is possible to build upon a network running a fully distributed routing protocol such as OSPF. On top this OSPF network, we could leverage a logically centralized routing application which reviews the link state database as computed by OSPF, and then computes and injects higher priority (least dollar cost or lowest delay) routes. In such a hybrid control case we have the best of both worlds. Our centralized routing application can fail (or be slow to converge) yet traditional OSPF routes will still be determined in a distributed manner ensuring that the traffic flows.