Cisco IT’s Identity Services Engine Deployment: Cluster and Server Sizing
When sizing clusters for devices in our Identity Services Engine (ISE) deployment, Cisco IT uses a “3+1” formula: For every person we assume three devices (laptop, smartphone, and a tablet) plus one device in the background (security camera, printer, network access device, etc.). In a company the size of Cisco, with roughly 80,000 employees, the math is simple: four devices multiplied by number of employees equals 320,000 possible device authentications at once, although the global distribution makes it unlikely that all devices will authenticate simultaneously. With this in mind, the initial sizing was a four-cluster environment (which we refer to as an “ISE cube”).
In fact, we initially had five main clusters (there was a US Central). We started the design document in 2011, and focused on the capabilities and scaling in ISE 1.0 version. The upper limit of devices ISE could handle in this version was 100,000. Taking future capability deployments into consideration, the assumption was that the clusters needed to handle global deployments of 802.1X for Wired and Wireless; hence, the need for four production clusters initially (roughly 320,000 devices divided by 100,000 per cluster). Because not all clusters are created equal, some regions like the US needed multiple clusters given the large user / device base.
Our methodology was to start with four clusters and adjust as ISE could handle more endpoints. As the product roadmap matured, the plan was to further consolidate the US East and West clusters into one, for example.
Much of this deployment was a foundation for the Internet of Things, what Cisco calls Internet of Everything (IoE). Early discussions around ISE infrastructure sizing focused on the need for Cisco to build the infrastructure to ensure everything on our network can be authorized and get access fit perfectly.
Cisco IT’s ISE deployment is a multi-year effort with plans through our fiscal year 2016 (which ends July 2016). As discussed in my previous blog, we have deployed globally guest networking, 802.1X Monitor Mode, Profiling and are in the process of completing the deployment of 802.1X for Wireless. As these capabilities were rolled out into production, they passed from project support to operational support. The team that supports this infrastructure supports our Access Control System (ACS) product in production as well. Having years of experience supporting this infrastructure for similar functions, the team provides valuable insight into production support concerns.
The first official deployment of ISE was on version ISE 1.1.3, with plans to upgrade to the latest patches and then to ISE 1.2. Supporting this global cluster configuration added overhead to the team and also created some concerns with regards to patch and upgrade management. As the operations team and project team met and discussed this challenge, it was agreed that we could approach sizing differently for the current and short-term capabilities slated for deployment into late 2014. The phased approach to capability deployment led to a phased uptick in devices on the network that would hit the upper limit. In addition, we had deployed ISE 1.2 with an upper limit of 250,000 devices. Given that we had not completely deployed 802.1X Wired globally, only to some high-risk environments where we had intellectual property concerns, we did not need the size to accommodate the full 3+1. Instead, we moved toward a single ISE global cluster.
The change to a single cluster allows us to not have to wait for the Manager of Managers (MoM) expected in later versions of ISE. Multiple clusters mean having to log into separate management and admin consoles. Patching and upgrading have to be done at each cluster separately. In current versions, there is no MoM in which these efforts can be centralized. Faced with a workforce that was static and with product limitations that might not exist in the next release, the operational support team decided to merge the four clusters into a single global one.
This one cluster will be sized to the maximum number of PSNs allowed in a single deployment (40), with a standby PAN/MnT pairs to split into two deployments when the limit of 250,000 concurrent endpoints is about to be reached. This split will instantly double the capacity to 500,000 concurrent endpoints without the need to add any additional policy service nodes (PSNs) or change the configuration of the network access devices (NADs), by leveraging virtual IP addresses (VIPs) and Application Control Engine (ACE) load balancers.
At a high level, following are the steps we took to complete the merger of the clusters:
- Infrastructure prerequisites
- Ensured all clusters are at same version and patch level
- Disabled all backups
- Made sure SFTP and ISE servers have enough free storage
- Completed a backup prior (multiple steps not listed, consult the CCO)
- Ensured primary PAN and M&T virtual machines (VMs) are cloned
- Consolidated all PPAN configurations into our development (DEV) environment
- Added six new PSN VMs to corresponding VIPs and installed ISE
- Main steps
- Merged US West, APAC, and EMEAR clusters into US East PSN
- De-registered one PSN behind each VIP in each of the US West, APAC, and EMEA clusters
- Registered PSN in US East cluster
- Added PSN to the other clusters
- Confirmed metrics on US East cluster were correct (now a single global cluster based in our Allen, Texas, data center)
- Checked replication and Active Directory status on cluster hosts
- After migration
- Restored services turned off prior
- Took a backup
- Checked all monitoring systems for issues
At this point we were running one cluster globally. This work was completed in December 2013, following four weekends upgrading from ISE 1.1.3 to ISE 1.2, one cluster at a time, before all clusters were merged into one.
We are finalizing the strategy for how to split the deployment in case we reach any of the product limits before the next version. Each option has its pros and cons. First option is to split the deployment between East and West, for example US is one deployment, and EMEA + APAC is another. The second option is to keep the deployments global and split them by leveraging the VIPs to point to a subset of PSNs. This can be a functional split (by service) or a select split (by volume).
Already our support teams have seen upsides of the single global cluster. Patch 3 and Patch 6 were applied to our global ISE infrastructure of one cluster. This process involved only logging into the PAN and uploading the patch file; the primary takes care of doling out the patch in a serial fashion to each of the secondaries. If we didn’t have a single cluster, we’d have to apply the patches to each cluster separately. Another downside, if the patch fails in one of those clusters, we would have versions that are out of sync.
This cluster consolidation means we have only one-fourth the amount of work to do and a large reduction in risk by upgrading only in one cluster. We’ve sized the current infrastructure for the current demand. Rather than building for all future capabilities as first deployed, we’ve gone to a right-sizing approach – no excess capacity waiting for future capabilities that aren’t currently deployed.
In the future, we may still get to multiple clusters and, in fact, we probably will. But we can’t determine now when we will need to split the clusters. We don’t know the full impact of IoE and all devices that will eventually connect into our enterprise, for example. Using good capacity management techniques is an ongoing effort for any IT organization; it is the same in this instance as the operational team has numerous ways to view and react to changes in load.
A good example of this active capacity management was a recent effort to upgrade our VM infrastructure running production ISE. At the time of deployment, we engaged with our internal provisioning team for VMs and selected from a standard set of menu items for this service. After running in production for over a year, added capabilities were impacting the system. The team noticed large numbers on the disk I/O and memory use. As a result, there was an active effort in the last three months to upgrade this service. The table below shows the current standards we use in our production VMs.
Cluster and server sizing decisions for Cisco IT are never static. It is constant observation and product knowledge that have led us to the optimal deployment configuration we’re at now. In one year, it is highly likely that it will be different in many ways. We’ll only know for sure what that looks like as the capacity is observed, analyzed, and reacted upon.
Much of the ISE project background and deployment work before my blog can be found in this public article.