I’ve been working with Cisco UCS since the very beginning. From the earliest days, whenever a customer ran into problems, I would often be asked to help figure out what was going wrong and to help fix it. Generally, this would involve a review of the system, and when we found less desirable configurations we would work with the partner and customer to clean things up. As a part of this process, I began documenting the good and the bad I saw, which evolved into what I describe as UCS “better” practices. This post aims to describe some of these practices and why they are useful. Follow-up posts will expand on this and include additional important practices.
Why it matters
Customers that invest in Cisco UCS want a good experience with it. When things aren’t optimal, the experience suffers. Most of the recommendations I make are focused on providing higher levels of reliability and availability while minimizing administrative effort to setup and maintain the environment. I also focus on maximizing consistency between the servers in the environment.
Why these aren’t “best” practices
Good or bad, there are lots of ways to use UCS. In some cases, the “best” for one customer isn’t the “best” for other customers. In my 6 years at Cisco working with UCS, I’ve learned that while these practices can vary with different type of customers and different size customers, the rules discussed here are generally universal.
Practice #1: Set up Call Home
This was a feature available at launch. It sends emails to your team when things go wrong and will even open a case with Cisco TAC when things fail. Early on, I found that customers almost never had configured Call Home. Over the years, I’ve seen it turned on more often, but it still isn’t about half the time. It’s already there so there’s no setup, it’s easy to configure and it’s free. Why not set it up when it can save you from a job-threatening event like an outage? To set it up simply follow the instructions here.
Practice #2: Do your backups
Backups can help you quickly recover from minor and major accidents or problems. When working with customers’ systems, I often find that the system has NEVER been backed up, even if it’s been installed for years! UCS has always had a backup capability but over the years it’s gotten a LOT better. You can manually backup the system and with the latest code it can backup itself. Do your backups periodically AND before any major changes. Details can be found in the Cisco UCS Manager GUI Configuration guide in the Backing Up and Restoring the Configuration section of the version of UCS Manager you are using. For example, the UCSM 2.2 instructions can be found here.
Practice #3: Make sure your team understands the basics
What makes UCS a better platform is that it’s not a rubber-stamp replica of another vendor’s antiquated server platform. It’s a simpler, easier-to-operate system, but it’s a different system as well. The team operating the system should have a good working knowledge of it. In some cases, knowledge learned by reading configuration guides and watching videos on cisco.com and other web sites might be enough. In other cases, getting time with an expert mentor from your preferred partner or Cisco is what you need. For others, going through formal training is the right move.
Practice #4: Proactively insure entitlement with the Cisco Technical Assistance Center (TAC)
When a customer purchases through a sales partner, that partner is responsible for registering that hardware for support. On occasion they may not register the systems properly, which could delay access to support. I recommend that customers open a proactive case to validate entitlement for the hardware for each of their admins. They can do this using the Support Case Manager and proving their serial numbers and the cisco.com id’s of the team members that need access to support on the hardware. If entitlement is correct, the admins will be able to see support and warranty information when they log into their cisco.com support page.
Practice #5: Clean up your faults
A well-managed system does not have a lot of faults, and faults you do have should have plan to resolve them. If you have a critical or unknown fault, don’t make major changes to your system, since this can lead to redundancy or, even worse, an outage. Each fault will have its own fault and resolution, so I can’t give you general instructions for doing this. In some cases, you will need to open a case with the Cisco TAC to help resolve these issues.
Practice #6: Set up (and use) maintenance policies
This policy makes admins perform an extra step when a change is made that causes the server want to reboot to reconfigure something. The default policy is “immediate”, so when you click “OK” to the change notification message the server(s) will immediately reboot. Create a policy called “user-ack” (short for user acknowledgement) and use that one. While you are at it change the default policy to “user-ack” as well. Details can be found in the Cisco UCS Manager GUI Configuration guide in the Deferring Deployment of Service Profile Updates section of the version of UCS Manager you are using. For example the UCSM 2.2 instructions can be found here.
In the next post I will describe some of the other recommended practices including the use of policies, pools and templates.