Ease of Troubleshooting : Meet Cisco MDS 9710 Multilayer Director
End of April, Cisco introduced the next generation Storage Network innovations with the new MDS 9710 Multilayer Director and new MDS 9250i Multiservice Switch (see Berna Devrim’s blog ) . These solutions have been presented to the Storage Tech Field Day (see blogs and videos below) , as well on the Cisco booth at different events (ie EMC World) generating a lot of comments from bloggers and users. So I came back to the team of Cisco engineers to know more about these solutions – Today I interviewed Bhavin Yadav one of these fine engineers about the MDS 9710 .
“Hi Bhavin . I heard a lot of comments about troubleshooting , and the claim that it’s easy . Really? How can I achieve that?
Haha . Welcome to the world of Cisco MDS 9700 Series Multilayer Directors, the easy to troubleshoot Multilayer Director Class platform in the storage area networking world! Isn’t it wonderful ? Who does not want that ? But let me first remind you some fundamental facts before we come to any conclusion.
What are some of the reasons we have to troubleshoot?
1. Unknown human part – like wrong part replacement, playing around for fun (Are you serious? Yes, you can just pull the lever and get pull the module out like a piece of cake),
2. Technology changes –sudden traffic pattern change like unexpected data spikes, infrastructure changes like adding bunch of new virtual machines – new servers, using SSD based storage, exponential increase in data backups every night , deployment of new data centers, need of speedier networks, etc.
3. Unexpected infrastructure acts: sudden power loss, lost air cooling, burned out cables, failing hard drives – switches – cables, etc.
So, the question comes out is – Okay, this is not something new, we deal with it everyday. Well, let me ask you this – while you work on above problems, does it impact your customers / clients? Does it impact (reduce) your available resources? Does it wake you up your staff at 2 am? But how about reducing these kinds of calls / engagements and also enhancing the customer experience? Yes, that’s what we are going to talk about.
Introducing Cisco MDS 9710 Multilayer Director Class SAN switch, along with Raising the bar for performance and new benchmarks on performance, reliability and flexibility, now also for troubleshooting. Let’s see how it does this.
With MDS 9710, comes a newly refreshed hardware with new troubleshooting and monitoring tool: GOLD – Generic Online Diagnostics. In this post, I want to talk about both, new hardware and GOLD.
First, let’s talk about new chassis. The new chassis can have maximum upto 8x16G FC line cards (modules), 2xSupervisor modules, 8xPower supplies, 6xFabric Modules (backplanes) and 3x Fan trays with Four fans in each tray.
By this time, we all know about the building hardware redundancy for all the components in the Cisco MDS Director class market in SAN world. With fully loaded chassis, if we lose any single component, we still maintain the bandwidth, performance, reliability and we can have our own time to replace failed component. But at the same time, how about having the predictive analysis done on when something that is about to fail? How about having some advance information on how much bandwidth I will need during weekend backups to avoid any disruption to production traffic?
That’s where MDS 9710 differs and comes out as winner ?
Exactly . And let me bring more colors to this statement
Using DCNM – Data Center Management Tool, we now have the future in our hands. Because we can now calculate the future requirement for our bandwidth, storage and other requirements based on daily / monthly transactions average. This helps in planning budget for OPEX fund requirement.
Beginning with NX-OS 6.2.x, we have new GOLD standards for monitoring and troubleshooting the new chassis and its components. Using GOLD, we can now verify the hardware, software and its internal components to make sure they are operating as expected and help us rapidly isolate faults, if any. With GOLD , we have following new categories:
Online diagnostics: We can verify the functionality of the hardware while the device is up and running. We can perform check using disruptive tests and non-disruptive tests as well. These are classified in following categories:
Bootup Diagnostics: These hardware checks are done whenever we insert the new component inside the chassis. e.g. insertion of line card. As soon as new line card powers up in chassis, it will go through certain boot up checks. If any of the check point says Hello, I have a problem, it will stop booting right there. How is this going to help? Well, consider it like, plane is on runway, about to pick up its engine force, found a fault and halts right there to save all the on-board lives. These boot-up checks are going to save all the rest of the line cards and their ports (anywhere between 48 – 384 lives, sorry ports). Isn’t this wonderful? Yes, any faulty hardware will not come up and will send out alert to admins. The default setting for these diagnostics is ON.
Run-time Diagnostics: Also called Health Monitoring diagnostics. Similar to what we do regularly for ourselves. Get our blood tests done to make sure everything is normal. All the run-time diagnostics do is – detect hardware issues, memory issues, ASIC issues, performance impact over the time and identify bottlenecks due to longevity of hardware. These are non-disruptive tests so we can do our normal work while these tests are being carried out. These tests are for line card modules and supervisor modules. The tests include ASIC register check, boot-rom tests, snake-loop back test, nvram check, control bus and management bus check, etc. Every test runs at a modifiable fixed timer and is active by default.
Scheduled / On-Demand Diagnostics: Well, this is like our yearly checkup. What should we do if any of the component has failed, how are other components doing, are we expecting any failures, or expecting heavy traffic burst to raise resource utilization flag? These are disruptive as well as non-disruptive tests that can be run for certain frequency and at defined time interval.
High Availability: Well, really, is this a kind of diagnostics? Yes, it is. Every major hardware component in MDS 9710 chassis has built-in redundancy to make sure performance / throughput is not impacted upon any failure, provided we have fully loaded chassis. N:N Fabric module redundancy, N:N and N+1 GRID redundancy for Power supplies, N:N supervisor modules, redundant control path and data path for fan trays, etc. One of the coolest features of this chassis is – LED indicators for every hardware component. And they are easily accessible and seen, no need to remove the cover or unscrew something to see LEDs.
Another feature is the level of deep integration we have within NX-OS and hardware parts. Let’s say you want to replace the line card in the chassis, you can use the ID LED, issue a command from CLI and start blinking it. This is now a standard but this chassis has more than that. As soon as the onsite person pushes the lever or opens up the lever on line card and is ready to pull the line card out of chassis, you can verify which line card lever / eject button is being pushed from the CLI status command.
Temperature controls are also another set of feature for this chassis. There are about 16 ambient temperature sensors across the chassis and on the hardware components to collect the statistics of temperature inside chassis. We can check the temperature of each sensor by using a CLI command. This not only helps to automatically adjust FAN speeds to maintain the right temperature inside chassis but also makes sure we don’t overheat any component resulting in further damage and avoid any hazardous situation.
Online diagnostics at such a deep level of integration provides piece of mind to the people operating it, managing the environment, resulting in reduced cycles spent in conference calls to do RCA. Same time, it guarantees the performance and allows enough time to take wise decisions to replace right part at the right time. No 2am calls please….
And now comes the creamy center part of the sandwich – price. This feature comes free to the customer with standard license. No additional cost. So, piece of mind, performance, flexibility, redundancy comes down to free. Now, that’s something called Director’s cut.
So Bhavin . How our readers who are now curious and excited can learn more ?
Well. Cisco Data Center Group (DCG) will be hosting 1:1 customer meetings during Cisco Live! Orlando 2013.
We will have BU Executives, Product Managers, and Technical Marketing Engineers on site to meet with customers. This is a unique opportunity for your customers to learn about next-generation Fibre Channel SANs that provide maximum bandwidth, while building in higher reliability, architectural flexibility, and simplified management.
We can hold private meetings (30-90 minutes) on the following topics:
• Cisco FC and FCoE SAN Vision and Roadmap
• Architectures for Data Center Consolidation and New Application Roll Out with MDS 16G and Nexus SANs
• Backup and Disaster Recovery solutions
• Management in a Virtualized Environment
Please use the Registration Link to schedule a meeting with us – select one of the SAN topics in the request and the schedulers will work with you to schedule a time that is convenient to you.
Here are some SAN related break out sessions that may be of interest to our visitors
• BRKSAN-2304 – Storage Area Network Design, Operation, and Extension
• BRKDCT-1044 – FCoE for the IP Network Engineer
• BRKSAN-1121 – SAN Core Edge Design Best Practices
• BRKSAN-2047 – FCoE – Design, Operations and Management Best Practices
• BRKSAN-2267 – Unified Storage Access Design
• BRKSAN-2282 – Operational Models for FCoE Deployments – Best Practices and Examples
• BRKSAN-2378 – Evolution of Connectivity Options for FCoE
• BRKNMS-2695 – Administration and monitoring of the Cisco Data Center with Cisco DCNM
• BRKCOM-2007 – UCS SAN Deployment Models and Best Practices
Finally I invite everybody to stop by our booth and say hello or join us for a conversation!
Here amongst many other demos what we will showcase :
• Multiprotocol Storage Networking • End-to-End LAN and SAN • Cisco NX-OS Unified Datacenter Operations
As you can see the storage networking team will provide you a lot of extremely valuable information at Cisco Live
But if you can’t attend Cisco Live US , here some additional resources to keep learning about these new solutions.