Part 1 of the 3-part Wireless Catalyst 9800 WLC KPIs
When working in critical wireless infrastructures, it is important to be proactive and determine in advance if there is any potential issue that could impact end-clients experience. Wireless Catalyst 9800 WLC KPIs will help in that task.
In this blog, I will share a systematic approach plus a list of commands that I have used while providing support on the NOC for one of the largest worldwide wireless events. The idea behind is to keep a close eye on how to monitor Key Performance Indicators (KPIs) for Catalyst 9800 WLC.
KPIs outputs can be collected periodically to create a baseline when a network is working fine. Therefore, making it easier later to find any deviation by comparing new outputs with previously collected ones.
I have divided WLC KPIs into six different buckets or areas:
- WLC checks
- Connection with other devices
- AP checks
- RF checks
- Client checks
- Packet Drops
KPIs will help us to spot issues in any of the mentioned six areas. In this blog, I have included WLC checks and Connections with other devices. Additionally, there will be two more blogs where I will share AP checks, RF checks, Client checks, and Packet Drops.
WLC checks
I usually start by checking the WLC first, since it is the most critical part. If any issues are seen in the controller, they will cascade shortly after as problems with APs and clients. In other words, the idea here is to perform top-down criteria.
While reviewing the health state of the WLC, I would first confirm that WLC is running the intended version and in install mode. Install mode will ensure that the controller will boot faster, with a reduced memory footprint. After that, I would check the uptime of the WLC to see if any reload has occurred. Use the command: “show version | i uptime|Installation mode|Cisco IOS Software”
Gladius1#show version | i uptime|Installation mode|Cisco IOS Software Cisco IOS Software [Amsterdam], C9800 Software (C9800_IOSXE-K9), Version 17.3.5a, RELEASE SOFTWARE (fc2) Gladius1 uptime is 2 weeks, 5 days, 21 hours, 30 minutes Installation mode is INSTALL
Check expected release, uptime, and WLC running in install mode.
For Catalyst 9800 WLC deployed in High Availability, which by the way, is highly recommended for critical deployments, we need to first verify that the HA pair stack is formed and in a standby-hot state. Secondly, check the stack uptime and each of the member’s individual uptime. Thirdly, identify a number of switchovers between active and standby. Use the command: “show redundancy | i ptime|Location|Current Software state|Switchovers”.
Gladius1#show redundancy | i ptime|Location|Current Software state|Switchovers Available system uptime = 2 weeks, 1 day, 2 hours, 48 minutes Switchovers system experienced = 1 Active Location = slot 1 Current Software state = ACTIVE Uptime in current state = 7 hours, 10 minutes Standby Location = slot 2 Current Software state = STANDBY HOT Uptime in current state = 7 hours, 4 minutes
Check stack uptime, number of switchovers, and uptime for members. Switchover occurred 7 hours ago. Slot1 is new active and Slot2 reloaded.
In HA deployments, the recommendation is to use RMI feature. This will allow monitoring active and standby through Wireless Management Interface (WMI) and Redundancy Port (RP). After that, we should enable Default-gateway Check to confirm that both active and standby can reach the gateway. Here is a link to the 9800 High Availability deployment guide.
The next step will be to check if there are any WLC crashes. Determine if crash matches with the time of switchovers or unexpected reload. When WLC crash occurs it should generate a core dump or a system report. Those files are stored in WLC harddisk for 9800-40/80 or in bootflash for 9800-L/CL. Use command: “dir harddisk:/core/ | i core|system-report”, “dir stby-harddisk:/core/| i core|system-report” and replace harddisk by bootflash for 9800-L/CL.
Gladius1#dir harddisk:/core/ | i core|system-report Directory of harddisk:/core/ 3661831 -rw- 11260562 Mar 25 2022 22:07:12 +01:00 Gladius1_1_RP_0_wncd_16574_20220325-220708-CET.core.gz 3661830 -rw- 48528 Mar 25 2022 21:57:20 +01:00 Gladius1_1_RP_0-system-report_20220325-215658-CET-info.txt 3661829 -rw- 126548098 Mar 25 2022 21:57:10 +01:00 Gladius1_1_RP_0-system-report_20220325-215658-CET.tar.gz 3661828 -rw- 57191 Mar 9 2021 16:21:48 +01:00 Gladius1_1_RP_0-system-report_20210309-161907-CET-info.txt 3661827 -rw- 504311304 Mar 9 2021 16:20:51 +01:00 Gladius1_1_RP_0-system-report_20210309-161907-CET.tar.gz 3661826 -rw- 11714625 Nov 19 2020 10:35:54 +01:00 Gladius1_1_RP_0_wncd_30240_20201119-103550-CET.core.gz
Check for cores and system reports. 2xcores in wncd process and 2xsystem-reports have occurred.
In case we observe any core dump we can identify the impacted process by checking file name. For example: WLC_1_RP_0_wncd_16574_20220325-220708-CET.core.gz crash occurred in “wncd” process, WLC_1_RP_0_dbm_14119_20201104-092800-CET.core.gz crash occurred in “dbm” process. Open a TAC case to identify the root cause of the crash.
Once we have verified crashes or unexpected reloads, we can continue by reviewing WLC CPU and memory utilization. For CPU monitoring we need to run command several times. Detect if there are any processes showing CPU above 80% consistently and not as a spike. I prefer to execute the command with sorted keyword. That way you can focus on processes with high CPU first. We have seen cases where consistent high CPU in WNCD process lead to AP disconnections. However, the releases 17.3.5 and 17.6.3 have received additional hardening, with the objective to protect AP CAPWAP connections in case a high CPU occurs. Use command: “show processes cpu platform sorted | ex 0% 0% 0%”
Gladius1#show processes cpu platform sorted | ex 0% 0% 0% CPU utilization for five seconds: 14%, one minute: 16%, five minutes: 16% Core 0: CPU utilization for five seconds: 10%, one minute: 7%, five minutes: 11% Core 1: CPU utilization for five seconds: 6%, one minute: 28%, five minutes: 12% Core 2: CPU utilization for five seconds: 48%, one minute: 55%, five minutes: 68% Core 3: CPU utilization for five seconds: 20%, one minute: 8%, five minutes: 11% Core 4: CPU utilization for five seconds: 38%, one minute: 13%, five minutes: 17% Core 5: CPU utilization for five seconds: 14%, one minute: 11%, five minutes: 13% Core 6: CPU utilization for five seconds: 9%, one minute: 20%, five minutes: 23% Core 7: CPU utilization for five seconds: 5%, one minute: 8%, five minutes: 18% Core 8: CPU utilization for five seconds: 7%, one minute: 50%, five minutes: 34% Core 9: CPU utilization for five seconds: 100%, one minute: 58%, five minutes: 27% Core 10: CPU utilization for five seconds: 27%, one minute: 17%, five minutes: 25% Pid PPid 5Sec 1Min 5Min Status Size Name -------------------------------------------------------------------------------- 19056 19037 99% 99% 99% R 7525896 wncd_0 21922 21913 96% 97% 99% R 127488 smand 19460 19451 37% 34% 33% R 6363828 wncd_2 19604 19596 18% 19% 18% R 4556132 wncd_3
Check CPU utilization per Core and per Process. Process wncd_0 and smand facing close to 100% CPU utilization
Catalyst 9800-CL and 9800-L platforms use CPU cores for data forwarding. Therefore, it is expected to see high CPU in ucode_pkt_PPE0. For those platforms to evaluate data plane performance use command: “show platform hardware chassis active qfp datapath utilization | i Load”
Gladius1#show platform hardware chassis active qfp datapath utilization | i load CPP 0: Subdev 0 5 secs 1 min 5 min 60 min Processing: Load (pct) 4 3 4 3 Check datapath load %
While checking memory utilization, we need to monitor if the device utilization is too high. Subsequently, identify if there are any processes holding memory and not releasing it over time (leak). Use command: “show platform resources” (basic), “show process memory platform sorted”, ”show processes memory platform accounting” (advanced)
Gladius1#show platform resources **State Acronym: H - Healthy, W - Warning, C - Critical Resource Usage Max Warning Critical State ---------------------------------------------------------------------------------------------------- RP0 (ok, active) H Control Processor 0.79% 100% 80% 90% H DRAM 4839MB(15%) 31670MB 88% 93% H harddisk 0MB(0%) 0MB 80% 85% H ESP0(ok, active) H QFP H TCAM 68cells(0%) 1048576cells 65% 85% H DRAM 420162KB(20%) 2097152KB 85% 95% H IRAM 13738KB(10%) 131072KB 85% 95% H CPU Utilization 0.00% 100% 90% 95% H
Confirm state is healthy for metrics. Review Control Processor and memory utilization
Gladius1#show processes memory platform sorted System memory: 15869340K total, 6152000K used, 9717340K free, Lowest: 9717340K Pid Text Data Stack Dynamic RSS Name ---------------------------------------------------------------------- 3546 367768 1404580 136 488 1404580 linux_iosd-imag 23602 22335 449968 136 1052 449968 ucode_pkt_PPE0 24525 847 437624 136 46628 437624 wncd_0 24004 160 373176 3956 6400 373176 wncmgrd 26358 128 344868 136 136628 344868 mobilityd
Check free memory available. Identify top processes holding more memory.
Gladius1#show processes memory platform accounting Hourly Stats process callsite_ID(bytes) max_diff_bytes callsite_ID(calls) max_diff_calls tracekey timestamp(UTC) ------------------------------------------------------------------------------------------------------------------------------------------------------------ cpp_cp_svr_fp_0 2887897091 7243446 2887897092 1133 1#e4bd31e0c668be2b8786dec9fcc99486 2022-05-25 14:04 ndbmand_rp_0 3571094529 5453112 3570931712 1119 1#00c5632bf072231d06cf80b8ccc37392 2022-05-09 21:52 wncd_4_rp_0 2556049411 3059712 3028615169 227 1#9f4792f37292983824f5bb97d7e2167c 2022-05-10 14:54 wncd_0_rp_0 2556049411 1990656 3028615168 680 1#9f4792f37292983824f5bb97d7e2167c 2022-05-25 11:05 wncd_2_rp_0 2556049411 1953792 3028615169 682 1#9f4792f37292983824f5bb97d7e2167c 2022-05-13 14:01 smand_rp_0 2887895047 1491984 3028615168 89 1#eaf6dd665e73b1edeee32fb9c5ac8639 2022-05-10 14:54
Check top processes and the number of calls. Stats are hourly, daily, weekly, and monthly.
As final controller health check, we can do a validation of the hardware. Check the status of power supplies, fans, SFPs, and temperature (only for physical WLCs). Likewise, review license status and the right number of licenses in use. Use commands: “show platform”, “show inventory”, “show environment” and “show license summary | i Status:”
Gladius1#show platform Chassis type: C9800-40-K9 Slot Type State Insert time (ago) --------- ------------------- --------------------- ----------------- 0 C9800-40-K9 ok 2w5d 0/0 BUILT-IN-4X10G/1G ok 2w5d R0 C9800-40-K9 ok, active 2w5d F0 C9800-40-K9 ok, active 2w5d P0 C9800-AC-750W-R ok 2w5d P1 Unknown empty never P2 C9800-40-K9-FAN ok 2w5d Slot CPLD Version Firmware Version --------- ------------------- --------------------------------------- 0 19030712 16.10(2r) R0 19030712 16.10(2r) F0 19030712 16.10(2r) Gladius1#show inventory NAME: "Chassis 1", DESCR: "Cisco C9800-40-K9 Chassis" PID: C9800-40-K9 , VID: V03 , SN: TTM242504SR NAME: "Chassis 1 Power Supply Module 0", DESCR: "Cisco Catalyst 9800-40 750W AC Power Supply Reverse Air" PID: C9800-AC-750W-R , VID: V01 , SN: ART2418F0GJ NAME: "Chassis 1 Fan Tray", DESCR: "Cisco C9800-40-K9 Fan Tray" PID: C9800-40-K9-FAN , VID: , SN: NAME: "module 0", DESCR: "Cisco C9800-40-K9 Modular Interface Processor" PID: C9800-40-K9 , VID: , SN: NAME: "SPA subslot 0/0", DESCR: "4-port 10G/1G multirate Ethernet Port Adapter" PID: BUILT-IN-4X10G/1G , VID: N/A , SN: JAE87654321 NAME: "subslot 0/0 transceiver 0", DESCR: "10GE LR" PID: SFP-10G-LR , VID: V02 , SN: AVD2141KCFB NAME: "module R0", DESCR: "Cisco C9800-40-K9 Route Processor" PID: C9800-40-K9 , VID: V03 , SN: TTM242504SR NAME: "module F0", DESCR: "Cisco C9800-40-K9 Embedded Services Processor" PID: C9800-40-K9 , VID: , SN: NAME: "Crypto Asic F0/0", DESCR: "Asic 0 of module F0" PID: NOT , VID: V01 , SN: JAE242711XF Gladius1#show environment Number of Critical alarms: 0 Number of Major alarms: 0 Number of Minor alarms: 0
Check power supplies, fan status, SFPs, SPAs, and any alarms.
An example of those Catalyst 9800 WLC KPIs helping to identify an issue, was a customer-facing High Availability setup issue between two WLCs. By reviewing the version, and hardware installed in both WLCs we identified a difference in SPA adapters that was causing the WLC to not pair as HA.
Connection with other devices Checks
In addition to WLC health, we can check the status of WLC’s connections. The most important connections are mobility with other WLCs for inter-WLC roams, telemetry with DNAC/PI for monitoring and automation, and Nmsp with DNA-Spaces/CMX for location services. We need to ensure that those connections are established and working fine.
Confirm that mobility tunnels with other WLCs are up and using the right encryption and MTU. And clients can roam or be anchored to other WLC. If tunnels are down we can find if an issue is occurring in the control tunnel (UDP port 16666), in the data tunnel (UDP port 16667), or in both. Use command: “show wireless mobility sum”
Gladius1#sh wireless mobility summary Wireless Management VLAN: 25 Wireless Management IP Address: 192.168.25.25 Mobility Control Message DSCP Value: 48 Mobility Keepalive Interval/Count: 10/3 Mobility Group Name: eWLC3 Mobility Multicast Ipv4 address: 0.0.0.0 Mobility MAC Address: 001e.f62a.46ff Mobility Domain Identifier: 0x2e47 Controllers configured in the Mobility Domain: IP Public Ip MAC Address Group Name Multicast IPv4 Multicast IPv6 Status PMTU ---------------------------------------------------------------------------------------------------------- 192.168.25.25 N/A 001e.f62a.46ff eWLC3 0.0.0.0 :: N/A N/A 192.168.5.35 192.168.5.35 00b0.e1f2.f480 3500-2 0.0.0.0 :: Up 1385 192.168.25.23 192.168.25.23 706d.1535.6b0b DAO2 0.0.0.0 :: Control And Data Path Down 192.168.25.33 192.168.25.33 f4bd.9e57.ff6b 5500 0.0.0.0 :: Up 1005
Check for mobility down and low PMTU.
If we have DNAC for Assurance or Provision we can confirm that DNAC Netconf connection is established. Afterward verify telemetry statistics for WLC, APs, and clients are updated in DNAC. Use command: “show telemetry internal connection”. After 17.7 this command have been replaced by “show telemetry connection all”
Gladius2#show telemetry internal connection Load for five secs: 29%/5%; one minute: 4%; five minutes: 2% Time source is NTP, 10:21:45.942 CET Wed Nov 4 2020 Telemetry connections Index Peer Address Port VRF Source Address State ----- -------------------------- ----- --- -------------------------- ---------- 1 192.168.0.105 25103 0 192.168.25.42 Active
Check for telemetry state
In case we are using DNA-Spaces for location. Firstly, we can confirm Nmsp connection status, and the number of packets transmitted and received. Secondly, list of clients in WLC probing database. And lastly, the client location is updated in DNA-Spaces. Use command “show nmsp status”
Gladius1#show nmsp status NMSP Status ----------- DNA Spaces/CMX IP Address Active Tx Echo Resp Rx Echo Req Tx Data Rx Data Transport ---------------------------------------------------------------------------------------------------------- 192.168.0.65 Active 693870 693870 16833737 181084 TLS 192.168.0.66 Inactive 21 21 222 7 TLS
Check for inactive servers, mismatch between echo tx/rx
With provided checks, we can proactively monitor the health of our 9800 WLC and connection with other devices like CMX/DNA-Spaces, other WLCs, and DNAC. In the next blog, we will share KPIs to monitor APs and RF.
List of commands to use for KPIs and automation scripts
In the document below, there is also a link to a script that will automatically collect all the commands. It will collect commands based on platform and release, save them in a file, and export the file. The script is using the “Guest-shell” feature that for now is only available in physical WLCs 9800-40/80 and 9800-L.
The document also provides an example of EEM script to collect logs periodically. In conclusion, EEM along with “Guest-shell” script will help to collect 9800 WLC KPIs and have a baseline for your Catalyst 9800 WLC.
For the list of commands used to monitor those KPIs
Visit the Monitor Wireless Catalyst 9800 KPIs
Great suggestions. Also show these checks using DNAC next time since that is the going forward management tool.