pyATS Demos: Network Profiling – NetDevOps Series, Part 15

Welcome back to my NetDevOps series! You can find all previous blogs in the NetDevOps series if you are just now tuning in.

An amazing and useful use case

Today we will look at one of the most amazing and useful use cases for pyATS and Genie. I can guarantee you will not regret reading this post today!

Please remember you must have a reserved sandbox with your VIRL simulations running and be connected to it via VPN. Please see my previous post for further information on how to complete the environment setup.

Profiling your network for troubleshooting

Let’s say you are responsible for a network and could use some help on how to be updated about possible issues happening in it. Wouldn’t it be great to have a tool that helps you profile the network end-to-end and store that info as snapshots?

Let’s focus, for example, on profiling everything related to BGP, OSPF, interfaces and the platforms in your network, and saving that info to snapshot files. Ideally you would take a first snapshot of your network when everything is working superbly.

Genie can help you do it with a simple command

Specifying what features you want to learn (ospf interface bgp platform), from what specific testbed (–testbed-file default_testbed.yaml), and the directory where you want to store the resulting files (–output good):

$ docker run -it --rm \
  -v $PWD:/pyats/demos/ \
  --env-file env.list \
  ciscotestautomation/pyats:latest-alpine ash
(pyats) /pyats# cd demos
(pyats) /pyats/demos # genie learn ospf interface bgp platform --testbed-file default_testbed.yaml --output good

Inside the created ‘good’ directory, console files will show you what commands were run to obtain all required info, while ops files will store the resulting information in structured format.

Now let’s simulate something terrible happened in your network… by shutting down one of the loopback interfaces in your CSR1000v router. Well, it’s not that terrible, but you get the idea as an example of what could have happened.

First you need to identify the IP address of that CSR1000v, so you can connect to it:

(pyats) /pyats/demos # cat default_testbed.yaml | grep -A 1 GigabitEthernet1:
      GigabitEthernet1:
        ipv4: 172.16.30.129/24

Now you can SSH to it with password cisco (accept it being added to your list of known hosts):

(pyats) /pyats/demos # ssh cisco@172.16.30.129

Once inside the system please shutdown interface loopback 1, to simulate that terrible catastrophe in your network:

csr1000v-1#conf t
Enter configuration commands, one per line. End with CNTL/Z.
csr1000v-1(config)#int lo 1
csr1000v-1(config-if)#shut
csr1000v-1(config-if)#exit
csr1000v-1(config)#exit
csr1000v-1#exit
Connection to 172.16.30.129 closed by remote host.
Connection to 172.16.30.129 closed.

In the real world, soon you would be receiving calls from users: “Something is wrong… terribly wrong”, “I lost ALL connectivity”, “My database stopped working!”. So instead of starting to troubleshoot by brute force, how about asking Genie to determine what is the current new status of the network after the outage. Even better, what changed exactly since the last time you took the snapshot of the network in good state? And while we are at it, would it be possible to show not only how configurations changed, but also operational status changes? Maybe nobody shut an interface, but what if the cable was unplugged or is faulty…

Let’s do this by running the same command as previously, but this time asking the system to store the resulting files in a different directory (–output bad).

(pyats) /pyats/demos # genie learn ospf interface bgp platform --testbed-file default_testbed.yaml --output bad

And now find out what changed between the ‘good’ situation and the ‘bad’ one with yet another simple command.

(pyats) /pyats/demos # genie diff good bad
1it [00:00, 5.96it/s]
+==============================================================================+
| Genie Diff Summary between directories good/ and bad/ |
+==============================================================================+
| File: ospf_iosxe_csr1000v-1_ops.txt |
| - Identical |
|------------------------------------------------------------------------------|
| File: platform_nxos_nx-osv-1_ops.txt |
| - Identical |
|------------------------------------------------------------------------------|
| File: interface_iosxe_csr1000v-1_ops.txt |
| - Diff can be found at ./diff_interface_iosxe_csr1000v-1_ops.txt |
|------------------------------------------------------------------------------|
| File: bgp_nxos_nx-osv-1_ops.txt |
| - Diff can be found at ./diff_bgp_nxos_nx-osv-1_ops.txt |
|------------------------------------------------------------------------------|
| File: ospf_nxos_nx-osv-1_ops.txt |
| - Identical |
|------------------------------------------------------------------------------|
| File: bgp_iosxe_csr1000v-1_ops.txt |
| - Diff can be found at ./diff_bgp_iosxe_csr1000v-1_ops.txt |
|------------------------------------------------------------------------------|
| File: platform_iosxe_csr1000v-1_ops.txt |
| - Identical |
|------------------------------------------------------------------------------|
| File: interface_nxos_nx-osv-1_ops.txt |
| - Identical |
|------------------------------------------------------------------------------|

As you can see the system generates some files that signal exactly what has changed from the ‘good’ situation to the ‘bad’ one. In this specific case, one of the files immediately shows that interface Loopback 1 in the CSR1000v has been disabled!

(pyats) /pyats/demos # cat ./diff_interface_iosxe_csr1000v-1_ops.txt
--- learnt/interface_iosxe_csr1000v-1_ops.txt
+++ bad/interface_iosxe_csr1000v-1_ops.txt
info:
 Loopback1:
...
+ enabled: False
- enabled: True
+ oper_status: down
- oper_status: up

Talk about an easy way to determine why your network is not working properly anymore, and quickly find out what happened exactly!

But we could do even better…

There’s always room for improvement, right? Probably you have noticed that the output from Genie commands is easier to understand than the one for the original pyATS commands. But still it was a lot for just a couple of devices. Just think if we wanted to run that same test in the complete network with maybe hundreds or thousands of systems… that would be a lot of logging info! However, as an operator probably I don’t need that much output, and I could use a more intuitive summary that gives me the key info on what I am doing.

Besides this, network operators are probably interested in defining their tests in a way that is as close to natural language as possible. The Robot framework is an open-source automation framework for testing that can help you with these challenges. Let’s take a look at an example on what can be done with it.

We will run the same scenario as before and see what are some of the benefits we get with Robot. So again, we will take a first snapshot of our network when it is working fine.

Before we start, please go to your CSR and get interface Loopback 1 back up again, so that the network is tidy and clean, as it was in the beginning.

(pyats) /pyats/demos # ssh cisco@172.16.30.129
csr1000v-1#conf t
Enter configuration commands, one per line. End with CNTL/Z.
csr1000v-1(config)#int lo 1
csr1000v-1(config-if)#no shut
csr1000v-1(config-if)#exit
csr1000v-1(config)#exit
csr1000v-1#exit
Connection to 172.16.30.129 closed by remote host.
Connection to 172.16.30.129 closed.

Exit the container. Everything is now back to the normal initial situation.

Now, instead of running the Genie profiling command directly from the CLI, with Robot we will use the initial_snapshot.robot test definition file you will find in the demos directory. This file specifies the libraries to import, where the testbed file resides, and the test cases definition. Please review this file and you will see the different steps in these test cases are defined with very simple language.

First it will connect to the testbed devices:

Connect
    # Initializes the pyATS/Genie Testbed
    use genie testbed "${testbed}"

    # Connect to both devices
    connect to device "nx-osv-1"
    connect to device "csr1000v-1"

And then the system will profile them, specifiying where to store the resulting network profile snapshot files:

Profile the devices
    Profile the system for "bgp;config;interface;platform;ospf;arp;routing;vrf;vlan" on devices "nx-osv-1;csr1000v-1" as "./good/good_snapshot"

Very simple and natural language that helps understanding intuitively what the test case is supposed to do.

Let’s run robot with a single command that simply specifies the directory where we want to store the resulting log, output and report (-d good):

(pyats) /pyats/demos # robot -d good initial_snapshot.robot
==============================================================================
Initial Snapshot
==============================================================================
[ WARN ] Could not load the Datafile correctly
Connect | PASS |
------------------------------------------------------------------------------
Profile the devices | PASS |
------------------------------------------------------------------------------
Initial Snapshot | PASS |
2 critical tests, 2 passed, 0 failed
2 tests total, 2 passed, 0 failed
==============================================================================
Output: /pyats/demos/good/output.xml
Log: /pyats/demos/good/log.html
Report: /pyats/demos/good/report.html
(pyats) /pyats/demos #

As you can see now the output an operator would get when executing the test case, is much more summarized. It clearly specifies, in one line per step, if the test passed or not and where you can find the resulting report, output and log files. These are extremely useful to easily visualize from a browser how did the tests go, drill down into each specific test and examine the logs about what happened exactly. In this case we have decided to store these files in the same directory where we keep the profiling snapshots.

The ‘good’ directory now stores everything about your network profile when things work fine. Let’s mess it up again, by connecting to the system and shutting down interface Loopback 1.

(pyats) /pyats/demos # cat default_testbed.yaml | grep -A 1 GigabitEthernet1: (pyats) /pyats/demos # ssh cisco@172.16.30.129
csr1000v-1#conf t
Enter configuration commands, one per line. End with CNTL/Z.
csr1000v-1(config)#int lo 1
csr1000v-1(config-if)#shut
csr1000v-1(config-if)#exit
csr1000v-1(config)#exit
csr1000v-1#exit
Connection to 172.16.30.129 closed by remote host.
Connection to 172.16.30.129 closed.

After this terrible happening it is time to profile the network again, but this time we will use the compare_snapshot.robot file to run another test case, a little bit different from the initial one. In this case it will include one extra step: once it is connected to the devices and has profiled them as before, it will automatically compare the new snapshots with the old good ones.

Compare snapshots
Compare profile "./good/good_snapshot" with "./fail/failed_snapshot" on devices "nx-osv-1;csr1000v-1"

Again, very simple and natural language that helps understanding intuitively what the test case is supposed to do.

(pyats) /pyats/demos # robot -d fail compare_snapshot.robot

As you will see from the output the first 2 steps work fine: it connects to the devices and profiles them just fine. However, when it goes into step number 3 it fails, indicating that something has changed from the previous ‘good’ situation. Going further down the log it clearly states the CSR interface has actually been shutdown and it is not operational anymore, compared to the initial ‘good’ state.

Wow, that network issue was easy to debug!

Comparison between ./good/good_snapshot and ./fail/failed_snapshot is different for feature 'config' for device:

'csr1000v-1'
interface Loopback1
+ shutdown

**********
Comparison between ./good/good_snapshot and ./fail/failed_snapshot is different for feature 'interface' for device:

'csr1000v-1'
info:
Loopback1:
+ enabled: False
- enabled: True
+ oper_status: down
- oper_status: up

In summary, using Robot we have been able to define the desired test case using very intuitive and natural language for the desired profiling. The resulting outcome is also very clear when debugging possible network issues and even offer HTML reporting that you can easily consume and share. Really awesome tool!

Learn more about Genie

If you want to learn more about how Genie network profiling can help you manage and debug issues in your network, please check this fantastic lab and also this one. Both offer you the option to run them on mocked devices, so you don’t actually need a reserved sandbox environment… how cool is that?

See you in a couple of weeks for our next set of pyATS demos, stay tuned! Any questions or comments please let me know in the comments section below, Twitter or LinkedIn.