FOSDEM 2017: a view from the NOC
FOSDEM 2017 was again a great success. We did a bit less analysis compared to 2016, but the numbers we got indicate the number of visitors grew significantly compared to last year: the total number of unique MAC addresses went from 9711 to a stunning 11918, an increase of 22.7%.
The number of mobile devices, a more accurate indication of the number of visitors, also went up. For Android, the number of unique MAC addresses went from 3892 to 4640 (+19.2%) and for iOS from 1060 to 2579 (+143.3%).
As in the past years we had a IPv6-only main network and a dual-stack legacy network for the people who needed it. The SSID of the dual-stack network was changed to encourage visitors to try the IPv6-only network. This seems to have worked as the IPv6-only network was used more to connect to IPv4-only hosts compared to the previous edition: this NAT64 traffic went from 6.1 million sessions in 2016 to 10.1 million in 2017 (+65%).
The traffic towards the internet rose from a mere 2982 million packets and 979.8 GB to 7924 million packets and 9.321 TB of traffic (+65% and +851%). From the internet, we received 2621 million packets and 2.912 TB of traffic in 2016, in 2017 it was 3620 million packets and 2.733 TB (+38% and -6.14%).
Most of this increase in outgoing traffic was due to the amount of traffic the Video team were pushing. They report: The video team pushed ~288 GB over the internet to the primary restreamer, the same amount to the backup one, and 7.1 TB (sustained 300 Mbps) to the small monitoring/control host that generated the thumbnails used in the control of the video mixer. This probably makes us the biggest user of the internet connection 🙂.
In fact, they were pushing too much traffic. We had not planned for this increase in traffic and the switches we used for the last few years were reaching their limits. We noticed this when we got reports of packets getting dropped. First we checked the load on the switches:
video-switch-1#show controllers utilization Port Receive Utilization Transmit Utilization Gi0/1 1 1 ... Gi0/25 12 22 Gi0/26 10 16 Total Ports : 26 Switch Receive Bandwidth Percentage Utilization : 1 Switch Transmit Bandwidth Percentage Utilization : 2 Switch Fabric Percentage Utilization : 1
This seemed normal, but when checking for drops we noticed the hard truth:
video-switch-1#show mls qos interface statistics | i GigabitEthernet|queue|dropped ... GigabitEthernet0/26 output queues enqueued: queue: threshold1 threshold2 threshold3 queue 0: 0 0 0 queue 1: 0 150564 119978 queue 2: 0 0 0 queue 3: 0 0 1645256287 output queues dropped: queue: threshold1 threshold2 threshold3 queue 0: 0 0 0 queue 1: 0 0 0 queue 2: 0 0 0 queue 3: 0 0 7154647
Clearly we were dropping a number of packets (0.43% of packets) because we ran out of buffers on some queues. We tried to fix the problem using flow-control, but that was a mistake and it did not help. Trying to change the buffer allocation was not possible as these switches are limited in their QoS features. In the end we were unable to fix this problem without risking interrupting the traffic.
Designing, configuring and testing a proper QoS architecture and replacing the old switches which have served us well for the last 8 years with switches more adapted to these higher amounts of traffic, is an action point for the next year for us.
This year we used a more general http-user-agent analysis, so the client numbers are not directly comparable, but we detected the following client distribution:
I’m really hoping that these machines running Windows 95 and friends were virtual machines or emulations.
See you all next year where we hope to be able to use telemetry instead of snmp/netflow!