Bodies started dropping; packets clashed like tumbling dice. A scale-up of a sensor monitoring network became unstable when agent processes began losing network connection for many seconds. Should have close to zero latency except for reasons yet to be known, traffic congestion is affecting a (small) mesh of sensors.
Sensors are great when they just work. Like stool pigeons, what they give you must be weighed against the right backdrop and confirmed with other source.
I’ve had sensors too close to CPUs where the heat bollixed the measurements, and lately I’ve looked at one that seems to report abruptly different values, like a liar that changes their story every time it’s repeated.
A worse scenario, the subject matter of this case, is not just one bad apple, but a flock of remote devices that are going in and out of view.
Because this is an open (unsolved case, tip line is open) issue, I’m posting as an article rather than a blog post, which I intend to write up if/when the mystery is solved.
(Also, I can post the article under Open Source topic directly, where posts need approval).
The open source under discussion is Zabbix, and I’ve “downsized” what I would do on a global network to a lab (home) scale. Beneath the covers is an open source Postgresql database, the Zabbix server (kernel) level, and what I’ll show are frontend web console views (the GUI).
One hour view of an agent shows drop-outs over 50%! Yah, what is causing this? Put on the detective cap and plan the investigation. First, look back in the archives for priors.
1 year.
There are definitely significant time periods of both 100% and “way off” between 30 and 60%. I like the click-drag to zoom gadget the Zabbix frontend allows (though not on Android tablets) a pinch to zoom. So the next step is to focus on when the bad news began.
Now what? Check the logs for any installs or tweaks at the first “tickle”. If that is unfruitful, get out the Persian slipper and ponder a root cause analysis:
- network congestion (self-looping ddos?)
- network driver faults
- data reduction / prevention
- capacity plan mismanagement
- wrong tool for the task
- wi-fi / wired or other data collection protocols
- minimum configuration to duplicate
I did the minimum isolation technique with an BSD-licensed NetBSD 10 server on a 2-core AMD64 machine, added one agent, and shortly duplicated the symptoms (or re-observed) per the server logs:
Log snips:
20267:20240615:033911.080 Zabbix agent item “system.swap.size[,free]” on host “b1” failed: first network error, wait for 15 seconds
4857:20240615:033928.859 resuming Zabbix agent checks on host “b1”: connection restored
15939:20240615:034021.875 Zabbix agent item “vm.memory.size[available]” on host “b1” failed: first network error, wait for 15 seconds
4857:20240615:034026.795 enabling Zabbix agent checks on host “a1”: interface became available
4857:20240615:034033.494 Zabbix agent item “vm.memory.size[total]” on host “a1” failed: first network error, wait for 15 seconds
4857:20240615:034038.584 resuming Zabbix agent checks on host “b1”: connection restored
<to be continued>