Scope of article
In a well managed IT infrastructure, network monitoring acts as eyes and ears to spot problems before they appear. System administrators need a complete visibility into their critical IT components such as servers, applications and networks.
These tools can monitor a server crash or a failing application or in some cases a highly utilized network bandwidth. This article talks about few important aspects of network monitoring. It also compares 3 leading tools which are of great importance for IT administrators.
Features of Network Monitoring Tool
Network monitoring tool is usually hosted on a standalone server and runs its client software on each machine to be managed or monitored. The tool usually runs its own copy of database such as MySQL or Postgres, which stores all scripts and historic events and actions. In some modern tools, an agent is not required to be run on managed machines, making it an agent-less installation. Table below lists a stack of features with examples, which must be available by default in a network monitoring tool.
|Application||App specific log check
App run / hung state
Query optimization and response
|Operating system||Task performance
OS Service run/hung state
|Web||URL query string monitoring
Web services response
Web service up/down
Disk quota utilization
|Network||Packet drops and anomalies
When it comes to monitoring large scale IT infrastructures, the system administrators need must advanced features architecture to make their life easy. Following is a list of few important features.
Auto discovery – It is cumbersome for administrators to add each managed device manually. Modern monitoring tools span the entire network segment to enumerate devices and perform auto discovery of operating system, configuration and settings. This feature automatically helps admins to get a glimpse of their IT inventory.
Network traffic stats – Earlier the monitoring tools used to just look at CPU, memory and disk utilizations. However it is not enough and network bandwidth usage is a key factor to know about, especially when the managed machines are supposed to access internet. Besides that, monitoring network traffic helps admins gauge insight into the bandwidth usage of the internet service provider’s line, to make a capacity planning decision.
Log monitoring – All operating systems create activity logs. For example, in case of Linux, SSH logs, Bash logs are created while for windows the application, system and security event logs are generated. A good tool must be capable of reading and parsing log files. Though this sounds easy, it can be tricky because the operating system opens log files and locks those, which demands tool to sneak into the file without tampering or corrupting it. Monitoring tool should be able to check log file size, parse text for particular string pattern etc, and perform configured actions. This gives lot of power to admins to tune up their infrastructure monitoring towards better control.
Device grouping – This is important for ease in managing devices such as firewalls, servers etc in specific groups. In some cases administrators choose to create department wise groups or a group for each building floor. They populate those groups with network switches, servers and desktops pertaining to that department or floor. In a growing infrastructure this feature is very important.
Alert management – Merely monitoring is not enough, a good tool should let admins produce alerts. For example, if CPU of a critical server crosses 90%, or if a firewall is dropping multiple packets in a row, it should create a trouble ticket, and email, or optionally send a short message to admin’s mobile phone. Almost all tools provide such facilities today which enhance their usefulness, however admins should look into the configurability and facilities available in alert management, to select the proper tool.
Customizable web dashboard – A good monitoring tool should let admins access its statistics over a web interface. Furthermore the web interface must be customizable to let them decide what should be on the dashboard front page. Modern tools provide widgets which are small screen sections or windows, which can show monitoring statistics of admins choice and can be moved or removed.
Integrating with helpdesk – Recording events generated by threshold violation, is very important and should be an automated process. The monitoring tool should provide necessary hooks or connectors so that the trouble ticket /helpdesk system can be easily connected. Monitoring events and the applicable actions should result into a trouble ticket. This helps decide how much manpower is utilized for addressing those events and intelligent actions can be taken based on that data.
Report generation – All monitoring tools today, provide some level of report generation which is based on date and time etc. However a detailed report such as device specific, event specific is really essentially for an admin. For example, a report generator should be able to drill down in to a particular event such as TCP timeout on a particular server, and provide historic occurrences of that event for that server. These levels of granular details help administrators establish a co-relation between event and root cause.
Following are a new trend of features which is found in commercial monitoring tools; however the open source world will surely catch up fully with them in days to come.
Plug-in API support – While few open source tools do provide this, there is still a scope for improvement. API calls of the monitoring engine can be exposed in a secure way, so that developers can write their own plugins. This is especially important when there is a new network device or software application in the market that must be monitored.
Trend analysis – Network or server monitoring industry is rapidly moving away from preventive mode to pro-active mode. Administrators want to know historic trends of problems and make a judgment in terms of corrective actions to be taken today, to prevent problems that might happen tomorrow. For example, a continuously high CPU utilization on a MySQL server over a period, tells in a trend analysis that one or more stored procedures are either not optimized or mis-behaving. This can be related to the application which uses those procedures. Thus if that application is expected to have more usage, trend analysis can tell admins, that the MySQL server is going to run into trouble.
Security monitoring – Very soon, no monitoring tool will be useful unless it support cyber security monitoring. Attacks happening at Layer 2 and 3, as well as application based security problems at Layer 7 should be trapped and reported by a good monitoring tool. This functionality is available in few commercial tools, however incorporating snort along with Nagios or any other monitoring tool can prove to be a powerful security monitoring solution.
Let’s talk about three famous open source monitoring tools and compare those. We will discuss Nagios, Zenoss, and Zabbix. While there are many features to cover in terms of comparison, we will discuss only those which matter the most to a mid-scale IT infrastructure management.
Nagios – This is a famous first generation network monitoring tool and very much used in all Linux distros. Developed in C and PHP, it supports multiple flavors of open source backend databases, as well legacy flat file structure.
Zenoss – Written using python scripting, Zenoss provides a highly flexible monitoring platform for mid scale and large scale infrastructures. It supersedes Nagios in few cases, especially when it comes to alert management
Zabbix – This is really an enterprise class open source tool. Written in C and PHP, this tool provides very elaborate dashboards which provide a detailed drill down to the administrators.
Comparison of Nagios, Zenoss, Zabbix
|Basic features (cpu, disk, mem)||Yes||Yes||Yes|
|Google Maps View||No||No||Yes|
|User friendly configuration||Yes||Partial||Partial|
|Performance and Reliability||Medium||High||Low|
|Plug-in API support||Partial||Yes||Yes|
While it is tough to decide which tool is best for monitoring, here are few guidelines. Administrators should first look at their infrastructure from uptime perspective and decide what needs to be really monitored, rather than checking what all things they can possibly monitored. This focused approach is important because it is easy to get distracted with multiple features available in each tool. Hence focusing on the basic monitoring mentioned above, should be the first agenda. As a second step, admins should look into the applications to be monitored and decide whether or not custom scripting needs to be done to achieve what they need from monitoring standpoint. The third step should be to focus on reporting and trend analysis, because as infrastructure grows it is essential to have a historic study of the problem in IT infrastructure. The last but important step would be to see if security monitoring is a requirement in the given scenario. If yes, then it is crucial to decide the level of additional scripting and log generation that would be required. The generated log can then be captured by monitoring tool and report as an incidence.
Nagio, Zenoss and Zabbix are all industry grade, professional tools with large installation base. It is observed that on Ubuntu platform, Nagios and Zenoss performs very well, while Zabbix runs great on other distros. Zenoss is bit unique in the three tools compared, because it offers more features, interacts well with multiple databases and other tools, and also has proved itself to be a robust solution even for high performing large scale IT infrastructures. Besides these three, there are tools such as Cacti, OpenNMS, Cricket etc, which I leave for readers to find more about on the net. It is always better to compare open source tool with a commercial one, and decide and choose the required features.