It has been a while, since I wrote my last blog where I promised to write about whether AccelOps monitoring solution is ‘Jack of all trades, and master of none’.
So here I am, on Mother’s day, sitting in front of the computer writing a geek’s blog. A workaholic mom and entrepreneur? I guess both are true. To me, building a startup is like raising kids, it requires 120% attention, hard work and commitment; no other choices.
Before I answer the ‘Jack of all trades and master of none’ question, let me try to start with the requirement for datacenter and network management in 2010, by quoting Evelyn Hubbert, Forrester Research:
The element-based network management era is over. Today, network management teams need to manage and understand network-related issues across silos such as servers, storage, security, databases, and applications. They need to manage complex and dynamic IP networks to connect customers, vendors, and employees. Forrester sees the traditional network management space becoming service-oriented: The attention is on the service that is being delivered to the business. Innovations such as IT automation and Web services management techniques, and best practices such as ITIL, have changed the network management market and will continue to shape it in the years to come.
How true! Data center and network management cannot be at the element level anymore. That used to be case in the 80’s when I worked in AT&T Bell Lab, and in the 90’s when I was working in the Network Management Business Unit in Cisco; but not in 2010. Times have changed, the datacenter infrastructure has evolved and so has their requirements. Managing the business services that the datacenter infrastructure and elements are delivering is now the key.
In order to meet these requirements, two fundamental pieces in a management solution have to be done well first: CMDB and mapping infrastructure impact to business services.
CMDB is a great concept and it is a corner stone towards the new management paradigm: managing by services. But often we see that mid and large enterprises embark on the process but are not able to make much progress. This is due to the fact that they often start with the top-down approach: cross functional teams with excel files trying to map out the organizations, map out the ownership of the infrastructure, and the dependency of the business services. The process is too heavy, out-weighs the benefits and deters the original intent.
As Evelyn Hubbert from Forrester Research sees it:
A CMDB is a fundamental component of an ITIL framework. The CMDB records Configuration Items (CIs) and the details about the important relationship between CIs. A CI is an instance of an entity that has configurable attributes – for example, a computer, a process, or an employee. A key success factor in implementing a CMDB is the ability to automatically discover information about the CIs – Autodiscovery
In complete agreement with her, we believe that the bottoms-up approach via auto-discovery is the right way to go: automatically discover what is in the datacenter including the network, map out the applications to the infrastructure, and map out their relationships. Gaining visibility is key. Once the IT organization has the map, they can then start defining the business services’ relationship to the infrastructure via the map easily. This discovery driven CMDB approach not only makes it easy to populate the CMDB for the first time, it also helps to keep the CMDB up to date. Periodically rediscover or rediscover upon changes and you are done!
Many years of experience working in network and security management field has taught me that the discovery process has to be very easy for the user to use; or it beats the purpose again.
With that philosophy, what we have built is something that requires very little from the user to quickly get to the final goal: simply define the credentials to the devices and applications, define the appropriate network range(s) and the tool should take over from there. The AccelOps discovery engine discovers all the pieces in the datacenter infrastructure, the attributes and their inter-relationships, how they relate to and impact the critical business services and applications. It discovers the configuration inside the devices, the installed and running software, the patches… It discovers L2/L3 relationships, Guest OSs to ESX relationships, Wireless APs to Controller relationships, switch modules to switch relationships… It understands the changes: differences between saved and running configurations, between saved configurations, ports going up and down, applications going up and down… It categorizes devices and applications and presents them in a very logical and but easy to understand graphical way. To do all of the above, it requires the understanding of network, systems, applications and storage. In a complex and large datacenter environment, this is a non trivial job to do, as there are so many network scenarios and so many combinations in network configurations.
The undertaking of the above tasks does not sound like a ‘jack of all trade and master of none’ would be able to cut it, does it? It requires deep understanding and the domain knowledge in network, systems and application management.
Now let’s get to the second fundamental in today’s datacenter and network management: mapping infrastructure impacts to business services. Here I would like to use examples to show, how the requirement of managing by business services cannot tolerate a ‘jack of all trade and master of none’.
In order to be able to manage by business services and map the infrastructure’s impact, a solution must be able to do the following, as a minimum:
(1) Define a problem, an exception or a vulnerability involving any datacenter infrastructure component and detect the issue in real-time. Here are a few datacenter scenarios:
Example 1: Service health critical
For the same hostIP, if
average cpu utilization >90% or (average memory utilization >98% and paging rate > ) or (disk I/O utilization > ) or max interface utilization >50% from 3 consecutive sample within 10 minutes
then generate an incident (alert)
Example 2: Excessive vMotion migration
For the same VMName, if
3 or more VM-Hot-Migration events or VM-Migration events in a 15 minute window,
then generate an incident (alert)
Example 3: Excessive End user DNS queries to unauthorized DNS servers
For the same srcIP, if
TCP/UDP port = 53, destination IP is not in internal DNS server group, source IP not in management applications group and not in internal DNS server group, and source IP is from the inside, and if this happens 10 times in a 5 minutes window,
then generate an incident (alert).
Note that internal DNS server group and internal management application group are populated from auto-discovery.
Example 4: User added as admin in the accounting application. Provide the identity of the user.
If VPN login event followed by windows server login event followed by user added to global admin group event within a 15 minute window, and the following conditions are met
VPN login source IP = windows server login source IP and
windows server login user = user add event’s user and
windows server login id = user add event’s logon id and
reporting IP in accounting server group
then generate an incident (alert)
(2) Define what makes up a business service and any of the problems defined in step (1)’s relationship to the business service. Here is another example of one of the scenarios:
Example. If there are any of incidents for the objects/components in a business service (devices, applications, users, users, etc.), generate an incident for that business service. (this requires the nest rules support. aka. Second level of rule fires based on the first level of rules)
(3) All the definitions can be easily entered by user via the GUI. So a user can define these scenarios, behaviors based on the IT knowledge of the user without waiting for the software vendor to come up with new upgrade to support the scenarios.
So now you can see, without good understanding of network, systems, applications, VMs, storage, security, and without the capability to describe these understandings, and the capability to monitor and detect exceptions/anomaly/problem based on the understandings, there is no way to even meet the basic requirement of managing by business service.
So today’s datacenter and network management puts a much higher bar for management solutions. The existing silo-ed management solutions cannot cut it simply because they do not a common analytical framework to handle all the data from disparate parts of the datacenter infrastructure. This is however what AccelOps does via deep-and-wide understanding of a datacenter.