How to Prepare for the Coming Age of Dynamic Infrastructure

Infrastructure 2.0 Journal

Subscribe to Infrastructure 2.0 Journal: eMailAlertsEmail Alerts newslettersWeekly Newsletters
Get Infrastructure 2.0 Journal: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn

Infrastructure 2.0 Authors: Kalyan Sri, Ravi Rajamiyer, Liz McMillan, Elizabeth White, Pat Romanski

Related Topics: Infrastructure 2.0 Journal, SOA & WOA Magazine, IT Strategy, Cloud Data Analytics, Business Process Improvement, Application Performance Engineering


Part 1 | Understanding the Impact of IT on Business

It Takes More than Advanced Correlation

Part 1 - of a two part series looking at the journey enterprise IT departments take as they increasingly seek to understand the relationships and impact of IT infrastructure performance on application performance and business services.

As a product manager at Netuitive, I'm often put in a position to explain how my product works. This question usually refers not just to the nuts and bolts of the technology, but also to the more specific question: "How do I make it work?" To get to the heart of the answer, you need to understand the underpinnings of today's monitoring solutions and why most of them don't represent a complete solution.

To help illustrate this, I'll look at the problem from the perspective of Fred, an operations manager for "Acmecorp." Fred is responsible for keeping Acmecorp's key E-Commerce platform,, up and performing under stringent 24x7 SLAs.

Acmecorp is a rapidly growing mid-sized enterprise utilizing a variety of conventional infrastructure monitoring tools. In recent days, Acmecorp has been experiencing unexplainable slow performance and downtime. Fred has been tasked with identifying, reporting and correcting the problem.

For Fred, this will be a six-step journey that involves him leading the transformation of Acmecorp's old, uninstrumented infrastructure into a new proactive monitoring environment including the procedural changes necessary to maintain and take advantage of it.

Stage 1: Fred needs Data
Fred's goal is to get a handle on the slow performance at BuyThis as quickly as possible. The continual fire drills are eating into the time Fred has to be proactive at work, see his family, sleep, etc. Due to the fast growth at Acmecorp, BuyThis has grown without any organized attempt at monitoring. he realizes that getting data with which to make decisions is his first goal.

Fred decides to make use of the metrics BuyThis's developers are already generating on the state of the platform, along with customer experience, OS Performance and database performance data from commercial packages. Fred's assorted tools each give an individual team (sys admins, DBAs, etc.) a view into its portion of the problem, but he still can't find the forest for the trees. Fred has a flood of data, but nothing that tells him definitively when action is needed.

Stage 2: Thresholds...or Fred draws a line in the sand
Fred is ready to take things to the next logical step and put together thresholds for the metrics he's created so that he can be alerted when the thresholds are crossed. He begins analysis to try to understand what the right thresholds are. Although he doesn't know all the right values, he has access to all the right people. Fred decides to divide and conquer. He distributes lists of OS metrics to his sys admins, database metrics to DBAs, and BuyThis custom metrics to the developers. Fred puts up his feet, confident that the right answers will be flowing back to him shortly.

The actual results vary pretty strongly. The OS team responds, "It depends," and proceeds to explain all the regular variation in the week's schedule for BuyThis's unreasonable behavior at 9 p.m. is very different from unreasonable behavior at 3 a.m., or 9 a.m. The DB team is equally cagey: "Depends on the cluster, the patch level, the load, the balance of inserts and updates vs. reads. And then there are logs and backups..." Only the developers are straightforward: "We don't really know. We just thought that stuff might be interesting. Is that package still enabled?"

Fred takes his best guess given the feedback he's received and learns pretty quickly that his thresholds get violated...a lot. Even as he thinks about adjusting them, he stops. He realizes that even if he somehow managed to get the values right today, he still wouldn't be accounting for cyclic variation, growth in demand, or updated releases by every team. He needs a more automated approach.

Stage 3: Baselines ... or better living through statistics
Fred does a little research and discovers that what he really needs is some statistical analysis to figure out what counts as normal for each metric, rather than asking his team. He finds a product (and there are several) that will baseline each metric, and then alert him when one of his metrics crosses its baseline. His solution is already way better than trying to figure out the right settings himself, and he gives himself a pat on the back.

In fact, he's almost ready to declare victory when he realizes that even though the hurricane of alerts is over, he still definitely has a storm. So much so that he can't reasonably pass them to his already strained Level 1 responders. Instead, Fred digs deeper into his bag of tricks to try and understand a new approach to learning what is "abnormal."

View Part 2 here.

More Stories By Marcus Jackson

Marcus is Director of Product Management at Netuitive. He is responsible for the direction of Netuitive's flagship product, including analytics and data visualization. He has over 20 years of experience in software engineering and performance management. Previously, he headed development for Netuitive and the IEEE Computer Society. Marcus holds a bachelor's degree in Computer Science from Harvard University.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.