IT is one of the few teams where crisis management is a large part of the job. Because IT issues can have serious consequences and are often time-sensitive, it’s not uncommon for a team member to be on-call in case something happens.
In many cases, this team member isn’t a dedicated after-hours resource; IT staff often share the burden of on-call duties in addition to their normal work schedule. In addition to any extra pay on-call employees may receive, this means that incident management can have a high human cost; being woken up by alerts or having to rearrange personal plans to resolve an issue are difficult things to ask of anyone. Burn-out—and turnover—are real considerations.
Automation offers a smarter approach to being on-call. From initially triaging issues and contacting the right expert to preventing incidents before they even happen, here are two key ways automation can help your on-call staff work smarter—not harder.
Contacting the Right IT Experts at the Right Time
According to PagerDuty, a leading incident notification platform, many teams cast a wide net when alerting staff of incidents after-hours, sending email alerts to the entire team and relying on someone to volunteer to resolve the issue.
This approach has obvious problems. Most notably, more seasoned team members may wind up handling a disproportionate number of incidents. On the flipside, newer hires are deprived of important learning opportunities. This can easily lead to burnout and job dissatisfaction.
Even when there’s a meticulously-planned on-call schedule, however, it can still be difficult to triage and assign incidents appropriately—and quickly. Hans Gustavson, the senior site reliability engineering director at Coupa, faced exactly this issue. He needed to streamline communication between apps and decrease menial tasks for the engineers.
“We’re responsible for performance and making sure the site is up and available, as well as how we manage and interact with the platform and services,” says Gustavson. “It’s important to allow the engineers to focus on triage and resolution of issues instead of going back and forth between different tools to create tickets.”
Using Workato, he created an automated incident management workflow that helps keep the correct people looped in. Whenever an engineer acknowledges a new issue in VictorOps, Workato automatically creates a new JIRA incident ticket. Workato also opens a new HipChat room for that incident and automatically invites everyone who is on-call into the HipChat room.
Once the engineers start working on the issue, another set of Workato recipes keep JIRA and their status dashboard in Cachet in sync. Workato watches the issue in JIRA and triggers when the status changes, updating it in Cachet. “Essentially, the entire scenario facilitates communication of the alert status and helps people stay on top of what’s happening,” Gustavson says.
Building Proactive IT Processes with Emerging Technology
One of the most frustrating aspects of being on-call is encountering similar issues over and over again. It’s also resource-intensive for your IT team to address the same problems from scratch every time.
New technology has a lot to offer high-stakes processes like incident management. It can decrease resolution times and provide insight into other KPIs. But it can also help decrease the likelihood that the on-call team must deal with similar issues over and over again.
Sanket Naik, the VP Cloud Infrastructure and Security at Coupa, says that machine learning can make incident management more intelligent. “[With a strong pool of data for ML to pull from], you can predict crashes before they happen. You can identify the triggers—such as storage running out—and remediate the issue before the crash, without waking someone up in the middle of the night. [With ML, you] won’t crash the same way twice.”
In order to be truly helpful, however, advanced analytics tools and ML require a robust, up-to-date data pool. Creating and maintaining that pool of data can be daunting because it can entail a lot of manual data entry—and that’s all before you can run any calculations!
Automation can help seamlessly move your data into the appropriate tools, without any extra work. In addition to their automated incident triage and tracking workflow, for example, Coupa uses Workato to enable better analytics. On an hourly basis, a recipe pulls issues from JIRA into a Google Sheet for analysis; a similar recipe aggregates alerts from VictorOps. This way, Gustavson’s team can pick up on evolving incident patterns—without doing any manual data entry.
Ultimately, automation is the future of incident management. By automating tedious tasks and processes, you can make being on-call less draining and prevent burn-out. Joni Klippert, the VP of Product at VictorOps, is confident that automation will soon be a mainstay of every on-call team. “It’s meeting humans where they want to do work,” she commented. “Even analysts are really into it, which means it’ll be ubiquitous in no time!”