IT is one of the few teams where crisis management is a large part of the job. Because IT issues can have serious consequences and are often time-sensitive, it’s not uncommon for a team member to be on-call in case something happens.
In many cases, this team member isn’t a dedicated after-hours resource; IT staff often share the burden of on-call duties in addition to their normal work schedule. In addition to any extra pay on-call employees may receive, this means that incident management can have a high human cost; being woken up by alerts or having to rearrange personal plans to resolve an issue are difficult things to ask of anyone. Burn-out—and turnover—are real considerations.
Automation offers a smarter approach to being on-call. From initially triaging issues and contacting the right expert to preventing incidents before they even happen, here are two key ways automation can help your on-call staff work smarter—not harder.
Contacting the Right IT Experts at the Right Time
According to PagerDuty, a leading incident notification platform, many teams cast a wide net when alerting staff of incidents after-hours, sending email alerts to the entire team and relying on someone to volunteer to resolve the issue.
This approach has obvious problems. Most notably, more seasoned team members may wind up handling a disproportionate number of incidents. On the flipside, newer hires are deprived of important learning opportunities. This can easily lead to burnout and job dissatisfaction.
Even when there’s a meticulously-planned on-call schedule, however, it can still be difficult to triage and assign incidents appropriately—and quickly. Hans Gustavson, the senior site reliability engineering director at Coupa, faced exactly this issue. He needed to streamline communication between apps and decrease menial tasks for the engineers.
“We’re responsible for performance and making sure the site is up and available, as well as how we manage and interact with the platform and services,” says Gustavson. “It’s important to allow the engineers to focus on triage and resolution of issues instead of going back and forth between different tools to create tickets.”
Using Workato, he created an automated incident management workflow that helps keep the correct people looped in. Whenever an engineer acknowledges a new issue in VictorOps, Workato automatically creates a new Jira incident ticket. Workato also opens a new HipChat room for that incident and automatically invites everyone who is on-call into the HipChat room.
Once the engineers start working on the issue, another set of Workato recipes keep JIRA and their status dashboard in Cachet in sync. Workato watches the issue in JIRA and triggers when the status changes, updating it in Cachet. “Essentially, the entire scenario facilitates communication of the alert status and helps people stay on top of what’s happening,” Gustavson says.
Getting Your IT Apps In Sync to Speed Up Time-Sensitive Work
One of the most difficult challenges for any IT team is that they use many best-of-breed apps. Each app is useful in its own way, but working with so many disjointed programs can slow down time-critical processes like bug fixes. Not only does your team have to manually copy information from app to app, but processes will grind to a halt if information isn’t up to date across all relevant systems.
Even the most tech-forward IT teams aren’t immune to this problem. Nathan Underwood, an automation engineer at CARFAX, explains that using a multitude of tools—including Jira and ServiceNow—made the development team’s resolution workflows difficult to track and manage.
“We wanted to improve the experience by removing all those extra steps and making it easy for the dev team to see what they need to do, actually do the work, and get approval for changes all from one place,” he explains.
To solve this problem, Underwood created a Workato automation that alerts engineers of issues that need their attention. “When a ServiceNow incident is created or assigned to one of the dev teams that works out of Jira, it triggers a Workato recipe that automatically creates an issue on their Jira project,” Underwood explains.
This eliminates the need for engineers to log into ServiceNow, see what incidents have been assigned to them, and manually move them into Jira for project tracking. Now they know when an incident has been assigned to them in real-time and never need to leave Jira.
Underwood also used automation to make change management approvals smoother. Now, instead of having to leave Jira to submit a change request, engineers can make the request directly from the app. Workato will automatically create the change request in ServiceNow.
“We’re an agile shop,” Underwood says, “so our developers want to update products very quickly to address emerging needs. Any time they can reclaim to put towards new things is very valuable to them and to the business itself!”
Building Proactive IT Processes with Emerging Technology
One of the most frustrating aspects of being on-call is encountering similar issues over and over again. It’s also resource-intensive for your IT team to address the same problems from scratch every time.
New technology has a lot to offer high-stakes processes like incident management. It can decrease resolution times and provide insight into other KPIs. But it can also help decrease the likelihood that the on-call team must deal with similar issues over and over again.
Sanket Naik, the VP Cloud Infrastructure and Security at Coupa, says that machine learning can make incident management more intelligent. “[With a strong pool of data for ML to pull from], you can predict crashes before they happen. You can identify the triggers—such as storage running out—and remediate the issue before the crash, without waking someone up in the middle of the night. [With ML, you] won’t crash the same way twice.”
In order to be truly helpful, however, advanced analytics tools and ML require a robust, up-to-date data pool. Creating and maintaining that pool of data can be daunting because it can entail a lot of manual data entry—and that’s all before you can run any calculations!
Automation can help seamlessly move your data into the appropriate tools, without any extra work. In addition to their automated incident triage and tracking workflow, for example, Coupa uses Workato to enable better analytics. On an hourly basis, a recipe pulls issues from JIRA into a Google Sheet for analysis; a similar recipe aggregates alerts from VictorOps. This way, Gustavson’s team can pick up on evolving incident patterns—without doing any manual data entry.
Ultimately, automation is the future of incident management. By automating tedious tasks and processes, you can make being on-call less draining and prevent burn-out. Joni Klippert, the VP of Product at VictorOps, is confident that automation will soon be a mainstay of every on-call team. “It’s meeting humans where they want to do work,” she commented. “Even analysts are really into it, which means it’ll be ubiquitous in no time!”