Adam Maksimuk, Senior Incident Response Engineer at Auth0 explains how the bot they built with Tines makes documenting and keeping track of incidents much easier.
Managing incidents is a common task for an engineer on any security team. Quite often those tasks are the same for every incident and can be automated to help resolve the situation as quickly as possible. After working to identify tasks and steps that could be automated to help our incident process, we started working on a technical solution.
Auth0 relies on Slack for internal communication, so naturally this is where we typically run internal incidents. We built IR Bot built in Tines within a very short period of development time to help us interact with the incident channel.
IR Bot was created as a way to streamline operations and help relieve some of the stress of running the incident.
Some commands are limited to a subset of engineers on the incident response (IR) team. We’ll cover some of the features of this implementation throughout this blog post.
Incident Creation & Initial Channel Setup
An IR engineer can start up a new incident with the Slack command /irbot incident.
The following occurs behind the scenes via API calls when this command is executed:
A Confluence page for the incident is created and our major incident template is applied to it
A Slack channel for the incident is created utilizing the current date, so for example if the date was Sept 20th 2020, the channel created would be: IR-20200920. Format: IR-YYYYMMDD
All engineers in our IR team are then invited by the bot to the channel
A new case is created in Hive which we use for IR case management
The details of the case, its reference number/URL etc are added to the Confluence page
The Slack channel topic is created, which includes the Confluence page URL and the URL to the Hive case.
Lastly, the Hive case URL is then posted to the channel as well
Once the incident channel has been created, the IR engineer can then assign incident roles that will be communicated to all participants.
Our IR process is split between two primary roles, Incident Commander (IC) and Incident Scribe (IS):
The IC is responsible for running the incident and handing out tasks that need completed to various parties
The IS performs documentation and tracking duties. They document key events to the incident timeline, fills out the incident report live as the incident progresses and helps take over the role of IC if the IC needs to temporarily step away
We assign the roles with these two Slack commands:
/irbot ic @<slack username=""></slack>
/irbot scribe @<slack username=""></slack>
Here’s what it looks like in the channel when a new IC is set. In this example there was a previous IC, which shows how the channel can be notified of a handover. If the channel is new and there is no IC, then the bot would report “TBD” as the previous IC.
The scribe is set the same way, and the same type of Slack notification is sent to the channel.
Any user can then enter the /irbot who command to see the current IC and IS. A similar message is also sent automatically when a new user joins the channel. These messages are only displayed to the user who requests them, helping keep the channel focused on the incident.
Example output of the who command:
Inviting Users
After the incident is initially set up, inviting users from various teams so that they may assist with the incident takes place on a need-to-know basis. When a user is invited to the channel, the bot can take one of two actions:
If user was invited by a member of the channel, then the bot will announce this information to the channel
However, if the inviter was NOT part of the IR team, IR bot will also send the inviter a direct message (DM) notifying them that membership of the channel is on a need-to-know basis and they should check with the IC before inviting any further users
Here is a sample of what this would look like in the channel.
Members can be invited through either of these methods:
Using the invite command
@’ing the user then clicking the invite command
Sometimes however you may need a resource to assist with the incident but you only know the department or team that could assist. For these situations we have created the search command which can be kicked off by the IC via:
/irbot search
A dialog box will pop up with a drop-down of all the departments/teams in the company, and the IC can click Search once a department selection is made:
The output will be only visible to the person conducting the search, showing names, titles, pictures of every member of the relevant team, followed by an invite button. The IC can then click the invite button and the bot will invite that user to the channel. The channel will then be notified that a user was invited to the channel to assist with the incident.
User DMs During Invites
Each user who joins a channel to assist with the channel gets a DM that goes over the channel rules. These DMs are then sent out to the user every eight hours or so while they are in the channel, reminding them of the rules.
Here is a example of a DM that was sent to a user during an incident test:
All Things Tasks
Tasks are small pieces of work that are assigned to various contributors in the channel during an incident. An IC assigns a task and then the bot handles the rest.
Assigning Tasks
If a new task needs to be assigned to someone, this can be done via the command:
/irbot newtask
A dialog box will be sent to the IC where they can fill in some basic information about the task:
Via the criticality drop-down box, the IC can select from a range choices, such as:
Critical
High
Normal
Low
This task is then created in Jira, with restrictive permissions to ensure the confidentiality of the incident details.
A DM is then sent to the owner of the task:
Working Tasks
The engineer working the task can then click the button to begin progress. This is recorded on the back end for the IR team, but more on that later. When an engineer clicks the start progress button, the Slack DM is modified by the bot and the start progress button is then replaced with a comment and resolve and comment button:
An engineer can then leave a comment on the task by clicking the button, or resolve the task and provide their resolution notes. When the comment button is pressed, the engineer gets a dialog box where they can leave notes:
These notes are reported to the IR team and not the incident channel. This is done to keep activity to the channel limited to important discussions. This data is also updated on the Jira board.
A confirmation is sent to the user when they notate their task:
Task Reminders
Every 30 minutes the bot reminds all users with an open task to complete it. The DM is similar to the DM that is originally sent during assignment, however the reminder DM specifically states that it is a reminder for them to work on and close out their task.
Muting Tasks
Sometimes these reminders can be too noisy and disruptive. For that purpose, the IC can mute or unmute a task which will cause the bot to no longer send the reminders. This is done via the command /irbot mute.
A dialog is then sent to the IC where they can supply a task ID as well as choose to either mute or unmute a task:
Providing Hourly Updates to the Channel
Generally once an hour, the IC will provide a public update to the channel about the current state of events and give an overview of all the tasks left to be completed.
To do this, the IC can issue the command /irbot update.
And a dialog will be sent to the IC for them to supply a executive overview synopsis of the current state of affairs:
Once the IC submits the synopsis, the status is sent to the channel, including a summary of all current open tasks and the progress made:
If an engineer for some reason has not started progress on their task, this will be evident in the update, which will also include any recent comments that the engineer may have provided the bot on their work.
The synopsis updates provided by the IC are also automatically added to the timeline. Adding notes to the timeline is covered below.
How IR Bot Assists the Scribe
Scribe duties are now significantly easier due to automation that IR Bot provides. A scribe has the following tasks during an incident:
Document significant findings from the IR channel and put the data into the Confluence major incident page’s timeline. Data such as:
UTC Time of the event
Engineer involved in the event
Note description of the event
Take the hourly updates the IC provides and put them into the Confluence timeline
Take over the role of IC when the IC requests a break
The toughest part about being an IC is reading through all the various threads, converting the time of the important noteworthy events from your local timezone to UTC, and then filling out the various cells in the incident timeline table in Confluence. This took significant effort, and during a fast moving incident, it was easy to miss key events, meaning a scribe would then have to scroll back through the incident to find them all.
Taking Notes
A scribe can now take notes two ways via IR bot.
Note Command
The first method is to fill in a manual note. Perhaps something occurred in a team video call for the incident, or perhaps outside of Slack, that needs to be notated. To do this, the scribe types this command into the slack channel:
/note <insert note="" data="" here=""></insert>
Once they have submitted the note, the scribe will get a message in the channel confirming the note submission, visible only to them:
The note ID means a scribe can modify a note that has already been created via the command /irbot searchnote. This note is then added to the timeline in Confluence.
Note Shortcut
A scribe can also add a message from the channel that anyone has authored to the timeline by clicking on the triple dot action next to the message. This message could have been typed in the main channel or from within a thread of the main channel. Either way, this allows the scribe to easily add that event to the timeline via a shortcut to “add to timeline”.
The same confirmation message is provided to the user like in the previous /note example.
Publishing Notes
Every 15 minutes IR bot will take the notes that are stored in its database and publish them to the timeline. If an end user wants to do this immediately, they can use the command /irbot pubtl.
Pubtl is an abbreviation for ‘publish timeline’.
Confirmation is provided in the channel:
This is where things get a bit more tricky. Perhaps the timeline already contains events that were manually added by hand, or screenshots etc. The bot will NOT overwrite the entire timeline when it publishes to it. It will simply add the new events from its database and place them chronologically based on the timestamps of the events already in the timeline, so as not to disorganize or overwrite any custom made notes in the timeline.
Here is an example of the output to the timeline:
As you can see, the bot automatically converts the timestamps to UTC and supplies the author’s name, employee team information and the actual note. If the note shortcut was used on someone’s Slack message other than the scribe, the bot will put the information of the author of that note into the timeline instead of the scribe. But in this example, all the notes are authored by me so my name only appears in the timeline.
Closing Out the Incident
When an incident has come to a conclusion the IC will close out the incident with the command /irbot archive.
A dialog appears for the IC to fill out and then once he or she clicks submit, the incident is closed:
Incident closure by the bot on the back end does the following:
The incident Confluence page is changed from ‘In Progress’ to a ‘Resolved’ status
The Hive case is resolved with the correct resolution type, tags, impact and notes from the dialog box
All current open/pending tasks assigned to users are resolved
The Slack channel is archived
IR Bot Admin Channel & Notifications
The bot reports various activity tasks to a separate admin channel that only the incident response team is a member of. Events include:
Errors the bot has encountered during various API calls
Users that have been in the incident for a long period of time
Jira task information such as when a user starts progress, resolves a task, or leave note on a task
A report when the bot auto reminds users of open tasks. The bot will provide a URL to a Jira search which contains a report of all open tasks
Sample screenshot of a Jira task event:
Sample screenshot of a user activity event:
If a user is kicked with the above button, then IR bot will politely notify the user via DM that they have been removed from the channel due to the need-to-know policy, and that their assistance in the channel has been greatly appreciated.
Post Incident Metrics
IR bot records user events for the Slack channel such as:
When a user joins the Slack channel
When a user leaves the Slack channel
IR bot will then forward these events via API to SumoLogic. The events will contain the following information:
Date timestamp
User’s Slack ID
User’s email
User’s name
Event in question (leave or enter)
In SumoLogic, we calculate the time a user spends in a channel and then represent this data in graphical format via a Dashboard, here is a sample of that data:
The graph displays three types of charts:
A line graph showing total users added or removed from a channel over time
A stacked bar graph showing percentage-wise how long a user has remained in the IR channel as a contributor, compared to other contributors
Lastly a table showing all users in the channel and how many hours they contributed in total to the incident
Using this activity data, the bot will send out periodic Slack DMs to users notifying them that they have been in a channel for X hours and if their assistance is no longer required they should leave the channel.
The IR bot admin channel will print a list of these users as seen in the admin channel section of this document, giving the IR team the ability to remove users. This calculation of activity and reminders is sent out to users every eight hours while a major incident is active.