editione1.0.0
Updated October 9, 2023🚀 As explained by Laura
It’s a cliche, but a lot of what we do in security is try to avoid bad things happening and prepare to respond if they do. It’s a profession of pessimists, and our pessimism and preparation are what makes the difference between a fast, smooth recovery and a prolonged, public crisis.
Let’s take a look at the two categories of “bad things” that typically affect our organizations—incidents and disasters—how they differ, and how we prepare for them. Think of this less like creating a bug-out bag and embracing survivalism, and more like having a plan for when the fire alarm goes off.
confusion Two of the most commonly misused words in security are incident and disaster. They are often used interchangeably, with every “incident” described as a “disaster” for the business. While we all love a good bit of hyperbole, in this chapter and the resulting plans and processes it yields, we need to make sure we have these two events defined clearly.
Definition Incidents are any form of event or occurrence in our organization, system, or processes. While they are typically perceived as negative events, an incident without context or investigation is simply a marker that something has happened. The cause and overall impact of an incident is unknown until a full investigation is carried out.
Incidents are not unique to security. They are categorized in many different ways, in many different fields.
Incident types that growing companies will typically encounter include:
systems or tool outage
performance issue on an application or system
bug identified in production code
unauthorized access to a system or account
loss or theft of a computing device
office alarm triggered outside of working hours.
Some of these are clearly security-related issues, such as alarm system issues and authorization alerts. Others are quite general; while they may have a security impact or association, this may not be immediately obvious without investigation.
Definition Disasters are a category of event that has a confirmed large scale impact on the organization, its systems, people, processes, and property. Like incidents, not all disasters are security related but there are definitely categories of disaster that are security aligned.
Disaster types that growing companies may encounter include:
earthquakes and natural disasters
fires
pandemics
loss of production databases or equipment.
In the case of these disasters, the scope and impact of the event is clear from the start. It’s likely assumed that the situation is bad and that systems, people, processes, or property have been harmed, destroyed, or otherwise rendered useless.
Incident response focuses its early activity on investigation and evidence gathering, later deciding on appropriate recovery actions. Disaster recovery focuses on the removal of immediate danger, protection of remaining assets, and restoration of that which has been damaged.
If someone steals the last cookies from the cupboard, this is an incident. First you’re going to investigate, then you will respond. You do not respond until you are sure of the facts.
If the kitchen, its cupboards, and the cookie jar are on fire, this is a disaster. First you will clear the area and trigger your fire safety plans, then you (or a trained professional) will extinguish the fire and check everyone is safe. Only later will you investigate the cause of the fire and plan for repairing and replacing the kitchen.
Whether we have an incident or a disaster on our hands, it’s crucial that we have a plan in place for how to respond. Let’s start with incidents, and in the next chapter we’ll dive deeper into disasters.
Incident response is a well-established practice in the technology space and there has been a lot written about it. This introduction gives you a high-level overview of how incident response processes work and the typical actions and considerations that are associated with every stage.
The first thing to note is that for the most part, incident response is not linear. An incident response is a triggered process that will loop between a number of stages until all evidence and impact of the incident is resolved.
Figure: The stages of incident response.
The process itself is typically made up of four stages of action:
Identification
Verification
Containment
Remediation
During the Identification stage, an incident has been identified via one of the identified information sources. This information is passed to a first line responder, who triggers the incident response plan.
Task | Owner | Output |
---|---|---|
Initiate logging and timeline. Start the record for the incident. Note the nature and content of information received/identified in the Security Channel. | Initial Responder | Documented audit trail in the security channel |
Verification of information Source. Where the information leading to the incident acknowledgment was received from outside the organization, it is important to review the source for credibility, agenda, and risk. | Initial Responder | Verification activities and findings noted in the security channel |
High-level triage. Before an incident can be confirmed, a basic assessment should be made. This aims to eliminate known false positives and confirm reported or suspected issues. Triage will vary by incident type. | Initial Responder Support | Triage notes in security channel |
Initiate incident. Response Create a channel for the incident within Slack. Notify the Security channel of this new channel and ask conversation to be moved to the incident specific space. | Incident Responder Incident Lead | Creation of new incident specific document or communications log. |
(Optional) Activate on call. If the incident has occurred outside of normal working hours, the on-call system should be used to contact and activate on-call staff. | Incident Responder Incident Lead | On-call staff available to respond |
Allocate roles. Assign incident lead, deputy, and communications lead roles. Notify other named parties with incident responsibilities (see Roles and Responsibilities) | Incident Lead | List of allocated roles and contact details in the incident security channel. |
Classification of the incident. Using the classification guidance in this document, classify the issue. Peer review this decision with another member of the incident response team. | Incident Lead | Classification of incidents made and documented in the incident security channel. |
(High severity or above) Executive briefing. Where an incident is of high severity or highly public in nature, a brief should be given to the executive team. They may have questions or concerns that should be addressed. The communications lead or executive liaison should act as the ongoing mediator with this group. | Incident Lead Comms Lead | A concise executive summary of the incident and its status delivered to the executive team and stored in the incident specific security channel. |
Incident response briefing. Initial responder to brief the incident team and answer any initial questions. This makes the end of the active responsibility for the initial responder (unless they have been assigned the lead or deputy role). | All Incident Team | Meeting held with the incident team. Team briefed and if appropriate, the initial responder was relieved of duty. Minutes of meeting documented in incident specific security channel |
(Optional) Update public status page. If the incident is directly affecting customers or public facing systems, an appropriate update should be made on status page or update channels. External messages should be QA’d by the incident lead and a member of the senior leadership. | Comms Lead Incident Lead | Update to status page mechanisms where appropriate. |
Before the incident is a confirmed issue, the accuracy and extent of the issues must be verified. This stage of incident response is focused on the confirmation of the issue and clarification of the scope or extent to which it affects your company, its systems, and users.
Verification includes the identification of the issue across multiple data sources and the reproduction of any suspicious performance behavior in a controlled manner (by organizational staff or on organizational equipment). Even if the verification process flags this incident as a false alarm or inaccurate, it should still be documented.
Task | Owner | Output |
---|---|---|
Identify affected customers and systems. It is crucial that the extent of the incident is understood and recorded. Where appropriate this should include a breakdown of customers affected or systems/hosts at risk | Incident Lead Deputy | List of affected systems or customers in incident specific security channel |
Access and monitor all logs for the affected accounts or systems. (Optional) Where relevant or appropriate, increase logging levels to ensure sufficient granularity. | Deputy | Updates and findings in incident specific security channel |
Establish a timeline of events. Record all findings and investigative paths in the Incident Security Channel. | Scribe | Updates and findings in incident specific security channel |
Reproduce issue on the non-production environment. For issues that are caused by specific bugs or actions, these must be tested and documented. | Deputy | Updates and findings in incident specific security channel |
Identify other potential issue areas. Where an issue is caused by a specific bug or action, extend testing to all associated use cases or similar interaction points where possible. | Deputy | Updates and findings in incident specific security channel |
Investigate root cause or sequence of events leading to incident. Where time allows, ensure that the issue being investigated is the root cause of the issue and not the side effect of another more serious issue. This will require cross log investigation and timeline analysis. | Deputy | Updates and findings in incident specific security channel |
Confirm issue across account types, geographic location, etc. (the scope of the incident). It is crucial that the full scope or extent of the issue is understood. For platform or system issues that are public facing, this includes running out of privilege and geographic distinctions. Test assumptions and systems from both inside and outside organizational networks to avoid testing environment bias. | Deputy | Updates and findings in incident specific security channel |
Once identified and confirmed, the issue should be contained such that its impact on your systems and customers can be limited. Where possible, affected systems should be isolated from healthy systems. This may include preventative account suspension, removal from networks, or password reset activities if an account has been compromised.
All containment activities should be documented as part of the incident log and implications of said containment communicated to affected stakeholders.
danger Containment steps are very specific to the individual incident and scenario type. The following are generic steps and should be used as a guideline but not a comprehensive and complete approach.
Task | Owner | Output |
---|---|---|
Initiate customer contact. Where customers are affected, directly contact each customer. Contact should aim to reassure and acknowledge rather than provide technical detail. Required actions must be well tested inside the organization before external communications are sent. | Comms Lead | Customer contact drafts and actual messages |
Isolate compromised host(s). Where a host is assumed compromised, remove it from the network wherever possible or lockdown ingress and egress to a single controlled IP. Avoid powering down or restarting the host until an image or snapshot can be made. | Incident Lead | List of compromised hosts plus results from checking the isolation is successful |
Suspend compromised account(s). Where an account has (or is suspected to have) been compromised, it should be suspended. Suspension should aim to preserve all access or event logs for the account. Where the account is central to core operations, this should be reflected in the incident severity and classification. A decision must be made as to whether the account can be suspended safely without disrupting availability. | Incident Lead | Suspended account list and access to the relevant access and event logs for said accounts |
Seize relevant hardware or equipment. Where hardware such as laptops are believed to be the cause of or affected by an incident, they should be taken by the incident team for investigation and eventual remediation. Temporary clean devices may be issued as an interim solution, however, these should provide the minimum to get the job done and be replaced once the incident is resolved. | Incident Lead | Seized hardware list including asset tag and assigned owner |
Once contained, the issue must be remediated. This stage may vary in length and complexity based on the incident. If dealing with a security issue or an issue involving complex or legacy systems, consultation with domain experts is strongly recommended.
Changes made during the remediation phase should be undertaken in a controlled and documented manner, ensuring that each change is tested before the next is applied. Chaotic or uncontrolled changes increase the likelihood of introducing additional issues into the system or hiding potentially simple solutions.
Remediation can only be deemed successful once the verification step has been repeated and end-to-end tests have been conducted. For vulnerabilities outside of your company’s control, this might include following security news feeds, running available check tools, and increasing monitoring for the duration of the issue.
Verification, containment, and remediation will continue as a repeating loop until all the issues have been addressed and systems behavior has been returned to normal.
danger Remediation steps are very specific to the individual incident and scenario type. The following are generic steps and should be used as a guideline but not a comprehensive and complete approach. As always, if you are unsure on how to proceed or don’t have the skills in your team, reach out to professionals for help. Companies specializing in incident response and forensics will have the skills and experience you need to respond.
Task | Owner | Output |
---|---|---|
Patching and systems updates. Where applicable apply vendor patches or assess the availability of application or framework updates. | Incident Lead | List of systems updated, and patches applied in incident specific security channels. |
Address privacy issues. If the privacy of any personal data has been compromised, the privacy officer must assess the impact and determine the appropriate action to take in remediation. | Privacy Officer | Assessment on whether further action is required. |
Address software flaws. Where an incident relates to a vulnerability or issue with an in-house application, ensure that code is fixed and tested before deployment. Ensure that all instances of the flaw or issue are addressed and not just the initial instance. Engage external assistance where appropriate. | Deputy | Changes to code base linked to specific commits and tests. |
Address configuration issues. Where an incident relates to a misconfiguration, ensure that this is addressed in the build systems or scripts and the host is rebuilt with the new configuration. Avoid fixing in place on deployed servers where possible to avoid configuration creep. | Incident Lead | Rebuilt hosts and updated host build files. |
Initiate backup recovery. Where data has been lost or compromised, ensure that a backup is available and prepared for restore. | Incident Lead | Estimated recovery time and recovered data. |
Re-image or rebuild equipment or machines. Where equipment has been compromised or affected by an incident, re-image, or rebuild from a trusted base image. Do not attempt to fix individual issues such as malware or viruses in place. | Incident Lead | Rebuild hardware |
Address gaps in logging and audit. If the incident highlighted gaps in logs or audit trails, address these and ensure logs are centralized, securely stored and monitored. | Deputy | Logging and audit for the acknowledged gaps |
(Optional) Engage an external specialist to assess and retest remediation. For serious or complex incidents, ensure an objective specialist has reviewed and retested the remediation issues. | Incident Lead | Assessment results and report |
Communicate with affected customers. Once remediation is complete, the affected customers should be briefed. Where the action is required on their part (such as resetting a password) this must be clear and concise. Communication content and a distribution list should be QA’d by the Incident lead and a senior leader before sending. | Comms Lead | Draft communications, sign off and actual communications |
(Optional) Executive brief. For high severity issues, an executive brief should be compiled upon remediation. This should address any concerns and explain the risks and effects of the incident in concise terms. | Incident Owner IT Manager | Executive briefing document |
Unlike the actions we have discussed above, this last set of suggested tasks are ongoing. They need to be something you do frequently at all stages of the incident response process. The aim here is to ensure you always have a good record of what you have done or discovered and that you are always taking steps to learn more about the situation as it evolves.
This documentation and discovery not only helps with post-incident reviews but makes it much easier to share the load during an incident and let people swap in and out.
Task | Owner | Output |
---|---|---|
Record all actions, findings, and communications in the log. | All | Documented audit trail |
Access and monitor all logs and audit trails for the affected accounts or systems. (Optional) Where relevant or appropriate, increase logging levels to ensure sufficient granularity. | All | None |
Identify, document, and challenge all assumptions (ongoing). | All | Documented audit trail |
Whatever the incident you face, this process provides a stable and predictable set of activities and actions that you and your team can use to respond. When we put our knowledge of this incident response process into a repeatable document, we form what is known as an incident response plan, your grab-and-go guide to surviving in stressful times.
There are many ways to document these plans—stick with what works for your internal culture and documentation style. Rather than define the document template, we will look at the sections you need to include and why they are important.
Like many of the subjects we have discussed in this book, just because something is an incident, it doesn’t mean the world is ending. Security isn’t always critical and that’s OK.
important Before you dig into the steps you need to take to respond to an incident, it’s important to define the levels of criticality associated with incidents. Like we mentioned when we discussed risk, defining these upfront allows you to prioritize and plan your actions based on likely impact, rather than your emotional response to a stressful situation.
Here are a set of example levels. They may not work for your organization, so it’s important to take a look at each and see what you need to adopt or adapt.
Each level should have a name, a description, and a definition of the impact this incident is having. This detail makes it easier to determine the level of an incident when they arise.
Level | Description | Example |
---|---|---|
Mission Critical | A serious event impacting large numbers of users for extended periods. This would include compromise that would cause large scale financial and reputation damage. Issues should be immediately escalated and addressed as a high priority. Business continuity actions and communications should be made ready. External specialists and law enforcement may need to be involved for security incidents. | • Database compromise. • Entire site outage affecting entire customer base, a large site, or the entire organization for an extended period (24 hours or more) • A critical or high severity vulnerability is made public |
Business Critical | An incident affecting a large number of customers across a wide range of activities. Issues are not remediated in half a working day (4 hours) For security incidents, this includes high risk vulnerabilities that have a high chance of exploitation (publicly known or received from a third party). Issue should be escalated and addressed. | • Issue affecting a number of customers, or a whole branch. • Private vulnerability disclosure or high potential of coverage in mainstream media. • CVSS 7 or above. |
Business Operational | An incident that affects a small group of customers and may affect their ability to complete activities. The issue is present for a short period of time. Issues should be escalated and prioritized. | • Issue affecting a small number of customers, a whole team, or isolated to a small number of data sets. • CVSS 5 or above issue in the software architecture. • Any issue that can be handled exclusively in working hours. |
Administrative | An incident that causes increased resource usage, mild customer discomfort, or confusion to a very small subset of customers. For security events, this would be a low-level security risk with a low likelihood of being exploited. No immediate action is required. | • Support issue. • Incident affecting only one customer/user or one data set, such as individual compromised accounts. |
Once you have your levels defined, they will become a guide to all initial incident responders during the initial stages of the incident response process .
As well as knowing how serious an incident is by defining its classification, we also need to define and simplify the roles we each play during incident response. Assigning and defining roles makes sure everyone knows what to do and avoids people all covering the same tasks (or all ignoring them and assuming someone else has it covered).
The following table is a set of typical incident response roles, their aim, and a brief summary of their responsibilities during an incident.
Remember, this definition stage isn’t about perfection, it’s about assigning responsibilities and removing ambiguity.
Role | Description | Responsibilities |
---|---|---|
Incident Response Owner | Owns this incident response plan and management level ownership of it and its associated risks. | • Update and maintain this document. • Arrange for regular tests of this process. |
Incident Lead | Controls and leads activities for a specific incident | • Lead the incident response team. • Coordinate response activities. • Manage prioritization during incident response. |
Deputy | Supports the Incident Lead and manages communications for a specific incident | • Manage communications with internal and external stakeholders. • Support the incident lead. |
Scribe | Records incident details for later reference | • Records the timeline of events during incident response. • Collects evidence to be used during post-incident review (screenshots, copies of log files, etc.) |
Comms Lead | Coordinates communication between team members | • Ensures that all team members are kept appropriately informed during the progress of the incident. • Escalates issues to the IT Manager when required. |
Privacy Officer | Manages privacy issues within your organization | • Must be informed of any incidents that involve a breach of private data. • Will liaise with the Privacy Commissioner if required. |
Incident response and management requires a number of coordinated roles to work efficiently. To ensure that your company is able to respond quickly, incident specific roles such as “incident lead” and “deputy” should be filled by people currently serving on the on-call roster, which is rotated regularly.
While we all like to think our companies are unique, we all secretly know that’s not the truth. There is something that makes your organization special, something your customers love, but many bits of how our companies operate are shared with other organizations around the world.
Identifying these common scenarios allows you to plan for them happening. In incident response we would normally create specific playbooks (as we discussed previously when we talked about policy, standards, and processes) to capture the specific actions our team needs to take if such an incident arises.
Here is a list of the most common scenarios. Feel free to use these as a suggested starting point for your organization’s scenario playbooks.
Scenario | Risks and Considerations |
---|---|
Lost computing device (laptop) | • Loss of sensitive information • Unauthorized device or systems access |
Account compromise (team member) | • Loss of confidential company data • Loss of data integrity • Attacker gains access to other systems or accounts |
Account compromise (single customer) | • Loss of confidential customer data • Loss of data integrity for individual customer • Security incident via support channel |
Account compromise (multiple customers) | • Loss of confidential customer data • Loss of data integrity for many customers • Security incident via support channel • Potential media interest |
Unauthorized systems access detected | • Loss of confidentiality/integrity |
Ransomware | • Systems disruption • Loss of data |
Virus detected | • Systems disruption • Loss of data |
File corruption or data loss | • Loss of data • Potential privacy breach • Potential systems availability issues |
Distributed Denial of Service Attack (DDOS) | • Loss of systems availability • Increase in support volume • Potential media interest |
So far we have defined the classification levels of our incidents based on their severity and impact, defined roles for our team to play, and looked at common scenarios that affect many companies around the world.
To make sure we turn all this definition and planning into action, we first need to understand how we would know if an incident was happening and what information sources would give us early warning.
We call these our incident notification sources and they are the places we need to be monitoring and connecting with frequently if we want to know something is happening as quickly as possible. Remember, you can’t respond to an incident until you know about it, so this is a pretty crucial step.
The following are some simple examples of incident notification sources. Remember that these sources are spread throughout your company, so it won’t always be your engineering or security team that are the first to know something bad is happening.
Type | Description |
---|---|
Alerting and Logs | One or more alerts have been received from an organizational or systems monitoring tool. |
Customer/User | A customer or user has contacted the organization to report an issue, suspicious behavior, or other concern. |
Responsible Disclosure | An individual or group has contacted the organization to report security vulnerability under the auspices of responsible disclosure. |
Third Party Notification | A notification has been received from any other third-party source, such as vulnerability notification sources or social media. |
Your company may have additional information sources, metrics, or contact points in addition to this list. Make sure you document each of those information sources and that the people who respond to or monitor them are aware of what they need to do should they encounter security messages or alerts in that channel.
While the steps outlined as examples in our overview of the incident response process are a good starting point, each incident scenario will have its own set of recommended actions and priorities. Creating documented playbooks for common incident scenarios can help you respond quickly and minimize the disruption of these events.
In this section, we will take a look at some common examples your company may face. You can use these as the basis for your playbooks or add new scenarios that are specific to your company or operating environment.
Description | • Computing or communications equipment is stolen or lost. |
Potential Scenarios | • Theft from any of the organization’s offices. • Theft while traveling (hotel, in transit, at the event). • Item left behind or lost while traveling. |
Incident Response Priorities | • Device replacement • Assessment of potential data loss • Insurance process compliance |
Suggested Actions | • Notify security team of the loss. • Identify if the device was secured sufficiently (passcode/password, disk encryption). • Gather written accounts of circumstances. • (In case of theft) Contact law enforcement if the intention is to prosecute or claim from insurance. • Contact insurance company to initiate claim. • Conduct root cause analysis to ensure travel choices, storage security, or device security choices remain appropriate. |
Description | • Data is corrupted or lost due to malicious actions or systems compromise. |
Potential Scenarios | • Customer instance is compromised and data for a specific customer is corrupted or lost • Central system component is compromised and data for several (or all) customers is lost or corrupted. • Configurations or source code is corrupted or lost. |
Incident Response Priorities | • Understanding the extent of data loss or compromise • Understand and document the timeline of the incident • Restore lost data to a known trusted state • Manage customer relationships where needee • Identify likelihood of data publication, resale or use in follow-up malicious activity (identity theft, extortion, fraud) |
Suggested Actions | • Extensive log and systems interrogation to understand and document the event timeline. • In case of customer specific compromise, construction of a communications plan and management of relationship • Identification of required backup data and initiation of backup processes. • Monitoring of external communications channels to ensure any chance or instance of follow up malicious activity using this data is known or managed if possible. • If data loss may leave consequences for customers or stakeholders, manage communications to focus on concise, action-oriented messages, and data limitation for all parties. |
Description | • Malicious software is installed and used on computing or communications equipment. |
Potential Scenarios | • Cryptolocker attacker renders organizational files unreadable • Malicious browser extension identifiee • Malicious application installed on device • Removable media containing malicious software used on network or systems |
Incident Response Priorities | • Containment of the issue to ensure malicious software (or its effects) are unable to spreae • Restoration of systems to a known good state • Communication and education of staff to ensure |
Suggested Actions | • Isolation of affected machines from networks and key systems • Revocation of accounts for affected systems if appropriate • Clean build and restore of systems from backups or known clean sources. |
Description | • Unauthorized or inappropriate systems usage is suspected or has been identified. |
Potential Scenarios | • Use of organizational systems for criminal or inappropriate activities • Fraud or deception • Intentional corruption of data or attempts to mislead |
Incident Response Priorities | • Understand the extent of the issue • Limit impact and reverse and damage caused. • Liaise with people and culture team to ensure the process is appropriate and within legal remit |
Suggested Actions | • Evidence gathering from logs and authoritative data sources (forensic investigation) • Interview with individuals or groups in question with appropriate assistance from people and culture teams. • Revocation of access during investigation period |
Description | • Organizational systems are subject to extreme levels of traffic or activity and are unable to continue normal levels of availability. |
Potential Scenarios | • Distributed denial of service (DDoS) attack against hosting provider • Distributed denial of service (DDoS) attack against application layer • Denial of service from unanticipated fault • Denial of service against individual customer instance or assets |
Incident Response Priorities | • Maintain or restore systems availability • Manage communications with customers and stakeholders • Respond to guidance from hosting or third-party providers as and when it emerges. |
Suggested Actions | • Increase monitoring and alerting • Manage operations team to ensure changes are controlled and appropriate • Work closely with communications teams to manage customer experience • Contact hosting providers directly to ensure all possible steps have been taken. |
Description | • One or more accounts are accessed without authorization. |
Potential Scenarios | • Poor quality password used for system • System did not require 2FA • Account left active after staff exiting the organization • Customer account compromisee • Phishing attack (see Social Engineering section below) |
Incident Response Priorities | • Containment of affected accounts • Limitation of the access granted to said account(s) • Investigation of data and systems accessible from account • Understanding of scope of compromise (what happened and what was lost). |
Suggested Actions | • Suspension of affected accounts • Investigation of associated accounts and systems • Audit of account logs to understand the scope of compromise • Education for the account holder (if appropriate) and other staff |
Description | • An individual or group within the organization complies with or falls for a social engineering attack. |
Potential Scenarios | • Phishing email • Malicious link in social media channel • Phone scam or phone-based attack |
Incident Response Priorities | • Identification of compromised accounts (if any) • Identification of data loss or corruption (if any) • Attack profile generation and awareness education material creation and delivery • Damage limitation |
Suggested Actions | • Interview with the affected person (people) • Examination of any physical or electronic records for the attack (emails, logs, phone logs) • Suspension or monitoring of suspected compromised systems or accounts • Creation of education material or warning messages for internal staff to reduce the likelihood of future success for the attacker. |
🚀 As explained by Laura
A disaster recovery plan is critical to your organization’s ability to respond to and recover from a range of disruptive events.