Archive | Point of View

Remediation planning guidance for meltdown & spectre

Ever since the announcement of Meltdown & Spectre – two major CPU-related security vulnerabilities –organizations, business leaders, IT executives, system administrators, and cybersecurity professionals have been trying to understand the risks to their environments and what needs to be done to address them. West Monroe has already been working with many of our clients to help them understand the business impacts and mitigate the risks in response to these vulnerabilities. The intent of this article is to share some of the lessons learned to gain an objective understanding these threats, provide technical remediation steps, and considerations for measuring performance impacts of the fixes.

To quickly summarize Meltdown & Spectre, these vulnerabilities allowed unauthorized access to data that was never meant to be exposed. Two popular scenarios publicized to date are:

A corporate user is web browsing on Facebook, the local flower shop’s website, or tricked into visiting a known malicious site. A malicious advertisement causes the user’s web browser to execute malicious code that compromises the web browser and the underlying operating system to gain access to the user’s personal and corporate credentials/passwords or documents being processed on the workstation.
Your organization operates a virtualized database server operating in a multi-tenant cloud where the physical server instance is shared amongst multiple organizations. In this scenario, a compromised or malicious virtual server instance operating on the same physical server could be utilized to attack other virtual machines including your virtual database server instance.

In either one of these scenarios, any data being actively processed on the system may be at risk of being compromised.

So, is it really as scary as it sounds? Thankfully, no.

Thanks to the quick response by CPU vendors, IT vendors, major hosting providers, and other software providers patches were quickly made available and deployed. In response to the examples above:

Microsoft Internet Explorer & Edge, Google Chrome, Apple Safari, and Mozilla Firefox web browsers have patches or workarounds publicly available.
Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform have implemented patches for their underlying public cloud hosting infrastructure.

These quick responses have reduced the likelihood of Meltdown & Spectre being exploited in the two most popular scenarios described above. However, other threats remain present that may continue to put your environment at risk and this is where organizations will need to invest a significant amount of effort in understanding, analyzing, and remediating these vulnerabilities.

Regardless of what your organization’s IT infrastructure looks like – locally hosted, hosted in a cloud provider, or a hybrid approach – ultimately you are responsible for ensuring the implementation of security patches as they become available to all affected systems. These systems may include:

Physical Server Hardware Firmware/BIOS: Dell, IBM, HP, Cisco, etc.
Virtualization Hypervisors: Microsoft Hyper-V, or VMware vSphere, OpenStack, etc.
Operating Systems: Microsoft Servers & Workstations, or Linux/Unix
Containerization Technologies: Docker Images/Containers, Orchestration Management Systems (Kubernetes), and Container Hosts (Azure AKS hosts, Amazon ECS/EKS hosts, or Kubernetes Nodes)
Networking Infrastructure: Routers, Switches, Firewalls, and Wireless Access Points
IT Vendor Appliance: Storage Appliances (SANs), Network Device Management Systems, Network Monitoring Appliances (e.g. SolarWinds, Splunk), or generally any appliance type solution where your organization may not have the ability to update the underlying operating system and components without an official patch by the appliance vendor.

While Meltdown & Spectre are arguably not the most critical or damaging security vulnerabilities to be released in recent years, the effects of patching will be one of the most challenging efforts organizations face as this impacts the heart of most computers. Compared to the patching routines many organizations are accustomed to, e.g. Patch Tuesday, remediation of these vulnerabilities will require significant planning. The Meltdown & Spectre vulnerabilities are issues that are embedded within the lowest level of hardware, the CPU itself. As such, not only does the operating system need to be patched, but the code controlling the CPU will need to be updated. Additionally, virtualization technologies, such as VMware or Hyper-V, will need to be updated. Updates to CPU code are typically bundled within a BIOS firmware update. However, some vendors (e.g. Dell, HP, etc.) may not be releasing a BIOS update for certain systems, which will require a separate CPU code update, this is commonly referred to as a “CPU Microcode” update that will be distributed by the CPU vendor (e.g. Intel, AMD, etc.). While organizations may have a patching process, this process typically does not include deployment of patches to the BIOS or low-level hardware firmware. As such, many organization will need to adapt their patching process to now include these items.

Patching for the Meltdown & Spectre vulnerabilities will require system restarts for each computing layer including the physical layer, virtualization layer (if applicable), and operating system layer. In order to identify the patch or change that may cause an undesired effect on the system, e.g. performance loss or system crash, each change should be allowed a period of time (e.g. 24-72 hours) to monitor the system’s stability. Depending on the size of your IT environment, you’ll likely want to stagger the patch deployment (e.g. Test, then Staging, then Production), so that any negative effect on system stability can be identified prior to effecting the entire environment.

Additionally, various resources will need to be prepared for any undesired system stability issues that may come up. For corporate workstations and systems, the IT Help Desk will need to be prepared to handle the potential increased issue or ticket volume. For SaaS or similar provider, the Customer Support/Relations group should be prepared to address any customer concerns regarding the amount of maintenance windows, system stability or system performance concerns that may arise.

In the sections below, we’ll discuss some high-level tactical steps to remediation planning, and then discuss considerations for measuring performance impacts.

Guidance on Technical Remediation Planning

Working with our clients to provide the highest risk reduction and lowest impact to business operations, four key activities were common

1. Performing a Business Impact Analysis

Whether your organization is an SMB or a Fortune 500, identifying and protecting your organization’s most critical assets is a key function to maintaining secure business operations. If you’re organization hasn’t performed a business impact analysis to determine which assets are most critical to the organization to date, developing this should be a high priority. The results of the business impact analysis should inform the business and groups responsible for patching systems which systems, if compromised, would result in the highest impact to the business. This will then drive the decision-making process for which systems are patched first or last, which will discuss later in this post, see #4.

2. Collecting a Detailed Asset Inventory

As security patches for Meltdown & Spectre have been implemented by various organizations it has been identified that the security patches are causing performance and availability disruptions. Further investigation showed that the impacts were dependent upon the model of CPU processor operating on the physical server/workstation. These results indicated the need for collecting CPU processor information, other low-level hardware information, and a list of software packages installed (specifically including Anti-Virus software) on all IT assets.

Additionally, reports that the security patches are incompatible with certain Microsoft Windows Anti-Virus (AV) solutions have been released by IT organizations and AV software manufacturers. For more information on mitigating Meltdown & Spectre on Microsoft Windows see our blog post here.

3. Frequently Reviewing Available Security Patches

Many IT vendors were quick to release security patches after Meltdown & Spectre were publicly disclosed. However, some of the security patches being released by IT vendors and deployed by IT organization attempting to patch the major security vulnerabilities have reported experiencing significant performance load increases, system crashes, or even not being able to start or boot the system after the patch was installed. These incidents have resulted in some IT vendors (Intel, Dell, HP, and more) retracting or removing the recently released security patches and advising IT organizations to downgrade to the previously, known vulnerable versions of firmware/software. It’s important to recognize that the reported performance or availability issues may only affect a small portion of your organization’s overall IT environment. As such, your organization should be frequently monitoring (e.g. daily) all of the IT vendors’ Meltdown & Spectre related responses, either their press releases on the topic or their software/firmware update download portals. To aid in collecting and processing all this information, an information register should be created so that all stakeholders can quickly be aware of the security patch status from the various IT vendors (such as servers, virtualization, operating systems, storage appliances, networking, etc.).

4. Developing an Order of Operations

Now that you’re armed with knowledge of the organization’s most critical assets, a detailed asset inventory, and a register of available security patches, your organization can begin planning the patch deployment effort. Here you’ll want to develop an order of operations to define the order in which each system or individual computing layer is patched.

Here’s an example order of operations we have presented recently to one of our clients:

Scenario: The organization provides a Software-as-a-Service (SaaS) to 300+ external customers in the healthcare industry. The SaaS environment operates on Amazon Web Services (AWS) EC2 virtual servers. The organization has a small corporate footprint with three servers (Active Directory, a file share, and a backup server). There are 30 employees; 3 employees have laptops.

In this scenario the priority to patch would likely be:

The SaaS environment: While Amazon AWS has implemented security patches to address these vulnerabilities to their underlying infrastructure (e.g. the physical virtualization hosts that EC2 instances reside on), your organization will be responsible for applying the operating system patches to the EC2 instances.
The employee workstations: Consider applying patches to laptops first as they can leave the corporate environment. The outside world will not have the same level of security protections as the corporate environment, such as intrusion detection & prevention systems, web content filtering, etc.
The corporate servers. These are the last to be patched because in this simple scenario they present the least amount of risk associated to the Meltdown & Spectre vulnerabilities.

Furthermore, in this example, for workstations and corporate servers, it’s imperative to recognize that in addition to applying patches to the operating systems, the hardware will likely need to be patched separately. Applying hardware level patches can take multiple forms including a BIOS update from the server manufacturer or it may require an update to the CPU Microcode from the CPU vendor (Intel, AMD, etc.) that is commonly bundled in a BIOS update. In AWS, Amazon is responsible for the hardware upgrades and thus has already applied security patches to the underlying physical system.

While your IT environment will likely be significantly more complex than the scenario described here, the principals should remain the same. Start by understanding how the Meltdown & Spectre vulnerabilities can be exploited and identify the most critical assets that your organization needs to protect based on that understanding. From there, implement remediations or mitigations based on the criticality of the system and the defined order of operations. By executing these four activities, your organization will have a solid foundation to address the Meltdown & Spectre vulnerabilities without causing a significant impact to business operations.

Analyzing the Performance Impact of Meltdown & Spectre Patches

An unfortunate side-effect of implementing the Meltdown & Spectre security patches is the reported performance impacts. Intel and other IT vendors have acknowledged that the security patches provided to address the vulnerabilities are causing performance slowdowns and system load increases. For most systems, the impact appears to be negatable, especially for low or moderately utilized systems, such as workstations or underutilized servers. However, many organization will struggle with the decision to apply security patches to a system already under high load (e.g. above 80% utilization). The most common systems affected by performance load increases will be high utilization systems that require high-frequency transactions with the CPU, such as database engines, message queues, and virtualization hosts.

In preparation for installing these security patches, or in general with any IT change, it’s important to have a reliable and detailed performance baseline of the system. Having a performance baseline will allow the organization to assess the impact of any change from security patches, application upgrades, configuration settings, or performance optimization tuning. While most monitoring systems will capture CPU load metrics and memory utilization, these metrics alone will not be sufficient to aid the organization to identify the true impact to the system. When assisting our clients in this process the popular metrics we recommend capturing include:

System Load
CPU usage overall and per processor
CPU state details including system, I/O, user, kernel, interrupts, and stolen usage
Memory utilization
Network utilization
Disk utilization

Additionally, Application Performance Monitoring (APM) metrics that capture the user experience and per-transaction level statistics. APM commonly requires organizations to implement a purpose-built solution within their environment based on the application or service. In a common web application scenario, with a web server, application server, and database server, the following metrics should be captured, at a minimum:

The duration each request took to process through each step of the request, plus averaged over an interval (e.g. 10 seconds)
The number of total requests sent to each server over a small duration of time (e.g. 1 minute or per-second)
The number of requests being processed on each server at any periodic interview (e.g. captures every 10 seconds)

Having this information readily available before and after implementing the security patches will increase the organization’s ability to analyze the impact to system performance. As your organization implements the security patches to a non-production staging environment and executes performance load tests, it will allow your organization to assess whether the impact of the security patches to system performance is negligible, slightly noticeable, or a red flag to stop the implementation process for that system. A red flag or high impact to system performance should not be considered a permanent solution. However, the organization should continue to monitor for new patches from the applicable IT vendor or, if appropriate, consider investing in scaling the system due to its existing high utilization.

As an additional reference, Mike Newswanger, a Site Reliability Engineer at Stack Overflow, recently published a great write-up analyzing the impact of the security patches within their testing environment.

The processes and recommendations outlined in this blog post are part of a mature IT organization with a good cybersecurity hygiene. Organizations that have mature processes in place are already practicing most, if not all, of these measures. For organizations that do not have well-practiced processes in place, the suggestions above should help improve the organization’s cybersecurity posture and practices. This guidance is not solely specific to Meltdown & Spectre but should be considered recommendations to help the organization be better prepared for the ones we know are to come.

Our Cybersecurity team assists clients in remediation planning and execution. If you’ve already been compromised, we have threat hunting capabilities. Contact us today to have a discussion, we are here to help.