Zero Trust Security: Why zero trust matters (and for more than just access)

Abstract

One of the major themes in network and application access in the past few years has been the concept of “zero trust.” In this paper, we show how, at its core, zero trust can be characterized by a small set of simple beliefs that can be applied not only to access but also more broadly across the cybersecurity space.

This paper presents a framework to encompass the broad concepts around zero trust, relating them to the existing business backdrop that motivates today’s application security business leaders. Finally, this paper provides a characterization of the zero-trust belief system – the driver for the tools and security implementations that address the current and emerging threats, which will be the focus of a follow-on paper.

This paper’s goals are twofold: first, to convey the security mindset that application business leaders should embrace, and second, to introduce a framework for security practitioners that we will expand on in future white papers.

ZTS: Start with principles, not technologies

The term “zero trust” is associated with a number of different concepts at various layers of abstraction: sometimes as a specific solution architecture, sometimes as a prescription to apply specific technologies, and at other times when it might refer to a property of a product. We believe zero trust security is, at its core, a mindset – a belief system – from which techniques and tactics emerge and leverage specific technologies, which can then be applied to address a broad spectrum of security threats.

The belief system that underlies zero trust security (ZTS) can be viewed as the next step in a security mindset evolution that started with “defense in depth” years ago. Defense in depth is predicated on the belief that while any single defensive screen has some small, but real, likelihood of breach, the addition of increased layers of protection for key assets reduces that likelihood, slowing down the attackers while also forcing them to use more resources for a successful attack.

Zero trust is a maturing of this mindset in three dimensions:

First, it evolves the premise of protection from being a set of simple “filters” gating access to the application to a more asset-appropriate set of protections contextually applied to any system transaction. The mindset of each transaction is to understand: Who is attempting what action on whom. The “who” and “whom” can be any component within the system or using the system, whether human, automated, or a piece of code. The concept of identity is key for who and whom, and the concept of privileges granted to an identity is used to codify what can be done.
Second, it considers the security decision assessment not as being static and absolute, but instead as more dynamic and adaptive, to be continuously reassessed in light of observed behaviors. The decision about the disposition of each transaction – whether to allow the interaction, block it, build additional confidence, and so on – must also consider the broader business context and, specifically, the risk-versus-reward tradeoff.
Finally, it takes as a given that, no matter how many layers of protection are provided, a sufficiently motivated or fortunate adversary will either breach or bypass the protections. Therefore, the timely detection of any compromise and the mitigation of attempted exploits must also be a key part of the overall strategy.

This evolution has been, in part, the result of time and experience, but it is increasingly driven by the confluence of two other application security trends. Specifically, today’s application architectures open up a much larger potential space of application vulnerabilities – notably threats from “inside” the application infrastructure – which are open to exploitation by the increased sophistication of today’s adversaries. Fortunately, concurrent advances in security tools, especially when they are used in conjunction with modern data collection and analysis capabilities, have enabled practical implementation of key components of the defensive strategy.

The remainder of this introduction presents an overview of the framework for our approach to zero trust security as well as its scope. From there, the following sections expand on the problem statement and its relationship to other contemporary approaches to zero trust, leading to a discussion of the core beliefs – the “why” – that should guide the choice of solution technologies and their application. Future papers will dive into the “how” – the requirements placed upon specific technologies such as authentication, access control, and AI-assisted analytics as they relate to the zero-trust maturity model.

ZTS: It starts with why

Simon Sinek’s “Start with Why” approach is an effective tool for communicating the ZTS framework, as shown in Figure 1 below. You can view this graphic from the outside-in, starting with the classes of threats that ZTS addresses, then drilling down into the security methods used, and finally distilling it all down to common principles. Or, you can view it from an inside-out perspective, starting with the core beliefs as the north star, from which the appropriate tools and techniques emerge, which then enable your ability to address a broad range of threats.

Figure 1 -- Zero trust security: Belief system, tools, and benefits.

Later sections dive into each concentric layer of this diagram, but in brief:

The core beliefs of the approach stem from the framing of the use-case as:
“Who is attempting What (action) to Whom?”
- To establish Who and Whom, explicitly verify any attestation of identity.
- Once you establish Who, use the principle of least privilege to limit What can be performed.
- Continually assess and reassess all three factors of Who, What, and Whom, and continue to validate against the policy constraints.
- Always assume that breaches and compromise will occur. Be prepared to intervene if the action (What to Whom), based on a risk-versus-reward analysis (the likelihood of fraud in the authentication, weighted by the business impact of an erroneous transaction approval), exceeds a pre-determined acceptable safety threshold.
The methods used are:
- Authentication and access control to determine Who at some level of confidence, and to then limit the privileges around What (actions) should be permitted by that identity with respect to a specific Whom target.
- Visibility and ML-assisted analysis to observe and continuously assess the full context of each transaction – Who doing What to Whom.
- Automated risk-aware remediation to intervene in a transaction when the risk-reward level rises above the specified tolerance threshold.
Address a wide range of cyberattacks by applying these methods:
- Traditional attacks such as defacement or data exfiltration – You can address these attacks by detecting identity compromise or privilege escalation, using the techniques of authentication and access control to limit Who can do What by policy.
- Availability/DDoS attacks -- Address these attacks by combing authentication and access control with risk-aware remediation. For example, you block (or rate-limit) access to unauthenticated or poorly authenticated actors attempting resource-intensive transactions.
- Advanced threats, such as ransomware or supply chain attacks -- You can address these threats by detecting changes and anomalies in the behaviors of Who is doing What to Whom, again coupled to risk-aware remediation.

The scope of ZTS

Zero trust security extends holistically across the full application, infrastructure, and delivery stack, and should span the complete development pipeline. More specifically, this should include:

Complete “top-to-bottom” application and infrastructure stack
- Hardware compute substrate, including servers, network interface cards, co-processors, and system board components
- Low-layer firmware and BIOS for the hardware.
- The lowest layer operating system, such as the hypervisor or runtime executive.
- The application level/container operating system.
- Imported third-party application components, either commercial or open source.
- Any in-house developed or bespoke application software.
Full “left-to-right” application delivery stack
- The supply chain used for ongoing maintenance and upgrades of each “top-to-bottom” stack element.
- Any inline application delivery components, such as firewalls, load balancers, API gateways, ingress/egress/mesh controllers, and inline fraud mitigation devices.
- Any application delivery components that indirectly affect traffic handling, such as Domain Name System (DNS) resolvers, or that indirectly receive metadata, such as security information event management (SIEM) solutions or federated identity services.

Traditionally, the focus of zero trust has been directed toward application development and delivery teams, often captured as the personas of Dev, DevOps, DevSecOps, and SRE. We point this out primarily to note the larger picture; that a comprehensive approach to zero trust should ideally include non-traditional personas and infrastructure as well as additional workflows, such as the supply chain procurement strategy.

Problem Statement

One of the top priorities for CIO and CISO is to modernize information technology to help accelerate the business. They also play a key role in corporate risk governance. Their goal is to help the business delight customers with new digital experiences, increase operational efficiency via digital connectivity with third parties, and let employees be productive from anywhere, while minimizing business risk. This requires the CIO and CISO organizations to free their workforce to use the latest technologies, architectures, and best practices for agility while simultaneously tasking these same individuals with taking appropriate security measures and establishing controls over people’s behavior, the information they access, and the technology they use to prevent loss exposure.

Organizations must understand and control risks related to changes in market and macro-economic conditions, consumer behavior, supply chain, and security exposures. An accurate assessment of risk and the ability to take rapid mitigation action is a competitive advantage for businesses. In this paper we focus on the risk of security exposures, which among other things can cause loss of intellectual property, disruption of business processes, loss of personal identifiable information, and fines from regulators. Security risk can be computed by evaluating the following aspects of a situation under consideration:

Value of assets involved
Assets that are involved in transactions have different levels of importance. As an example, a database consisting of personally identifiable information is more valuable than a database that lists retail locations that carry products made by the enterprise. It is possible for security and IT teams to use a deterministic process for creating an inventory of all the assets, categorized by a score representing the “value” of the asset.
Impact of compromise
The nature of the compromise determines the impact related to it. For example, the impact of stealth of the information in the datastore containing personally identifiable information is much higher than disruption in the availability of the datastore. While somewhat more nebulous, it is possible to list various types of compromises and assign them an “impact” score.
Likelihood of compromise
The probability that a transaction will lead to compromise of an asset is the least deterministic factor in assessing risk associated with the transaction. Humans are not able to deal with the scale of required observations, so organizations employ machine learning and artificial intelligence to increase confidence in their computation of the probability of compromise.

The challenge at hand is to compute risk associated with every single transaction in near real time and take appropriate mitigation action to control the impact of potential compromise.

Recognizing the problem

Business acceleration demands lead to new practices that exacerbate cybersecurity risk. We discuss this point in more detail below.

Figure 2 -- Business pressures drive new practices, which leads to enhanced risk.

Business acceleration demands
1. Digital experiences
  Consumers have an unsatiable desire for digital experiences and they demand frequent improvements on multiple platforms (PC, tablet, mobile phones).
2. Connected business
  Digital business processes require connectivity with partners, vendors, distributors, and services provided by other businesses.
3. Mobile workforce
  Employees need to be able to access business applications from anywhere for execution efficiency.
Practices for meeting business demands
1. Agile development methodology
  Enterprises switched to the incremental, iterative Agile development approach with a strong focus on customer satisfaction, instead of the sequential Waterfall approach that focuses on timely project delivery.
2. Use of ready-made software
  To deliver new digital experiences as quickly as possible, developers reuse code that is open-sourced by their colleagues in the industry.
3. Contract development
  Outsourcing work to contract developers helps enterprises scale their workforce on demand.
4. Use of cloud infrastructure
  Agility, flexibility, and scalability of the cloud and its ease of use allows developers to obtain needed infrastructure on demand.
5. Adoption of SaaS
  Software as a service (SaaS) is advantageous for business operations applications as this reduces a significant burden of procuring, deploying, and maintaining such applications in private data centers.
6. Microservices architecture
  Teams use microservices to achieve continuous delivery with faster time to market.
7. Distributed application components
  Organizations achieve efficiency by running microservices in infrastructure that offers the best tools to develop or deliver the functionality of the service. Recent extensions to legacy workflows are done using modern application architectures that need to communicate with the legacy elements, and the two often run on different infrastructure.
8. Open APIs
  An ecosystem of various services is more efficient than an enterprise developing all the services on its own.
9. Network connectivity with third parties
  Enterprises connect with each other using encrypted tunnels with their partners, vendors, and distributors to automate and streamline business processes.
Enhanced security risk
1. Increased attack surface
  Use of third-party software and open-source libraries creates opportunities for supply-chain attacks, and all of the vulnerabilities of externally developed software are inherited. Contract developers are motivated to finish feature functionality on time and security is not their concern. In addition, once they deliver the software and exit the project, it is difficult and time consuming for internal teams to go through the code and find security holes. Agile sprints are very efficient in delivering feature functionality, but the increased velocity of development does not leave much time for detailed security audits and testing. One vulnerable microservice, which has the ability to talk to other microservices, increases security risk to the entire system.Although interconnected businesses improve operational efficiency, a serious consequence is that security holes in any one of them impacts them all. Attackers find the weakest link to proliferate through the rest. A vulnerability or lapse in security in the SaaS platform or cloud infrastructure becomes an additional attack vector, and the shared responsibility model can complicate early detection and remediation.
2. Increased architectural complexity
  Distributed application elements, developed using varying architectures and deployed on multiple infrastructures, have differing security and compliance needs. This makes it complicated for security teams to design and deploy appropriate mechanisms for securing enterprise applications and data, and to meet regulatory compliance requirements.
3. Well-funded, highly motivated, and skilled hackers
  From Operation Aurora of 2010 to Solargate of 2020, we have had a decade of advanced cyberattacks that are discovered only after they have caused great harm. Breaches occurred at organizations equipped with the best security technology, operated by highly trained security teams. Attackers persisted in the IT infrastructure of these organizations for long periods before they were detected. Intellectual property was stolen, personal identifiable information was stolen, business operations were disrupted, organizations were held hostage to ransomware, even as IT and security teams went above and beyond complying with regulations designed to keep cyberattacks at bay.

US government directives

After a barrage of persistent cyberattacks against various US federal, state, and local government agencies, and several US corporations, President Joe Biden issued an executive order on improving the nation’s cybersecurity on May 12, 2021. One of the key elements of this order is for government agencies to use the principles of zero trust to prepare for cyberattacks. The Biden Administration followed this order with a memorandum from the office of management and budget (OMB) for the heads of executive departments and agencies on January 26, 2022. This memorandum from OMB “sets forth a Federal Zero Trust architecture (ZTA) strategy, requiring agencies to meet specific cybersecurity standards and objectives by the end of Fiscal Year (FY) 2024 in order to reinforce the Government’s defenses against increasingly sophisticated and persistent threat campaigns.”¹Both the executive order and the OMB memorandum refer to the zero trust architecture described in the National Institute for Standards and Technology (NIST) publication SP 800-207, which was published in August 2020, and require government agencies to follow it.

Zero trust architectures and maturity models

Thought leaders in government agencies and the private sector have published papers that help to explain the zero-trust architecture and offer advice on how best to adopt it. We summarize ideas contained in these papers below. We note that the critical essence of ideas and suggestions contained in these papers is to examine every transaction to assess Who is attempting What action on Whom, and to make a real-time decision to allow or disallow the transaction based on computation of the risk associated with it.

National Institute for Standards and Technology (NIST): Zero trust architecture

The NIST team enumerates the tenets of zero trust and defines an abstract zero trust architecture (ZTA) in their paper NIST SP 800-207.²Further, they discuss variations of zero trust approaches and describe different ways to deploy the abstract architecture.

Key abstractions discussed in the paper are the Policy Decision Point (PDP) and the Policy Enforcement Point (PEP) which work in conjunction with each other. The Policy Decision Point is composed of the Policy Engine (PE) and the Policy Administrator (PA). The Policy Engine uses a trust algorithm to make decisions on whether access to a resource should be granted to a subject. This decision is executed by the Policy Administrator, which communicates with the Policy Enforcement Point to allow or disallow a new session and to modify or terminate an existing session. In addition, the paper discusses variations of the trust algorithm, network components for a zero-trust environment, and various use cases or deployment scenarios. Finally, there is a consideration of ways in which the zero-trust architecture can be thwarted by attackers, such that implementors are mindful of the threats and take appropriate measures to protect their zero trust architecture components.

Cybersecurity and Infrastructure Security Agency (CISA): Zero trust maturity model

With the goal of assisting agencies in developing their zero trust plans, thought leaders at the Cybersecurity and Infrastructure Security Agency (CISA), published a zero trust maturity model (https://www.cisa.gov/zero-trust-maturity-model).³ The work builds on the abstract zero trust architecture described in the NIST SP 800 -207 paper. Authors have identified five areas and recommend that organizations make consistent progress in following principles of zero trust in each area. The areas are (i) identity, (ii) device, (iii) network, (iv) application workload, and (v) data. They recommend use of visibility and analytics, and automation and orchestration in each of the five areas.

The document offers a high-level maturity model across all five foundational pillars of zero trust identified earlier, and then digs deeper into each area. Organizations can use the maturity model to understand their current state and set an iterative course toward the optimal state.

Department of Defense (DOD): Zero trust reference architecture

Following the discovery of the Solar Winds attacks, the National Security Agency (NSA) issued guidance advising cybersecurity teams to adopt a zero trust security model in their paper “Embracing a Zero Trust Security Model.”⁴ Experts from the joint Defense Information Systems Agency (DISA) and National Security Agency zero trust engineering team authored the Department of Defense (DOD) zero trust reference architecture.⁵ Authors express the need for adoption of zero trust with the following statement: “Vulnerabilities exposed by data breaches inside and outside DOD demonstrate the need for a new and more robust cybersecurity model that facilitates well informed risk-based decisions.”⁶

This reference architecture bases its recommendations on the abstractions defined in NIST SP 800-207 zero trust architecture document. The document presents a high-level operational model and describes its elements in detail. It also includes a high-level maturity model and the mapping of capabilities for applying principles of zero trust to various areas of interest. Collectively these documents help organizations assess their current state and devise a plan.

Managing risk and governance with zero trust principles

Having an attitude that “assumes breach” and following zero-trust principles can help organizations in their risk-management and governance practices. The zero-trust principles of “continuous monitoring” and “explicit verification” of actors involved in transactions allows organizations to build a dynamic risk score for all actors and their activities. This is in line with Gartner’s recommendation to use “continuous adaptive risk and trust assessment” (CARTA) methodology for enhancing existing security programs. Adopting the principle of only allowing least privilege access to resources reduces the risk of loss even in the wake of a breach. Information generated by continuous monitoring and explicit verification is also useful in generating compliance reports.

Mindset in more detail

This section is intended to focus on the mindset – the belief system, the “Why” – that motivates the strategy and decisions around the tools and design that an enterprise should adopt for zero trust security. In fact, one can distill the impetus for all the methods and component technologies employed by today’s zero trust solutions to four simple key principles. This section opens with an enumeration of these principles, followed by a discussion of how they can be applied within the broader context of the application development lifecycle – that is, how to consider these principles up-front, in the design phase, as well as operationally, in the deployment/run-time of the application.

Zero-trust operational principles

If zero trust solutions are understood as fundamentally building trust in system interactions – “Who is doing What to Whom?” – then the building of an interaction-appropriate level of trust can be broken down into four components.

Explicitly verify

As the name suggests, the zero-trust mindset is one of, “Don’t blindly trust, always verify.” Therefore, any actor in the system – the Who and Whom in system interactions – must be able to:

Attest to its identity, including the special case of an “anonymous” identity, if the system allows interactions sourced by or destined to anonymous identities – such as anonymous browsers in an airline reservation system. For all other identities, the actor must be able to state Who they are, in a namespace that the system accepts.
Receive and validate collateral “proof” of the actor’s attested identity (for any non-anonymous attested identity). The nature of the “proof” – passwords, certificates, behaviors, etc. – may vary, as may the strength of that proof, but the system must be able to ascertain some degree of confidence in the attestation. Simple systems may have a binary yes/no view of confidence in the proof, whereas more advanced systems may have a numeric confidence score that can be explicitly referenced as part of a risk-reward based policy. Note that the system may also increase or reduce confidence through other means, such as response to an active challenge or even from passive observation of the actor’s behavior.
Evaluate and adjust the confidence in the attested identity, based on additional contextualization of the current behavior relative to past behaviors. For example, the system may maintain historic metadata about the actor, such as the actor’s previous and current geolocation, hardware platforms, IP addresses, reputation, and so on, to be used with the goal of increasing or decreasing confidence in the “proof” of identity.

In addition, the principle of explicitly verify can be applied not only to identity, but also action – the What of the transaction. A common use-case is where the action is expressed as an API call, such as an API to perform a database query. In this example, the explicitly verify principle can be used not only to have confidence in the identity of the actor calling the API, but also to verify the correctness of the use of the API action, such as verifying that the parameters passed to the API conform to the appropriate constraints.

In a mature zero trust mindset, “proof” is almost never absolute. Identity credentials can be stolen and devices can be compromised; API parameter constraints are often incomplete. Therefore, the term “confidence” in the zero-trust context should be interpreted as more of an indication of how likely it is that the attested identity or transaction parameters are legitimate. Thus, the “confidence” level should be viewed as one key factor, but not the only one, in the decision to allow, block, or further inspect a transaction.

Least privilege

Once an acceptable level of “confidence” is established in the actors participating in a transaction, the zero-trust approach requires that the actor (typically, the requestor, the “Who”) be granted only the minimum set privileges required for it to be able to do what it is designed to accomplish in that system. This principle embodies what is also referred to as a “positive security model” – an approach where all actions are disallowed by default, with specific privileges being granted only as required for system operation. For example, a reservation system may allow anonymous users to browse flight schedules but – as per the application design requirements – an anonymous user may not be permitted to book a flight.

These privileges may apply to individual actors, or to classes of actors. Typically, application consumers are either human actors or proxies for humans, and are grouped in classes, such as “anonymous users,” “customers,” “partners,” or “employees.” The less trusted classes of actors typically require a lower confidence threshold to pass authentication, but they also have access to fewer or less-sensitive APIs. Actors internal to the application or its infrastructure, such as specific services or containers within an application, may often have more customized privileges, such as “only containers <X> and <Y> can access the database, only <X> can write to the object store, but <Y> and <Z> can read from it.”

An important consideration for the implementation of least privilege is to strive to apply it in a more tailored manner, with forethought.⁷ Specifically, rather than applying a generic set of privileges across all actors, or a class of actors, a mature zero trust implementation should have a more granular view of What action is being attempted. For example, at a very coarse grain, “filesystem access” may be granted privilege, but “filesystem read” is a tighter specification of the true privilege required, which yields high-quality implementation of the positive security model.

Finally, more sophisticated embodiments of least privilege can work in conjunction with mature implementations of “explicitly verify,” by viewing authorized privileges for an actor not as absolute, but instead as being predicated on the authentication-supplied confidence. Thus, privileges are granted only If the confidence in the attested identity (Who) meets a minimum threshold, with the thresholds being specific for the action that is being attempted. For example, some particularly impactful action (e.g., shutting down the system) may require 90% or higher confidence that the actor is an administrator. In this example, if the system’s confidence in the identity is only 80% when the shutdown is attempted, the system would require additional verification to boost the confidence score in the attested identity before allowing the action.

Continuously assess

Explicit verification and least privilege are key concepts; however, the principle of continuously assess captures the fact that those principles must be continuously evaluated, in the sense that:

All transactions must be subject to verification and privilege. In other words, it should not be the case that only a subset of transactions are subject to inspection – such as “the first transaction in a user session” or “those transactions that are initiated via Docker container workload.” While this may sound self-evident, many zero trust implementations do not validate all transactions, either from poor design or because of lack of visibility. Common shortcomings in this area arise from only applying zero trust to external actors, but not internal ones, and/or assuming a zero trust verdict remains true for an extended period of time.
The key factors in the assessment – the confidence in the identity of the actor and the rights allowed for that actor – must be viewed as dynamic and subject to change. For example, the confidence in an identity may increase or decrease over time, based on observed behaviors and sideband metadata, and any confidence-based policy must therefore also dynamically adjust to changing confidence. In addition, a privilege threshold granted by policy earlier may be revoked a bit later, or a minimum confidence required for some action may change. While the timescales for policy changes will vary – it may change slowly (at timescales of human operational processes) or quickly (via automated governance agents) – the system should be capable of dynamically responding and honoring those changes.

Assume breach

The final principle is rooted in the presumption of highly motivated adversaries against a backdrop of healthy paranoia. Specifically, the premise is “assume you have been breached,” where “breach” is defined as “a transaction that should have been denied (with perfect knowledge and execution) but was instead permitted.” The root cause of this escape may be imperfect knowledge (e.g., an incorrect high confidence score from authentication stemming from undetected fraudulent credentials), or may be implementation limitations (e.g., not having sufficiently fine-grain specificity of granted privileges, such as having “open network connection” as a privilege, but without the ability to differentiate based on geolocation of the network destination), or may be caused by an incomplete implementation of zero trust (e.g., not applying zero trust to vulnerable open-sourced components used internally).

Therefore, the zero-trust mindset must also address how to best manage/minimize the impact of such breaches.

The implementation of this principle varies more than the other principles, but generally manifests as:

First, identify any transactions that, although technically allowed by policy, are suspicious. “Suspicious” is often very contextual, but common interpretations are: (a) anomalous transactions very different from past observed behaviors, (b) groups of transactions that are individually normal, but are collectively unusual – for example, reading and writing a file may be common, but reading and writing at a rate of 1000 files per seconds may be unusual, or (c) actions that are correlated with an undesired and not previously seen system impact – an example would be a pattern where a specific transaction opens a network connection to a TOR node or causes the system CPU load to increase significantly.
Then, perform some deeper analysis, either an entirely automated or a hybrid human-controlled/ML-assisted workflow, to determine if those transactions are invalid. If those transactions are then determined to be invalid, apply mitigation. This can take the form of either a general policy update or, as an additional layer of mitigation, a “backstop” for the other policy-driven mitigations.

If the approach chosen is to use policy-based adjustments, the adjustments may be applied by leveraging any of the available static policy tools. Examples of policy-based adjustments would be to restrict the transaction-granular access control privileges (i.e. no longer allow Who to do What to Whom) or to instead apply a stricter the “standard of proof” (i.e., require MFA or a higher confidence score) for an actor (or class of actors) to take a specific action.

If, instead, the approach of using an additional “backstop” layer is chosen, this strategy can also be implemented as either fine-grain or coarse-grain. The most fine-grained strategy would be to block exactly and only those transactions that rise above a specific risk-reward ratio, although that solution also has the possibility of adding unacceptable levels of latency to some allowed transactions if the implementation requires additional analysis. Coarser-grained strategies could be used instead, such as sandboxing future transactions from that actor or even entirely disallowing the actor from the system. In extreme cases, even coarser mitigation methods – such as shutting down file I/O – may be appropriate.

Of course, all else being equal, finer-grain backstop mitigation is generally preferable. However, tradeoffs often must be made based on the constraints of the solution technologies available, and a coarse-grain backstop is usually better than no backstop. For example, if the coarse-grain response to prevent suspected ransomware is to disable file I/O, the side effects of that response still may be preferable to the alternative – allowing the actor to continue to operate in the system unchecked – assuming the result of inaction would be a ransomware-encrypted filesystem.

Zero trust: It really starts before operations

The best starting point for a secure application development using zero trust is, unsurprisingly, at the beginning. The foundational principles that enable operationalizing zero trust principles should be baked into the design phase of application development processes. Therefore, any discussion of an operational zero trust solution integrated into the CI/CD pipeline must begin with an understanding of these foundational elements that should be first-class design considerations.

The core of these foundational elements should capture the desired/intended behavior of the interaction of system behaviors, coupled to sufficient visibility and control to detect deviations and act upon them. The intended interactions are used to define the desired set of actions (What) for each actor (Who) toward each target (Whom). That said, there may be some scenarios and environments where the intended interaction patterns are unknown. In such cases, an organization can leverage deeper visibility to “discover” the set of appropriate interactions,⁸which it can then codify in policy.

For example, in today’s ZTNA solutions, which focus on identity-driven access control to an application’s external APIs, visibility and controls into the authentication APIs is required. Or, in the cloud workload protection platform (CWPP) context, detection of a compromised container workload requires visibility into the actions each container performs, and in real time, if real-time remediation is required.

The following is a list of the high-level “buckets” related to foundational considerations that should be part of the design process, with additional drilldowns to provide specific questions to think about for each of the key points:

What are the black-box interfaces – inputs and outputs – of the system?
- Who are the classes of users – administrators, authenticated users, non-authenticated users, partners – that interact with the application, and what do they need to do?
- Are all communications via a defined API, or are there “raw” network communications or communications via a data store, such as a database or object store?
  - For API communications, how well-defined is the API, in terms of which users can interact with it, and the constraints around those interactions (e.g., parameter constraints, rate-limits, etc.)?
  - For any network communication, is all data transmitted encrypted?
  - If there are any “raw” data interfaces (e.g., information is shared between the client and the application via an object store or database), are there audit logs to track who accessed what information, when?
Similarly, at the internal “white-box” level, what are the component services that comprise the overall applications, including any services that are externally provided, and how do they communicate?
- How do these services communicate – what is the API being used, and what are the constraints (roles, access controls, rate-limits, parameters, etc.) on those interactions?
  - Similar questions, as those above, exist around the formality of the API and its constraints and the encryption of communications.
  - “Internal” (e.g., inter-workload/container) communications are more likely to use shared memory and filesystem-based interfaces; any such interfaces should be identified.
What control mechanisms does the system make available, to limit access to the black-box interfaces and internal communication paths?
- Is there a policy indicating who – what roles – can invoke specific APIs and what happens if the policy is violated? Similarly, what means exist to detect and mitigate any sort of abuse of the APIs, such as invalid parameters, or being invoked at too high a rate? Can those policies be dynamically updated, based on contextual inputs from other systems?
- Analogously, are there policies that constrain the other forms of inter-workload communications – filesystems, object stores, database tables – in terms of who can access which files/stores/tables, and prevent abuse of those resources (e.g., “SELECT * from <table>”)?
What visibility (dashboards, logs, statistics) does the system make available, to limit access to both the black-box interfaces and internal communication paths?
- Can the system identify who is invoking which API at what time? Is this data retained, and, if so, for how long? How quickly (real-time, hourly, etc.) is such data made available? How consumable is the data – is it an unstructured file log, a structured JavaScript Object Notation (JSON) that can be sent to a security event and incident management (SEIM) service, or data recorded in a big-data database table?
- What are the answers for the same questions when it comes to other communication paths – memory, files, objects, databases? Who is doing what? How is that data exposed?
Beyond communication paths, what other controls and visibility does the system provide to prevent resource oversubscription or overconsumption?
- Does the system have visibility into resource consumption metrics – CPU, memory, bandwidth, pod scaling, etc.? If so, at what granularity, and how consumable is that data?
- Does the system have controls or guardrails for resource consumption – per-workload memory/CPU/file IO limits, tracking of process creation system calls, limits on scale-up/scale-out of pods, rate-limits for APIs that invoke other services?

Explicitly asking these questions will help you discover where gaps exist in the foundation to enable zero trust. Often, the mere act of asking, early in the design, will result in the gap being addressed with minimal additional design effort. And, where a gap may persist, simple awareness of the gap will help direct additional security focus or identify vulnerable threat surfaces.

Doing this sort of up-front secure development analysis is a crucial part of the zero-trust mindset. Despite this fact, however, much of the focus of zero trust today is around what happens after the initial design, and the majority of most enterprises’ focus centers on the post-design aspects of zero trust. However, being thoughtful, in the design phase, about the requirements for effective implementation of zero trust will prevent much larger incremental efforts needed to “patch the holes” after the application is deployed.

Mitigation considerations: Timeliness, false positives/negatives, and risk

As the “assume breach” premise of the mindset embodies, a security practitioner must assume that some sufficiently motivated adversary will manage to execute a malicious transaction – an instance of Who does What to Whom that met the policy rules, but, in a perfect world, should not have been allowed. In such cases, the focus then shifts to having an effective “backstop” mechanism that can find those cases, usually based on observations of patterns of transactions and system side effects, as described in the earlier “assume breach” section.

However, just as with the notion of identity, knowledge of whether a specific transaction is malicious or not will never be perfect. Therefore, just as with identity, an ideal zero trust solution should report a “confidence” score in the legitimacy of the transaction. As an example, seeing a daemon read and write 10 times the normal file rate for 10 seconds might result in a confidence score (of maliciousness) of 70%, but seeing the rate increase 100 times, sustained over 1 minute, might raise the confidence to 95%. In this example, taking the remediative action of inhibiting future file writes will still have some chance (a 30% or 5% possibility) of being the incorrect response – a “false positive.” Therefore, the decision to remediate or not must consider the risk of false positives versus the potential impact of allowing the possibly malicious behavior.

The point of the example is to highlight that any decision to take a user-visible remediative action is really a business decision, one that considers all the risks, costs, and rewards around the suspicious activity. Introducing additional friction to the transaction increases the probability that the value will not be received, but not intervening/adding friction introduces the risk of compromise. Therefore, the decision to act (or not) in such cases is a function of three inputs:

What is the risk of allowing possibly malicious transactions to continue?
This risk may be directly monetary, such as the transfer of bank funds, or may have indirect costs, either operational (e.g., key records being encrypted and held for ransom) or branding (e.g., defacement of a web site). There may also be legal or compliance costs, such as from the leaking of personal customer data. In essence, assignment of risk is a governance issue, and good governance will understand and quantify the risks for the application.
What are the side effects of taking the remediative action?
The side effects can be expressed along the dimensions of (a) friction introduced and (b) blast zone. The friction is how much more difficult it is for a legitimate transaction to proceed; it may range from slightly more inconvenient (e.g., an extra level of authentication) to impossible (the transaction is simply not permitted). The blast zone refers to whether any other independent transactions will also be affected and, if so, how many. For example, blocking a specific user will only impact that user, but shutting down a logging service will affect auditability for all users and transactions.
What is the confidence in the belief the transaction is malicious?
Building confidence typically implies collecting more data points and correlating across more data sources, both of which take time, so this tradeoff is often, in practice, “How long should I watch before I decide to act?”

So, while a zero-trust strategy must embrace the fact that breaches will occur, a thoughtful approach will adopt a risk-versus-reward approach on the remediation of transactions that are allowed but appear suspicious. That tradeoff will comprehend the risk level of different application transactions and the consequences of applying remediation, and applying the remediation only if the risk level exceeds the expected business reward.

_{¹ https://www.whitehouse.gov/wp-content/uploads/2022/01/M-22-09.pdf
² https://csrc.nist.gov/publications/detail/sp/800-207/final
³ https://www.cisa.gov/zero-trust-maturity-model
⁴ https://media.defense.gov/2021/Feb/25/2002588479/-1/-1/0/CSI_EMBRACING_ZT_SECURITY_MODEL_UOO115131-21.PDF
⁵https://dodcio.defense.gov/Portals/0/Documents/Library/(U)ZT_RA_v1.1(U)_Mar21.pdf
⁶https://dodcio.defense.gov/Portals/0/Documents/Library/(U)ZT_RA_v1.1(U)_Mar21.pdf
⁷ The aforethought begins at the design phase, as described later.
⁸Or, at least, the set of “thought to be appropriate” interactions that the system appears to require. There is always the risk that the set of interactions derived from observation may be incomplete, or may have some pre-existing risk. Therefore, aforethought in the design is always preferable.}

PUBLISHED MAY 04, 2022

CONNECT WITH F5

F5 LABS

The latest in application threat intelligence.

Go to F5 Labs

DEVCENTRAL

The F5 community for discussion forums and expert articles.

Go to DevCentral

F5 NEWSROOM

News, F5 blogs, and more.

Go to the newsroom