Keywords

1 Introduction

Amazon Web Services (AWS) has made significant investments in developing and applying formal tools and techniques to prove the correctness of critical internal systems and provide services to AWS users to prove correctness of their own systems [24]. We use and apply a varied set of automated reasoning techniques at AWS. For example, we use (i) bounded model checking [35] to verify memory safety properties of boot code running in AWS data centers and of real-time operating system used in IoT devices [22, 25, 26], (ii) proof assistants such as EasyCrypt [12] and domain-specific languages such as Cryptol [38] to verify cryptographic protocols [3, 4, 23], (iii) HOL-Lite [33] to verify the BigNum implementation [2], (iv) P [28] to test key storage components in Amazon S3 [18], and (v) Dafny [37] to verify key authorization and crypto libraries [1]. Automated reasoning capabilities for external AWS users leverage (i) data-flow analysis [17] to prove correct usage of cloud APIs [29, 40], (ii) monotonic SAT theories [14] to check properties of network configurations [5, 13], and (iii) theories for strings and automaton in SMT solvers [16, 39, 46] to provide security for access controls [6, 19].

This paper describes key milestones in our journey of generating billion SMT queries a day in the context of AWS Identity and Access Management (IAM). IAM is a system for controlling access to resources such as applications, data, and workload in AWS. Resource owners can configure access by writing policies that describe when to allow and deny user requests that access the resource. These configurations are expressed in the IAM policy language. For example, Amazon Simple Storage Service (S3) is an object storage service that offers data durability, availability, security, and performance. S3 is used widely to store and protect data for a range of applications. A bucket is a fundamental container in S3 where users can upload unlimited amounts of data in the form of objects. Amazon S3 supports fine-grained access control to the data based on the needs of the user. Ensuring that only intended users have access to their resource is important for the security of the resource. While the policy language allows for compact specifications of expressive policies, reasoning about the interaction between the semantics of different policy statements can be challenging to manually evaluate, especially in large policies with multiple operators and conditions.

To help AWS users secure their resources, we built Zelkova, a policy analysis tool designed to reason about the semantics of AWS access control policies. Zelkova translates policies and properties into Satisfiability Modulo Theories (SMT) formulas and uses SMT solvers to prove a variety of security properties such as “Does the policy grant broad public access?” [6]. The SMT encoding uses the theory of strings, regular expressions, bit vectors, and integer comparisons. The use of the wildcards \(*\) (any number of characters) and ? (exactly one character) in the string constraints makes the decision problem PSPACE-complete. Zelkova uses a portfolio solver, where it invokes multiple solvers in the backend and uses the results from the solver that returns first, in a winner takes all strategy. This allows us to leverage the diversity among solvers and quickly solve queries—a couple hundred milliseconds to tens of seconds. A sample of AWS services that integrate Zelkova includes Amazon S3 (object storage), AWS Config (change-based resource auditor), Amazon Macie (security service), AWS Trusted Advisor (compliance to AWS best practices), and Amazon GuardDuty (intelligent threat detection). Zelkova drives preventative control features such as Amazon S3 Block Public Access and visibility into who outside an account has access to its resources [19].

Zelkova is an automated reasoning tool developed by formal methods experts and requires some degree of expertise in formal methods to use it. We cannot expect all AWS users to be experts in formal methods, have the time to be trained in the use of formal methods tools, or even be experts in the cloud domain. In this paper, we present the three pillars of our solution that enable Zelkova to be used by all AWS users. Using a combination of techniques such as eliminating specifications, domain-specific abstractions, and advances in SMT solvers we make the power of Zelkova available to all AWS users.

2 Eliminate Writing Specifications

End users will not write a specification

Zelkova follows a traditional verification approach where it takes as input a policy and a specification, and produces a yes or no answer. We have developers and cloud administrators who author policies to govern access to cloud resources. We have someone else, a security engineer, who writes a specification of what is considered acceptable. The automated reasoning engine Zelkova does the verification and returns a yes or no answer. This approach is effective for a limited number of use cases, but it is hard to scale to all AWS users. The bottleneck to scaling the verification effort is the human effort required to specify what is acceptable behavior. The SLAM work had similar a observation about specifications; for use of Static Driver Verifier, they needed to provide the tool as well as the specification [7]. A person has to put in a lot of work upfront to define acceptable behavior and only at the end of the process, they get back an answer—a boolean. It’s a single bit of information for all the work they’ve put in. They have no information about whether they had the right specification or whether they wrote the specification correctly.

To scale our approach to all AWS users, we had to fundamentally rethink our approach and completely remove the bottleneck of having people write a specification. To achieve that, we flipped the rules of the game and made the automated reasoning engine responsible for specification. We had the machine put in the upfront cost. Now it takes as input a policy and returns a detailed set of findings (declarative statements about what is true of the system). These findings are presented to a user, the security engineer, who reviews these findings and makes decisions about whether these findings represent valid risks in the system that should be fixed or are acceptable behaviors of the system. Users are now taking the output of the machine and saying “yes” or “no”.

Fig. 1.
figure 1

An example AWS policy

Fig. 2.
figure 2

Stratified abstraction search tree

2.1 Generating Possible Specifications (Findings)

To remove the bottleneck of specification, we changed the question from is this policy correct? to who has access?. The response to the former is a boolean while the response to the latter is a set of findings. AWS access control policies specify who has access to a given resource, via a set of Allow and Deny statements that grant and prohibit access, respectively. Figure 1 shows a simplified policy specifying access to an AWS resource. This policy specifies conditions on the cloud-based network (known as a VPC) for which the request originated and on the organizational Amazon customer (referred to by an Org ID) who made the request. The first statement allows access to any request whose SrcVpc is either vpc-a or vpc-b. The second statement allows access to any request whose OrgId is o-2. However, the third statement denies access from vpc-b unless the OrgId is o-1.

For each request, access is granted only if: (a) some Allow statement matches the request, and (b) none of the Deny statements match the request. Consequently, it can be quite tricky to determine what accesses are allowed by a given policy. First, individual statements can use regular expressions, negation, and conditionals. Second, to know the effect of an allow statement, one must consider all possible deny statements that can overlap with it, i.e., can refer to the same request as the allow. Thus, policy verification is not compositional, in that we cannot determine if a policy is “correct” simply by locally checking that each statement is “correct.” Instead, we require a global verification mechanism, that simultaneously considers all the statements and their subtle interactions, to determine if a policy grants only the intended access.

For the example policy sketch shown in Fig. 1, access can be summarized through a set of three findings, which say that access is granted to a request iff:

  • Its SrcVpc is vpc-a, or,

  • Its OrgId is o-2, or,

  • Its SrcVpc is vpc-b and its OrgId is o-1.

The findings are sound as no other requests are granted access. The findings are mostly precise; most of the requests match the conditions that are granted access. The finding “OrgId is o-2” also includes some requests that are not allowed, e.g., when SrcVpc is vpc-b. To help understandability of the findings, we sacrifice this precision. Precise findings would need to include negation, and that would add complexity for the users to make decisions. Finally, the findings compactly summarize the policy in three positive statements declaring who has access. In principle, the notion of compact findings is similar to abstract counterexamples or minimizing counterexamples [21, 30, 32]. Since the findings are produced by the machine and already verified to be true, we have a person deciding if they should be true. The human is making a judgment call and expressing intent.

We use stratified predicate abstraction for computing the findings. Enumerating all possible requests is computationally intractable, and even if it were not, the resulting set of findings is far too large and hence useless. We tackle the problem of summarizing the super-astronomical request-space by using predicate abstraction. Specifically, we make a syntactic pass over the policy to extract the set of constants that are used to constrain access, and we use those constants to generate a family of predicates whose conjunctions compactly describe partitions of the space of all requests. For example, from the policy in Fig. 1 we would extract the following predicates

$$ \begin{array}{lll} p_a \doteq \mathsf {SrcVpc} = \mathtt {\mathtt {vpc\text{- }{a}}},\ {} &{} p_b \doteq \mathsf {SrcVpc} = \mathtt {\mathtt {vpc\text{- }{b}}},\ {} &{} p_\star \doteq \mathsf {SrcVpc} = \mathtt {\star },\\ q_1 \doteq \mathsf {OrgId} = \mathtt {\mathtt {o\text{- }{1}}},\ {} &{} q_2 \doteq \mathsf {OrgId} = \mathtt {\mathtt {o\text{- }{2}}},\ {} &{} q_\star \doteq \mathsf {OrgId} = \mathtt {\star }. \end{array} $$

The first row has three predicates describing the possible value of the SrcVpc of the request: that it equals \(\mathtt {vpc\text{- }{a}}\) or \(\mathtt {vpc\text{- }{b}}\) or some value other than \(\mathtt {vpc\text{- }{a}}\) and \(\mathtt {vpc\text{- }{b}}\). Similarly, the second row has three predicates describing the value of the OrgId of the request: that it equals \(\mathtt {o\text{- }{1}}\) or \(\mathtt {o\text{- }{2}}\) or some value other than \(\mathtt {o\text{- }{1}}\) and \(\mathtt {o\text{- }{2}}\).

Fig. 3.
figure 3

Cubes generated by the predicates \(p_a, p_b, p_\star , q_1, q_2, q_\star \) generated from the policy in Fig. 1 and the result of querying Zelkova to check if the the requests corresponding to each cube are granted access by the policy.

We can compute findings by enumerating all the cubes generated by the above predicates and querying Zelkova to determine if the policy allows access to the requests described by the cube. The enumeration of cubes is common in SAT solvers and other predicate abstraction based approaches [8, 15, 36]. The set of all the cubes are shown in Fig. 3. The chief difficulty with enumerating all the cubes greedily is that we end up eagerly splitting-cases on the values of fields when that may not be required. For example, in Fig. 3, we split cases on the possible value of OrgId even though it is irrelevant when SrcVpc is vpc-a. This observation points the way to a new algorithm where we lazily generate the cubes as follows. Our algorithm maintains a worklist of minimally refined cubes. At each step, we (1) ask Zelkova if the cube allows an access that is not covered by any of its refinements; (2) if so, we add it to the set of findings; and (3) if not, we refine the cube “point-wise” along the values of each field individually and add the results to the worklist. The above process is illustrated in Fig. 2.

The specifications or findings generated by the machine are presented in the context of the access control domain. The developers do not have to learn a new means to specify correctness, think about what they want to be correct of the system, or check the completeness of their specifications. This is a very important lesson that we need to apply across many other applications for formal methods to be successful at scale. The challenge here is the specifics depend on the domain.

3 Domain-Specific Abstractions

It’s all about the end user

Zelkova was developed by formal methods subject matter experts who learnt domain of AWS access control policies. Once we had the analysis engine, we faced the same challenges all other formal methods tool developers had before us. How do we make it accessible to all users? One hard earned lesson was “eliminating the need for specifications” as discussed in the previous section. But that was only part of the answer. There was a lot more to do. Many more questions to answer—How do we get users to use it? How do we present the results to the users? How do the results stay updated? The answer was to design and build domain-specific abstractions. Do one thing and do it really well.

We created a higher level service on top of Zelkova called IAM Access Analyzer. We provide a one-click way to enable Access Analyzer for an AWS account or AWS Organization. An account in AWS is a fundamental construct that serves as a container for the user’s resources, workloads, and data. Users can create policies to grant access to resources in their account to other users. In Access Analyzer, we use the account as a zone of trust. This abstraction lets us say that access to resources by users within their zone of trust is considered safe. But access to resources outside their zone of trust is potentially unsafe.

Once a user enables Access Analyzer, we use stratified predicate abstraction to analyze the policies and generate findings showing which users outside the zone of trust have access to resources. We had to shift from a mode where Zelkova can answer “any access query” to Zelkova can enumerate “who has access to what”. This brings to attention the permissions that could lead to unintended access of data. While this idea seems simple in hindsight, it took us a couple of years to figure out the right abstraction for the domain. It can be used by all AWS users. They did not need to be experts in the area of formal methods or even have deep understanding of how access control in the cloud worked.

Fig. 4.
figure 4

Interface that presents Access Analyzer findings to users.

Each finding includes details about the resource, the external entity with access to it, and the permissions granted so that the user can take appropriate action. We present example findings in Fig. 4. Note these findings are not presented as SMT-lib formulas but rather in the domain that the user expects—AWS access control constructs. These map to the findings presented in the previous section for Fig. 1. Users can view the details included in the finding to determine whether the access is intentional or a potential risk that the user should resolve.

Most automated reasoning tools are run as a one-off: prove something, and then move on to the next challenge. In the cloud environment this was not the case. Doing the analysis once was not sufficient in our domain. We had to design a means to continuously monitor the environment and changes to access control policies within the zone of trust and update the findings based on that. To that end, Access Analyzer analyzes these policies if a user adds a new policy, or changes an existing policy, and either generates new findings, or removes findings, or updates the existing findings. Access Analyzer also analyzes all policies periodically, to ensure that in a rare case, if a change event to the policy is missed by the system, it is still able to keep the findings updated. The ease of enablement, just-in-time analysis on updates, and periodic analysis across all policies are the key factors in getting us to a billion queries daily.

Fig. 5.
figure 5

Comparing the runtime for solving SMT queries generated by Zelkova by CVC4 and the different cvc5 versions (a) CVC4 vs. cvc5 version 0.0.4, (b) CVC4 vs. cvc5 version 0.0.7. Comparing the runtimes of winner take all in the portfolio solver of Zelkova with: (c) a portfolio solver consisting of Z3 sequence string solver, Z3 automata solver, and cvc5 version 0.0.4 (d) a portfolio solver consisting of Z3 sequence string solver, Z3 automata solver, and cvc5 version 0.0.7. Evaluating the performance of the latest cvc5 version 1.0.0 with its older versions (e) cvc5 version 0.0.4 and (f) cvc5 version 0.0.7

4 SMT Solving at Cloud Scale

Every query matters

The use of SMT solving in AWS features and services means that millions of users are relying on the correctness and timeliness of the underlying solvers for the security of their cloud infrastructure. The challenges around correctness and timeliness in solver queries have been well studied in the automated reasoning community, but they have been treated as independent features. Today, we are generating a billion SMT queries every day to support various use cases across a wide variety of AWS services. We have discovered an intricate dependency between correctness and timeliness that manifests at this scale.

4.1 Monotonicity in Runtimes Across Solver Versions

Zelkova uses a portfolio solver to discharge its queries. When given a query, Zelkova invokes multiple solvers in the backend and uses the results from the solver that returns first, in a winner takes all strategy [6]. The portfolio approach allows us to leverage the diversity amongst solvers. One of our goals is to leverage the latest advancements in the SMT solver community. SMT solver researchers and developers are fixing issues, making improvements to existing features, adding new theories, adding features such as generation of proofs, and making other performance improvements. Before deploying a new version of the solver within the production environment, we perform extensive offline testing and benchmarking to gain confidence in the correctness of the answers, performance of the queries, and ensure there are no regressions.

While striving for correctness and timeliness, one of the challenges we face is that new solver versions are not monotonically better in their performance than their previous version. A solution that works well in the cloud setting is a massive portfolio, sometimes even containing older versions of the same solver. This presents two issues. One, when we discover a bug in an older version of the solver, we need to patch this old version. This creates an operational burden of maintaining many different versions of the different solvers. Two, when the number of solvers increases, we need to ensure that each solver provides a correct result. Checking the correctness of queries that result in SAT is straightforward, but SMT solvers need to provide proof for the UNSAT queries. The proof generation and checking needs to be timely as well.

In the Zelkova portfolio solver [6], we use CVC4, and our original goal was to replace CVC4 with the then latest version of cvc5 (version 0.0.4)Footnote 1. We wanted to leverage the proof checking capabilities of cvc5 to ensure the correctness of UNSAT queries [11]. To check the timeliness requirements, we ran experiments across our benchmarks, comparing the results of CVC4 to those of cvc5 (version 0.0.4). The results across a representative set of queries are shown in Fig. 5(a). In the graph we have approximately 15,000 SMT queries that are generated by Zelkova; we select a distribution of queries that are solved between 1 s and 30 s, after which the solver process is killed and a timeout is reported. Some queries that are not solved by CVC4 within the time bound of 30 s are now being solved by cvc5 (version 0.0.4), as seen by the points in the graph along the y-axis on the extreme right. However, cvc5 (version 0.0.4) times out on some queries that are solved by CVC4, as seen by the points on the top of the graph.

The results presented in Fig. 5(b) are not surprising given that the problem space is computationally hard, and there is an inherent randomness in search heuristics within SMT solvers. In an evaluation of cvc5, the authors discuss examples where CVC4 outperforms cvc5 [10]. But this poses a challenge for us when we are using the result of these solvers in security controls and services that millions of users rely on. The changes did not meet the timeliness requirement of continuing to solve the queries within 30 s. When a query times out, to be sound, the analysis marks the bucket as public. The impact of a query timing out, that was previously being solved, will lead to the user not being able to access the resource. This is unexpected for the user because there was no change in their configuration.

For example, consider the security checks in the Amazon S3 Block Public Access that block requests based on the results of the analysis. In this context, suppose that there was a bucket marked as “not public” based on the results of a query, and now that same query times out; the bucket will be marked as “public”. This will lock down access to the bucket and the intended users will not be able to access it. Even a single regression that leads to loss of access for the user is not an acceptable change. As another example, these security checks are also used by IoT devices. In the case of a smart lock, a time out in the query that was previously being solved could lead to a loss of access to the user’s home. The criticality of these use cases combined with the end user expectation is a key challenge in our domain.

We debugged and fixed the issue in cvc5 that was causing certain queries to time out. But even with this fix, CVC4 was 2x faster than cvc5 for many easier problems that took 1 s to solve originally. This slowdown was significant for us because Zelkova is called in the request path of security controls such as Amazon S3 Block Public Access. When a user attempts to attach a new access control policy or update an existing one, a synchronous call is made to Zelkova and the corresponding portfolio solvers to determine if the access control policy being attached grants unrestricted public access or not. The bulk of the analysis time is spent in the SMT solvers, so doubling the analysis time for queries can lead to a degraded user experience. Where and how the analysis results are used plays an important role in how we track changes to the timeliness of the solver queries.

Our solution was to add a new solver to the portfolio rather then replace an existing solver. We added cvc5 (version 0.0.7) to the existing portfolio of solvers consisting of CVC4, Z3 with the sequence string solver, and a custom Z3-based automata solver. When we started the evaluation of cvc5, we did not plan to add a new version of the CVC solver to the portfolio. We had expected to the latest version of cvc5 to be comparable in timeliness to CVC4. We worked closely with the CVC developers and cvc5 was better on many queries, but it did not meet our timeliness requirements on all queries. This led to our decision to add cvc5 (version 0.0.7) to the Zelkova portfolio solver.

The results of comparing the portfolio solvers of two Z3 solvers, CVC4 and cvc5 (version 0.0.4) with a winner take all and portfolio solver without cvc5 (version 0.0.4) is shown in Fig. 5(c). The same configuration now with cvc5 (version 0.0.7) is shown in Fig. 5(d). The results show that the portfolio solving approach that Zelkova takes in the cloud is an effective one.

The cycle now repeats with cvc5 (version 1.0.0), and the same question comes up again. The question we are evaluating yet again is, “do we upgrade the existing cvc5 version with the latest or add yet another version of CVC to the portfolio solver”. Some early experiments show that there is no clear answer yet. The results so far comparing the different version of cvc5 shown in Fig. 5(e) and (f) indicate that the latest version of cvc5 is not monotically better in performance than either of its previous versions. We do want to leverage the better proof generating capabilities of cvc5 (version 1.0.0) in order to gain more assurance in the correctness of the UNSAT queries.

Fig. 6.
figure 6

Variance in runtimes after shuffling terms in the problem instances.

4.2 Stability of the Solvers

We have spent quite a bit of time defining and implementing the encoding of the AWS access control policies into SMT. We update the encoding as we expand to more use cases or when we support new features in AWS. This is a slow and careful process that requires expertise in understanding AWS and how SMT solvers work. There is a lot of trial and error to figure out what encoding is correct and performant.

To illustrate the importance of the encoding, we present an experiment on solver runtimes with different ordering of clauses for our encoding (Fig. 6). For the same set of problem instances used in Fig. 5, we now use the standard SMT competition shufflerFootnote 2 to reorder assertions, terms, and rename variables to study the effect of ordering clauses for our default encoding. In Fig. 6, each point on the x axis corresponds to a single problem instance. For the problem instance, we run it in its original form (default encoding) which is the “base time”, and five shuffled versions. This gives us a total of six versions of the problem; we record the min, max, and mean times. So for each problem instance, x we have:

  1. 1.

    (x, base time): time on the original problem;

  2. 2.

    (x, min time): minimal time on the original and 5 shuffled problems;

  3. 3.

    (x, max time): maximal time on the original and 5 shuffled problems; and

  4. 4.

    (x, mean time): mean time on the original and 5 shuffled problems.

The instances are sorted by ‘base time’ so the line looks smooth in base time, and the other points look more scattered. The comparison between CVC4 in Fig. 6(a) and Fig. 6(b) cvc5 shows that cvc5 can solve more problems with the default encoding shown by the smooth base line. However, when we shuffle the assertions, terms and other constructs in the problem instance, the performance of cvc5 varies more dramatically compared to that of CVC4. The points for the maximal time are spread wider across the graph and there are now several timeouts in Fig. 6(b).

4.3 Concluding Remarks

Based on our experience from generating a billion SMT queries a day, we propose some general areas of research for the community. We believe these are key to enabling the use of solvers to evaluate security controls, and to enable applications in emerging technologies such as quantum computing, blockchains, and bio-engineering.

Monotonicity and Stability in Runtimes. One of the main challenges we encountered is the lack of monotonicity and stability in runtimes within a given solver version and across different versions. Providing this stability is a fundamentally hard problem due to the inherent randomness in SMT solver heuristics, search strategies, and configuration flags. One approach would be to incorporate the algorithm portfolio approach [31, 34, 42] within mainstream SMT solvers. A way enable algorithm portfolio is to leverage serverless and cloud computing environment, and develop parallel SMT solving and distributed search strategies. At AWS, this is an area that we are investing in as well. There has been some work in parallel and distributed SMT solving [41, 45] but we need more. Another aspect of research would be to develop specialized solvers that focus on a specific class of problems. The SMT-comp could devise categories that allow room for specific types of problem instances as an incentive for developing these solvers.

Reduce the Barrier to Entry. Generating a billion SMT queries day is a result of the exceptional work and innovation of the entire SMT community over the past 20 years. A question we are thinking about is how to replicate the success described here for other domains in Amazon and elsewhere. There is a natural tendency in the formal methods community to target tools for the expert user. This limits their broader use and applicability. If we can find ways to lower the barrier to adoption, we can gain greater traction and improve the security, correctness, availability, and robustness of more systems.

More Abstractions. SMT solvers are powerful engines. One potential research direction for the broader community is to provide one or more higher level languages that allows people to specify their problems. We could create different languages based on the domain and take into account the expectations of developers. This would make interacting with a solver a more black-box exercise. The success we have had with SMT in Amazon, can be recreated in other domains if we provide developers the ability to easily encode their problems in a higher level language and use SMT solvers to solve them. It will more easily scale by not requiring a formal methods expert as an intermediary. Developing new abstractions or intermediate representations could be one approach to unlock billions of other SMT queries.

Proof Generation. All SMT solvers should be generating proofs to help the end-user gain confidence in the results. There has been some initial work in this area [9, 20, 27, 43, 44],but SMT has a long way to catch up with SAT solvers, and for good reason. The proof production is important for us gain greater confidence in the correctness of our answers, though it creates a tension with the timeliness. We need the proof production to be performant and the tools that check the generated proofs to be correct themselves. Continued push on different testing approaches, including fuzzing and property-based testing of SMT solvers, should continue with the same rigor and enthusiasm. Using these fuzz testing and mutation testing based techniques in the development workflow of SMT solvers is something that should become mainstream.

We are working to provide a set of benchmarks that can be leveraged by SMT developers to help further their work, are funding research grants in these areas, and are willing to evaluate new solvers.