ADR020: Introduce static membership in arenas
authors  last revised 

Igor Konnov  20220601 
Table of Contents
 Summary (Overview)
 Context (Problem)
 Options (Alternatives)
 Solution (Decision)
 Consequences (Retrospection)
Summary
We discuss an extension of the model checker arenas. The main application of this extension is a more efficient implementation of powersets and functions sets. Potentially, this extension will let us optimize the number of SMT constraints and thus improve performance of the model checker in general.
1. Context
We give only a brief introduction to arenas. A detailed exposition can be found in KKT19.
1.1. Short introduction to arenas
The model checker heavily relies on the concept of arenas, which are a static overapproximation of the data structures produced by symbolic execution of a TLA+ specification. Here we give a very short recap. In a nutshell, all basic values of TLA+ (such as integers, strings, and Booleans) and data structures (sets, functions, records, tuples, and sequences) are translated into cells. Cells are SMT constants, which can be connected to other cells by edges. Currently, we have three kinds of edges:

has. A membership edge
(c_p, c_e)
represents that a parent cellc_p
potentially contains an elementc_e
(e.g., ifc_p
represents a set). These edges encode manytomany relations. 
dom. A domain edge
(c_f, c_d)
represents that a function cellc_f
has the domain represented with a cellc_d
. These edges encode manytoone relations.Likewise, a domain edge
(c_F, c_c)
represents that a function set cellc_F
has the domain represented with a cellc_c
. 
cdm. A codomain edge
(c_F, c_c)
represents that a function set cellc_F
has the codomain represented with a cellc_c
. These edges encode manytoone relations.For historic reasons, functions are also encoded with the edges called dom and cdm, though the cdm edge points to the function relation, not its codomain. We would prefer to call label the relation edge with rel, not cdm. As these edges are manytoone, we can map them from their kinds
kind > (c_p, c_e)
. This requires simple refactoring, so we are not going to discuss the dom and cdm any further.
There is a need for refactoring and extension of the hasedges. We summarize the issues with the current implementation of this kind of edges:

Originally, every edge
(c_S, c_e)
of the kind has was encoded as a Boolean constantin_${c_e.id}_${c_S.id}
in SMT. Hence, every time we introduce a copyc_T
of a setc_S
, we introduce a new edge(c_T, c_e)
in the arena, and thus we have to introduce another Boolean constantin_${c_e.id}_${c_T.id}
in SMT. Alternatively, we could use a single Boolean variable both for(c_S, c_e)
and(c_T, c_e)
. 
Later, when translating records and tuples, we stopped introducing Boolean constants in SMT for the hasedges. However, we do not track in the arena the fact that these edges are presented only in the arena, not in SMT. Hence, we have to be careful and avoid expressing membership in SMT when working with these edges.

As every hasedge directly refers to its parent in the edge name (that is,
in_${c_e.id}_${c_S.id}
), we cannot share edges when encodingSUBSET S
and[S > T]
. As a result, we have to introduce a massive number of Boolean constants and constraints, which are not necessary. 
We keep adding edges and SMT constants to the solver context, even when we know exactly that an element belongs to a set, e.g., as in
{ 1, 2, 3 }
.
1.2. Arena examples
To introduce the context in more detail, we give an example of how several TLA+ expressions are represented in arenas and SMT.
Consider the following expression:
{ a, b, c } \union { d, e }
Let's denote the arguments of the set union to be S
and T
. In the current arena representation, the rewriting
rule SetCtorRule
creates the following SMT constants (assuming that a, ..., e
were translated to arena cells):

Two cells
c_l
andc_r
to represent the setsS
andT
. These cells are backed with two SMT constants of an uninterpreted sort, which corresponds to the common type ofS
andT
. 
Five SMT constants of the Boolean sort that express set membership of
a, b, c
andd, e
inS
andT
, respectively. The setsS
andT
are backed with SMT constants of the sort ofS
andT
. 
One cell
c_u
to represent the setS \union T
. 
Five Boolean constants of the Boolean sort that express set membership of
a, b, c, d, e
inS \union T
.
It is obvious that 10 Boolean constants introduced for set membership are completely unnecessary, as we know for sure
that the respective elements belong to the three sets. Moreover, when constructing S \union T
, the rule SetCupRule
creates five SMT constraints:
;; a, b, c belong to the union, when they belong to S
(= in_a_u in_a_l)
(= in_b_u in_b_l)
(= in_c_u in_c_l)
;; d and e belong to the union, when they belong to T
(= in_d_u in_d_r)
(= in_e_u in_e_r)
2. Options

Keep things as they are.

Implement the extension of membership edges presented below.
3. Solution
3.1. Pointers to the elements
Instead of the current solution in the arenas, which maps a parent cell to a
list of element cells, we propose to map parent cells to membership pointers of
various kinds. To this end, we introduce an abstract edge (the Scala code can
be found in the package at.forsyte.apalache.tla.bmcmt.arena
):
https://github.com/informalsystems/apalache/blob/81e397fadc6f3ce346d8f8a709ebb3715ac57391/tlabmcmt/src/main/scala/at/forsyte/apalache/tla/bmcmt/arena/ElemPtr.scala#L8L24
Having an abstract edge, we introduce various case classes. The simplest case
is the FixedElemPtr
, which always evaluates to a fixed Boolean value:
https://github.com/informalsystems/apalache/blob/81e397fadc6f3ce346d8f8a709ebb3715ac57391/tlabmcmt/src/main/scala/at/forsyte/apalache/tla/bmcmt/arena/ElemPtr.scala#L26L39
Instances of FixedElemPtr
may be used in cases, when the membership is
statically known. For instance, set membership for the sets {1, 2, 3}
and
1..100
is static (always true
) and thus it does not require any additional
variables and constraints in SMT. The same applies to records and tuples.
The next case is element membership that is represented via a Boolean constant in SMT:
https://github.com/informalsystems/apalache/blob/81e397fadc6f3ce346d8f8a709ebb3715ac57391/tlabmcmt/src/main/scala/at/forsyte/apalache/tla/bmcmt/arena/ElemPtr.scala#L41L64
Instances of SmtConstElemPtr
may be used in cases, when set membership can be
encoded via a Boolean constant. Typically, this is needed when the membership
is either to be defined by the solver, or when this constant caches a complex
SMT constraint. For instance, it can be used by CherryPick
.
The most general case is represented via an SMT expression, which is encoded in TLA+ IR:
https://github.com/informalsystems/apalache/blob/81e397fadc6f3ce346d8f8a709ebb3715ac57391/tlabmcmt/src/main/scala/at/forsyte/apalache/tla/bmcmt/arena/ElemPtr.scala#L66L77
Instances of SmtExprElemPtr
may be used to encode set membership via SMT
expressions. For instance:

Evaluating an array expression, e.g., via
apalacheStoreInFun
. 
Combining several pointers. For instance, when computing
{ x \in S: P }
, we would combine set membership inS
and the value ofP
for everyx
.
3.2. Optimization 1: constant propagation via membership pointers
One immediate application of using the new representation is completely SMTfree construction of some of the TLA+ expressions.
Recall the example in Section 1.2. With the new representation, the set constructor would simply
create five instances of
FixedElemPtr
that carry the value true
, that is, the elements unconditionally belong to S
and T
. Further, the
rule SetCupRule
would simply copy the five pointers, without propagating anything to SMT.
As a result, we obtain constant propagation of set membership, while keeping the general spirit of the arenabased encoding.
3.3. Optimization 2: sharing membership in a powerset
Consider the TLA+ operator that constructs the powerset of S
, that is, the set
of all subsets of S
:
SUBSET S
Let c_S
be the cell that represents the set S
and c_1, ..., c_n
be the
cells that represent the potential elements of S
. Note that in general,
membership of all these cells may be statically unknown. For example, consider
the case when the set S
is constructed from the following TLA+ expression:
{
x \in 1..100:
\E y \in 1..10:
y * y = x
}
In the above example, computation of the predicate is delegated to the SMT solver.
The code in PowSetCtor
constructs 2^Cardinality(S)
sets that contain all subsets of S
. The tricky part here is
that some of the elements of S
may be outside of S
. To deal with that, PowSetCtor
constructs cells for each
potential combinations of c_1, ..., c_n
and adds membership tests for each of them. For instance, consider the
subset T
that is constructed by selecting the indices 1, 3, 5
of 1..n
. The constructor will introduce three
constraints:
(= in_c_1_T in_c_1_S)
(= in_c_3_T in_c_3_S)
(= in_c_5_T in_c_5_S)
Hence, the current encoding introduces 2^n
SMT constants for the subsets and
n * 2^(n  1)
membership constraints in SMT (thanks to Jure @Kukovec for
telling me the precise formula).
With the new representation, we would simply copy the respective membership
pointers of the set S
. This would require us to introduce zero constraints in
the SMT, though we would still introduce 2^n
SMT constants to represent the
subsets themselves. Note that we would still have to introduce n * 2^(n  1)
pointers in the arena. But this would be done during the process of rewriting.
3.4. Feature: computing the set of functions via pointer sharing
Sometimes, it happens that the model checker has to expand a set of functions
[S > T]
. Such a set contains T^S
functions. Since the model checker works with arenas, it can only construct an arena representation of [S > T]
. To this end, assume that the set S
is encoded via cells s_1, ..., s_m
,
and the set T
is encoded via cells t_1, ..., t_n
.
If we wanted to construct [S > T]
in the current encoding, we would have to
introduce a relation for each function in the set [S > T]
. That is, for
every sequence i[1], ..., i[n]
over 1..n
, it would construct the relation
R
:
{ <<s_1, t_i[1]>>, ..., <<s_m, t_i[m]>> }
Let's denote with p_j
the pair <<s_j, t_i[j]>>
for 1 <= j <= m
.
Moreover, we would add m
membership constraints (per function!) in SMT:
(= in_p_1_R (and in_s_1_S in_t_i[1]_T))
...
(= in_p_m_R (and in_s_m_S in_t_i[m]_T))
As a result, this encoding would introduce m * n^m
constants in SMT and the same number of membership constraints. For
instance, if we have m = 10
and n = 5
, then we would introduce 90 million constants and constraints!
Using the approach outlined in this ADR, we can simply combine membership pointers of S
and T
via SmtExprElemPtr
.
This would neither introduce SMT constants, nor SMT constraints. Of course, when this set is used in expressions
like \E x \in S: P
or \A x \in S: P
, the edges will propagate to SMT as constraints.