Folder util

Author: Martin Brain, Owen Jones, Chris Smowton

Data Structures

util contains some of the key data-structures used in the CPROVER codebase.

irept: a general-purpose polymorphic tree

See detailed documentation at irept.

irepts are generic tree nodes. You should think of each node as holding a single string (data, actually an irep_idt) and lots of child nodes, some of which are numbered ([sub](irept::dt::sub)) and some of which are labelled ([named_sub](irept::dt::named_sub)). The the label can either start with a “#” or without one. The meaning of the “#” is that this child shouldn't be considered when comparing two irepts for equality; this is usually used when making an advisory annotation which does not alter the semantics of the program.

They are used to represent many kinds of structured objects throughout the CPROVER codebase, such as expressions, types and code. An exprt represents expressions, including for example equal_exprt, an equality predicate, or dereference_exprt, which represents the * unary dereference operator found in C. A typet represents a type, and may have other typet nodes as children: for example, array_typet represents an array, and has a single numbered child that gives the type of the array elements. Finally, codet represents imperative statements, such as code_assignt, which represents an imperative assignment. It has two numbered operands, its left- and right-hand sides.

Note that exprt, codet and similar classes deriving from irept do so in order to add constraints to the general tree (for example, providing accessor methods with user-friendly names, or enforcing invariants that a node must have a certain number of children), but do not override virtual methods or add fields.

The implementation of irept allows saving memory by sharing instances of its internal storage using a copy-on-write scheme. For example, the statement irep1.sub()[0] = irep2; will share rather than copy irep2 and its children, saving memory unless either irept is subsequently modified, at which point a copy is made.

Discussion

Many other compiler infrastructures represent a polymorphic tree using nodes specialised to a particular expression type: for example, perhaps a binary addition operator could be represented using a tag plus two pointers to child expressions, rather than allocating a whole irept (including a variable-length expression vector and empty maps for storing named subexpressions). This may save memory, but inhibits ad-hoc annotations such as tagging the addition "does not overflow" without adding that field to addition operations globally or maintaining a shadow data structure to track that information. This is easier with a general irept structure that can store an arbitrary number of arbitrarily-named child nodes.

Another optimisation commonly seen when storing polymorphic trees is to use a uniform node data structure (like irept) but to keep the node compact, for example storing at most four pointers and transparently allocating extension nodes when necessary for an unusually large expression. This provides the best of both worlds, obtaining compact storage for common cases such as unannotated binary expressions while permitting deviation at times. The downside is that implementing such a system is more complex than straightforwardly using C++ standard data structures as in irept.

Strings: dstringt, the string_container and the ID_*

Within cbmc, strings are represented using irep_idt, which is typedefed to dstringt. You can also easily convert an irep_idt to a std::string using the id2string function, or to a char* using the c_str() member function.

dstringt stores a string as an index into a large static table of strings. This makes it easy to compare if two irep_idts are equal (just compare the index) and it makes it efficient to store many copies of the same string. The static list of strings is initially populated from irep_ids.def, so for example the fourth entry in irep_ids.def is “IREP_ID_ONE(type)”, so the string “type” has index 3. You can refer to this irep_idt as ID_type. The other kind of line you see is "IREP_ID_TWO(C_source_location, #source_location)", which means the irep_idt for the string “#source_location” can be referred to as ID_C_source_location. The “C” is for comment, meaning that it starts with “#”. Any strings that need to be stored as irep_idts which aren't in irep_ids.def are added to the end of the table when they are first encountered, and the same index is used for all instances.

See documentation at dstringt.

typet

typet represents the type of an expression. Types may have subtypes, stored in two subs named “subtype” (a single type) and “subtypes” (a vector of types). For pre-defined types see std_types.h and mathematical_types.h.

tag_typet

tag_typet is a type used to store a reference to the symbol table. The full symbolt can be retrieved using the identifier of tag_typet.

exprt

exprt is the class to represent an expression. It inherits from irept, and the only things it adds to it are that (1) every exprt has an entry in named_sub containing its type and (2) everything in the sub of an exprt is again an exprt, not just an irept. You can think of exprt as a node in the abstract syntax tree for an expression. There are many classes that inherit from exprt and which represent more specific things. For example, there is minus_exprt, which has a sub of size 2 (for the two arguments of minus).

Recall that every irept has one piece of data of its own, i.e. its id(), and all other information is in its named_sub or sub. For exprts, the id() is used to specify why kind of exprt this is, so a minus_exprt has ID_minus as its id(). This means that a minus_exprt can be passed wherever an exprt is expected, and if you want to check if the expression you are looking at is a minus expression then you have to check its id() (or use can_cast_expr<minus_exprt>).

codet

See documentation at codet.

exprt represents expressions and codet represents statements. codet inherits from exprt, so all codets are exprts, with id() ID_code. Many different kinds of statements inherit from codet, and they are distinguished by their statement. For example, a while loop would be represented by a code_whilet, which has statement ID_while. code_whilet has two operands in its sub, and helper functions to make it easier to use: cond() returns the first subexpression, which is the condition for the while loop, and body() returns the second subexpression, which is the body of the while loop - this has to be a codet, because the body of a while loop is a statement.

symbolt, symbol_tablet, and namespacet

A symbol table is a mapping from symbol names to symbolt objects, which store a symbol's name, attributes, type and perhaps value. They are used to describe local and global variables, type definitions and function prototypes and definitions.

All symbols store a type (an instance of typet). For function or method symbols these are code_typet instances.

Global variable symbols may have a value (an exprt), in which case it is used to initialise the global.

Method or function symbols may also have a value, in which case it is a codet and gives the function definition. A method or function symbol without a value is a prototype (for example, it might be an extern declaration in C). A function symbol that has been converted to a GOTO function may be replaced with a special "compiled" value, but this varies from driver program to program – at the time of writing, only goto-cc does this.

Local variables' symbol values are always ignored; any initialiser must be explicitly assigned after they are instantiated by a declaration (code_declt).

Symbol expressions (symbol_exprt) and types (tag_typet) refer to symbols stored in a symbol table. Symbol expressions can be thought of as referring to the table for more detail about a symbol (for example, is it a local or a global variable, or perhaps a function?), and have a type which must match the type given in the symbol table. Symbol types can be thought of as shorthands or aliases for a type given in more detail in the symbol table, for example permitting a shorthand for a large structure type, as well as permitting the construction of expressions referring to that type before its full definition is known.

Note the implementation of symbol_tablet is split into a base interface (symbol_table_baset) and an implementation (symbol_tablet). There is one alternate implementation (journalling_symbol_tablet) which additionally maintains a log or journal of symbol creation, modification and deletions.

Namespaces (namespacet) provide a read-only view on one or more symbol tables, and provide helper functions that aid accessing the table. A namespace may layer one or more symbol tables, in which case any lookup operation checks the 'top' symbol table before moving down the layers towards the 'bottom' symbol table, looking up the target symbol name in each successive table until one is found. Note class multi_namespacet can layer arbitrary numbers of symbol tables, while for historical reasons namespacet can layer up to two.

The namespace wrapper class also provides the namespacet::follow_tag operation, which dereferences a tag_typet to retrieve the type it refers to, including following a type tag which refers to another symbol which eventually refers to a 'real' type.

Symbol lifetimes

Symbols with symbolt::is_static_lifetime set are initialised before a program's entry-point is called and live until it ends. Such long-lived variables are used to implement globals, but also C's procedure-local static variables, which have restricted visiblity but the lifetime of a global. They may be marked dead using a code_deadt instruction, but this does not render the variable inaccessible, it only indicates that the variable's current value will not be read without an intervening write.

Non-type, non-global symbols (those with symbolt::is_static_lifetime and symbolt::is_type not set) are local variables, and their lifespan description varies depending on context.

In symbol table function definitions (the values of function-typed symbols), local variables are created using a code_declt instruction, and live until their enclosing code_blockt ends (similar to the C idiom { int x; ... } // x lifespan ends). By contrast in GOTO programs locals are declared using a DECL instruction and live until a DEAD instruction referring to the same symbol. Explicit DEAD instructions are always used rather than killing implicitly by exiting a function.

Multiple instances of the same local may be live at the same time by virtue of recursive function calls; a dead instruction or scope end always targets the most recently allocated instance.

Scoping rules are particular to source languages and are not enforced by CPROVER. For example, in standard C it is not possible to refer to a local variable across functions without using a pointer, but in some possible source languages this is permitted.

Symbol details

Symbols have:

A mode, which indicates the source language frontend responsible for creating them. This is mainly used in pretty-printing the symbol table, to indicate the appropriate language frontend to use rendering the symbol's value and/or type. For example, mode == ID_C == "C" indicates that ansi_c_languaget, the C front-end, should be used to pretty-print, which in turn delegates to expr2ct.
A base-name and pretty-name, which are a short and user-friendly version of the symbol's definitive name respectively.
Several flags (see symbolt for full details), including symbolt::is_static_lifetime (is this a global variable symbol?), symbolt::is_type (is this a type definition), symbolt::is_thread_local (of a variable, are there implicitly instances of this variable per-thread?).

cmdlinet

See cmdlinet.

Examples: how to represent the AST of statements, in C and in java

x = y + 123

To be documented..

if (x > 10) { y = 2 } else { v[3] = 4 }

To be documented.

Java arrays: struct Array { int length, int *data };

To be documented.

Goto Model Typecheck

Class typecheckt.

<tt>irept</tt> Serialization

The module provides serialisation and deserialisation of integer numbers, strings, and irept instances.

This is implemented in C++ modules:

irep_serialization.h
irep_serialization.cpp
irep_hash_container.h (applies only to irept instances)
irep_hash_container.cpp (applies only to irept instances)

Serialization of Numbers

A number is serialiased in 7-bit encoding. For example, given a 2-byte number in base 2, like 10101010 01010101, then it is saves in 3 bytes, where each byte takes only 7 bits from the number, reading from the left. The 8th bit in each output byte is set to 1 except in the last byte, where the bit is 0. That 0 bit indicates the end of the encoding of the number. So, the output bytes look like this: 11010101 11010100 00000010.

This is implmented in the function write_gb_word.

The deserialisation does the oposite process and it is implemented in irep_serializationt::read_gb_word.

Serialization of Strings

A string is encoded as classic 0-terminated C string. However, characters 0 and \\ are escaped by writing additional \\ before them.

This is implmented in the function write_gb_string and the deserialisation is implemented in irep_serializationt::read_gb_string.

Each string which is stored inside an irept instance is saved (meaining its characters) into the ouptut stream, only in the first serialisation query of the string. In that case, before the string there is also saved a computed integer hash code of the string. Then, all subsequent serialisation queries save only that integer hash code of the string.

Serialization of <tt>irept</tt> Instances

CProver output - printing.

CProver output can be in plain text, json or xml format. All of CProver output should use the built-in messaging infrastructure, which filters messages by 'verbosity'. Default verbosity filtering is set to the maximum level (10), i.e. all messages are printed. Error messages have the lowest level of verbosity (level 1) while debug messages have the highest level (level 10). Intermediate levels (in order of increasing 'verbosity') are for warnings, results, status/phase information, statistical information and progress information (more information on these verbosity levels can be found in messaget in the message.h header).

Key classes related to the messaging infrastructure can be found in the message.h header. messaget provides messages (with a built-in verbosity level), which are then processed by subclasses of message_handlert. These filter messages according to verbosity level and direct output to an appropriate location. An important group of subclasses is stream_message_handlert and its subtypes, which direct messages to an output stream (std::cout & std::cerr, std::clog is not used). In particular, messages of verbosity level less than 3 are directed to std::cerr, while others go to std::cout (this may change, but you should be aware that not all messages are necessarily directed to only std::cout or std::cerr). Another key message_handlert subclass is ui_message_handlert - which provides output in plain text, json or xml formats.

CProver legacy output

Although all output should use the messaging interface - there are a few locations where this is not yet implemented. These should not be extended - but it may be helpful to be aware of where this happens.

Output from invariants / exceptions

Invariants output to std::cerr - and provide a backtrace and optionally some diagnostic information. For more information on invariants, see invariant.h

Exceptions have a standard what() interface. Best current practice for exceptions is for the output of what() to be routed via a message with verbosity level 1 (as returned by messaget::error()). The message is then processed by a message_handler. Where plain text output (versus json or xml output) is chosen, exceptions print (indirectly) to std::cerr. Json and xml output always goes to std::cout. There are still a few locations where exceptions print directly to std::cerr. These should not be extended. More information on exceptions can be found in exception_utils.h.
Direct output via std::cout & std::cerr

These are in the process of being removed - no new output should go via std::cout or std::cerr, but should instead use the messaget and message_handlert infrastructure.

Graph

Implemented in src/util/graph.h as grapht class. The grapht class represents a directed graph. However, an undirected graph can be emulated by inserting for each edge (u, v) also (v, u). A multi-graph is not supported though, because parallel edges are not allowed between vertices.

Data representation

A graph is defined by a template class grapht<N>, where N is the type of the nodes of the graph. The class grapht stores the nodes in a vector. A user of the graph is supposed to access the nodes via their indices in the vector. The node type N must meet these requirements:

It must be default constructible; the common way how to insert a node to the graph is to call method add_node which pushes a new default-constructed instance of N to the end of the vector and returns its index. Then operator[] can be used to obtain a reference to the pushed instance and set its content as desired.
It must define a type edget, representing type of data attached to any edge of the graph; in case edges should not have any data attached, then one can use a predefined class empty_edget. The type must be default-constructible.
It must define a type edgets as an associative container compatible with the type std::map<std::size_t, edget>.
It must define publicly accessible members of the type edgest, named in and out. The members represent edges in the graph. Namely, if u is an index of some node in the graph and v is a key in the map in (resp. out) of that node, then (v, u) (resp. (u, v)) represent an edge in the graph, with attached data in[v] (resp. out[v]).
It must define methods add_in, erase_in, and add_out, erase_out for insertion and removal of values to and from the maps in and out.
It must define a method pretty converting the node to a "pretty" string.

One can use a predefined template class graph_nodet<E> as the template parameter N of the graph. The template parameter E allows to define the type edget.

Graph algorithms

The graph implementation comes with several graph algorithms. We describe each of them in following paragraphs.

TODO!