CBMC
|
This module provides a front end for Java.
To be documented.
The Java language contains high-level programming concepts like virtual functions and throw/catch semantics. These need to be rewritten in terms of other, more fundamental operations in order to analyse the Java program. This operation is referred to as "lowering". See Background Concepts for more information.
The following lowering operations are done on Java bytecode after converting into a basic codet
representation.
These are performed in process_goto_function
for example jbmc_parse_optionst::process_goto_function
Once these lowerings have been completed, you have a GOTO model that can be handled by goto-symex.
To be documented.
To be documented.
In Java, all variables that are of a non-primitive type are pointers to an object. When initializing such variables using java_object_factoryt, e.g., as input arguments for the method under test, we may need to select the correct pointer type. For example, abstract classes can be replaced with their concrete implementations and type parameters in generic types can be replaced with their specialized types.
The class select_pointer_typet offers the basic interface for this functionality, in particular see select_pointer_typet::convert_pointer_type. Note that this class only implements generic specialization (see Generic specialization), derived classes can override this behavior to provide more sophisticated type selection.
In Java, generics are checked at compile-time for type-correctness. The generic type information is then removed in a process called type erasure. However, throwing away all this type information would be very inconvenient for our analysis. Therefore, when we initialize variables that have generic type we dynamically replace the generic parameters with their specialized types rather than using only the raw type. For example, consider a generic class MyGeneric<T>
with a field T myField
. When initializing a variable MyGeneric<Integer> mg
we change the type of its field mg.myField
from MyGeneric::T
to Integer
. Generic specialization is applied during pointer type selection (see Pointer type selection).
The generic specialization relies on a map that stores the concrete types of all generic parameters in the current scope. The map is maintained in java_object_factoryt::generic_parameter_specialization_map. Note that not all generic parameters need to have a concrete type, e.g., when the method under test is generic. The types are removed from the map when the scope changes. In different depths of the scope the parameters can be specialized with different types so we keep a stack of types for each parameter.
We use the map in select_pointer_typet::specialize_generics to retrieve the concrete type of generic parameters such as MyGeneric::T
and of arrays of generic parameters such as MyGeneric::T[]
. More complicated generic types such as MyGeneric<T>
are specialized indirectly within java_object_factoryt. Their concrete types are already stored in the map and are retrieved when needed, e.g., to initialize their fields.
To be documented.
Every Java class is compiled into a .class file. Inner classes, anonymous classes or classes for tableswitches are also compiled into their own .class files.
There exists an official specification
Each class files contains information about its version, the constant pool, information about the contained class, its super-class, its implemented interfaces, its fields, methods and finally additional attributes, such as information about lambda functions used in the methods of the class or inner classes the it contains.
The content of a .class file can be inspected via the javap
tool which is part of the JDK installation. Useful options are -c
for code, -p
to display protected / private methods -l
for line numbers and local variable information, as well as -verbose
which prints a lot of additional information.
In general, all variable length entries in a class file are represented by first having an integer n
that specifies the number of entries and then an array of n
such entries in the file. All variable length attribute also contain information about their length in bytes.
The integer format used in class files are unsigned integers of size 8/16/32 bit, which are named as u1
/u2
/u4
.
The JVM specification defines different access flags, e.g., final
, static
, protected
, private
etc. where different ones are applicable to the class itself, its fields or methods. All access flags are represented as bits, the set of bits that are defined for one entity is represented as disjunction of those values. Each of these values is defined as a constant with a name prefixed with ACC_
in JBMC, e.g., as
The constant pool contains all strings and referred values that are used in the .class. This includes the names of the class itself and its super-class, as well as the names and signatures of all fields and methods. All strings in the constant pool are in UTF-16 format.
Each member variable of a class has a field entry with a corresponding field information structure. This contains the name of the field, its raw JVM type (called the descriptor) and an optional signature.
A signature is an extension to the Java raw types and contains information about the generic type of an object if applicable.
The name of the field, the descriptor and the signature are all represented as indices into the constant pool of the class file.
Methods are represented in a similar way as fields. Each method has an associated name, descriptor and optional signature entry in the constant pool table.
An implemented method also has several attributes. One is the Code
attribute that stores the actual bytecode instructions. There is also an optional LocalVariableTable
which contains the names of local variables and parameters. In addition to this there is also an optional LocalVariableTypeTable
that contains information about generic local variables and parameters. Finally the exception table is defined as entries in the Exceptions
attribute.
Note: most information about generic types is optional and exists mainly as debugger information. This is because the Java compiler ensures that typing is correct and creates code accordingly. The same holds true for information on local variables. It is therefore advisable to compile Java projects with the -g
option that adds debugging information in all cases.
The last section contains additional attributes, e.g., SourceFile
which specified from which source file the .class was compiled, BootstrapMethods
which is used for lambda functions or InnerClasses
which refers to inner classes of the class in question.
To be documented.
Relevant code:
Concurrency instrumentation for Java programs is disabled by default. It can be enabled by specifying the --java-threading
command line flag.
Thread-blocks have special semantics, in the sense that unlike basic blocks they can be executed independently. JBMC treats a sequence of codet's surrounded with calls to CProver.startThread:(I)V
and CProver.endThread:(I)V
as a thread-block. These functions take one argument, an integer, that is used to associate the start of a thread (i.e: calls to CProver.startThread:(I)V
) to the end of the thread (i.e: calls to CProver.endThread:(I)V
). JBMC assumes that calls to the these functions are well-formed, more specifically, each call to CProver.startThread:(I)V
must have an associated call to CProver.endThread:(I)V
.
The instrumentation process (described here) will transform the aforementioned function calls, synchronized blocks/methods and calls to java.lang.Thread.getCurrentThreadId:()I
into appropriate codet. As part of this process a new variable per thread is created along with a single global variable that is used keep track of thread identifiers. These variables are needed to handle calls to java.lang.Thread.getCurrentThreadId:()I
.
Hold on, how do we go from Java threads to CProver.startThread:(I)V
and CProver.endThread:(I)V
? Well, two ways
CProver.startThread:(I)V
and CProver.endThread:(I)V
.java.lang.Thread
class.Note: instrumentation of Java static initializers changes when the --java-threading
flag is specified, this is because the static initializer of every class needs to be carefully synchronized to prevent superfluous interleavings.
Note': conversation of synchronized methods (void synchronized test(){ ... }
) and synchronized blocks (synchronized(lock) { ... }
) are dealt with differently as they are represented differently in bytecode. One is a flag, while the other has explicit instructions.
JBMC will iterate through the symbol table to find and instrument thread-blocks. Specifically, function calls to CProver.startThread:(I)V
are transformed into a codet(id=ID_start_thread, destination=I)
, which carries a target label in the field destination
. Similarly CProver.endThread(I)V
is transformed into a codet(id=ID_end_thread)
.
For each new thread a thread local variable __CPROVER__thread_id
is created. The new id is obtained by incrementing a global variable __CPROVER__next_thread_id
.
The semantics of codet(id=ID_start_thread, destination=I)
roughly corresponds to: spawn the current thread, continue the execution of the current thread in the next line, and continue the execution of the new thread at the destination.
Example, the following Java code:
is transformed into the following codet:
The ID of the thread is made accessible to the Java program by having calls to the function CProver.getCurrentThreadId()I
replaced by the variable __CPROVER__thread_id
.
Example, the following Java code:
is transformed into the following codet:
Synchronized blocks make it impossible for two synchronized blocks on the same object to interleave.
Converting synchronized blocks is rather straightforward as there is a specific bytecode instruction to indicate the start (monitorenter
) and end (monitorexit
) of a synchronized block.
monitorenter
is converted to a call to java::java.lang.Object.monitorenter:(Ljava/lang/Object;)V
.
monitorexit
is converted to a call to java::java.lang.Object.monitorexit:(Ljava/lang/Object;)V
.
Synchronized methods make it impossible for two invocations of the same method on the same object to interleave.
A method is considered to be synchronized and thus subject to the following process, if the symbol that represents it has the irep_id is_synchronized
specified. In Java this corresponds to the synchronized
keyword in the declaration of a method.
Unlike synchronized blocks Java synchronized methods do not have explicit calls to the instructions monitorenter
and monitorexit
(or equivalents), instead the JVM interprets the synchronized flag and internally implements locking/unlocking on the object.
To emulate this behaviour JBMC will iterate through the symbol table to find synchronized methods. It will then insert a call to our model of monitorenter
(java::java.lang.Object.monitorenter:(Ljava/lang/Object;)V
) at the beginning of the method and calls to our model of monitorexit
(java::java.lang.Object.monitorexit:(Ljava/lang/Object;)V
) is instrumented at each return instruction. We wrap the entire body of the method with an exception handler that will call our model of monitorexit
if the method returns exceptionally.
Example, the following Java code:
The codet version of the above is too verbose be shown here. Instead the following code depicts our equivalent Java code interpretation of the above:
remove_java_new.h is responsible for converting the new
, newarray
and multianewarray
Java bytecode operation into codet. Specifically it converts the bytecode instruction into: - An ALLOCATE with the size of the object being created - An assignment to the value zeroing its contents - If an array, initializing the size and data components - If a multi-dimensional array, recursively calling java_new
on each sub array
An ALLOCATE is a side_effect_exprt that is interpreted by goto_symext::symex_allocate
Note: it does not call the constructor as this is done by a separate java_bytecode operation.
The basic new
operation is represented in Java bytecode by the new
op
These are converted by remove_java_newt::lower_java_new
For example, the following Java code:
Which is represented as the following Java bytecode:
The first instruction only is translated into the following codet
s:
For more details about the zero expression see expr_initializert
The new Java array operation is represented in Java bytecode by the newarray
operation.
These are converted by remove_java_newt::lower_java_new_array
See How are Java arrays represented in GOTO for details on how arrays are represented in codet.
A newarray
is represented as:
For example the following Java:
Which is compiled into the following Java bytecode:
Is translated into the following codet
s:
The ARRAY_SET
codet
sets all the values to null.
The new Java multi dimensional array operation is represented in bytecode by multianewarray
These are also by remove_java_newt::lower_java_new_array
They work the same as single dimensional arrays but create a for loop for each element in the array since these start initialized.
remove_java_new
is then recursively applied to the new subarray
.
When remove_exceptions is called on the goto_modelt, the goto model contains complex instructions (goto_program_instruction_typet) such as CATCH-POP
, CATCH-PUSH
and THROW
. In order to analyze the goto model, the instructions must be simplified to use more basic instructions - this is called "lowering". This class lowers the CATCH
and THROW
instructions.
THROW
instructions are replaced by assigning to @inflight_exception
and a goto to end of the function. CATCH
instructions are replaced by a check of the @inflight_exception
and setting it to null.
Consider a simple method testException(I)V
:
The goto for testException(I)V
before remove_exceptions
(removing comments and replacing irrelevant parts with ...
) is:
where there is a THROW
instruction to be replaced.
After passing the goto model through remove_exceptions
, it is:
where now instead of the instruction THROW
, the global variable @inflight_exception
holds the thrown exception in a separate goto statement.
Consider the method catchSomething(I)V
that tries the above method testException(I)V
and catches the exception:
The goto model before remove_exceptions
is:
The CATCH-PUSH
instruction signifies the start of the try block, the CATCH-POP
instruction signifies the end of the try block, and the EXCEPTION LANDING PAD
signifies beginning of the catch block.
After remove_exceptions
the goto model is:
where the CATCH-PUSH
has been replaced by a check on the @inflight_exception
variable and goto statements, the CATCH-POP
replaced by a check on the class of the exception and a goto statement, and the EXCEPTION LANDING PAD
replaced by a section that assigns the exception to a local variable and sets the @inflight_exception
back to null.
remove_instanceof.h removes the bytecode instruction instanceof
and replaces it with two instructions:
To be documented.
Loading strategies are policies that determine what classes and method bodies are loaded. On an initial pass, the symbols for all classes in all paths on the Java classpath are loaded. Eager loading is a policy that then loads all the method bodies for every one of these classes. Lazy loading strategies are policies that only load class symbols and/or method bodies when they are in some way requested.
The mechanism used to achieve this is initially common to both eager and context-insensitive lazy loading. java_bytecode_languaget::typecheck calls java_bytecode_convert_class (for each class loaded by the class loader) to create symbols for all classes and the methods in them. The methods, along with their parsed representation (including bytecode) are also added to a map called java_bytecode_languaget::method_bytecode via a reference held in java_bytecode_convert_classt::method_bytecode. typecheck then performs a switch over the value of lazy_methods_mode to determine which loading strategy to use.
If lazy_methods_mode is LAZY_METHODS_MODE_EAGER then eager loading is used. Under eager loading java_bytecode_languaget::convert_single_method(const irep_idt &, symbol_table_baset &, lazy_class_to_declared_symbols_mapt &, message_handlert &) is called once for each method listed in method_bytecode (described above). This then calls java_bytecode_languaget::convert_single_method(const irep_idt &, symbol_table_baset &, std::optional<ci_lazy_methods_neededt>, lazy_class_to_declared_symbols_mapt &, message_handlert &);
without a ci_lazy_methods_neededt object, which calls java_bytecode_convert_method, passing in the method parse tree. This in turn uses java_bytecode_convert_methodt to turn the bytecode into symbols for the parameters and local variables and finish populating the symbol for the method, including converting the instructions to a codet.
Context-insensitive lazy loading is an alternative method body loading strategy to eager loading that has been used in Deeptest for a while. Context-insensitive lazy loading starts at the method given by the --function
argument and follows type references (e.g. in the definitions of fields and method parameters) and function references (at function call sites) to locate further classes and methods to load. The following paragraph describes the mechanism used.
If lazy_methods_mode is LAZY_METHODS_MODE_CONTEXT_INSENSITIVE then context-insensitive lazy loading is used. Under this stragegy java_bytecode_languaget::do_ci_lazy_method_conversion is called to do all conversion. This calls operator()
of ci_lazy_methodst, which creates a work list of methods to check, starting with the entry point, and classes, starting with the types of any class-typed parameters to the entry point. For each method in the work list it calls ci_lazy_methodst::convert_and_analyze_method, which calls the same java_bytecode_languaget::convert_single_method used by eager loading to do the conversion (via a std::function
object passed in via parameter method_converter) and then calls ci_lazy_methodst::gather_virtual_callsites to locate virtual calls. Any classes that may implement an override of the virtual function called are added to the work list. Finally the symbol table is iterated over and methods that have been converted, their parameters and local variables, globals accessed from these methods and classes are kept, everything else is thrown away. This leaves a symbol table that contains reachable symbols fully populated, including the instructions for methods converted to a codet.
To be documented.
To be documented.
To be documented.
See also String Solver Interface.
To be documented.
For high-level documentation about goto traces, see Goto Trace Structure.
There are three types of assignments to construct all types of Java objects:
We use the convention left-hand side (LHS) = right-hand side (RHS) to represent assignments in this document.
The expressions on each side vary depending on the type, whether its a member assignment, whether its an assignment by reference, etc. Some examples:
java::SomeClass.someGlobalField
, and a RHS expression containing the value, e.g. a constant_exprt, a struct_exprt or an address_of_exprt. See Variable assignments.java::SomeClass.<init>:()V::this
which corresponds to a symbol with is_static_lifetime
set to true. See Variable assignments.java::array
, then if the array is non-empty it is followed by a series of index_exprt assignments to assign elements to the array of the required type. A notable exception to this assignment structure is when arrays are created non-deterministically for primitive types - these are created by assigning an entire array to the java::array
symbol instead of being built using index assignments. See Array assignments.Variable assignments have LHS symbol_exprt with an identifier that reflects the variable name and scope and a RHS that is one of the following (noting that Strings are an exception, see String assignments):
A primitive variable assignment has the form:
An assignment of a non-primitive (struct) variable can have the following forms:
struct_exprt that have members are constructed first by an assignment of the form
where the members of the struct_exprt are zero-initialized (null for struct_exprt and 0 for constant_exprt) then members can be assigned. For primitive members, this is a simple assignment to the member:
For members of reference type, the object to be assigned must first be created, then assigned by address, e.g.
This concept becomes clearer with an example.
Consider the trace for the function call to Example.assignReferenceTypes()
, which assigns a value to the static field globalReferenceTypeReferenceField
.
There are 54 non-trivial steps in the trace to make this assignment. This example will give detail of the assignment steps only, filling in other steps with description.
CProver initialize initializes the instance of the class containing the function under test, Example, to have null
static field:
java::ObjectReference.globalReferenceTypeReferenceField
= constant_exprt with null
valueObjects are empty-initialized to build the assignment to globalReferenceTypeReferenceField
. symex_dynamic::dynamic_object1
represents globalReferenceTypeReferenceField
and symex_dynamic::dynamic_object2
represents referenceField
.
symex_dynamic::dynamic_object1
= struct_exprt with referenceField
set to null
valuesymex_dynamic::dynamic_object2
= struct_exprt with primitiveField
set to 0
Call to the constructor of ReferenceTypePrimitiveField
, initializing the parameters this
and i
java::ReferenceTypePrimitiveField.<init>:(I)V::this
= constant_exprt with address_of_exprt pointing to symex_dynamic::dynamic_object2
java::ReferenceTypePrimitiveField.<init>:(I)V::i
= constant_exprt with value 5
Call to the super constructor (Object). The to_construct
parameter is internal
java::java.lang.Object.<init>:()V::this
= constant_exprt with address_of_exprt pointing to symex_dynamic::dynamic_object2
java::java.lang.Object.<init>:()V::to_construct
= constant_exprt with address_of_exprt pointing to symex_dynamic::dynamic_object2
Setting the member of symex_dynamic::dynamic_object2, primitiveField. Note that primitives are always assigned using constant expressions with values.
symex_dynamic::dynamic_object2.primitiveField
= constant_exprt with value 5
Call to the constructor of ReferenceTypeReferenceField, initializing the parameters this
and r
java::ReferenceTypeReferenceField.<init>:(LReferenceTypePrimitiveField;)V::this
= constant_exprt with address_of_exprt pointing to symex_dynamic::dynamic_object1
java::ReferenceTypeReferenceField.<init>:(LReferenceTypePrimitiveField;)V::r
= constant_exprt with address_of_exprt pointing to symex_dynamic::dynamic_object2
Call to the super constructor (Object). The to_construct
parameter is internal
java::java.lang.Object.<init>:()V::this
= constant_exprt with address_of_exprt pointing to symex_dynamic::dynamic_object1
java::java.lang.Object.<init>:()V::to_construct
= constant_exprt with address_of_exprt pointing to symex_dynamic::dynamic_object1
Setting the member of symex_dynamic::dynamic_object1, referenceField. Note that this because referenceField is of struct type, it is assigned by reference.
symex_dynamic::dynamic_object1.referenceField
= constant_exprt with address_of_exprt pointing to symex_dynamic::dynamic_object2
Finally, the actual assignment to globalReferenceTypeReferenceField is done by reference to the object we built symex_dynamic::dynamic_object1
java::ObjectReference.globalReferenceTypeReferenceField
= constant_exprt with address_of_exprt pointing to symex_dynamic::dynamic_object1
Using the flag "--validate-trace" will call the function check_trace_assumptions, which will throw an error when these assumptions are not met.
To be documented.
To be documented.