Through our experiments we investigated the existence and impact of atoms of confusion. Through reverse engineering winners of the International Obfuscated C Code Contest, we generated a partial list of 19 potential atoms. In the atom existence experiment we tested how well subjects could hand trace those atoms compared to functionally equivalent pieces of code with the obfuscations removed. Of the 19 proposed atoms, 15 were statistically confirmed to be significantly confusing to our subjects.
Atom Example Effect Size p-value
Change of Literal Encoding printf("%d", 013) 0.63 2.93e-14
Preprocessor in Statement int V1 = 1 #define M1 1 +1; 0.54 8.53e-11 *
Macro Operator Precedence #define M1 64-1 2*M1 0.53 1.77e-07 *
Assignment as Value V1 = V2 = 3; 0.52 3.78e-10
Logic as Control Flow V1 && F2(); 0.48 5.62e-09
Post-Increment/Decrement V1 = V2++; 0.45 6.98e-08
Type Conversion (double)(3/2) 0.42 5.17e-07
Reversed Subscripts 1["abc"] 0.40 1.52e-06
Conditional Operator V2 = (V1==3)?2:V2 0.36 1.74e-05 *
Infix Operator Precedence 0 && 1 || 2 0.33 5.90e-05
Comma Operator V3 = (V1+=1, V1) 0.30 2.46e-04
Pre-Increment/Decrement V1 = ++V2; 0.28 6.89e-04
Implicit Predicate if (4 % 2) 0.24 4.27e-03
Repurposed Variable argc = 7; 0.22 6.66e-03
Omitted Curly Braces if (V) F(); G(); 0.22 8.64e-03
Unaccepted Atom Candidates
Dead, Unreachable, Repeated V1 = 1; V1 = 2; 0.16 0.059
Arithmetic as Logic (V1-3) * (V2-4) 0.10 0.248
Pointer Arithmetic "abcdef"+3 0.03 0.752 *
Constant Variables int V1 = 5; printf("%d", V1); 0.00 1.000
* See errata for details




Contingency Tables

In our Existence experiment we analyzed our results by comparing pairs of obfuscated and transformed code. For each pair we categorized whether subjects got both snippets correct, only one, or neither. From these counts we could apply a clustering-adjusted variation of McNemar's test (supplied by clust.bin.pair). Below are these data.

Atom
Both Correct
Obfuscated Correct
Transformed Correct
Neither Correct
Change of Literal Encoding 35 2 89 20
Preprocessor in Statement 39 5 73 28
Macro Operator Precedence 53 2 37 5
Assignment as Value 59 6 68 13
Logic as Control Flow 32 12 72 30
Post-Increment/Decrement 75 4 54 13
Type Conversion 75 10 52 9
Reversed Subscripts 70 6 40 30
Conditional Operator 109 1 34 1
Infix Operator Precedence 105 4 25 11
Comma Operator 53 17 50 26
Pre-Increment/Decrement 82 11 35 18
Implicit Predicate 108 6 20 12
Repurposed Variables 54 15 33 44
Omitted Curly Braces 77 13 33 23
Dead, Unreachable, Repeated 138 1 6 1
Arithmetic as Logic 133 4 8 1
Pointer Arithmetic 67 21 23 35
Constant Variables 142 2 2 0




Atom descriptions

Change of Literal Encoding

All numbers are stored in binary inside of a computer, but for human convenience we tend to represent numbers in decimal, and occasionally hexadecimal or octal for certain uses. Even though different representations can hold the same number, their accessibility to humans for different computations can be very different.

Confusing: 208 & 13 Non-confusing: 0xD0 & 0x0D
Preprocessor in Statement

Preprocessor directives must stand alone on their own line. After the preprocessor runs, however, that line is treated as whitespace. As a result, preprocessor directives may be present in the middle of an expression as long as they are on their own lines. Since the preprocessor directive and the source code are processed in different compiler phases, they are processed independently. Yet, to the casual reader, they appear to interact with each other.

Confusing: int V1 = 1 #define M1 1 +1; Non-confusing: #define M1 1 int V1 = 1 + 1;
Macro Operator Precedence

Macros can be used to add many features to C, including guaranteed inlining, duck-typing, and adding metadata like line number and file name to program output. Unfortunately, macro references are impossible to distinguish from other identifiers and can often act in ways that variables and functions can not. This can cause readers to be misled.

Confusing: #define M1 64 - 1 2 * M1 Non-confusing: 2 * 64 - 1
Assignment as Value

The assignment expression changes the underlying state of the machine when it executes. However, it also returns a value. Often when reading an assignment expression people will forget one of the two effects of the expression.

Confusing: V1 = V2 = 3; Non-confusing: V2 = 3; V1 = V2;
Logic as Control Flow

Traditionally, the && and || operators are used for logical conjunction and disjunction, respectively, in predicates. Due to short-circuiting, they can also be used for conditional execution.

Confusing: V1 && F2(); Non-confusing: if (V1) F2();
Post-Increment/Decrement

The post-increment (and decrement) operator increases the value of its operand variable by 1, while returning the original value of the variable. Confusion here arises because the value of the expression is different from the resultant value of the variable.

Confusing: V1 = V2++; Non-confusing: V1 = V2; V2 += 1;
Type Conversion

The C compiler will implicitly convert types in various situations when there is a mismatch, but sometimes this conversion also results in an implicit change of outcome from what the author may have intended.

Confusing: 3/2; Non-confusing: trunc(3.0/2.0);
Reversed Subscripts

Arrays can be indexed using the subscript operator, but underneath ``E1[E2] is identical to (*((E1)+(E2)))''. Since addition is commutative, so too is the subscript operator.

Confusing: 1["abc"]; Non-confusing: "abc"[1]
Conditional Operator

The conditional operator is the only ternary operator in C, and functions similarly to an if/else block. However, the conditional operator is an expression for which the value is that of the executed branch.

Confusing: V2 = V1 == 3 ? 2 : 1; Non-confusing: if (V1 == 3) { V2 = 2; } else { V2 = 1; }
Infix Operator Precedence

There are 32 binary operators (operators which accept one operand before and one operand after) in C. Each of these operators falls into one of 15 precedence classes and has either right-to-left or left-to-right associativity. Needless to say, the average programmer knows only a functional subset of the information needed to correctly parse complicated expressions of binary operations.

Our preferred method for removing precedence confusion is with parenthesis. Other removal transformations are possible, such as introducing intermediate identifiers. These other strategies can have larger impacts on the structure of the code and so were avoided when possible.

Confusing: 0 && 1 || 2 Non-confusing: (0 && 1) || 2
Comma Operator

The comma operator is used to sequence an otherwise ambiguous series of computations. Whether due to its eccentricity, or its odd precedence, the comma operator is commonly misinterpreted.

Confusing: V3 = (V1++, V1); Non-confusing: V1++; V3 = V1;
Pre-Increment/Decrement

Similar to post-increment and post-decrement, the pre-increment and pre-decrement operators change a variables value by one. In contrast to the other operators, pre-increment and pre-decrement first update the variable then return the new value, instead of the old.

Confusing: V1 = ++V; Non-confusing: V2 += 1; V1 = V2;
Implicit Predicate
The semantics of an expression change based on the context in which its consumed. In the rvalue of a char assignment, i.e., char c = expr the expression (assuming it itself makes no variable assignments or updates) can result in up to 256 different states of the of the program. By contrast, the same expression in the context of a predicate, i.e., if (expr) can only result in a maximum of two different program states. Often in the context of a condition, the reader can become confused as to effect of a certain predicate value. We clarified these by explicitly adding logical operators to the predicates.

Confusing: if (4 % 2) Non-confusing: if (4 % 2 != 0)
Repurposed Variable

By convention, variables tend to have a single conceptual identify and represent one idea. When a variable is used in different roles across the lifetime of the program, its current purpose can be difficult to follow.

Confusing: int main(int argc, char **argv) { argc = 7; ... Non-confusing: int main(int argc, char **argv) { int V1 = 7; ...
Omitted Curly Braces

C looping and selection exhibit dynamic behavior over a trailing statement. The trailing statement, optionally, can be enclosed in braces for clarity, or to extend the number of sub-statements modified by the loop or conditional. Confusion may arise when the braces are omitted for brevity.

Confusing: if (V1) F1(); F2(); Non-confusing: if (V1) { F1(); } F2();
Dead, Unreachable, Repeated

Redundant code is code that will either never be executed, or it's effects are immediately invalidated. It can be counter-intuitive that code exists to have no impact on the output of the program.

Confusing: V1 = 1; V1 = 2; Non-confusing: V1 = 2;
Arithmetic as Logic

Arithmetic operators are capable of mimicking any predicate formulated with logical operators. Arithmetic, however, implies a non-Boolean range, which may be confusing to a reader.

Confusing: (V1 - 3) * (V2 - 4) Non-confusing: V1 != 3 && V2 != 4
Pointer Arithmetic

Pointers admit several operations like integer addition/subtraction, but, in many cases, these operations are interpreted by the reader to update the target data instead of the pointer data.

Confusing: "abcdef"+3 Non-confusing: "abcdef"[3]
Constant Variables

Constant variables are a layer of abstraction that, in the context of a complex system, let us focus on the concept a value represents rather than the value itself. When simply trying to determine the output of a piece of code, having a layer of indirection that hides the value of your data can cause difficulty.

Confusing: V1 = V2; Non-confusing: V1 = 5;