Materials
Atoms of confusion have been shown to cause confusion in programmers what remains to be shown is how prevalent these patterns are in real production software. This project attempts to answer that question.
We have written classifiers that can identify which parts of a C program contain atoms of confusion.
Corpus
All of our experiments are performed on a set of 14 large and popular open source projects.
Project | Domain | Creation | KLOC | Revision |
---|---|---|---|---|
Linux | Operating System | 1991 | 22641 | f341578 |
FreeBSD | Operating System | 1993 | 20496 | c2b6ea8 |
Gecko | Browser Renderer | 1998 | 15170 | dd47bee |
WebKit | Browser Renderer | 2001 | 8216 | e8c7320 |
GCC | Compiler Suite | 1988 | 5488 | 2201c33 |
Clang | Compiler Suite | 2007 | 2001 | 2bcd2d0 |
MongoDB | Database | 2007 | 3872 | 67f735e |
MySQL | Database | 2000 | 2990 | 0138556 |
Subversion | Version Control | 2000 | 720 | 0a73cab |
Git | Version Control | 2005 | 253 | ba78f39 |
Emacs | Text Editor | 1985 | 484 | cb73c70 |
Vim | Text Editor | 1991 | 459 | 6ce6504 |
Httpd | Webserver | 1996 | 637 | 6fe2348 |
Nginx | Webserver | 2002 | 187 | 9cb9ce7 |
Validation Examples
To make our sure our classifiers were capturing patterns that represented atoms of confusion, we hand validated some positively flagged AST nodes. We randomly selected (up to) 40 examples for each of our classifiers. While we only hand-evaluated the first 20 of each, all examples are included below.
- Assignment as Value
- Comma Operator
- Conditional Operator
- Implicit Predicate
- Literal Encoding
- Logic as Control Flow
- Macro Operator Precedence
- Omitted Curly Brace
- Operator Precedence
- Post-Increment
- Pre-Increment
- Preprocessor in Statement
- Repurposed Variable
- Reversed Subscript
- Type Conversion
Datasets Mined from our Corpus
Most every graphic in our paper is generated from a uniquely-tailored dataset mined from our corpus. The majority of these datasets are available in our source repository here.
Some data is quite accessible directly, for example the file atom_counts.csv shows the total number of atoms of each type in each of our corpus projects. It is only 15 rows long, and 18 columns wide:
Project | All Nodes | Non Atoms | Operator Precedence | Omitted Curly Braces | Implicit Predicate | Conditional | Preprocessor In Statement | Logic As Control Flow | Post Increment | Repurposed Variable | Assignment As Value | Comma Operator | Pre Increment | Type Conversion | Literal Encoding | Macro Operator Precedence | Reversed Subscript |
linux | 78757141 | 77439581 | 555452 | 570565 | 143299 | 53874 | 40085 | 33862 | 23461 | 21381 | 17527 | 8325 | 5500 | 3201 | 2675 | 494 | 0 |
freebsd | 76779090 | 75741579 | 256235 | 528778 | 72499 | 58481 | 92179 | 73127 | 25459 | 24270 | 43904 | 33037 | 8312 | 2568 | 1710 | 1359 | 2 |
gecko-dev | 43274783 | 42930794 | 115529 | 122117 | 13134 | 29027 | 37237 | 37911 | 12663 | 10956 | 13168 | 15557 | 2858 | 2363 | 1377 | 1133 | 0 |
gcc | 33466986 | 32929723 | 110741 | 215568 | 128546 | 79764 | 4964 | 38819 | 10478 | 14243 | 8170 | 8516 | 2197 | 9609 | 1358 | 19 | 32 |
webkit | 15081419 | 14937775 | 35688 | 71575 | 5347 | 10172 | 13859 | 17070 | 2649 | 2877 | 2398 | 4105 | 1218 | 688 | 276 | 362 | 4 |
mongo | 12635600 | 12538099 | 21474 | 47775 | 1953 | 6775 | 2291 | 11166 | 3409 | 2797 | 4584 | 3135 | 2418 | 284 | 192 | 87 | 0 |
clang | 10177857 | 10006694 | 43234 | 69303 | 37182 | 31703 | 1253 | 11921 | 748 | 2197 | 1431 | 2105 | 4535 | 1306 | 20 | 61 | 15 |
mysql-server | 9840835 | 9737142 | 24025 | 45514 | 6398 | 6931 | 2193 | 9227 | 4290 | 1941 | 8568 | 5025 | 964 | 475 | 140 | 395 | 0 |
subversion | 2717036 | 2693808 | 4529 | 14615 | 414 | 1915 | 995 | 1350 | 322 | 287 | 465 | 100 | 87 | 7 | 12 | 19 | 0 |
emacs | 1359693 | 1330562 | 6648 | 13892 | 1824 | 2070 | 429 | 2989 | 1461 | 984 | 1298 | 1250 | 238 | 388 | 209 | 1 | 0 |
git | 1205550 | 1177705 | 6611 | 15727 | 2798 | 1291 | 192 | 1937 | 759 | 612 | 835 | 217 | 224 | 8 | 0 | 27 | 0 |
httpd | 1027265 | 1016524 | 2947 | 2272 | 1408 | 1304 | 988 | 1332 | 607 | 120 | 1496 | 148 | 128 | 234 | 0 | 51 | 0 |
nginx | 730659 | 726250 | 3485 | 1 | 50 | 309 | 17 | 236 | 641 | 150 | 28 | 63 | 81 | 0 | 0 | 0 | 0 |
vim | 644955 | 630928 | 2683 | 7857 | 1328 | 662 | 253 | 1198 | 946 | 342 | 467 | 69 | 177 | 10 | 4 | 0 |
Some data, on the other hand, is better suited for programmatic consumption. For example, atoms-in-bugs_gcc_2018-01-11_added.csv.bz2, describes the number of atoms added in each file committed to GCC across its entire history. It spans almost 800k lines and 10mb after bz2 compression. These are the first few lines:
Author Name | Author Email | File | # Bugs | Added Non Atoms | # Nodes Added | Rev Str | Removed Non Atoms | # Nodes Removed | Preprocessor In Statement | Logic As Control Flow | Conditional | Reversed Subscript | Literal Encoding | Post Increment | Pre Increment | Comma Operator | Omitted Curly Braces | Assignment As Value | Macro Operator Precedence | Operator Precedence | Repurposed Variable | Implicit Predicate | Type Conversion |
rguenth | rguenth@138bc75d-0d04-0410-961f-82ee72b054a4 | gcc/ChangeLog | 0 | 3 | 3 | 2201c33012d4c6dc522ddbfa97f5aa95a209e24d | 0 | 0 | |||||||||||||||
rguenth | rguenth@138bc75d-0d04-0410-961f-82ee72b054a4 | gcc/tree-ssa-pre.c | 0 | 2 | 2 | 2201c33012d4c6dc522ddbfa97f5aa95a209e24d | 4160 | 4233 | |||||||||||||||
rguenth | rguenth@138bc75d-0d04-0410-961f-82ee72b054a4 | gcc/tree-ssa-sccvn.c | 0 | 4151 | 4225 | 2201c33012d4c6dc522ddbfa97f5aa95a209e24d | 0 | 0 | 1 | 1 | 53 | 2 | 16 | 1 | |||||||||
rguenth | rguenth@138bc75d-0d04-0410-961f-82ee72b054a4 | gcc/tree-ssa-sccvn.h | 0 | 9 | 9 | 2201c33012d4c6dc522ddbfa97f5aa95a209e24d | 0 | 0 | |||||||||||||||
rguenth | rguenth@138bc75d-0d04-0410-961f-82ee72b054a4 | gcc/testsuite/ChangeLog | 1 | 3 | 3 | 476ea17a1752df3ca32ae996e3c88f42f00ecc3a | 0 | 0 | |||||||||||||||
paolo | paolo@138bc75d-0d04-0410-961f-82ee72b054a4 | gcc/testsuite/ChangeLog | 1 | 3 | 3 | 39a925e789721936cf9ed74153a2b375ee504ec9 | 0 | 0 | |||||||||||||||
vries | vries@138bc75d-0d04-0410-961f-82ee72b054a4 | gcc/testsuite/ChangeLog | 0 | 3 | 3 | 1ddd2233adfc059bfb2982a0a5f5dadeb723ec46 | 0 | 0 | |||||||||||||||
vries | vries@138bc75d-0d04-0410-961f-82ee72b054a4 | gcc/testsuite/gcc.dg/tree-ssa/loop-1.c | 0 | 1 | 1 | 1ddd2233adfc059bfb2982a0a5f5dadeb723ec46 | 0 | 0 |
Examples
Along the way in our work we found several interesting examples of atoms in our corpus. While they’re described in detail in our paper, we provide references to the original source here.
- New Bugs in Linux - Macro Operator Precedence.
- linux commit 7aa92c4
- Several implementations of the absolute value macro incorrectly parenthesized their arguments. Our team patched those files to use a correctly-implemented version of ABS.
- Old Bugs in FreeBSD - Operator Precedence, Conditional Operator, Omitted
Curly Braces, Implicit Predicate
- freebsd commit 74e4174
- After accidentally committing an incorrect expression with bad operator precedence the author goes back and replaces a confusing conditional operator with an if-statement without curly braces.
- Correct Code at the Expense of Readability - Parameterizing #im- ports with temporary #defines.
- gecko/harfbuzz file hb-shape-plan.cc and gecko/harfbuzz file hb-shaper-list.hh
- The X-Macro pattern in C/C++ has some pretty cool technical properties, but it’s pretty confusing to read. We were initially stumped when we first stumbled across it in HarfBuzz, embedded in Gecko.
- Showing Of with Atoms - Reversed Subscript.
- freebsd/ftp file security.c
- A string literal is used to index into an integer to select a character which represents a data channel protection level.