author: Rick de Vries
title: Clash of Clangs
keywords: grammar-based testing, testing of parsers
topics: Case studies and Applications , Languages , Testing
committee: Vadim Zaytsev
started: November 2020

Description

Parsing C++ is known to be hard [1,6,13,15]. Luckily, it has been solved for us by various existing third party parsers, each of which represents years of design and development effort. However, these parsers work differently, since they have set different goals for themselves, and reached them through a series of diverging design decisions. In this project, we will investigate the differences by comparing two mature and popular open source frameworks: srcML [3,4,5] that takes its pride in performance (can parse the entire codebase of Linux kernel in a few minutes on a decent laptop), and Clang [10] covers the entire language in much more detail, while mapping its constructs properly to the LLVM intermediate representation.

The project will definitely involve:

  • writing a differ tool for trees produced by Clang and by srcML, in a mutually agreed on language (preferred/advised options are multiplatform, such as .NET Core)
  • seeking and investigating a few concrete examples of programs where srcML trees are either incorrect according to Clang, or ambiguous

The project can possibly involve:

  • using state of the art test case generators like Csmith [7,14] or any of the grammar/automata based ones [2,8,9,11,12], to cheaply infer massive amounts of test data from the C++ spec
  • inspecting the source code of scrML [4,5] and/or Clang [10] to gather ideas for the language features to explore
  • cataloging known difficulties in parsing C++ [1,6,13,15] to gather ideas for the language features to explore

If the project is successful and involves all/most of the items mentioned below, its contents are publishable at a well-ranked conference, in which case the student responsible for the project becomes its first author, unrelated to their direct involvement in the actual writing process.

References

  1. David Beazley, Thoughts on the Insanity of C++ Parsing, SWIG, 2002.
  2. Geoff Birch, Bernd Fischer, Michael Poppleton, Fast test suite-driven model-based fault localisation with application to pinpointing defects in student programs, SoSyM, 2019.
  3. Michael L. Collard, Michael John Decker, Jonathan I. Maletic, Lightweight Transformation and Fact Extraction with the srcML Toolkit, SCAM, 2011.
  4. Michael L. Collard, Jonathan I. Maletic, Michael John Decker, et al, scrML, open source software (GPL).
  5. Michael John Decker, Michael L. Collard, Brian Bartman, Heather Guarnera, Drew Guarnera, https://github.com/srcML/srcML, GitHub, 2013–2020.
  6. Mike Dimmick, C++ Grammar - Update, comp.compilers, 2001.
  7. Eric Eide, John Regehr, et al, Csmithhttps://github.com/csmith-project/csmith, open source software (BSD).
  8. Bernd Fischer, Ralf Lämmel, Vadim Zaytsev, Comparison of Context-Free Grammars Based on Parsing Generated Test Data, SLE, 2011.
  9. Phillip van Heerden, Moeketsi Raselimo, Konstantinos Sagonas, Bernd Fischer, Grammar-based Testing for Little Languages: An Experience Report with Student Compilers, SLE, 2020.
  10. Chris Lattner et al, Clang: a C language family frontend for LLVM, open source software (Apache).
  11. Moeketsi Raselimo, Bernd Fischer, Spectrum-based fault localization for context-free grammars, SLE, 2019.
  12. Christoff Rossouw, Bernd Fischer, Test Case Generation from Context-Free Grammars Using Generalized Traversal of LR-Automata, SLE, 2020.
  13. Edward Willink, Meta-Compilation for C++, PhD thesis, 2001.
  14. Xuejun Yang, Yang Chen, Eric Eide, John Regehr, Finding and understanding bugs in C compilers, PLDI, 2011.
  15. Most Vexing Parse, Wikipedia, 2010–2020.