Skip to content
loarabia edited this page Dec 21, 2011 · 18 revisions

How to parse C programs with Clang: A tutorial

Written by Nico Weber (9/2008)
Updated by Justin LaPre (10/2009)
Update by Larry Olson (12/2011)

Introduction

From Clang's website

The goal of the Clang project is to create a new C, C++, Objective C and Objective C++ front-end for the LLVM compiler.

What does that mean, and why should you care? A front-end is a program that takes a piece of code and converts it from a flat string to a structured, tree-formed representation of the same program — an Abstract Syntax Tree or AST.

Once you have an AST of a program, you can do many things with the program that are hard without an AST. For example, renaming a variable without an AST is hard: You cannot simply search for the old name and replace it with a new name, as this will change too much (for example, variables that have the old name but are in a different namespace, and so on). With an AST, it’s easy: In principle, you only have to change the name field of the right VariableDeclaration AST node, and convert the AST back to a string. (In practice, it’s a bit harder for some codebases.

Front-ends have existed for decades. So, what’s special about clang? I think the most interesting part is that clang uses a library design, which means that you can easily embed clang into your own programs (and by “easily”, I mean it. Most programs in this tutorial are well below 50 lines). This tutorial tells you how you do this.

So, do you have a large C code-base and want to perform non-trivial analysis? Would you like to have ctags that works better with C++ and at all with Objective-C? Would you like to collect some statistics about your program, and you feel that grep doesn’t cut it? Then clang is for you.

This tutorial will offer a tour through clang’s preprocessor, parser, and AST libraries.

A short word of warning: Clang is a work in progress. Although its API surface is more stable than it was in 2008 when the first version of this tutorial was written, it does not have a stable API, so this tutorial might not be completely up-to-date.

Clang works on all platforms. In this tutorial I assume that you have some Unix-based platform, but everything works on Windows, too.

Getting Started

The official release of Clang is at version 3.0. You can get it here. You can download and add that into your various binary, include, and library paths.

Alternatively, you can get the latest from SVN by following these instructions

A hint for folks on Unix like systems who are pulling straight from SVN and not the official released build. After getting the source for llvm and clang and configuring it per the instructions, run the following:

make happiness
# Note: You'll almost certainly need to run the next command under sudo
make install

make happiness will checkout the latest llvm and clang from svn, build them, and run the resulting binaries through a test suite. The result of this command should look like:

Testing Time: 202.10s
  Expected Passes    : 9678
  Expected Failures  : 74
  Unsupported Tests  : 13

If there are any lines that say:

Unexpected Failures: 3

or something like that, then run the command again, often times a fix will already be waiting and the last update just happened to miss it. Otherwise, check the mailing lists as there may be a bug.

make install will install the built libs, binaries, and include files.

Recommended reading to get some context if needed.

  • Compilers: Principles, Techniques, and Tools by Aho, Lan, Sethi, and Ullman Pay attention to the first two chapters, especially discussions of Lexical Analysis and Syntax Trees. Skim anything else that looks interesting up to the Chapter on Syntax Directed Translation (chapter 5 in the 1st edition). Note: All of this is what Clang is doing for you.
  • Compiler Construction: Principles and Practice by Kenneth C. Louden Pay attention to the first 3 chapters up to section 3.3 again.

Tutorial 1: The bare minimum

A front-end consists of multiple parts. First is usually a lexer, which converts the input from a stream of characters to a stream of tokens. For example, the input while is converted from the five characters ‘w’, ‘h’, ‘i’, ‘l’, and ‘e’ to the token kw_while. For performance reasons, clang does not have a separate preprocessor program, but does preprocessing while lexing.

The Preprocessor class is the main interface to the lexer, and it’s a class you will need in almost every program that embeds clang. So, for starters, let’s try to create Preprocessor object. Our first program will not do anything useful, it only constructs a Preprocessor and exits again.

The constructor of Preprocessor takes no less than 6 arguments: A DiagnosticsEngine object, a LangOptions object, a TargetInfo object, a SourceManager object, a HeaderSearch object, and finally a Module Loader object. Let’s break down what those objects are good for, and how we can build them.

First is DiagnosticsEngine. This is used by clang to report errors and warnings to the user. A DiagnosticsEngine object can have a DiagnosticsConsumer, which is responsible for actually displaying the messages to the user. We will use clang’s built-in TextDiagnosticPrinter class, which writes errors and warnings to the console (it’s the same DiagnosticsConsumer that is used by the clang binary).

Next up is LangOptions. This class lets you configure if you’re compiling C or C++, and which language extensions you want to allow. Constructing this object is easy, as its constructor does not take any parameters.

The TargetInfo is easy, too, but we need to call a factory method as the constructor is private. The factory method takes a “host triple” as parameter that defines the architecture clang should compile for, such as “i386-apple-darwin”. We will get and pass the default host triple (getDefaultTargetTriple()), which contains the host triple describing the machine llvm was compiled on. But in principle, you can use clang as a cross-compiler very easily, too. The TargetInfo object is required so that the preprocessor can add target-specific defines, for example __APPLE__. You need to delete this object at the end of the program.

SourceManager is used by clang to load and cache source files. Its constructor takes a DiagnosticsEngine for errors and a FileManager which helps it manage files on disk and in cache.

The constructor of HeaderSearch which also requires a DiagnosticsEngine for errors and a FileManager which helps it manage files on disk and in cache. HeaderSearch configures where clang looks for include files.

Finally, a ModuleLoader is an abstract class whose concrete implementation helps resolve module names. In this case, we'll create a CompilerInstance as the default ModuleLoader.

So, to build a Preprocessor object, the following code is required: clang::DiagnosticOptions diagnosticOptions; clang::TextDiagnosticPrinter *pTextDiagnosticPrinter = new clang::TextDiagnosticPrinter( llvm::outs(), diagnosticOptions); llvm::IntrusiveRefCntPtrclang::DiagnosticIDs pDiagIDs;

clang::DiagnosticsEngine *pDiagnosticsEngine =
    new clang::DiagnosticsEngine(pDiagIDs, pTextDiagnosticPrinter);

clang::LangOptions languageOptions;
clang::FileSystemOptions fileSystemOptions;
clang::FileManager fileManager(fileSystemOptions);
clang::SourceManager sourceManager(
    *pDiagnosticsEngine,
    fileManager);
clang::HeaderSearch headerSearch(fileManager, *pDiagnosticsEngine);

clang::TargetOptions targetOptions;
targetOptions.Triple = llvm::sys::getDefaultTargetTriple();

clang::TargetInfo *pTargetInfo = 
    clang::TargetInfo::CreateTargetInfo(
        *pDiagnosticsEngine,
        targetOptions);
clang::CompilerInstance compInst;

clang::Preprocessor preprocessor(
    *pDiagnosticsEngine,
    languageOptions,
    pTargetInfo,
    sourceManager,
    headerSearch,
    compInst);

Note that this is quite verbose. Since we're using a CompilerInstance anyway, I've rebuilt these tutorials using a CompilerInstance object and its helper methods. They make this setup a bit simpler.

Compiling

Now that you've written your first tutorial, you need to compile it. To do that, use llvm-config. Pass it the -fno-rtti flag otherwise you'll get a link error. Also, pass it which backend libraries to use along with a list of clang libraries. See the checked-in makefile in the project.

Clone this wiki locally