Are there any tools that can build the control flow graph for an entire Linux kernel binary? For example, consider Linux kernel compiled for x86 architecture (vmlinux file). Is it possible to determine all execution paths (regarding indirect call) using both static analysis and dynamic analysis? Are there any tools suitable for this?
Advertisement
Answer
I assume you mean analyzing the source code used to produce the Linux binaries.
Hope you are prepared for a lot of work. There are reasons you can’t get this off the shelf.
You need two kinds of tools:
Machinery to construct a control flow graph of individual C source files, that work for real dialects of C as used by the Linux kernel.
Something that can construct a global call graph including indirect calls; if you don’t handle indirect calls well, your call graph is either ridiculously over connected (the famous “scribble” diagram), or ridiculously underconnected (most functions won’t be reachable).
For [1],
- You could build your own custom tool from scratch, after all, you “just need a parser”. This path is simply madness. See http://www.semdesigns.com/Products/DMS/LifeAfterParsing.html
- Most modern C compilers will do that as part of the process of compiling a source file, so that’s a promising place to look. Given that GCC is used to compile Linux, that’s the most obvious place to look: https://gcc.gnu.org/ Getting GCC to cough up that control flow graph in some usable form for your purpose may be harder; GCC’s internals are famously hard to work with. Clang can compile C programs; I don’t know about its compatibility with the GCC dialect but I’d guess it was highly probable. Clang (https://clang.llvm.org/) appears to be built with the intention of customization so CFG extraction is likely to be technically straightforward, and I’d bet that Google has done this for internal projects. There are other C compilers out there (e.g., Intel’s) , but if they don’t do the GCC dialect used on Linux, you will find yourself first have to adapt them and than extracting what you want, and getting the source code of those compilers so you can make these changes is likely difficult.
- You can try to adapt some security scanning tool such as Coverity or KlocWork. I doubt you will make much progress getting such a tool in a form that you can adapt.
- You can use a (commercial) general program transformation tool like our DMS Software Reengineering Toolkit. DMS is customizable compiler technology which was designed to be internally accessible to motivated engineers. DMS can process many GCC dialects, and has been used on real embedded software systems of tens of millions of lines of code. DMS’s GCC front end constructs control and data flow graphs; see http://www.semdesigns.com/products/DMS/FlowAnalysis.html#DataFlowAnalysis. Another (research) program transformation tool is RascalMPL https://www.rascal-mpl.org/ might work, but I have not heard of it being applied to GCC dialects or operating at this scale; it may do so.
For [2],
- You can build your call-graph construction tool custom tool. Doing a pointer-sensitive flow analysis at this scale isn’t as hard as building your flow graph construction, but its pretty close in the amount of audacity required to do it. (Google likely has the audacity).
- DMS has a points-to analysis subsystem, that has been been used on a 16 million line embedded system (for a client, to enable a custom code generation task for them). It also has call-graph construction that uses this points-to information. See http://www.semdesigns.com/products/DMS/FlowAnalysis.html#CallGraphAnalysis. Specifically see the Sample Call graph link, which shows the results of points-to analysis.
Once you have these pieces, you can consider what you might do with the result. I can tell you that a flow graph for a million line system will cover a football field at 1 inch resolution; you’ll need serious computing power to traverse/analyze such a graph.
If your intent was to analyze the Linux binaries (well you’d want to process the linker modules) directly, you don’t have nearly as bad a problem of building the control flow graph because you don’t have to deal with what amounts to most of a compiler. Now you just have to worry about the entire Intel instruction set. But if you model the machine instructions accurately, your CFG is likely to be 10x the size of one for the source code and a whole lot less helpful in tracing any issues back to the source.