Vasilisk is a V8 JIT fuzzer that focuses on optimization passes.
Fuzzing consists of randomized input generation to automate the process of finding crashes in programs. The probability of finding a crash increases as code coverage in the application increases. In a well formed fuzzing campaign, test cases are generated in a systematic way such that every iteration generates a greater likelihood of crashing. The structure of input for an application must also be taken into account during this process.
TurboFan design consists of bitcode being JITed to quality machine code through phases of optimization passes. It makes use of static type information to accurately convert bytecode to machine code. During this process the bitcode can be translated into an intermediate representation where traits like evaluation order, code elimination, allocations, and bounds checks can be analyzed. By parsing the nodes in the IR, conclusions can be drawn about how actively the JIT optimized our test cases, and allows for updates to our grammars accordingly. Building V8 with sanitizers like ASAN, and SANCOV allows for methodology to be tested and verified.
At a high level the scoring system was developed based off percent changes after a test case passed through every phase of optimization. Test cases are initially generated from a corpus of grammars. Grammars are scored from the concrete test cases generated from them, and grammars with high scores are more likely to get selected in the next iteration of test case generation. A system for mutating, and preserving concrete interactions between different grammars was developed through the idea of grouping.
Design and Implementation
At the heart of our design is a coverage metric based on changes in number of optimization passes. We divide our fuzzing into a series of rounds focused on a different subset of grammar rules. These rules are progressively combined and scored at each stage. We distinguish between three variations of grammar rules: actions, variables, and controls.
Initially, our fuzzer took advantage of the open source grammar generation library, Dharma. However, our unique requirements meant that writing our own grammar generation would result in much faster performance. To save computational time over the course of running the fuzzer, we initially “unravel” our grammars. We look into the master lists of actions, variables, and controls and expand them out as far as we can while still being deterministic on startup. Usually this means we avoid extending to a point where we believe we are generating some end value. For example, if an action requires a random numerical value, we stop before generating it. When creating a fuzzing round, we pick a set of actions, variables, and controls from the unraveled grammar. This set must pass through another processing round where we fully expand out by choosing a random end value if necessary. As a result of our caching, we were able to vastly improve on generation time over dharma.
Our framework itself consists of three parts: grammar generation, coverage, and our fuzzing instrumentation. Generated grammars are placed into a thread safe queue where our instrumentation running on different threads can ask for a grammar. A single thread handles generating the grammar, which has managed to keep up so far. The coverage portion of the framework coordinates between the grammar generation and the fuzzing itself. Our coverage information is stored in a Redis db. When a grammar is generated we increment a counter in Redis and when the grammar is processed by the fuzzing instrumentation we increment a separate counter. Before proceeding to a new round, we ensure that these counters match.
At the end of a fuzzing round we take the highest scoring set of rules (currently 10) and create a group for each. We call a set of actions, variables, controls, and interactions a group. These are currently stored as JSON and contain all the information necessary to recreate a test case. When our initial pool of groups reaches a point we determine to be impactful (~10,000), we move away from our progressive generation to a mutation based method.
Each group allows for several mutations to be applied to it. For example, we may choose to change some interactions, replace an action, or regenerate an action if it relies on an end value. We compare each mutation to the base group and work towards eliciting more overall optimization passes.
We conducted an experiment to test the effectiveness of our grammar generation with respect to code coverage. It consisted of a control group, which was AFL on top of a dharma generated input corpus, and an experimental group that incorporated our grammar generation, mutations, and feedback loop. By collecting the coverage data from both test groups we compared the results. The original JS grammar we created was used in both groups, but test cases were generated from dharma for one, and vasilisk for the other yielding significantly different results.
d8 binary. The goal here was to observe how effective dharma’s code generation was on top of a standard coverage based fuzzer. Unlike vasilisk, code coverage is optimized with respect to the whole binary instead of just the JIT. AFL could not explicityl function to take feedback in this way. Instead more subtle approaches were taken like having inputs that were more likely to get jitted in the original corpus.
The coverage report for this run is shown below; it is the result of checking the inputs generated from the queues directory in the AFL output directory.
Here is the total:
The coverage report for the experimental group is shown below( this was run for half the time AFL was run):
Here is the total:
There are improvements to be made that would complement the design and scalability of Vasilisk. With a larger corpus of initial grammars, it would be more able to trigger different types of JIT optimizations that come from complicated language features. Furthermore, we could find a way to combine sets of grammars we already have to produce more interesting ones after observing our first few trials with the fuzzer. Our current mechanism for re-scaling grammar weights was simply adding to weight score after each iteration. Implementing Q-learning would more aptly fit our use case, and adjust bias in a systematic way. This method would allow Vasilisk to learn the effectiveness of grammars in a more meaningful, long term capacity. The mutations could also be improved within the groups. Perhaps developing more interactions and mechanics for relating variables to actions would impprove the breadth of overall mutations. Furthermore, our current method of storing interesting test cases and crashes is writing to
/dev/shm. An actual database where test cases could be properly deleted or mutated would promote less memory usage and greater scalability. Enhancing Vasilisk to make better use of CPU resources would significantly imporove the design, but Vasilisk is written in Python, which makes working with threads at a more granular level more difficult.
We appreciate the guidance from our Professor Brendan Dolan Gavitt who inspired many of our good ideas throughout this project. A link to vasilisk can be found here.