Vasilisk

Vasilisk is a V8 JIT fuzzer that focuses on optimization passes.

Background

Fuzzing consists of randomized input generation to automate the process of finding crashes in programs. The probability of finding a crash increases as code coverage in the application increases. In a well formed fuzzing campaign, test cases are generated in a systematic way such that every iteration generates a greater likelihood of crashing. The structure of input for an application must also be taken into account during this process.

The premise of our research was to fuzz the JavaScript engine V8 by generating valid inputs. Specifically our target was to apply grammars to generate code that would trigger the most optimizations in TurboFan, the JIT in V8. The feedback metric used to assess coverage in the JIT was the input’s change over time through all the optimization passes. Using this metric grammar that favors heavy optimization can be selected and mutated in future iterations.

TurboFan design consists of bitcode being JITed to quality machine code through phases of optimization passes. It makes use of static type information to accurately convert bytecode to machine code. During this process the bitcode can be translated into an intermediate representation where traits like evaluation order, code elimination, allocations, and bounds checks can be analyzed. By parsing the nodes in the IR, conclusions can be drawn about how actively the JIT optimized our test cases, and allows for updates to our grammars accordingly. Building V8 with sanitizers like ASAN, and SANCOV allows for methodology to be tested and verified.

Introduction

The goal of our research was to develop a feedback metric to measure code coverage in the V8 JIT, and leverage a custom method of grammar generation that accurately handled this feedback. Furthermore, the grammar generation component also had to preserve the interactions between different concrete grammar test cases. Our solution to this problem seeks to provide insight into fuzzing large applications when symbolic execution is not valid for dynamic test case generation, developing performant code generation systems with grammars, and uncovering new properties about JIT compiler bugs. A system such as this is able to start with versatile JavaScript and nurture specific test cases that force the JIT to do more work at each optimization pass.

At a high level the scoring system was developed based off percent changes after a test case passed through every phase of optimization. Test cases are initially generated from a corpus of grammars. Grammars are scored from the concrete test cases generated from them, and grammars with high scores are more likely to get selected in the next iteration of test case generation. A system for mutating, and preserving concrete interactions between different grammars was developed through the idea of grouping.

Design and Implementation

At the heart of our design is a coverage metric based on changes in number of optimization passes. We divide our fuzzing into a series of rounds focused on a different subset of grammar rules. These rules are progressively combined and scored at each stage. We distinguish between three variations of grammar rules: actions, variables, and controls.

Actions are generic snippets of JavaScript that can be applied to a variable. Variables are the various ways that JavaScript objects can be initialized. Controls are different methods of changing the control flow of a generated grammar. Our current generation defines a subset of rules as five actions and a control. These are picked randomly from unique grammar files that contain all actions and controls. Since actions are applied to variables, the set of variables used in a round depends on the actions that we pick. For every action, we pick at least one corresponding variable. However, in cases where there are multiple actions relying on the same variable type, we pick a random number between one and the total number of reliant actions. In some cases, actions will share a variable and in others, each will have their own variable. To bring our variables together, we have the concept of interactions. Currently, these consist of addition, subtraction, multiplication, and division. Pairs of variables are assigned a random interaction to allow a resulting combination of all variables which can be returned and printed by V8.

Fuzzing rounds start with a single action and control. A variable is generated for that action and the result is combined into valid JavaScript. We cycle through our five actions before expanding our action depth to two. Our grammar generation now contains a combination of two actions. At each stage we keep track of the parent from which this generation originated from. For example, given a set of actions: {0, 1, 2}, the parent of this set would be the set {1, 2}.

Initially, our fuzzer took advantage of the open source grammar generation library, Dharma. However, our unique requirements meant that writing our own grammar generation would result in much faster performance. To save computational time over the course of running the fuzzer, we initially “unravel” our grammars. We look into the master lists of actions, variables, and controls and expand them out as far as we can while still being deterministic on startup. Usually this means we avoid extending to a point where we believe we are generating some end value. For example, if an action requires a random numerical value, we stop before generating it. When creating a fuzzing round, we pick a set of actions, variables, and controls from the unraveled grammar. This set must pass through another processing round where we fully expand out by choosing a random end value if necessary. As a result of our caching, we were able to vastly improve on generation time over dharma.

Our framework itself consists of three parts: grammar generation, coverage, and our fuzzing instrumentation. Generated grammars are placed into a thread safe queue where our instrumentation running on different threads can ask for a grammar. A single thread handles generating the grammar, which has managed to keep up so far. The coverage portion of the framework coordinates between the grammar generation and the fuzzing itself. Our coverage information is stored in a Redis db. When a grammar is generated we increment a counter in Redis and when the grammar is processed by the fuzzing instrumentation we increment a separate counter. Before proceeding to a new round, we ensure that these counters match.

When the fuzzing instrumentation finishes running a test case, we distinguish between results by exit codes. Timeouts and exit codes of 1 (invalid JavaScript) are treated as invalid test cases and given a score of 0. All other exit codes are treated as a crash and given a very large score. Normal runs are passed along to the coverage portion along with the parent id of the generated grammar. A score is calculated as the sum of the changes between each optimization pass in TurboFan. If the parents score exists, that score is added to create a cumulative score.

At the end of a fuzzing round we take the highest scoring set of rules (currently 10) and create a group for each. We call a set of actions, variables, controls, and interactions a group. These are currently stored as JSON and contain all the information necessary to recreate a test case. When our initial pool of groups reaches a point we determine to be impactful (~10,000), we move away from our progressive generation to a mutation based method.

Each group allows for several mutations to be applied to it. For example, we may choose to change some interactions, replace an action, or regenerate an action if it relies on an end value. We compare each mutation to the base group and work towards eliciting more overall optimization passes.

Our initial goal for fuzzing was to find ways to find ways to make TurboFan do more work. We hope that by eliciting more optimization passes, we can explore more of the TurboFan codebase as well as finding unique optimizations not normally created by fuzzers. By generating our grammars progressively, we focus on a growth in optimization passes instead of JavaScript that generally produces more optimizations. We believe that this is a better method because we avoid being pushed towards working on only the same few rules over and over. In order to utilize our scoring, we decided to create groupings from our best results and mutate on them.

Experiments

We conducted an experiment to test the effectiveness of our grammar generation with respect to code coverage. It consisted of a control group, which was AFL on top of a dharma generated input corpus, and an experimental group that incorporated our grammar generation, mutations, and feedback loop. By collecting the coverage data from both test groups we compared the results. The original JS grammar we created was used in both groups, but test cases were generated from dharma for one, and vasilisk for the other yielding significantly different results.

The control group consisted of a large dharma generated JavaScript corpus, and an afl instrumented d8 binary. The goal here was to observe how effective dharma’s code generation was on top of a standard coverage based fuzzer. Unlike vasilisk, code coverage is optimized with respect to the whole binary instead of just the JIT. AFL could not explicityl function to take feedback in this way. Instead more subtle approaches were taken like having inputs that were more likely to get jitted in the original corpus.

The coverage report for this run is shown below; it is the result of checking the inputs generated from the queues directory in the AFL output directory.

Here is the total:

The controlled approach made it difficult to yield valid JavaScript. This is because the mutator is “unaware” of syntactically valid JavaScript that is more likely to get JITed. On top of this we found Dharma somewhat incapable offollowing syntactic rules. This was the difficulty of using a tool that was generalized for all eBNF forms. This dificulty could explain the meager code coverage on top of AFL instrumentation.

The coverage report for the experimental group is shown below( this was run for half the time AFL was run):

Here is the total:

Our grouping into actions, variables, and interactions made generating grammars and syntactically valid JavaScript extremely effective. Although our coverage feedback was optimized for JIT optimizations, we still found that it covered a fairly large breadth of code. Vasilisk’s iterative approach to optimize for grammars that are more likely to get JITed, and it’s effective syntactic JS generation allowed it to outperform AFL and Dharma.

Future Work

There are improvements to be made that would complement the design and scalability of Vasilisk. With a larger corpus of initial grammars, it would be more able to trigger different types of JIT optimizations that come from complicated language features. Furthermore, we could find a way to combine sets of grammars we already have to produce more interesting ones after observing our first few trials with the fuzzer. Our current mechanism for re-scaling grammar weights was simply adding to weight score after each iteration. Implementing Q-learning would more aptly fit our use case, and adjust bias in a systematic way. This method would allow Vasilisk to learn the effectiveness of grammars in a more meaningful, long term capacity. The mutations could also be improved within the groups. Perhaps developing more interactions and mechanics for relating variables to actions would impprove the breadth of overall mutations. Furthermore, our current method of storing interesting test cases and crashes is writing to /dev/shm. An actual database where test cases could be properly deleted or mutated would promote less memory usage and greater scalability. Enhancing Vasilisk to make better use of CPU resources would significantly imporove the design, but Vasilisk is written in Python, which makes working with threads at a more granular level more difficult.

Special Thanks

We appreciate the guidance from our Professor Brendan Dolan Gavitt who inspired many of our good ideas throughout this project. A link to vasilisk can be found here.