Metabob Whitepaper

Overview

Metabob is an ensemble AI system for classifying, identifying, and explaining non-deterministic faults within source code. Metabob employs BERTopic-based topic modeling to build the seed data set by collecting the underlying reasons behind particular classes of code changes. These are taken from the dataset and are based on the surrounding documentation behind each change. This enables supervised training of a classifier by utilizing an extended version of the Abstract Syntax Tree (AST), extracted from the source code. The input vectors and fault class, determined by BERTopic, serve as the output class for each node in a graph attention-based neural network. The output is analogous to the input graph with the addition of a ‘bug category’ attribute that can be converted to a string containing the top topic words, as defined by BERTopic. Explanations are then generated via a Seq2Seq Transformer by building a context vector from the topic labels, the source code, and portions of the inline documentation, docstrings, headers, and other non-local information (README’s, etc.). This produces a simple explanation of the underlying issue behind a particular code change, whose appearance mirrors that of our primary dataset.

Motivation

Improving software quality is crucial for individual developers and entire organizations alike. Bugs in the code can lead to system failures, data loss, and other issues. AI code review models can help identify these errors early.

Creating an AI code review model to identify coding errors has several benefits such as efficiency and scalability. AI models can analyze large amounts of code quickly and accurately, which can save time and resources for developers. This is particularly useful for large codebases, as it can be difficult for humans to manually review and identify bugs. As the amount of code in a project grows, it becomes increasingly difficult for humans to manually review and identify bugs. Larger code bases typically increase the complexity of errors and make it harder for humans and rule-based tools to identify them. AI code review models can handle large codebases with ease, making it possible to identify bugs even in very complex projects. Furthermore, AI can automate the process of identifying bugs, which can free up developer time to focus on other tasks. Overall, AI models have the potential to  improve the overall efficiency of the development process.

In addition, these models can provide consistent and unbiased results, which can help eliminate human error and subjectivity when identifying bugs. This likely improves the overall accuracy and reliability of the process. Further, AI code review models can be trained to identify patterns and anomalies in code that are difficult for humans to detect, for example, architectural or algorithmic inefficiencies, unbounded edge cases, and security vulnerabilities.

Data Sources

Data is primarily generated from open-source repositories on GitHub, Bitbucket, and GitLab. In addition, data is collected from secondary public datasets, such as Stackoverflow and Reddit. The primary dataset is principally centered around pull/merge requests, issues, and their associated comments to determine the reasoning behind particular code changes. These are defined as batches of commits for a particular purpose.

Data Processing

One of the biggest challenges when cleaning data taken from open-source repositories is the inconsistent data format. Data from different sources can be in different formats, use different standards, conventions, and specific jargon, and are largely non-standardized. This can require a significant amount of time and resources to resolve and can lead to inaccuracies if not properly addressed. Furthermore, as the focus is on the subset of changes that correspond to critical issues or other improvements to existing code (rather than the creation of new functionalities) we also need to be cognizant of the types of changes that we accept.

Obviously, the challenge of data cleaning increases with larger repositories. Dealing with large amounts of data from different sources can be time-consuming and computationally intensive, making it difficult to clean and prepare the data for use in an AI model. Additionally, some repositories may contain sensitive data that need to be removed before processing. This can be a challenging task that requires specialized knowledge and experience.

The biggest roadblock is the reliance on manual labeling of technical data. In order to resolve this in an efficient manner, a technique has been developed that enables the automatic filtering and classification of code changes. To do this, a two stage neural topic model was created. The topic model first performs binary classification of raw data to remove unusable items. In a second stage it assigns a label to each item that corresponds to the reason why the code was changed. This method was bootstrapped using a small, curated set of hand labeled data and then extended across the entire corpus of changes.

Filtering

Data filtering is necessary due to the noise present in the datasets used for this project. Primarily the process needs to remove “new features” and “devops” related configuration and documentation changes while preserving the bug fixes and improvements in the codebase itself. To address this issue, the project uses several simple models based on "ALBERT" and "Longformer'' to filter the data. Using an automated solution does come with risks, which are mitigated in the following ways. First, a group of experts familiar with software development operations hand-labeled a curated dataset with labels based on the “usefulness” of the proposed application. The criteria to determine “usefulness” is based on the reader’s ability to identify a significant code change due to an existing problem. Each data point in this ground truth dataset was reviewed by at least five reviewers and is used to provide validation data to compare the filtering model’s accuracy. The selection of repositories that were sampled to build the ground truth data was also selected in a way to reduce the impact of project-specific conventions, such as uniform templates, labels or language on the model’s ability to discern between the two classes “useful” and “not useful.”

Labeling

The second stage of the process is to assign a label to each item that corresponds to the reason why the code was changed. The system uses the BERTopic technique to perform topic detection and then trains a clustering model using the data transformed by that embedding space. C-TF-IDF (class-based Term Frequency Inverse Document Frequency) is then used to reduce the number of clusters, resulting in more easily interpretable topics. The filtered data is then passed through the clustering model to assign a label to each item. The project aims to use topic classification to match the cause of the problem described in the document as well as the solution. The number of topics to search for and the desired distribution of source causes is adjustable to emphasize certain categories of problems and to better encapsulate changes under salient topics by their topic words. This allows the system to adapt the configuration to better meet the needs of specific users, making it more effective in identifying bugs and performance issues in their codebase.

Topic validation is an ongoing process in the project, which includes both topic discovery and categorization. This heavily depends on human evaluation, which involves having human experts assess the topics generated by the model to determine their relevance and accuracy. These experts are individuals with a deep understanding of software development and are able to provide valuable feedback on the quality of the topics generated by the model. Particularly relevant samples flagged as a specific topic are then cataloged as part of a ground truth dataset that can be used to validate subsequent versions of the topic model. This ensures that the labels are accurate and that the data is representative of the types of changes that are likely to be encountered in other codebases.

Another approach is the evaluation of the coherence of the topics generated by the model. Coherence is a measure of how well the words in a topic are related to each other. A high coherence score indicates that the words in a topic are semantically related and therefore likely to describe a meaningful topic. The coherence score is calculated using the “u-mass” coherence score. While this can provide useful quantitative metrics for determining consistency, manual validation of the results is still required to ensure that the underlying documents are correctly described by the topic.

Graph Embedding

Metabob employs a two-stage system for accurately embedding context information within a single graph. The source code is first split up into semantic tokens through an nlp2 tokenizer and then 80-bit vector embeddings using FastText are generated. These have been trained on code snippets of a particular language. This model was manually evaluated with the goal to ensure high correspondence between module/library names and the general conventions for variables, scopes, and typing. Since FastText is a skip-gram method, it is possible to generalize vectors for the number of new words that would exist while keeping commonalities between external dependencies or APIs shared as fixed vectors across multiple codebases.

Those text tokens are then mapped to groupings identified in the abstract syntax tree. This is done by excluding the individual nodes for each text token, opting instead for the function call with attributes as the smallest individual grouping, and averaging the embeddings across each token type. The reasoning behind this is twofold. Firstly, it decreases the amount of noise generated by individual tokens, as there is a large degree of variability in specific names for variables, modules, aliases, etc. These each vary based on coding style and general architecture. Secondly, it decreases the overall size of the graph, which allows the inclusion of more of the codebase within a single, well-connected graph.

Next, a “Flow Augmented Abstract Syntax Tree'' is created from the AST by expanding links to other parts of the codebase. This is done via resource calls to defined nodes within the same source file (or across multiple source files) via resolving the imports. Then edges for each call are added, each step in the sequential execution of the program counter removing control nodes. Edges are added in stages after the parsing is complete, starting with edges between the definitions of code components to their usages, then the intended call sequence, then bi-directional pairs of edges to demark the beginning end of flow control blocks (these are conditionals, loops, etc.).

Using FA-AST graph embeddings allows adding additional context information about the semantic structure across multiple components in the codebase. This higher degree of connectivity enables feature attribution across multiple node and edge types to identify particular code patterns in an arbitrary way. This is possible given how the structural components are connected to each other.

Hierarchical Aggregation

In highly structured contexts, such as abstract syntax, where a large number of the subgraphs surrounding nodes follow consistent and regularized patterns, there can be an induced deficiency of meaningful context. By aggregating features from multiple levels of the graph, the model can also gain a more comprehensive understanding of the codebase and its structure, which can help it to more accurately detect bugs. Additionally, by reducing the size of the graph, the model is able to run more efficiently, which can make it more practical for use in large codebases.

Hierarchical aggregation is a method of graph pooling in graph neural networks. It is a technique that is used to reduce the size of the graph while preserving important feature information. The idea behind hierarchical aggregation is to use a sparse and differentiable method to capture the graph structure. The method employed by Metabob is largely inspired by the HGP-SACA and HIBPOOL methods. Due to how the graphs in this domain are structured and the regularity of the AST, we can employ some heuristics during the source code parsing phase in order to attribute subgraphs to specific node groups before applying further aggregation based on local groupings of the broader code architecture. Since the embedding contains additional code symbol reference, function, and execution information the pooling must take into account the differences in meaning that the edge features provide. To that end, the clustering of node groupings is weighted by their edge feature types.

Windowing

For large codebases, windowing is a technique used to divide the codebase into smaller chunks, or windows, that can be analyzed independently. This is necessary because large codebases can contain a large number of files and lines of code, making it difficult to analyze the entire codebase at once. Windowing allows for the codebase to be broken down into smaller, manageable chunks, which can be analyzed individually and then combined. Windowing is performed by stepping through the diagonalized edge matrix in chunks of half the window size, which is an optimizable parameter. The independent windows are then averaged to provide node-wise labels.

Graph Attention Classifier

A Graph Attention Network, GAT, is utilized for the categorization of nodes within the FA-AST structure as a specific detectable categorization. This model offers a number of key benefits over other graph-based networks. These include the ability to determine attention weights on a per neighborhood basis for a particular node, the possibility to handle “irregularly” shaped graphs without the need to predefine a specific structure prior to training the model, and the ability to handle directed graphs, which are how AST (and more importantly FA-AST) are structured internally. Due to these benefits, GAT is particularly well suited for this application, as it was discovered during the model selection process.

A fully supervised training method is employed for the bug detection system with the associated labels defined by the Topic Modelling technique. The codebase is subsequently parsed into the FA-AST structure and labels assigned to each (or the nearest parent) node that was changed as determined from the commit history.

The resulting dataset is fairly lopsided towards larger, better-maintained projects. The reason is that initial sampling focused on these kinds of repositories given that they contain more highly correlated code changes for a particular topic and have better associated documentation. To counteract overspecificity towards particular codebases and architectural styles, some of the underrepresented codebases within the dataset are oversampled in two ways. Firstly, if there are specific labeled categories that are strongly associated with a particular code change their relative weight within the training dataset is increased. Secondly, new samples are generated from the overrepresented datasets by manipulating the graph structure to alter the sequence of function calls in the non-buggy code surrounding the flagged bug regions.

Generating Explanations

Found bugs are explained by generating natural language explanations using a subsequent seq2seq model in the AI pipeline. Input context strings are built from a mapping of the identified bug categories for sequences of nodes (that are longer than 5 nodes) with each node registering above 70% for any specific bug type. Since the classification categories correspond to classes from our topic modeling system, the topics can be expanded into the keywords that strongly correspond to that topic. This is then used as a seed for the context string.

In addition, these node sequences are resolved into specific subspans of the codebase. This is done to retrieve code snippets for the detected fault, including additional documentation relating to that area of the code and the purpose of the broader codebase from both co-located documentation, primarily docstrings, and remote documentation, such as READMEs.

Once the code snippets and documentation are detected, they are used as input to a sequence-to-sequence model (seq2seq) in order to generate natural language explanations for the detected bugs. The seq2seq model is trained on a dataset of code-explanation pairs to learn how to generate human-readable explanations from the code snippets. The input context strings that are built earlier can also be used as additional input to the seq2seq model to provide more context for the explanation. This way, the model can generate explanations that are more specific to the codebase and the bug category.

Additionally, keywords can be used from the topic modeling system to guide the explanation generation. For example, if the bug category is related to memory management, the model can use keywords such as "memory allocation" and "garbage collection" to generate explanations that focus on those specific aspects of the code.

As a result, the mechanism used to generate explanations can be tuned by using the same levers to adjust the topic modeling as described earlier. This allows the system to adapt to changing detection profiles based on the categories found within the input data.

Generating Code Fixes

Code fixes are essentially guided code recommendations. Currently the most common techniques to create code recommendations are the more conventional collaborative and content based filtering approaches to build recommendation libraries to search through. In addition, universal Seq2Seq transformers have been applied to the code generation task with good success for specific types of tasks. However, due to the way that Metabob flags bugs there is another approach that offers additional value given its interoperability with the existing mechanism to perform analysis on code graphs.

The Graph-Based recommendation approach is based on Graph-to-Graph, modeled after those models used in translation tasks. Fundamentally, the task requires the re-attribution of existing nodes and edges within the broader codebase to resolve the fixes of chained dependencies as well as the creation of new nodes and edge features when required. Since both of these are present within the parsed representation of the before and after states of a code change, the required set of changes can then be represented as the difference of the two parsed graph structures. Within each node, the feature representation reflects the encoding of underlying text changes. In that way the smaller adjustments required are also included within the existing source embedding. However, there is some data loss associated with the mechanism used to perform the parsing on the input data. This is primarily relegated to the individual tokens referenced within a node. Here it does not matter if the edge features in the embedding contain links to either their original definition or use the exact string as long as this is done uniformly across the codebase.

The primary challenges with this technique rely in managing the complexity of the codebase analyzed and recomposing the code graph alterations back to human readable source code. This requires inverse parsers for the code representation used in the model for each language and a customized sequence generation model to create valid inverse mappings of node features (since they are built from embeddings of skip-gram text). Additionally, there may be issues with data loss during the parsing process, which could impact the accuracy of the recommendations generated. Despite these challenges, the Graph-Based recommendation approach has the potential to provide valuable insights and recommendations for code fixes, particularly in situations where there are complex dependencies between different parts of the code.