Reducing noise from automated code reviews
When talking to different members of engineering teams through research efforts at work, it has become apparent that code reviews take an awful lot of time from engineers’ daily work hours...
When talking to different members of engineering teams through research efforts at work, it has become apparent that code reviews take an awful lot of time from engineers’ daily work hours...
When talking to different members of engineering teams through research efforts at work, it has become apparent that code reviews take an awful lot of time from engineers’ daily work hours. This raised a question of why these engineers or engineering teams aren’t using any tools for automating code reviews as there are multiple options available in the market? Little did we know that most of them were using automation for code reviews! Tools such as SonarQube/SonarCloud kept popping up in the discussions. Such tools are meant to automatically identify errors in the code and report them to the user - which they do. What we found out later was that the problem wasn’t that these tools were not finding errors from the code, but rather that they were finding too many and too insignificant ones, ultimately creating a lot of noise from the automated code review. Engineering managers often pointed out that their teams do not have the bandwidth to go through all the raised errors from their legacy code. This problem is one of the reasons why we originally started Metabob, an AI-assisted code review tool that has the ability to analyze complete code bases and recognize complex context and logic-based errors/problems.
In this blog, we are going to compare our AI-assisted tool to traditional code review tools such as SonarCloud as well as a more recent tool called DeepSource. For this purpose, we selected a repository and ran it through the three different code review tools, SonarCloud, DeepSource, and Metabob. The hypothesis was that SonarCloud and DeepSource will find a lot more problems than Metabob will, but Metabob’s detections will be more complex and significant.
We chose a repository called pdpipe from GitHub to compare the three tools. Pdpipe is a Python package that “provides a concise interface for building pandas pipelines that have pre-conditions, are verbose, support the fit-transform design of scikit-learn transformers and are highly serializable.” The reason why pdpipe was chosen was the type of repository it is. At the moment, Metabob’s AI mostly covers Python and performs best with data science related projects. Because the purpose is to measure and compare generated noise between AI and non-AI code review tools, it was important to select a repository that the AI can fully process. Since SonarCloud and DeepSource can analyze pdpipe using the full caliber of their technology as well, it seemed like an optimal fit for the case.
Alright, so let’s take a quick look at what these different tools were able to detect. Below is a summary of the number of problems and categories each tool was able to find. As a side note, on DeepSource’s platform users can choose what type and what severity of problems they want the tool to detect during the code review process. For the purpose of this blog, we asked DeepSource to detect problems on all kinds of severity levels and to include all problem categories except style and documentation.
First of all, the number of problems detected varies heavily among the three tools. Deeposource’s analysis generated a lot of noise by raising 1,584 problems in total. Going through this many errors takes a lot of effort from development teams. However, one could argue that DeepSource offers an opportunity to reduce noise by changing the settings for what kind of problems and what severity of problems to detect. On the other hand, DeepSource raised 1,351 security related issues. It would be surprising if organizations would choose to not detect security issues. So, overall, the result of DeepSource’s code review seemed a bit noisy for this particular project. To help address the raised errors, DeepSource offers an autofix to automatically correct the code. However, the autofix feature is not yet available for all detected issues.
SonarCloud detected significantly less problems than DeepSource. SonarCloud mostly detected code smells which aren’t necessarily problems that require immediate attention from the developers. Code smell is defined as “A maintainability that makes your code confusing and difficult to maintain.” The tool also detected 17 bugs. A bug is defined by SonarCloud as “A coding error that will break your code and needs to be fixed immediately.” 16 of the found bugs were classified as major by SonarCloud and one of them was classified as a minor bug. For most of the bugs SonarCloud detected, the tool suggested to “Remove or refactor this statement; it has no side effects.” However, in this case, following this suggestion would invalidate the logic of this function. Because the bug is raised in the pytest, where it is stated that pytest.raises(TypeError) for cond & 5, so it is actually desired that the error is raised if such condition happens. Given that SonarCloud is only reading this through pre-set rules without understanding the context, it raises an error as it doesn’t recognize the statement used anywhere in the code. SonarCloud doesn’t understand that the statement is used in a pytest where it is indicated that if the statement is used, it is desired to raise a TypeError. This is a major disadvantage for a tool that uses preset rules to detect errors compared to an AI-assisted tool because raising this bug over and over again generates noise. Metabob does not raise the same error because it understands the context of the code due to its ability to analyze the whole codebase.
As DeepSource and SonarCloud detect problems through preset rules, we were expecting to see some overlap between the tools’ output in terms of detections. As an example, both of these tools detected a bug risk on line 1228 on the core.py file. However, in total there wasn’t as much overlap between the tools’ detections as expected. Below are screenshots of how the same problem was presented by DeepSource and SonarCloud, respectively.
The bug risk error in line 1228 is classified as minor by SonarCloud and low severity by DeepSource. Metabob’s AI did not detect this issue. Ultimately, there was a surprisingly low amount of overlap between the output of DeepSource and SonarCloud. Ultimately, neither DeepSource nor SonarCloud were able to detect similar problems to Metabob. Metabob’s AI has been trained on millions of bug fixes performed by veteran developers through which it has learned to recognize the root causes of many logical and context-based problems. The tool runs an attention based graph neural network to detect errors from the codebase and learns dynamically based on each node and how important the changes and corrections are. Overall, Metabob’s ability to interpret the whole codebase allows it to find more complicated problems hiding in logic and context, as well as to generate code utilizing the existing code architecture, which can be seen more and more especially with Metabob’s upcoming IDE plugin. As an example for pdpipe’s case, Metabob was able to detect a problem where “The cause is that the lshift method of the BoundColumnPotential has unhandled present exceptions” (see image below). To put it simply, Metabob’s AI is trying to communicate that the lshift method has unhandled present exceptions which the user should address.
Below is another example of a high severity problem that Metabob was able to detect from the pdpipe repository. The problem presented is that the function is being passed in as part of the class doing operations, however it is not verifying that this function is a callable function this way. So, this would prevent the code from running.
Conclusion
When it comes to taking action in regards to automated code reviews, one concern is that it takes too much time and effort to address all or most of the raised errors by the given tool. Rule-based tools such as the ones presented in this blog (DeepSource and SonarCloud) are capable of finding relatively simple errors/issues, and a lot of them. The tools almost give a feeling of the final linting checkpoint, but do not necessarily appear like a code review tool that detects complex problems. Sometimes the found detections can be confusing, just like in the case where SonarCloud detected bugs from the pytest files. Metabob’s AI is a useful tool to find more complex errors that hide in logic and context of the code. It also generates less noise as it is capable of examining the context of the code.
Ultimately it all depends on what you are looking for from these tools. If you are writing messy code and want help in syntax and style errors, it is fairly simple to use linters either on the IDE or through static analysis tools where the same rules can be aggregated to the overall technology. On the other hand, if you are looking to save time for your senior developers from reviewing other developers’ code, Metabob is the closest solution to finding problems that senior developers would detect while performing a code review.