Software maintenance is an expensive activity in the software lifecycle. Last minute change requests push developers to make “band-aid fixes” setting aside good design and implementation principles to deliver code in a short time frame. Continuous software changes that do not always follow software quality requirements create technical debt, i.e., the introduction of quick workarounds in the source code that worsen its maintainability. Technical debt makes it even harder to maintain code, leading to a downward spiral.
Technical debt can be recognized as patterns or characteristics that indicate a deeper problem, these are called “code smells.” Code smells occur due to poor implementation solutions and can negatively affect program comprehensibility, change- and defect-proneness, and maintenance costs.
Given these problems, it has become increasingly important to check code quality early in the software development cycle. A recent buzzword is to “shift left” in the cycle. Recognizing a poor design decision late costs a huge amount of time overall. So called static analysis tools have the potential to detect faulty code and code smells before they are sent to production.
The issue with most of the available static analysis tools, however, is that they have low accuracy, which is manifested as high false positives rates, causing programmers to ignore them. Research has shown that static analysis tools indeed generate too many alerts. Heckman and Williams (2008) found an alert density of 40 alerts per thousand lines of code. Problem is that many of these alerts (ranging from 35-91%) are unactionable. In fact, checking all alerts reported by static analysis tools is incredibly time consuming. If a tool would report 1,000 alerts and each alert requires 5 min for inspection, the time to inspect the alerts would take 10.4 uninterrupted 8-h workdays!
Most of these tools can be considered “rule-based” that is systems that apply human-made rules to store, sort and manipulate data. Therefore, in recent years, different techniques have been proposed to come up with better, more accurate ways to identify crucial and actionable alerts. The most promising approach is machine learning (click here for an overview of the research done in the field). Through machine learning rules can be inferred that human instructors may not have encoded. Further, AI can learn rules based on actions from other programmers.
Pecorelli and colleagues (2022) applied machine learning algorithms on the outputs of static code analysis tools. However, the gain provided by the warnings raised by static analysis tools to the predictions done when using those warnings as features for code smell detection is limited, showing that the warnings given to developers do not evidently refer to any design problems. Applying a similar approach, Lenarduzzi and colleagues (2019) studied the fault-proneness of Sonarqube violations on 21 open-source systems. By applying seven machine learning algorithms and logistic regression they showed that violations classified as “bugs” hardly lead to a failure. An additional study applied eight machine learning techniques on 33 Java projects to understand if Sonarqube is capable of correctly identifying technical debt. Results show that the 28 software metrics derived from Sonarqube are in fact not correlated with technical debt. One of the likely reasons for these disappointing results is the high amount of false positive warnings raised by static analysis tools. When using these outputs as inputs for the machine learning model it is not surprising that the results are not useful. In the end, the AI simply re-learns the rules of the static analysis tool. Given that the rules were not overly useful, the machine learning model is not either.
Consequently, developers tend to reject the use of static analysis tools, which was proven through research and a considerable number of interviews conducted by the authors of this article. Further, research calls for additional features that have mostly been ignored in static analysis, such as AI-specific instruments.
Studies have already shown that machine learning can indeed identify actionable code smells very accurately. By applying 16 different machine-learning techniques on four types of code smells (Data Class, Large Class, Feature Envy, Long Method) and on 74 software systems, Arcelli Fontana and Zanoni (2017) achieved accuracy rates of up to 96%. Shcherban et al. (2020) applied two machine learning algorithms to better locate code smells with a precision of 98% and a recall of 97%. Differently from other studies, this approach mines and analyzes code smell discussions from textual artifacts (e.g., code reviews). Regarding the latter, the most promising approach is to systematically assess deep learning methods, which might more naturally combine features, given that they act directly on source code.
So, indeed machine learning seems to be a way out of the false positives dilemma that is currently plaguing static analysis tools. Feel free to test such a tool on your own repository.