It Will Never Work in Theory

Software development research that is relevant in practice

Khaled El Emam, Saida Benlarbi, Nishith Goel, and Shesh N. Rai: “The Confounding Effect of Class Size on the Validity of Object-Oriented Metrics“. IEEE Transasctions on Software Engineering, 27(7), July 2001.

Much effort has been devoted to the development and empirical validation of object-oriented metrics. The empirical validations performed thus far would suggest that a core set of validated metrics is close to being identified. However, none of these studies allow for the potentially confounding effect of class size. In this paper, we demonstrate a strong size confounding effect and question the results of previous object-oriented metrics validation studies. We first investigated whether there is a confounding effect of class size in validation studies of object-oriented metrics and show that, based on previous work, there is reason to believe that such an effect exists. We then describe a detailed empirical methodology for identifying those effects. Finally, we perform a study on a large C++ telecommunications framework to examine if size is really a confounder. This study considered the Chidamber and Kemerer metrics and a subset of the Lorenz and Kidd metrics. The dependent variable was the incidence of a fault attributable to a field failure (fault-proneness of a class). Our findings indicate that, before controlling for size, the results are very similar to previous studies: the metrics that are expected to be validated are indeed associated with fault-proneness. After controlling for size, none of the metrics we studied were associated with fault-proneness any more. This demonstrates a strong size confounding effect and casts doubt on the results of previous object-oriented metrics validation studies. It is recommended that previous validation studies be reexamined to determine whether their conclusions would still hold after controlling for size and that future validation studies should always control for size.

We all know that some programs are more complex than others, but can we actually quantify that? Ever since the early 1970s, researchers have invented metrics (such as cyclomatic complexity or coupling and cohesion), then validated them by seeing how well they correlate with things like post-release bug counts. The idea is that if what we mean by “complex” is “hard to understand”, complex software should have more bugs than simple software, and a measure that can predict the likely number of bugs in a product before it’s released would be a very useful thing.

El Emam and his colleagues repeated some of those experiments using bivariate analysis so that they could allocate a share of the blame to code size (measured by number of lines) and the metric in question. It turned out that code size accounted for all of the significant variation: in other words, the object-oriented metrics they looked at didn’t have any actual predictive power once they normalized for the number of lines of code. Herraiz and Hassan’s chapter in Making Software, which reports on an even larger study using open source software, reached the same conclusion:

…for non-header files written in C language, all the complexity metrics are highly correlated with lines of code, and therefore the more complex metrics provide no further information that could not be measured simply with lines of code… In our opinion, there is a clear lesson from this study: syntactic complexity metrics cannot capture the whole picture of software complexity. Complexity metrics that are exclusively based on the structure of the program or the properties of the text…do not provide information on the amount of effort that is needed to comprehend a piece of code—or, at least, no more information than lines of code do.

This emphatically doesn’t mean that trying to measure software is a waste of time: Weyuker and Ostrand’s chapter in that same book shows that it is possible to predict which files are likely to contain the most bugs. What it does mean, though, is that figuring out whether some new measure actually tells us something we didn’t already know is harder than it seems.

Zornitza RachevaMaya DanevaAndrea Herrmann, Klaus Sikkel and Roel Wieringa, ”Do We Know Enough About Requirements Prioritization in Agile Projects: Insights from a Case Study“. RE10.

Requirements prioritization is an essential mechanism of agile software development approaches. It maximizes the value delivered to the clients and accommodates changing requirements. This paper presents results of an exploratory cross-case study on agile prioritization and business value delivery processes in eight software organizations. We found that some explicit and fundamental assumptions of agile requirement prioritization approaches, as described in the agile literature on best practices, do not hold in all agile project contexts in our study. These are (i) the driving role of the client in the value creation process, (ii) the prevailing position of business value as a main prioritization criterion, (iii) the role of the prioritization process for project goal achievement. This implies that these assumptions have to be reframed and that the approaches to requirements prioritization for value creation need to be extended.

Mention the phrase “requirements engineering” to many software developers and you’ll get a groan. For a long time, requirements engineering (elicitation, analysis, modeling, etc.) has been seen as something you do to satisfy the paper-pushers. There’s even a derogatory acronym: BRUF, Big Requirements Up Front. Nevertheless, the fact remains that we typically build software to satisfy a user, even if the user is ourselves. Doing so requires us to think about what we should build, and, hopefully, why we are building it. While requirements may take different forms (user stories, tasks, use cases), they remain fundamental to the process of building software.

This paper points out that the problem of prioritizing requirements is even more of a concern in agile methodologies than other approaches. This is because short cycle times require frequent prioritization of the backlog. To do so, both XP and Scrum, for instance, call for an involved customer. However, this study found that this was rarely possible, and that as a result prioritization was done by the developers. Customers either found planning meetings too technical, or were not aware of their own requirements. They also found a difference in the understanding of the “value” of a requirement: there is a distinct difference between the value for the customer and the value for the developer. For example, developers might prefer to re-use solutions from other projects. Racheva et al. did find, though, that the notion of frequent, short iterations with re-prioritization was highly useful, in particular for dealing with new information and unclear requirements.

Mike Barnett, Manuel Fähndrich, K. Rustan M. Leino, Peter Müller, Wolfram Schulte, and Herman Venter: “Specification and Verification: The Spec# Experience”. ICSE 2011

Spec# is a programming system that facilitates the development of correct software. The Spec# language extends C# with contracts that allow programmers to express their design intent in the code. The Spec# tool suite consists of a compiler that emits run-time checks for contracts, a static program verifier that attempts to mathematically prove the correctness of programs, and an integration into the Visual Studio development environment. Spec# shows how contracts and verifiers can be integrated seamlessly into the software development process. This paper reflects on the six-year history of the Spec# project, scientific contributions it has made, remaining challenges for tools that seek to establish program correctness, and prospects of incorporating verification into everyday software engineering.

A plethora of testing tools and static analyzers “suddenly” became mainstream a decade ago after years of hard work and experimentation by their creators. Today, program verification tools are poised to become part of every serious developer’s toolbox in the same way, not least because of the challenges of concurrent programming. This experience report describes what a mature verification tool can do, and what its creators learned while building it and trying to persuade people to adopt it. Even if you’re not using C#, it offers a lot of insight into things to come.

Mauro Cherubini, Gina Venolia, Rob DeLine, and Andrew J. Ko: “Let’s Go to the Whiteboard: How and Why Software Developers Use Drawings”. CHI 2007.

Software developers are rooted in the written form of their code, yet they often draw diagrams representing their code. Unfortunately, we still know little about how and why they create these diagrams, and so there is little research to inform the design of visual tools to support developers’ work. This paper presents findings from semi-structured interviews that have been validated with a structured survey. Results show that most of the diagrams had a transient nature because of the high cost of changing whiteboard sketches to electronic renderings. Diagrams that documented design decisions were often externalized in these temporary drawings and then subsequently lost. Current visualization tools and the software development practices that we observed do not solve these issues, but these results suggest several directions for future research.

A lot of people have pointed out that formal diagrammatic notations for software, like UML, are taught much more often than they’re used. This paper goes a long way toward explaining the reason: in almost all cases, developers use diagrams as a way of keeping track of bits of conversation while talking to each other, rather than as archival documentation for the benefit of people who weren’t there at the time, and the cost of turning the first into the second is so great that almost no-one ever does it voluntarily. As well as explaining one aspect of real-world software development, this paper is also a great example of how qualitative methods can produce answers that quantitative investigation never could.

Audris Mockus, “Organizational Volatility and its Effects on Software”. FSE 2010:

The key premise of an organization is to allow more efficient production, including production of high quality software. To achieve that, an organization defines roles and reporting relationships. Therefore, changes in organization’s structure are likely to affect product’s quality. We propose and investigate a relationship between developer-centric measures of organizational change and the probability of customer-reported defects in the context of a large software project. We find that the proximity to an organizational change is significantly associated with reductions in software quality. We also replicate results of several prior studies of software quality supporting findings that code, change, and developer characteristics affect fault-proneness. In contrast to prior studies we find that distributed development decreases quality. Furthermore, recent departures from an organization were associated with increased probability of customer-reported defects, thus demonstrating that in the observed context the organizational change reduces product quality.

The influx of newcomers into an organization does not seem to increase defects in its software, perhaps because newcomers get simple tasks at the start. However, other changes to the organization (and especially departures from it) hurt its software significantly. In his paper, Mockus notes that organizational volatility is not the main driver of defects—that would be the technical complexity of the code. But his organizational change measures still explain over 20% of the variance in fault-proneness of the code.

Kathryn Stolee and Sebastian Elbaum, “Refactoring Pipe-like Mashups for End-User Programmers”. ICSE 2011:

Mashups are becoming increasingly popular as end users are able to easily access, manipulate, and compose data from many web sources. We have observed, however, that mashups tend to suffer from deficiencies that propagate as mashups are reused. To address these deficiencies, we would like to bring some of the benefits of software engineering techniques to the end users creating these programs. In this work, we focus on identifying code smells indicative of the deficiencies we observed in web mashups programmed in the popular Yahoo! Pipes environment. Through an empirical study, we explore the impact of those smells on end-user programmers and observe that users generally prefer mashups without smells. We then introduce refactorings targeting those smells, reducing the complexity of the mashup programs, increasing their abstraction, updating broken data sources and dated components, and standardizing their structures to fit the community development patterns. Our assessment of a large sample of mashups shows that smells are present in 81% of them and that the proposed refactorings can reduce the number of smelly mashups to 16%, illustrating the potential of refactoring to support the thousands of end users programming mashups.

A great use of computer science-y automation to improve end-user programs. In this case the beneficiaries are Yahoo! Pipes users, but the same code smell -> refactor principle can easily be used in other web mashups, and eventually in other kinds of end-user programs.

Foyzur Rahman and Premkumar Devanbu, “Ownership, Experience and Defects: A Fine-Grained Study of Authorship”. ICSE 2011:

Recent research indicates that “people” factors such as ownership, experience, organizational structure, and geographic distribution have a big impact on software quality. Understanding these factors, and properly deploying people resources can help managers improve quality outcomes. This paper considers the impact of code ownership and developer experience on software quality. In a large project, a file might be entirely owned by a single developer, or worked on by many. Some previous research indicates that more developers working on a file might lead to more defects. Prior research considered this phenomenon at the level of modules or files, and thus does not tease apart and study the effect of contributions of different developers to each module or file. We exploit a modern version control system to examine this issue at a fine-grained level. Using version history, we examine contributions to code fragments that are actually repaired to fix bugs. Are these code fragments “implicated” in bugs the result of contributions from many? or from one? Does experience matter? What type of experience? We find that implicated code is more strongly associated with a single developer’s contribution; our findings also indicate that an author’s specialized experience in the target file is more important than general experience. Our findings suggest that quality control efforts could be profitably targeted at changes made by single developers with limited prior experience on that file.

The findings are a bit hard to parse from the abstract, but the upshot is that (a) code worked on by a single developer (as opposed to many developers) is more often implicated in defects, and that (b) while a developer’s general experience in a project is not correlated with the rate of defects introduced in it, the developer’s experience with the file in question, in particular, is correlated with a lesser likelihood of introducing defects in it.

People have been building complex software for over sixty years, but until recently, only a handful of researchers had studied how it was actually done. Many people had opinions—often very strong ones—but most of these were based on personal anecdotes, or on the kind of “it’s obvious” reasoning that led Aristotle to conclude that heavy objects fall faster than light ones.

To make matters worse, many of the studies that were done were crippled by lack of generality, artificiality, or small sample sizes. As a result, while software engineering billed itself as a “hard” science, rigor was much less common than in “soft” disciplines like marketing, which has gone from the gut instincts of Mad Men to being a quantitative, analytic discipline.

Over the last fifteen years, though, there has been a sea change. Instead of just inventing new tools or processes, describing their application to toy problems in academic journals, and then wondering why practitioners ignored them, a growing number of software development researchers have been looking to real life for both questions and answers. In doing so, some are increasing the sophistication of their quantitative research toolkit, putting the power of statistics and data mining to good use as they plow through massive amounts of electronic records. Others have used rigorous qualitative techniques from anthropology and business studies to deal with complexities that t-tests and data mining algorithms cannot handle.

Sadly, most people in industry still don’t know what researchers have found out, or even what kinds of questions they could answer. One reason is their belief that software engineering research is so divorced from real-world problems that it has nothing of value to offer them (an impression that is reinforced by how irrelevant most popular software engineering textbooks seem to the undergraduates who are forced to wade through them, and by how little software most software engineering professors have ever built).

Another reason is many programmers’ disdain for qualitative research methods, which are often dismissed out of hand (and out of ignorance) as “soft”. A third reason is ignorance—often willful—among practitioners themselves. People will cling to creationism, refuse to accept the reality of anthropogenic climate change, or insist that vaccines cause autism; it is therefore no surprise that many programmers continue to act as if a couple of pints and a quotation from some self-appointed guru constitute “proof” that one programming language is better than another.

The aim of this blog is to be a bridge between theory and practice. Each week, we will highlight some of the most useful results from studies past and present. We hope that this will encourage researchers and practitioners to talk about what we know, what we think we know that ain’t actually so, why we believe some things but not others, and what questions should be tackled next.