It Will Never Work in Theory

Software development research that is relevant in practice

Jeffrey Stylos and Steven Clarke: “Usability Implications of Requiring Parameters in Objects’ Constructors.” ICSE 2007.

The usability of APIs is increasingly important to programmer productivity. Based on experience with usability studies of specific APIs, techniques were explored for studying the usability of design choices common to many APIs. A comparative study was performed to assess how professional programmers use APIs with required parameters in objects’ constructors as opposed to parameterless “default” constructors. It was hypothesized that required parameters would create more usable and self-documenting APIs by guiding programmers toward the correct use of objects and preventing errors. However, in the study, it was found that, contrary to expectations, programmers strongly preferred and were more effective with APIs that did not require constructor parameters. Participants’ behavior was analyzed using the cognitive dimensions framework, and revealing that required constructor parameters interfere with common learning strategies, causing undesirable premature commitment.

Programmers argue endlessly about whether language X is more “natural” or more “expressive” than language Y (see, for example, this recent post by Stephen Colbourne about C-style vs. Pascal-style variable declarations). Almost without exception, these arguments are based on personal experience and anecdote, rather than on the kind of careful empirical analysis that has become the norm among serious usability professionals.

This paper by Stylos and Clarke is a good introduction to how such analyses can be done, and the insights they yield. Their work is based on the cognitive dimensions frameowork developed by Green and Petre in the early 1990s, which Clarke has successfully applied to new APIs at Microsoft (see for example this writeup from 2004). Those interested in the area should also check out the PLATEAU workshops, which have run annually since 2009.

Jo E. Hannay, Erik Arisholm, Harald Engvik, and Dag I. K. Sjøberg. Effects of Personality on Pair Programming. TSE 36(1), 2010.

Personality tests in various guises are commonly used in recruitment and career counseling industries. Such tests have also been considered as instruments for predicting the job performance of software professionals both individually and in teams. However, research suggests that other human-related factors such as motivation, general mental ability, expertise, and task complexity also affect the performance in general. This paper reports on a study of the impact of the Big Five personality traits on the performance of pair programmers together with the impact of expertise and task complexity. The study involved 196 software professionals in three countries forming 98 pairs. The analysis consisted of a confirmatory part and an exploratory part. The results show that: 1) Our data do not confirm a meta-analysis-based model of the impact of certain personality traits on performance and 2) personality traits, in general, have modest predictive value on pair programming performance compared with expertise, task complexity, and country. We conclude that more effort should be spent on investigating other performance-related predictors such as expertise, and task complexity, as well as other promising predictors, such as programming skill and learning. We also conclude that effort should be spent on elaborating on the effects of personality on various measures of collaboration, which, in turn, may be used to predict and influence performance. Insights into such malleable, rather than static, factors may then be used to improve pair programming performance.

The topic of personality often comes up in discussions of pair programming efficiency: whether you need to do an extravert to reap its benefits, whether the contrast in personality with your peer is important, and so on. Many research studies have addressed these questions; Hannay & Co’s is a good place to start reading about them. They report: “we found no strong indications that personality affects pair programming performance or pair gain in a consistent manner”, and suggest that industry and research should “focus on other predictors of performance, including expertise and task complexity” instead, as these factors overshadow any personality effects.

Kinshuman Kinshumann, Kirk Glerum, Steve Greenberg, Gabriel Aul, Vince Orgovan, Greg Nichols, David Grant, Gretchen Loihle, and Galen Hunt: Debugging in the (Very) Large: Ten Years of Implementation and Experience. Communications of the ACM, 54(7), July 2011.

Windows Error Reporting (WER) is a distributed system that automates the processing of error reports coming from an installed base of a billion machines. WER has collected billions of error reports in 10 years of operation. It collects error data automatically and classifies errors into buckets, which are used to prioritize developer effort and report fixes to users. WER uses a progressive approach to data collection, which minimizes overhead for most reports yet allows developers to collect detailed information when needed. WER takes advantage of its scale to use error statistics as a tool in debugging; this allows developers to isolate bugs that cannot be found at smaller scale. WER has been designed for efficient operation at large scale: one pair of database servers records all the errors that occur on all Windows computers worldwide.

Engineering is in part what we do when quantitative differences become qualitative differences—when the ten pounds that we can lift easily becomes twenty, then two hundred. This paper is a very readable overview of how Microsoft’s error reporting team has handled such a change in scale. WER’s carefully-tuned bucketing and aggregation helps developers pinpoint errors and prioritize their work, so that the things that will affect the most people get the most attention. The authors’ discussion of how this works, and of the insights that such big data can provide, are a great example of how innovative practices can open up new areas of research.

Peter C. Rigby and Margaret-Anne Storey, “Understanding Broadcast Based Peer Review on Open Source Projects”. ICSE 2011.

Software peer review has proven to be a successful technique in open source software (OSS) development. In contrast to industry, where reviews are typically assigned to specific individuals, changes are broadcast to hundreds of potentially interested stakeholders. Despite concerns that reviews may be ignored, or that discussions will deadlock because too many uninformed stakeholders are involved, we find that this approach works well in practice. In this paper, we describe an empirical study to investigate the mechanisms and behaviours that developers use to find code changes they are competent to review. We also explore how stakeholders interact with one another during the review process. We manually examine hundreds of reviews across five high profile OSS projects. Our findings provide insights into the simple, community-wide techniques that developers use to effectively manage large quantities of reviews. The themes that emerge from our study are enriched and validated by interviewing long-serving core developers.

Rigby and Storey’s report won’t surprise people steeped in the open source development culture, but everyone else may find it instructive. If you are used to assigning code reviews to some of your peers, to have other reviews assigned to you, and to only work on the items in your queue, the broadcast method of many open source projects (that is, broadcast your patch to the mailing list and hope it will be picked up, improved, and accepted by others) may seem entirely dysfunctional. Still, for the most part it works very well, and the authors explain how and why. If you don’t have time for the full paper, the last section provides a good summary of its findings.

(Full disclosure: I’m currently affiliated with Dr. Storey’s lab.)

Jan Chong and Tom Hurlbutt, “The Social Dynamics of Pair Programming“. ICSE 2007.

This paper presents data from a four month ethnographic study of professional pair programmers from two software development teams. Contrary to the current conception of pair programmers, the pairs in this study did not hew to the separate roles of “driver ” and “navigator”. Instead, the observed programmers moved together through different phases of the task, considering and discussing issues at the same strategic “range” or level of abstraction and in largely the same role. This form of interaction was reinforced by frequent switches in keyboard control during pairing and the use of dual keyboards. The distribution of expertise among the members of a pair had a strong influence on the tenor of pair programming interaction. Keyboard control had a consistent secondary effect on decision-making within the pair. These findings have implications for software development managers and practitioners as well as for the design of software development tools.

The myth of the driver/navigator split in pair programming is very pervasive: I’ve found it in almost all descriptions of pair programming I’ve seen, and in the language that programming pairs use to refer to the work that they do. However, Chong and Hurlbutt report that, effectively, programming pairs perform the same mixed role throughout their collaboration. They also have some interesting observations on the effect of keyboard control and expertise differentials within the pair.

Their suggested implications for pair programming (discussed in Section 6):

  • Move beyond the “driver” and the “navigator”. This artificial role distinction only muddles the learning process of a new practice.
  • Help programmers stay focused and engaged. The authors suggest this can be achieved with good hardware support (hardware that supports, for instance, fast keyboard switching between pairs).
  • Consider differentials in programmer knowledge. Too great a difference is not productive.
  • Avoid pair rotation late in a task. Re-pair based on task completion, not on day cycles.

The latter two suggestions seem commonplace in the firms that I’ve observed that do some pair programming, but I’ve found that the former two still need wider dissemination. What is your experience?

Smart Bear Software is hosting two online panel discussions about The Architecture of Open Source Applications, at 1:00 pm EST on Wednesday, July 13, and again at the same time (with different panelists) a week later. You can sign up on their site; we look forward to seeing/hearing from lots of you.

Khaled El Emam, Saida Benlarbi, Nishith Goel, and Shesh N. Rai: “The Confounding Effect of Class Size on the Validity of Object-Oriented Metrics“. IEEE Transasctions on Software Engineering, 27(7), July 2001.

Much effort has been devoted to the development and empirical validation of object-oriented metrics. The empirical validations performed thus far would suggest that a core set of validated metrics is close to being identified. However, none of these studies allow for the potentially confounding effect of class size. In this paper, we demonstrate a strong size confounding effect and question the results of previous object-oriented metrics validation studies. We first investigated whether there is a confounding effect of class size in validation studies of object-oriented metrics and show that, based on previous work, there is reason to believe that such an effect exists. We then describe a detailed empirical methodology for identifying those effects. Finally, we perform a study on a large C++ telecommunications framework to examine if size is really a confounder. This study considered the Chidamber and Kemerer metrics and a subset of the Lorenz and Kidd metrics. The dependent variable was the incidence of a fault attributable to a field failure (fault-proneness of a class). Our findings indicate that, before controlling for size, the results are very similar to previous studies: the metrics that are expected to be validated are indeed associated with fault-proneness. After controlling for size, none of the metrics we studied were associated with fault-proneness any more. This demonstrates a strong size confounding effect and casts doubt on the results of previous object-oriented metrics validation studies. It is recommended that previous validation studies be reexamined to determine whether their conclusions would still hold after controlling for size and that future validation studies should always control for size.

We all know that some programs are more complex than others, but can we actually quantify that? Ever since the early 1970s, researchers have invented metrics (such as cyclomatic complexity or coupling and cohesion), then validated them by seeing how well they correlate with things like post-release bug counts. The idea is that if what we mean by “complex” is “hard to understand”, complex software should have more bugs than simple software, and a measure that can predict the likely number of bugs in a product before it’s released would be a very useful thing.

El Emam and his colleagues repeated some of those experiments using bivariate analysis so that they could allocate a share of the blame to code size (measured by number of lines) and the metric in question. It turned out that code size accounted for all of the significant variation: in other words, the object-oriented metrics they looked at didn’t have any actual predictive power once they normalized for the number of lines of code. Herraiz and Hassan’s chapter in Making Software, which reports on an even larger study using open source software, reached the same conclusion:

…for non-header files written in C language, all the complexity metrics are highly correlated with lines of code, and therefore the more complex metrics provide no further information that could not be measured simply with lines of code… In our opinion, there is a clear lesson from this study: syntactic complexity metrics cannot capture the whole picture of software complexity. Complexity metrics that are exclusively based on the structure of the program or the properties of the text…do not provide information on the amount of effort that is needed to comprehend a piece of code—or, at least, no more information than lines of code do.

This emphatically doesn’t mean that trying to measure software is a waste of time: Weyuker and Ostrand’s chapter in that same book shows that it is possible to predict which files are likely to contain the most bugs. What it does mean, though, is that figuring out whether some new measure actually tells us something we didn’t already know is harder than it seems.

Zornitza RachevaMaya DanevaAndrea Herrmann, Klaus Sikkel and Roel Wieringa, ”Do We Know Enough About Requirements Prioritization in Agile Projects: Insights from a Case Study“. RE10.

Requirements prioritization is an essential mechanism of agile software development approaches. It maximizes the value delivered to the clients and accommodates changing requirements. This paper presents results of an exploratory cross-case study on agile prioritization and business value delivery processes in eight software organizations. We found that some explicit and fundamental assumptions of agile requirement prioritization approaches, as described in the agile literature on best practices, do not hold in all agile project contexts in our study. These are (i) the driving role of the client in the value creation process, (ii) the prevailing position of business value as a main prioritization criterion, (iii) the role of the prioritization process for project goal achievement. This implies that these assumptions have to be reframed and that the approaches to requirements prioritization for value creation need to be extended.

Mention the phrase “requirements engineering” to many software developers and you’ll get a groan. For a long time, requirements engineering (elicitation, analysis, modeling, etc.) has been seen as something you do to satisfy the paper-pushers. There’s even a derogatory acronym: BRUF, Big Requirements Up Front. Nevertheless, the fact remains that we typically build software to satisfy a user, even if the user is ourselves. Doing so requires us to think about what we should build, and, hopefully, why we are building it. While requirements may take different forms (user stories, tasks, use cases), they remain fundamental to the process of building software.

This paper points out that the problem of prioritizing requirements is even more of a concern in agile methodologies than other approaches. This is because short cycle times require frequent prioritization of the backlog. To do so, both XP and Scrum, for instance, call for an involved customer. However, this study found that this was rarely possible, and that as a result prioritization was done by the developers. Customers either found planning meetings too technical, or were not aware of their own requirements. They also found a difference in the understanding of the “value” of a requirement: there is a distinct difference between the value for the customer and the value for the developer. For example, developers might prefer to re-use solutions from other projects. Racheva et al. did find, though, that the notion of frequent, short iterations with re-prioritization was highly useful, in particular for dealing with new information and unclear requirements.

Mike Barnett, Manuel Fähndrich, K. Rustan M. Leino, Peter Müller, Wolfram Schulte, and Herman Venter: “Specification and Verification: The Spec# Experience”. ICSE 2011

Spec# is a programming system that facilitates the development of correct software. The Spec# language extends C# with contracts that allow programmers to express their design intent in the code. The Spec# tool suite consists of a compiler that emits run-time checks for contracts, a static program verifier that attempts to mathematically prove the correctness of programs, and an integration into the Visual Studio development environment. Spec# shows how contracts and verifiers can be integrated seamlessly into the software development process. This paper reflects on the six-year history of the Spec# project, scientific contributions it has made, remaining challenges for tools that seek to establish program correctness, and prospects of incorporating verification into everyday software engineering.

A plethora of testing tools and static analyzers “suddenly” became mainstream a decade ago after years of hard work and experimentation by their creators. Today, program verification tools are poised to become part of every serious developer’s toolbox in the same way, not least because of the challenges of concurrent programming. This experience report describes what a mature verification tool can do, and what its creators learned while building it and trying to persuade people to adopt it. Even if you’re not using C#, it offers a lot of insight into things to come.

Mauro Cherubini, Gina Venolia, Rob DeLine, and Andrew J. Ko: “Let’s Go to the Whiteboard: How and Why Software Developers Use Drawings”. CHI 2007.

Software developers are rooted in the written form of their code, yet they often draw diagrams representing their code. Unfortunately, we still know little about how and why they create these diagrams, and so there is little research to inform the design of visual tools to support developers’ work. This paper presents findings from semi-structured interviews that have been validated with a structured survey. Results show that most of the diagrams had a transient nature because of the high cost of changing whiteboard sketches to electronic renderings. Diagrams that documented design decisions were often externalized in these temporary drawings and then subsequently lost. Current visualization tools and the software development practices that we observed do not solve these issues, but these results suggest several directions for future research.

A lot of people have pointed out that formal diagrammatic notations for software, like UML, are taught much more often than they’re used. This paper goes a long way toward explaining the reason: in almost all cases, developers use diagrams as a way of keeping track of bits of conversation while talking to each other, rather than as archival documentation for the benefit of people who weren’t there at the time, and the cost of turning the first into the second is so great that almost no-one ever does it voluntarily. As well as explaining one aspect of real-world software development, this paper is also a great example of how qualitative methods can produce answers that quantitative investigation never could.