It will never work in theory

Software development research that is relevant in practice

Raj Chetty and John Friedman (Harvard), and Jonah Rockoff (Columbia) recently published a study showing how much long-term impact teachers have on students. To make a long story short, the answer is “a lot”, and that impact persists long after the child leaves the classroom. As far as I know, no-one has ever done something similar for programmers, i.e., looked at the long-term impact a particular software developer has on a project (for either good or ill). I think the hardest part would be developing a measure of one person’s impact on software; like all metrics, what you’d get out could all too easily be determined primarily by what assumptions you baked in.

But this example raises a broader question that we’d like to throw out to the whole community. What studies have you seen in other areas that you’d like to see replicated in software development? For example, Evan Robinson’s classic article “Why Crunch Mode Doesn’t Work” does a great job of summarizing research on the effects of sleep deprivation on productivity. None of those studies specifically looked at programmers; I suspect that one that did would be read and cited a lot. What other analyses would you like empirical software engineering researchers to transfer to our domain?

Nachiappan Nagappan, E. Michael Maximilien, Thirumalesh Bhat, and Laurie Williams. Realizing quality improvement through test driven development: results and experiences of four industrial teams. ESE 2008.

Test-driven development (TDD) is a software development practice that has been used sporadically for decades. With this practice, a software engineer cycles minute-by-minute between writing failing unit tests and writing implementation code to pass those tests. Test- driven development has recently re-emerged as a critical enabling practice of agile software development methodologies. However, little empirical evidence supports or refutes the utility of this practice in an industrial context. Case studies were conducted with three development teams at Microsoft and one at IBM that have adopted TDD. The results of the case studies indicate that the pre-release defect density of the four products decreased between 40% and 90% relative to similar projects that did not use the TDD practice. Subjectively, the teams experienced a 15–35% increase in initial development time after adopting TDD.

In the test-driven development (TDD) chapter of Making Software, Turhan & Co. reported that the evidence for it is mixed: there is moderate support for the claim that it improves quality, and it is not quite clear if this entails a productivity cost. To come up with this conclusion, the authors went through all the papers they could find on the topic—some of them reporting experiments with students, some with practitioners, and with highly varying quality.

In my opinion, one of the stronger papers in their sample was this one, by Nagappan & Co. It reports on four teams, one at IBM and three at Microsoft, and it contrasts TDD vs. comparable non-TDD teams post-hoc (so the study did not bias data collection). As the abstract points out, there were far fewer defects in all four products, though managers at all teams reported an increase in development time.

The conservatism in the Making Software chapter is warranted: there is still conflicting empirical evidence with TDD, as with most other practices in software development. But studies like Nagappan & Co.’s show that TDD is likely to be beneficial. Just note that (at least in the Microsoft teams they studied) “there was no enforcement and monitoring of the TDD practice; and decisions to use TDD were made as a group.” In other words, developers applied TDD because they wanted to, not because of a decree from their manager. Whether it would’ve been as effective otherwise is an open question.

 

A survey of the practice of computational science. In International Conference for High Performance Computing, Networking, Storage and Analysis, pages 19:1–19:12, 2011. (doi:10.1145/2063348.2063374)

Computing plays an indispensable role in scientific research. Presently, researchers in science have different problems, needs, and beliefs about computation than professional programmers. In order to accelerate the progress of science, computer scientists must understand these problems, needs, and beliefs. To this end, this paper presents a survey of scientists from diverse disciplines, practicing computational science at a doctoral-granting university with very high research activity. The survey covers many things, among them, prevalent programming practices within this scientific community, the importance of computational power in different fields, use of tools to enhance performance and software productivity, computational resources leveraged, and prevalence of parallel computation. The results reveal several patterns that suggest interesting avenues to bridge the gap between scientific researchers and programming tools developers.

Several studies of scientific programmers and scientific programming have come out in the past few years [1]. This in-depth analysis, which is based on semi-structured interviews with 114 researchers in science and engineering at Princeton University, is probably the most insightful to date. It explores the languages and tools researchers use, their debugging techniques, the environments they use, their performance tuning strategies, their use of parallelism [2], and many other aspects of their work. While some of its conclusions are unsurprising (e.g., the fact that scientists don’t test their programs rigorously), others highlight fruitful directions for future research—most particularly, the need to integrate performance analysis and tuning tools into everyday programming. More studies like this in other areas would be very welcome.

[1] Disclosure: one of us (GW) co-authored one of these studies, a web-based survey of over 1900 scientists conducted in 2008-09.

[2] Not surprisingly, job-level parallelism (i.e., running a sequential program many times with slightly different parameters) is by far the most common.

Daryl Posnett, Abram Hindle, and Prem Devanbu. Got Issues? Do New Features and Code Improvements Affect Defects? WCRE 2011.

There is a perception that when new features are added to a system that those added and modified parts of the source-code are more fault prone. Many have argued that new code and new features are defect prone due to immaturity, lack of testing, as well unstable requirements. Unfortunately most previous work does not investigate the link between a concrete requirement or new feature and the defects it causes, in particular the feature, the changed code and the subsequent defects are rarely investigated. In this paper we investigate the relationship between improvements, new features and defects recorded within an issue tracker. A manual case study is performed to validate the accuracy of these issue types. We combine defect issues and new feature issues with the code from version-control systems that introduces these features; we then explore the relationship of new features with the fault-proneness of their implementations. We describe properties and produce models of the relationship between new features and fault proneness, based on the analysis of issue trackers and version-control systems. We find, surprisingly, that neither improvements nor new features have any significant effect on later defect counts, when controlling for size and total number of changes.

One piece of common wisdom in the software industry is that new code tends to be buggier than old code, because it is immature and more poorly tested. But in this short paper, Posnett, Hindle, and Devanbu present an interesting twist on this. In the open source projects they studied, they found that although code changes in general are associated with future defect fixing activity, as we might expect, those changes that correspond to new feature development and to code improvements are not. That’s interesting and counter-intuitive—one would expect new feature code commits to be among the buggiest. The authors offer a possible explanation: well-established open source projects tend to be quite conservative, and new feature code is heavily scrutinized, so that most defects are found and sorted out before the code is integrated. Which means that projects that are not so careful might experience much more new feature pain.

 

Allen C. Bluedorn, Daniel B. Turban, and Mary Sue Love. The Effects of Stand-Up and Sit-Down Meeting Formats on Meeting Outcomes“. Journal of Applied Psychology 84(2), 1999.

The effects of meeting format (standing or sitting) on meeting length and the quality of group decision making were investigated by comparing meeting outcomes for 56 five-member groups that conducted meetings in a standing format with 55 five-member groups that conducted meetings in a seated format. Sit-down meetings were 34% longer than stand-up meetings, but they produced no better decisions than stand-up meetings. Significant differences were also obtained for satisfaction with the meeting and task information use during the meeting but not for synergy or commitment to the group’s decision. The findings were generally congruent with meeting-management recommendations in the time-management literature, although the lack of a significant difference for decision quality was contrary to theoretical expectations. This contrary finding may have been due to differences between the temporal context in which this study was conducted and those in which other time constraint research has been conducted, thereby revealing a potentially important contingency—temporal context.

If there’s one practice that caught on with every software team that calls itself Agile, it’s got to be daily stand-up meetings. If you hold your meetings standing up, the argument goes, they will go briskly, which is great because nobody likes meetings that drag on and on, especially if you hold them daily. This paper provides valuable evidence with respect to the efficacy of stand-up meetings: they are significantly shorter than sit-down meetings, and the decisions taken in them are just as good. Their only downside in the experiment is that participants were less satisfied with the meeting than those in sit-down meetings.

These were all 5-person meetings lasting 10-20 minutes and concerning a well-defined problem. The authors warn: “…additional research is needed to determine whether the stand-up meeting can be used for longer meetings dealing with problems that vary in their structure.”

(Thanks to Laurent Bossavit for pointing me to this paper. If you know of interesting papers that are relevant for software practitioners, even—or especially—if they’re from other disciplines, please send them our way! Also, note that we try to post links to freely downloadable versions of the papers we discuss. Sometimes, as in this case, we found none—but e-mailing the authors and asking nicely usually gets you a copy.)

Laurie McLeod and Stephen G. MacDonell. Factors that Affect Software Systems Development Project Outcomes: A Survey of Research. ACM Computing Surveys, 2011.

Determining the factors that have an influence on software systems development and deployment project outcomes has been the focus of extensive and ongoing research for more than 30 years. We provide here a survey of the research literature that has adressed this topic in the period 1996-2006, with a particular focus on empirical analyses. On the basis of this survey we present a new classification framework that represents an abstracted and synthesized view of the types of factors that have been asserted as influencing project outcomes.

Reading this literature review was a strange experience. Despite its 56-page length, and the fact that it was published only a couple of months ago, it manages to miss most of the interesting research in software development of recent years. There seem to be two reasons for this. First, the paper focuses almost entirely on research coming from the Information Systems community, which for reasons I’ve never understood is fairly disconnected from the Software Engineering research community (such as the TSE and ESE journals and the ICSE and FSE conferences). Second, the paper only considers research published between 1996 and 2006. It took me a while to realize this, but most of the exciting developments in our field (such as the link between organizational and code structure, the exploitation of data mining techniques to predict defects, and the rich and detailed qualitative evaluations of Agile practices) have only flourished in the last five years or so, and therefore would be out of scope for this survey.

In any case, McLeod and MacDonell’s survey provides a long list of factors that have been found to affect software projects, along with citations for each of them, and in that sense it is a useful gateway to research on these topics. Just be aware as you read it that, despite its recent publication date, it is fairly dated already.

PS: The paper is still only available behind a paywall, but it may eventually be posted in the authors’ lab site.

WebFWD recently posted a video presentation by UC Berkeley’s Prof. Homa Bahrami and her student Claire Rudolph, who studied how Mozilla builds software. It’s full of useful insights about how a distributed mix of volunteers and paid professionals builds world-class software without drowning in information, and is a great example of research in progress. We’d welcome pointers to more presentations of this kind.

Mordechai Ben-Ari and Roman Bednarik and Ronit Ben-Bassat Levy and Gil Ebel and Andrés Moreno and Niko Myller and Erkki Sutinen: “A decade of research and development on program animation: The Jeliot experience”. Journal of Visual Languages & Computing, 22(5), 2011.

Jeliot is a program animation system for teaching and learning elementary programming that has been developed over the past decade, building on the Eliot animation system developed several years before. Extensive pedagogical research has been done on various aspects of the use of Jeliot including improvements in learning, effects on attention, and acceptance by teachers. This paper surveys this research and development, and summarizes the experience and the lessons learned.

Like our two previous papers, this one is about software engineering education rather than software engineering per se, but (a) we’re unlikely to improve the latter until we start getting the former right, and (b) education research has always had a strongly empirical flavor, which people studying “grown up” programmers could learn a lot from. What makes this paper interesting for me is that it describes how a specific research program has evolved over more than ten years. Ideas are turned into tools; how people use those tools, and what impact they have, are studied in situ; those studies produce new insights, which are turned into a new generation of tools, and the cycle repeats. Along the way, the researchers evolve as well: they learn how to ask more penetrating questions, and (hopefully) how to iterate more rapidly. Jonathan Weiner’s book Time, Love, Memory does a great job of describing this process at greater length in genetics; young researchers (and those of us who are not so young) can learn a lot about our craft from reading both.

So what does this paper actually cover? It opens with an eight-paragraph summary of program visualization—tools and methods to draw pictures of the states of programs as they execute—followed by a brief discussion of the difference between program animation and algorithm animation. Section 3 then summarizes the evolution of their software testbed, while Section 4 shows readers what it looks like now. Sections 5-10 are the meat of the paper: what do users learn, and what effect does program visualization have on attention (both in the classroom as a whole and at the individual level), on teachers, and on collaboration. Section 11, an in-depth summary of lessons learned. In a way, it’s the whole point of the paper, and everything that comes before it is scene-setting. I wish there were more summaries and retrospectives like this, since every shared insight can save other designers or researchers months of wasted effort going down blind alleys.

Here are a couple of videos (the first about 8 minutes long, the second over an hour) discussing empirical studies in software engineering, and why they matter.

Christopher Hundhausen, Pawan Agarwal, and Michael Trevisan: “Online vs. Face-to-Face Pedagogical Code Reviews: An Empirical Comparison.” SIGCSE 2011.

Given the increased importance of communication, teamwork, and critical thinking skills in the computing profession, we have been exploring studio-based instructional methods, in which students develop solutions and iteratively refine them through critical review by their peers and instructor. We have developed an adaptation of studio-based instruction for computing education called the pedagogical code review (PCR), which is modeled after the code inspection process used in the software industry. Unfortunately, PCRs are time-intensive, making them difficult to implement within a typical computing course. To address this issue, we have developed an online environment that allows PCRs to take place asynchronously outside of class. We conducted an empirical study that compared a CS 1 course with online PCRs against a CS 1 course with face-to-face PCRs. Our study had three key results: (a) in the course with face-to-face PCRs, student attitudes with respect to self-efficacy and peer learning were significantly higher; (b) in the course with face-to-face PCRs, students identified more substantive issues in their reviews; and (c) in the course with face-to-face PCRs, students were generally more positive about the value of PCRs. In light of our findings, we recommend specific ways online PCRs can be better designed.

Like our previous selection, this paper comes from software engineering education rather than software engineering per se, but has a lot to say about the latter. Code review is now a regular part of most open source projects, thanks in part to online code review tools like ReviewBoard. Here, the authors compare those kinds of reviews with face-to-face reviews, and find that the latter are more effective in several ways: people enjoy them more, they find more issues, and they are more likely to come away believing that reviews are worth doing.  It would be fascinating to replicate this study with both junior programmers joining established teams, and developers with more experience who are undertaking reviews systematically for the first time.