It Will Never Work in Theory

Software development research that is relevant in practice

Our previous post, “Empirical Evidence for the Value of Version Control“, generated a lot of comments. Many sought to explain why version control is helpful, but that’s not what we were looking for: we were looking for empirical evidence that it is. To see why we need it, take a look at this response from Jordi Cabot [1]. In it, he says:

Quite regularly, I get questions about what empirical evidence supports my “belief” that models are good… Until now, I used to point to the (true, few) scientific empirical studies on the effectiveness of software modeling…but now I have an even anser to give you: “Empirical Evidence of the Value of Version Control”.
No, I haven’t lost my mind. The point of this link is to show you that there’s no proof that version control is better for software development, and yet, I don’t think any of you would argue against it.
Same for modeling and model-driven engineering. It would be great to have more proof but the absence of proof alone should not be used against it unless you want to start also abandoning other unproven things like version control.

He’s right: if we’re willing to accept that version control is valuable, without proof, then we can hardly require advocates of modeling to prove their case. Or advocates of functional programming, or literate programming, or Hungarian notation. Heck, if we don’t require proof for our claims, then we’re honor-bound to accept that Perl is “intuitive” because its grammar has as many special cases and contradictions as the grammars of natural languages, aren’t we? Or that learning Befunge makes you a better programmer (seriously, I’ve heard that claim too).

At some point, the statement, “If we don’t need to prove the value of version control, we don’t need to prove the value of X” becomes absurd. However, everyone’s threshold of absurdity is different. I personally don’t think that modeling adds value for most developers in most situations—I think that if it did, or if its benefits really were as significant as its advocates claim, more developers would have adopted it by now—but I don’t know. What I do know is, if we can’t demonstrate the value of something that most of us believe in, like version control, what chance do we have of telling whether other practices, like modeling and test-driven development, are worth adopting (or rather, when they’re worth adopting and by whom, since I doubt there’s a one-size-fits-all answer)?

So here are my requests:

  1. Tell us what kind of study would convince you that using Befunge didn’t make programmers more productive.
  2. Then tell us what kind of study would convince you that version control didn’t either.

If your answer to the second question is is, “Nothing ever could,” then version control is an article of faith for you, and there’s no point arguing further [2]. If your answer to the second is different from your answer to the first, please tell us why.

[1] Full disclosure: Jordi and I co-authored a study of web-based software project portals. And either way, we hope you have a happy and productive 2013.

[2] This request is inspired by Karl Popper’s notion of falsifiability: a claim is only scientific if there is some way to prove it wrong.

We received this by email:

I use version control for my software, and I encourage others to do so, but I have no experimental evidence to base that decision. I pulled out my old copy of Code Complete (it’s a first edition), and the only reference it makes is to “Moore 1992″, which is a private communication that says that Microsoft considers their internal use of version control to be a competitive advantage.

The common practices I know of are:

  1. no version control
  2. every once in a while make a backup, either as a tar/zip file or copy everything into a new directory
  3. use filesystem versioning, like what was on a VAX, or Time Machine on a Mac, or Dropbox for a distributed multi-version file system
  4. - use a version control system; though this in turn can vary from SCCS and RCS to Fossil and Veracity

In addition, there’s a difference between the needs of a single developer vs. a small team, vs. a large, distributed team.

Is there published experimental evidence showing that a version control system is more useful than, say, developing using Dropbox? I tried looking for the relevant papers but I don’t know how to search that field and I couldn’t find anything.

It’s a good question—does anyone have an answer?

Jorge Aranda and I submitted a short opinion piece to Communications of the ACM in February 2012 that discussed some of the reasons people in industry and academia don’t talk to each other as much as they should. Ten months later, it has ironically turned into an illustration of one of the reasons: it was six months before we received any feedback at all, and we’ve now waited four months for any further word. In that time, Jorge has left academia and I’ve taken a job with Mozilla, so we have decided to withdraw the manuscript and publish it on my personal blog. We hope you find it interesting, and we would welcome comments.

Many people have noted the wide gulf between the people who study software development and the people who do it. One person trying to close that gap is Michael Feathers, who is running a one-day workshop in London on Wednesday, January 16 titled “Developing Project Guidance Through Code History Mining“. Feathers is the author of the landmark book Working Effectively With Legacy Code, and is actively seeking to build ties with people who have similar interests.

Our recommendation: two thumbs up.

David Ameller, Claudia Ayala, Jordi Cabot, and Xavier Franch, How do Software Architects Consider Non-functional Requirements: An Exploratory Study, RE 2012, Chicago.

Dealing with non-functional requirements (NFRs) has posed a challenge onto software engineers for many years. Over the years, many methods and techniques have been proposed to improve their elicitation, documentation, and validation. Knowing more about the state of the practice on these topics may benefit both practitioners’ and researchers’ daily work. A few empirical studies have been conducted in the past, but none under the perspective of software architects, in spite of the great influence that NFRs have on daily architects’ practices. This paper presents some of the findings of an empirical study based on 13 interviews with software architects. It addresses questions such as: who decides the NFRs, what types of NFRs matter to architects, how are NFRs documented, and how are NFRs validated. The results are contextualized with existing previous work.

In this work, Ameller et al. consider the contention that NFRs ought to be driving concerns for software architects. They conducted a study with Spanish software architects in a variety of domains to understand how they thought of NFRs. Their first finding was that no one held a formal “architect” role, although that was what their work entailed. The job position was based on skills and knowledge rather than training. Their second finding was that NFRs were not of primary importance, which contradicts other research findings. Instead, they found it was more important to consider project-wide constraints like licencing and overall cost. This suggests some interesting directions for new research in the role architecture plays in the software development process.

A related blog post with more detail can be found here.

Stefan Hanenberg. ”An experiment about static and dynamic type systems: doubts about the positive impact of static type systems on development time“. OOPSLA 2010.

Although static type systems are an essential part in teaching and research in software engineering and computer science, there is hardly any knowledge about what the impact of static type systems on the development time or the resulting quality for a piece of software is. On the one hand there are authors that state that static type systems decrease an application’s complexity and hence its development time (which means that the quality must be improved since developers have more time left in their projects). On the other hand there are authors that argue that static type systems increase development time (and hence decrease the code quality) since they restrict developers to express themselves in a desired way. This paper presents an empirical study with 49 subjects that studies the impact of a static type system for the development of a parser over 27 hours working time. In the experiments the existence of the static type system has neither a positive nor a negative impact on an application’s development time (under the conditions of the experiment).

How many experiments in software engineering research are you aware of where the researcher developed a new programming language and corresponding IDE just for the experiment? Well, Stefan Hanenberg did exactly that, and the results are remarkable. The goal of his experiment was to measure the impact of static vs. dynamic type systems on development time and software quality. While there is a lot of conventional wisdom around the use of static or dynamic type systems (e.g., static type systems capture many recurring programming errors and make systems easier to maintain, dynamic type systems make life easier by not posing unnecessary restrictions), there is hardly any hard evidence to support these claims, and for a practitioner, it is unclear which arguments can be trusted.

Unlike what has been done in previous work, Hanenberg decided not to use existing programming languages and IDEs in his experiment because he worried that subjects’ familiarity with the tooling would influence the results, in particular if his subjects knew only the dynamic or only the static version used in the study. Therefore, he developed a new object-oriented programming language “Purity” (with some similarities to Smalltalk, Ruby and Java) and a corresponding IDE (class browser, test browser and console). Actually, he developed two versions of Purity: one with static types, the other one with a dynamic type system. The two versions were identical in all other aspects.

His experimental setup followed a between-subject design (i.e., each subject was only used once). He recruited 49 students, divided them into two groups, and taught each group one of the Purity versions (the dynamic type version was taught for 16 hours and the static type version was taught for 18 hours). All subjects were then given exactly 27 hours to implement a scanner and a parser for a given grammar. Hanenberg measured two outcomes: development time and quality. Development time was measured based on log entries and test cases in order to determine the exact point in time when subjects fulfilled all the test cases for a minimal scanner, and the quality of the parser was measured through 400 test cases that represented valid and invalids words in the grammar.

The main result from Hanenberg’s study is that — under the conditions of his experiment — the existence of a static type system did not have a positive impact on development time or quality. In fact, the subjects who used the dynamic type version of Purity were significantly faster in developing a scanner, but there was no statistically significant difference with respect to quality of the final product.

In addition to conducting and describing a well-planned and well-executed experiment, Hanenberg does a thorough job explaining and justifying his choice of methods, both for data collection and data analysis. But he also discusses the limitations of his work in great depth — in particular that it is impossible to draw general conclusions from one experiment. However, what a single experiment such as Hanenberg’s can do is cast doubts on the role of static type systems in software engineering, and his work opens up lots of venues for future work on which programming languages work better than others, and why.

As we reported a few days ago, one of our contributors, Greg Wilson, gave a keynote at the MSR Vision 2020 workshop in Kingston on August 20. In that, he explored why there’s still a gulf between software engineering researchers and the people who actually build software for a living (see the slides or the discussion on Reddit for details). He also said that:

  1. there’s no easy way to close that gap, because most of the people in industry that researchers want to collaborate with have never encountered empirical software engineering studies, and therefore don’t understand their scope or value; so
  2. researchers—many of whom are professors—should pivot the software engineering classes they teach to focus on how to analyze real-world data, and what past analyses have told us, so that the next generation of developers will understand (and listen, and want to collaborate).

To make this more concrete, Greg asked the workshop participants to make up some assignments and exam questions for such a course. Some of the suggestions are listed below; we would welcome other ideas as well (please post them as comments). We’d also like to know who’d be interested in trying to teach such a course at their institution, and what you think the prerequisites would have to be: statistics, obviously, but would a database course that introduced students to SQL be necessary? What about a natural language processing course? Or something else we haven’t thought of?

Group 1

Give two examples of success stories in studies of the social aspects of software engineering.
  1. Reorganization based on social structures
  2. Identifying the “big players” in a software project
What are three sources of social interaction in software projects?
  1. Email
  2. IRC
  3. bug comments
  4. source code comments
Name three challenges in preprocessing emails.
  1. signatures
  2. code snippets
  3. stack traces
  4. fake/multiple email addresses
  5. identifying email headers and inline replies
  6. typos
  7. chat acronyms
  8. non-native speakers
  9. use of multiple languages

Group 2

  1. You are given a dataset A of OSS projects and a subset of it B. Evaluate whether a hypothesis H can be rejected on A and B. Design the question in such a way that H is significant (at 0.05 level) at A and not B. Discuss the discrepancy.
  2. Given a dataset and a specific question, perhaps from exisitng MSR papers, discuss which data mining approach is best suited for that question.
  3. Given a specific question (e.g., bug finding) what repositories should you use to solve it? Illustrate it with Bugzilla. How do you adapt this to Jira?
  4. Given that two variables A and B correlate, can you say “A causes B”? Why or why not?
  5. Repeat an existing analysis from an MSR paper. Do you get the same results? Vary a number of variables. How different are the results?

Group 3

  • Statistics
    1. What is wrong with this claim: “Files with a large number of committers/authors have more defects/bugs, so we conclude that more authors cause more bugs, and we recommended that the number of commiters be reduced.”
    2. A tool is 99% accurate in detecting defective lines of code. Should developers use the tool? Why or why not?
    3. What are the internal validity issues and external validity issues with this method? “Researcher X finds that a lack of modularity leads to more defects in Windows, and Y is going to apply that predict defects in Eclipse.”
    4. Design a study to see whether people who go to lunch together have fewer build defects in their software.
    5. Which would product fewer false positives: 90% recall and 10% precision, or 10% precision and 90% recall?
  • Data
    1. Given a table of bug reports with severity, etc. and another table of users with qualifications, etc., determine whether experience and bug report frequency are correlated, and if so, how strongly.
    2. Define: evolutionary coupling, tokenizing, word nets, stemming, n-gram, entropy.
    3. List 10 sources of data that could be mined to estimate the risks to a software projects, and describe the limitations of each.
  • Interpretation and Actionability
    1. Your boss has asked you to generate documentation for a legacy system that doesn’t have any. What approach(es) would you use to automatically generate some useful documentation for each class and method?
    2. Given a set of version control logs, how would you tell which commits were bug fixes (vs. adding new features)?
    3. What technique(s) would you use to correlate email messages from a mailing list archive with related version control commits?
  • Ethics
    1. Given a data set (mailing list archive, bug reports, and version control log), anonymize it so that it can be shared without risk.
    2. Is it ethical to do an experiment to find out whether one race or gender produces more bugs than another? Justify your answer. How about graduates of one university vs. another?

Abram Hindle and Thomas Zimmerman, “Do Topics Extracted from Requirements Make Sense to Managers and Developers?“, International Conference on Software Maintenance, 2012.

Disclosure: Abram and I have collaborated on a somewhat related paper.

Large organizations like Microsoft tend to rely on formal requirements documentation in order to specify and design the software products that they develop. These documents are meant to be tightly coupled with the actual implementation of the features they describe. In this paper we evaluate the value of high-level topic-based requirements traceability in the version control system, using Latent Dirichlet Allocation (LDA). We evaluate LDA topics on practitioners and check if the topics and trends extracted matches the perception that Program Managers and Developers have about the effort put into addressing certain topics. We found that effort extracted from version control that was relevant to a topic often matched the perception of the managers and developers of what occurred at the time. Furthermore we found evidence that many of the identified topics made sense to practitioners and matched their perception of what occurred. But for some topics, we found that practitioners had difficulty interpreting and labelling them. In summary, we investigate the high-level traceability of requirements topics to version control commits via topic analysis and validate with the actual stakeholders the relevance of these topics extracted from requirements.

A holy grail of software research is to (automatically) relate the business value of the software feature to the code implementing that feature, known as requirements traceability. All sorts of benefits are posited to result from this, including the ability to tell whether your customer’s needs are met.

One approach to this is to use an information retrieval technique called topic modelling. Topic modelling generates word distributions for a set of documents, like requirements specifications. One of the problems with topic modelling is that the topics are presented as lists of seemingly unrelated words, and the content of these topics must be captured with a descriptive label. In this paper the authors assess whether developers at Microsoft find the topics easy to label and understand.

What they discovered was that the study participants agreed with the proposed linkages between requirements topics and commits, but that the topics were difficult to label without being customized to the individual developer. Program managers seemed to find the topics more comprehensible, possibly because they deal with a wider array of features in their work. Further use of topic modelling in this area seems to require labelling by domain experts before being widely applicable to the traceability problem.

I gave the opening talk at MSR Vision 2020 in Kingston on Monday (slides), and in the wake of that, an experienced developers at Mozilla sent me a list of ten questions he’d really like empirical software engineering researchers to answer.  They’re interesting in their own right, but I think they also reveal a lot about what practitioners want from researchers in general; comments would be very welcome.

  1. Vi vs. Emacs vs. graphica editors/IDEs: which makes me more productive?
  2. Should language developers spend their time on tools, syntax, library, or something else (like speed)? What makes the most difference to their users?
  3. Do unit tests save more time in debugging than they take to write/run/keep updated?
  4. Do distribution version control systems offer any advantages over centralized version control systems? (As a sub-question, Git or Mercurial: which helps me make fewer mistakes/shows me the info I need faster?)
  5. What are the best debugging techniques?
  6. Is it really twice as hard to debug as it is to write the code in the first place?
  7. What are the differences (bug count, code complexity, size, etc.), if any, between community-driven open source projects and corporate-controlled open source projects?
  8. If 10,000-line projects don’t benefit from architecture, but 100,000-line projects do, what do you do when your project slowly grows from the first size to the second?
  9. When does it make sense to reinvent the wheel vs. use an existing library?
  10. Are conferences worth the money? How much do they help junior/intermediate/senior programmers?

Yingnong Dang, Rongxin Wu, Hongyu Zhang, Dongmei Zhang, and Peter Nobel. “ReBucket: A Method for Clustering Duplicate Crash Reports Based on Call Stack Similarity”. ICSE 2012.

Software often crashes. Once a crash happens, a crash report could be sent to software developers for investigation upon user permission. To facilitate efficient handling of crashes, crash reports received by Microsoft’s Windows Error Reporting (WER) system are organized into a set of “buckets”. Each bucket contains duplicate crash reports that are deemed as manifestations of the same bug. The bucket information is important for prioritizing efforts to resolve crashing bugs. To improve the accuracy of bucketing, we propose ReBucket, a method for clustering crash reports based on call stack matching. ReBucket measures the similarities of call stacks in crash reports and then assigns the reports to appropriate buckets based on the similarity values. We evaluate ReBucket using crash data collected from five widely-used Microsoft products. The results show that ReBucket achieves better overall performance than the existing methods. On average, the F-measure obtained by ReBucket is about 0.88.

For successful software products, one nasty consequence of a massive user base is the similarly massive amount of crash reports that they produce. Somebody (or some tool) needs to sift through all of them and categorize them to figure out if there’s anything that’s new and worthy of investigation, as well as which bugs are in most urgent need of attention. Dang & Co developed a method to cluster these crash reports (the paper describes it in some detail), and it seems to have pretty good results so far—and although it has been tried only on Microsoft data, the authors are planning to move onto other projects as well.