It Will Never Work in Theory

Software development research that is relevant in practice

Browsing Posts published by Jorge Aranda

Yingnong Dang, Rongxin Wu, Hongyu Zhang, Dongmei Zhang, and Peter Nobel. “ReBucket: A Method for Clustering Duplicate Crash Reports Based on Call Stack Similarity”. ICSE 2012.

Software often crashes. Once a crash happens, a crash report could be sent to software developers for investigation upon user permission. To facilitate efficient handling of crashes, crash reports received by Microsoft’s Windows Error Reporting (WER) system are organized into a set of “buckets”. Each bucket contains duplicate crash reports that are deemed as manifestations of the same bug. The bucket information is important for prioritizing efforts to resolve crashing bugs. To improve the accuracy of bucketing, we propose ReBucket, a method for clustering crash reports based on call stack matching. ReBucket measures the similarities of call stacks in crash reports and then assigns the reports to appropriate buckets based on the similarity values. We evaluate ReBucket using crash data collected from five widely-used Microsoft products. The results show that ReBucket achieves better overall performance than the existing methods. On average, the F-measure obtained by ReBucket is about 0.88.

For successful software products, one nasty consequence of a massive user base is the similarly massive amount of crash reports that they produce. Somebody (or some tool) needs to sift through all of them and categorize them to figure out if there’s anything that’s new and worthy of investigation, as well as which bugs are in most urgent need of attention. Dang & Co developed a method to cluster these crash reports (the paper describes it in some detail), and it seems to have pretty good results so far—and although it has been tried only on Microsoft data, the authors are planning to move onto other projects as well.

Anthony Finkelstein, the Dean of the Faculty of Engineering Sciences of University College London (and my academic grandfather), has set up a new blog with a similar purpose as ours: he publishes “snappy summaries” of software engineering research with the aim of supporting practitioners. Be sure to check it out! And if you’re doing research, I’d suggest you also read his lists of the Top Ten and Bottom Ten Software Engineering Challenges.

Magne Jørgensen and Stein Grimstad. “Software development estimation biases: the role of interdependence.” TSE 2012 38(3).

Software development effort estimates are frequently too low, which may lead to poor project plans and project failures. One reason for this bias seems to be that the effort estimates produced by software developers are affected by information that has no relevance for the actual use of effort. We attempted to acquire a better understanding of the underlying mechanisms and the robustness of this type of estimation bias. For this purpose, we hired 374 software developers working in outsourcing companies to participate in a set of three experiments. The experiments examined the connection between estimation bias and developer dimensions: self-construal (how one sees oneself), thinking style, nationality, experience, skill, education, sex, and organizational role. We found that estimation bias was present along most of the studied dimensions. The most interesting finding may be that the estimation bias increased significantly with higher levels of interdependence, i.e., with stronger emphasis on connectedness, social context, and relationships. We propose that this connection may be enabled by an activation of one’s self-construal when engaging in effort estimation, and a connection between a more interdependent self-construal and increased search for indirect messages, lower ability to ignore irrelevant context, and a stronger emphasis on socially desirable responses.

Researchers know that estimates of software development effort can be biased pretty easily by anchors and by irrelevant information. An important question, though, is whether these biases occur due to purely cognitive reasons or due to a desire to please and to connect with others. Jørgensen and Grimstad help answer this question, by getting hundreds of paid developers from several countries to participate in an experiment. Some of their findings:

Bias in effort estimation seems to be present for developers from all the countries studied. We were unable to find strong and systematic differences between countries or regions (Eastern Europe and Asia). (…)

In spite of the lack of any strong and systematic difference between the countries and regions, there may be culturally related variables that are useful for understanding the mechanisms by which estimation biases occur. In particular, a developer’s level of interdependence (emphasis on connectedness, social context, and relationship) seems to be connected systematically with how much he or she was affected by irrelevant and misleading information and with lower effort estimates.

In other words, they found that everyone (that would include you and your peers) seems susceptible to estimation biases, that people with greater levels of “interdependence” (those that give greater weight to relationships and connectedness) are more subject to bias, and that despite this, cultural differences by country do not seem to play an important role in estimation bias. The best thing you can do, in any case, is to try to shield estimators from anchors and other sources of bias as much as possible.

Chris Parnin, Christoph Treude, Lars Grammel, and Margaret-Anne Storey. Crowd Documentation: Exploring the Coverage and the Dynamics of API Discussions on Stack Overflow. Georgia Tech Technical Report, 2012.

Traditionally, many types of software documentation, such as API documentation, require a process where a few people write for many potential users. The resulting documentation, when it exists, is often of poor quality and lacks sufficient examples and explanations. In this paper, we report on an empirical study to investigate how Question and Answer (Q&A) websites, such as Stack Overflow, facilitate crowd documentation — knowledge that is written by many and read by many. We examine the crowd documentation for three popular APIs: Android, GWT, and the Java programming language. We collect usage data using Google Code Search, and analyze the coverage, quality, and dynamics of the Stack Overflow documentation for these APIs. We find that the crowd is capable of generating a rich source of content with code examples and discussion that is actively viewed and used by many more developers. For example, over 35,000 developers contributed questions and answers about the Android API, covering 87% of the classes. This content has been viewed over 70 million times to date. However, there are shortcomings with crowd documentation, which we identify. In addition to our empirical study, we present future directions and tools that can be leveraged by other researchers and software designers for performing API analytics and mining of crowd documentation.

The process of figuring out how to use an API has changed radically since Q&A sites (and in particular StackOverflow) came along. But to what extent can we depend on such sites for complete, speedy documentation? Parnin and colleagues looked into this, and got some pretty interesting stats (a sample: 87% of all classes of the Android API and 77% of the Java API classes have at least one thread at StackOverflow; questions are answered in a median time of 11 minutes), and some visualization tools for you to play with the data: see Chris Parnin’s blog for more details.

(Disclaimer: I’m currently associated with Dr. Storey’s lab. However, I did not participate in this research.)

We interrupt our regular programming to let you know of a slight change in editorial policy here at Never Work in Theory: although we always try to present journal and conference papers that are freely accessible online, we’ve made exceptions for interesting research that is only available behind a paywall. But our goal is to build bridges between software research and practice, and paywalled papers do not help—especially when considering that they report on research that, more often than not, citizens already paid for with their taxes. Therefore, from now on, open access is one of our few basic requirements for discussing papers at NWIT—the others being empirical findings, relevance to practitioners, and no self-promotion.

To clarify: at this point we don’t care if the papers we present are available to the public through an Open Access journal (there’s still few of those in our area) or through a researcher’s personal website, as long as you can access them for free.

Open Access is gathering steam right now, and for good reason: academic publishing is one of the most abusive and least necessary businesses around. Currently, in the United States, there is an online petition at the White House website to require free access over the Internet to scientific articles arising from taxpayer-funded research. If enough people sign the petition, the White House is bound to consider it. It’s currently about 6,000 signatures below the threshold; if this is something you care for at all, please consider adding your name today.

Abram Hindle, Earl Barr, Zhendong Su, Prem Devanbu, and Mark Gabel. “On the Naturalness of Software”, ICSE 2012.

Natural languages like English are rich, complex, and powerful. The highly creative and graceful use of languages like English and Tamil, by masters like Shakespeare and Avvaiyar, can certainly delight and inspire. But in practice, given cognitive constraints and the exigencies of daily life, most human utterances are far simpler and much more repetitive and predictable. In fact, these utterances can be very usefully modeled using modern statistical methods. This fact has led to the phenomenal success of statistical approaches to speech recognition, natural language translation, question-answering, and text mining and comprehension.

We begin with the conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations—and thus, like natural language, it is also likely to be repetitive and predictable. We then proceed to ask whether a) code can be usefully modeled by statistical language models and b) such models can be leveraged to support software engineers. Using the widely adopted n-gram model, we provide empirical evidence supportive of a positive answer to both these questions. We show that code is also very repetitive, and in fact even more so than natural languages. As an example use of the model, we have developed a simple code completion engine for Java that, despite its simplicity, already improves Eclipse’s completion capability. We conclude the paper by laying out a vision for future research in this area.

This paper is not directly applicable to software practice, but you may still find it pretty cool and a great read. It uses the statistical approach to Natural Language Processing that is used to such good effect by tools such as Google Translate, but applied to lines of code. The authors find that code is much more amenable to statistical modelling than English. This means that more powerful code completion and code suggestion tools are viable (they prototyped one for Eclipse), and it also opens the door to new approaches in software mining research. Exciting stuff…

Ekrem Kocaguneli, Tim Menzies, and Jacky Keung. “On the value of ensemble effort estimation”, TSE 2011.

Background: Despite decades of research, there is no consensus on which software effort estimation methods produce the most accurate models.

Aim: Prior work has reported that, given M estimation methods, no single method consistently outperforms all others. Perhaps rather than recommending one estimation method as best, it is wiser to generate estimates from ensembles of multiple estimation methods.

Method: 9 learners were combined with 10 pre-processing options to generate 9 × 10 = 90 solo-methods. These were applied to 20 data sets and evaluated using 7 error measures. This identified the best n (in our case n = 13) solo-methods that showed stable performance across multiple datasets and error measures. The top 2, 4, 8 and 13 solo-methods were then combined to generate 12 multi-methods, which were then compared to the solo-methods.

Results: (i) The top 10 (out of 12) multi-methods significantly out-performed all 90 solo-methods. (ii) The error rates of the multi-methods were significantly less than the solo-methods. (iii) The ranking of the best multi-method was remarkably stable.

Conclusion: While there is no best single effort estimation method, there exist best combinations of such effort estimation methods.

Anybody who has ever done software effort estimation knows that it’s a pretty hard thing to do. It’s tough even for small individual tasks for someone without practice, and it’s a horribly difficult task for large-scale group projects even for estimators with lots of practice. There are many methods that estimators could use, but as Kocaguneli & Co remind us, “no single method consistently outperforms all others”—sometimes you’re better off using method A, other times, method B would’ve been more appropriate. Their proposal: to build ensembles of methods, each of them deficient on their own, and to plug them with different automated learners in the hope that these new multi-methods will provide estimates with less error and more consistency.

The multi-methods approach worked well in their (pretty large) dataset of nearly 1,200 projects. This does not mean that the method that came out on top for them will come out on top for you, too. It only means that ensembles of methods are a good workaround for the problem of inconsistency of method efficacy. What the authors propose is for practitioners to learn the basics of machine learning and build method ensembles themselves:

Therefore, our recommendations to practitioners, who are willing to use multi-methods but lack the knowledge of machine learning algorithms are:

  • Start with initial 2 learners and build the associated multi-methods
  • See the performance of the current multi-methods
  • Build new multi-methods only if you are not pleased with the performance of the current ones

That won’t be an easy task, but it may be less painful than committing to using a single method that often won’t work. If you’re interested in doing it, this paper has several references and pointers to get you started.

 

Susan Elliott Sim, Rosalva Gallardo-Valencia, Kavita Philip, Medha Umarji, Megha Agarwala, Cristina V. Lopes, and Sukanya Ratanotayanon. Software Reuse through Methodical Component Reuse and Amethodical Snippet Remixing“, CSCW 2012.

Every method for developing software is a prescriptive model. Applying a deconstructionist analysis to methods reveals that there are two texts, or sets of assumptions and ideals: a set that is privileged by the method and a second set that is left out, or marginalized by the method. We apply this analytical lens to software reuse, a technique in software development that seeks to expedite one’s own project by using programming artifacts created by others. By analyzing the methods prescribed by Component-Based Software Engineering (CBSE), we arrive at two texts: Methodical CBSE and Amethodical Remixing. Empirical data from four studies on code search on the web draws attention to four key points of tension: status of component boundaries; provenance of source code; planning and process; and evaluation criteria for candidate code. We conclude the paper with a discussion of the implications of this work for the limits of methods, structure of organizations that reuse software, and the design of search engines for source code.

One of the ways in which the Internet transformed software development is the prevalence of the “programming by Google” practice: searching online for a function or snippet that does what you want, and copy-and-pasting it into one’s own code, stitching it as needed to make it work. This practice is great in some ways (it speeds up development, it helps cross-pollinate ideas and techniques), but it also has its problems (maintaining code provenance, intellectual property, and diverging from policy, for example).

In this paper, Sim & Co provide a very good exploration of the distinction between the safer, more planned, and stuffier “component reuse” approach and the ad-hoc, versatile, under-the-table “snippet remixing” approach to code reuse. They have some interesting statistics on the use of both approaches (for instance: 92% of surveyed developers admit to remixing snippets of code), and they identify points of tension between them. Their paper should be a wake-up call for Software Engineering professors to stop acting as if component-based reuse was all there is to code reuse, and an invitation to practitioners to consider the strengths and weaknesses of both approaches and to define the right balance between them in their own contexts.

(Disclosure: Susan Sim is my academic sister—we had the same graduate advisor, though we did not overlap. Also, here is one more reminder that we link to PDFs of the papers we discuss when we find them. In cases we do not, asking the authors nicely for a copy usually works.)

Update, April 30: There is now a freely available electronic copy here.

Laura Dabbish, Colleen Stuart, Jason Tsay, and Jim Herbsleb. “Social Coding in GitHub: Transparency and Collaboration in an Open Software Repository” CSCW 2012.

Social applications on the web let users track and follow the activities of a large number of others regardless of location or affiliation. There is a potential for this transparency to radically improve collaboration and learning in complex knowledge-based activities. Based on a series of in-depth interviews with central and peripheral GitHub users, we examined the value of transparency for large-scale distributed collaborations and communities of practice. We find that people make a surprisingly rich set of social inferences from the networked activity information in GitHub, such as inferring someone else’s technical goals and vision when they edit code, or guessing which of several similar projects has the best chance of thriving in the long term. Users combine these inferences into effective strategies for coordinating work, advancing technical skills and managing their reputation.

Platforms like GitHub provide an interesting twist to social dynamics in open source: they make it easy for everyone to keep track of, interact, collaborate, and be aware of the work of other developers, including some of the best in the world, all in one place. This paper by Dabbish & Co reports on work habits and perceptions of “central and peripheral” GitHub users. If you’re new to GitHub, this paper is a good take on its social aspect, but if you’re already used to working in GitHub, there will be little that will surprise you here. Still, I found some cool nuggets that might interest you. For instance, that once people amass an audience looking at their code production, they become more careful about what they make available publicly. Also, that some people get followers not because of programming ability or personal connections, but because they have “good taste” in the projects they themselves follow.

Patrick Jermann and Marc-Antoine Nüssli. “Effects of Sharing Text Selections on Gaze Cross-recurrence and Interaction Quality in a Pair Programming Task” CSCW 2012.

We present a dual eye-tracking study that demonstrates the effect of sharing selection among collaborators in a remote pair-programming scenario. Forty pairs of engineering students completed several program understanding tasks while their gaze was synchronously recorded. The coupling of the programmers’ focus of attention was measured by a cross- recurrence analysis of gaze that captures how much programmers look at the same sequence of spots within a short time span. A high level of gaze cross-recurrence is typical for pairs who actively engage in grounding efforts to build and maintain shared understanding. As part of their grounding efforts, programmers may use text selection to perform collaborative references. Broadcast selections serve as indexing sites for the selector as they attract non-selector’s gaze shortly after they become visible. Gaze cross-recurrence is highest when selectors accompany their selections with speech to produce a multimodal reference.

The fact that pair programming can work pretty well doesn’t mean that it “just works.” Instead, it requires its own set of skills and considerations, and perhaps some people are better suited for it than others. In a controlled experiment using an eye tracker, Jermann and Nüssli show the effect of some of the very low-level actions that people in pairs may take to improve their performance. Specifically, two seemingly simple kinds of actions (talking aloud and selecting the block of text that you’re talking about) bring your partner’s attention to the same screen area. When pairs do this, their level of code comprehension increases.

Jermann and Nüssli’s study had engineers sitting separately, in front of different but shared screens. My guess is that if you and your pair are sitting side by side, other actions with the same purpose (such as pointing with your finger, or with your mouse) should have similar effects.

(As usual, we post links to the actual papers when we find them. I couldn’t in this case, but remember that researchers are usually happy to share their work over email if you ask nicely…)