Automated program repair is a problem of finding a transformation (called a patch) of a given incorrect program that eliminates the observable failures. It has important applications such as providing debugging aids, automatically grading student assignments and patching security vulnerabilities. A common challenge faced by existing repair techniques is scalability to large patch spaces, since there are many candidate patches that these techniques explicitly or implicitly consider. The correctness criteria for program repair is often given as a suite of tests. Current repair techniques do not scale due to the large number of test executions performed by the underlying search algorithms. In this work, we address this problem by introducing a methodology of patch generation based on a test-equivalence relation (if two programs are "test-equivalent" for a given test, they produce indistinguishable results on this test). We propose two test-equivalence relations based on runtime values and dependencies respectively and present an algorithm that performs on-the-fly partitioning of patches into test-equivalence classes. Our experiments on real-world programs reveal that the proposed methodology drastically reduces the number of test executions and therefore provides an order of magnitude efficiency improvement over existing repair techniques, without sacrificing patch quality.
Singer et al. find that software developers use Twitter to ``keep up with the fast-paced development landscape''. Our survey with 71 developers who use Twitter in their development activities highlights that developers are interested in following specialized software gurus that share relevant technical tweets. However, finding these gurus among the more than 310 million Twitter users is not an easy feat. To help developers perform this task, we propose a recommendation system to identify specialized gurus which takes into account four things; firstly the content of microblogs generated by Twitter users, secondly the structure of the Twitter network of such users, thirdly the profile information of the users, and lastly the GitHub information of the users. Our approach first extracts different kinds of features that characterize a Twitter user and then employs a two-stage classification approach to generate a discriminative model that can differentiate specialized software gurus in a particular domain from other Twitter users that generate domain-related tweets. Our experiments on a dataset of 62,774 Twitter users, which generate 6,321,450 tweets over one month, demonstrate that our approach can achieve an F-measure of up to 0.774, which outperforms a state-of-the-art competing approach by at least 56.05%.
While developers are aware of the importance of comprehensively testing patches, the large effort involved in coming up with relevant test cases means that such testing rarely happens in practice. Furthermore, even when test cases are written to cover the patch, they often exercise the same behaviour in the old and the new version of the code. In this article, we present a symbolic execution-based technique that is designed to generate test inputs that cover the new program behaviours introduced by a patch. The technique works by executing both the old and the new version in the same symbolic execution instance, with the old version shadowing the new one. During this combined shadow execution, whenever a branch point is reached where the old and the new version diverge, we generate a test case exercising the divergence and comprehensively test the new behaviours of the new version. We evaluate our technique on the Coreutils patches from the CoREBench suite of regression bugs, and show that it is able to generate test inputs that exercise newly added behaviours and expose some of the regression bugs.
The number of mobile devices sold worldwide has exponentially increased in recent years, surpassing that of personal computers in 2011. Such devices daily download and run millions of apps that take advantage of modern hardware features (e.g., multi-core processors, large OLED screens) to offer exciting user experiences. Clearly, there is a cost to pay in terms of energy consumption and, in particular, of reduced battery life. This has pushed researchers to investigate how to reduce the energy consumption of apps, for example, by optimizing the color palette used in the app's GUI. Whilst past research in this area aimed at optimizing energy while keeping an acceptable level of contrast, this paper proposes an approach, named GEMMA (Gui Energy Multi-objective optiMization for Android apps), for generating color palettes using a multi-objective optimization technique, which produces color solutions optimizing energy consumption and contrast while using consistent colors with respect to the original palette. The empirical evaluation demonstrates (i) substantial improvements in the three different objectives, (ii) a concrete reduction of the energy consumption as assessed by a power monitor, (iii) the attractiveness of the generated color compositions for apps' users, and (iv) the suitability of GEMMA to be adopted in industrial context
A wide variety of research methods and techniques are available to software engineering researchers to conduct studies in software engineering. Although several overviews exist of research methods, overall, there is not a great deal of consistency in the research methods covered, and also some ambiguities in how research terminology is used. Furthermore, research is sometimes criticized by reviewers for characteristics inherent to the used methods. We present the ABC framework for SE research which offers a holistic view of eight archetypal research strategies. The ABC framework is based on two dimensions that are widely considered to be key in research design: the level of control that a researcher can exert on a research setting, and the level of generalizability of a study's findings. We identify metaphors for each and also discuss essential limitations and potential strengths of each strategy. ABC refers to the research goal which seeks to achieve generalizability over actors (A), precise measurement of their behavior (B), in a realistic context (C). We illustrate these research strategies in two key SE domains: global software engineering and requirements engineering. Finally, we discuss six ways in which the framework can be used to further SE research.
Model transformations play a cornerstone role in Model-Driven Engineering (MDE) as they provide the essential mechanisms for manipulating and transforming models. The correctness of software built using MDE techniques greatly relies on the correctness of model transformations. However, it is challenging and error prone to debug them, and the situation gets more critical as the size and complexity of model transformations grow, where manual debugging is no longer possible. Spectrum-Based Fault Localization (SBFL) uses the results of test cases and their corresponding code coverage information to estimate the likelihood of each program component (e.g., statements) of being faulty. In this paper we present an approach to apply SBFL for locating the faulty rules in model transformations. We evaluate the feasibility and accuracy of the approach by comparing the effectiveness of 18 different state-of-the-art SBFL techniques at locating faults in model transformations. Evaluation results revealed that the best techniques, namely Kulcynski2, Mountford, Ochiai and Zoltar, lead the debugger to inspect a maximum of three rules in order to locate the bug in around 74% of the cases. Furthermore, we compare our approach with a static approach for fault localization in model transformations, observing a clear superiority of the proposed SBFL-based method.
Software effort estimation studies still suffer from discordant empirical results (i.e. conclusion instability) due mainly to the lack of rigorous benchmarking methods. So far only one baseline method, namely Automatically Transformed Linear Model (ATLM), has been proposed yet it has not been extensively assessed. In this paper, we propose a novel method based on Linear Programming (dubbed as Linear Programming for Effort Estimation, LP4EE) and carry out a thorough empirical study to evaluate the effectiveness of both LP4EE and ATLM for benchmarking widely used effort estimation techniques. The results of our study confirm the need to benchmark every other proposal against robust baselines and reveal that LP4EE is not only more accurate than ATLM for 66% of the experiments but also more robust against different data splits for 41% of the cases. Therefore, suggesting that using LP4EE as a baseline method can help reduce conclusion instability. We make publicly available an open-source implementation of LP4EE in order to facilitate its adoption as a benchmark in future studies.