The advent of variability management and generator technology enables users to derive individual system variants from a configurable code base by selecting desired configuration options. This approach gives rise to the generation of possibly billions of variants, which, however, cannot be efficiently analyzed for bugs and other properties with classic analysis techniques. To address this issue, researchers and practitioners developed sampling heuristics and, recently, variability-aware analysis techniques. While sampling reduces the analysis effort significantly, the information obtained is necessarily incomplete, and it is unknown whether state-of-the-art sampling techniques scale to billions of variants. Variability-aware analysis techniques process the configurable code base directly, exploiting similarities among individual variants with the goal of reducing analysis effort. However, while promising, variability-aware analysis techniques have been applied mostly only to small academic examples. To learn about the mutual strengths and weaknesses of variability-aware and sample-based analysis techniques, we compared the two by means of seven concrete control-flow and data-flow analyses, applied to five real-world subject systems: Busybox, OpenSSL, SQLite, Linux, and uClibc. In particular, we compare the efficiency (analysis execution time) of the analyses and their effectiveness (potential bugs found). Overall, variability-aware analysis outperforms most sampling techniques with respect to efficiency and effectiveness.
Automated program repair is a problem of finding a transformation (called a patch) of a given incorrect program that eliminates the observable failures. It has important applications such as providing debugging aids, automatically grading student assignments and patching security vulnerabilities. A common challenge faced by existing repair techniques is scalability to large patch spaces, since there are many candidate patches that these techniques explicitly or implicitly consider. The correctness criteria for program repair is often given as a suite of tests. Current repair techniques do not scale due to the large number of test executions performed by the underlying search algorithms. In this work, we address this problem by introducing a methodology of patch generation based on a test-equivalence relation (if two programs are "test-equivalent" for a given test, they produce indistinguishable results on this test). We propose two test-equivalence relations based on runtime values and dependencies respectively and present an algorithm that performs on-the-fly partitioning of patches into test-equivalence classes. Our experiments on real-world programs reveal that the proposed methodology drastically reduces the number of test executions and therefore provides an order of magnitude efficiency improvement over existing repair techniques, without sacrificing patch quality.
This paper presents systematic literature review that enquires into the maturity of FLTs evaluation in terms of baseline comparison, homogeneity of empirical designs and finally, the reproducibility of FLTs and their evaluation. It identifies different issues that substantially affect the ability of researchers and practitioners when trying to identify the best-of-breed FLT or, in the case of researchers, when trying to replicate existing FLT-evaluation studies. The results show that a 95% of the existing research in this field present novel FLTs, and that only just over half of the examined FLTs have been evaluated through formal empirical methods. In addition, only 8% of the studies compared the FLT to openly available, baseline techniques. Another characteristic of the reviewed literature is the 255 different subject systems, 60 metrics, 210 benchmarks and plethora of user input formats in FLT evaluations, which also negatively affects the comparability of FLTs. Finally, there is a lack of reproducible FLTs in the field, disallowing researchers from re-creating the FLT for comparison studies. Cumulatively, these conditions make it difficult to find answers to questions like which are the best FLTs?". Paper concludes by providing guidelines for empirical evaluation of FLTs that may help towards empirical standardization.
Singer et al. find that software developers use Twitter to ``keep up with the fast-paced development landscape''. Our survey with 71 developers who use Twitter in their development activities highlights that developers are interested in following specialized software gurus that share relevant technical tweets. However, finding these gurus among the more than 310 million Twitter users is not an easy feat. To help developers perform this task, we propose a recommendation system to identify specialized gurus which takes into account four things; firstly the content of microblogs generated by Twitter users, secondly the structure of the Twitter network of such users, thirdly the profile information of the users, and lastly the GitHub information of the users. Our approach first extracts different kinds of features that characterize a Twitter user and then employs a two-stage classification approach to generate a discriminative model that can differentiate specialized software gurus in a particular domain from other Twitter users that generate domain-related tweets. Our experiments on a dataset of 62,774 Twitter users, which generate 6,321,450 tweets over one month, demonstrate that our approach can achieve an F-measure of up to 0.774, which outperforms a state-of-the-art competing approach by at least 56.05%.
While developers are aware of the importance of comprehensively testing patches, the large effort involved in coming up with relevant test cases means that such testing rarely happens in practice. Furthermore, even when test cases are written to cover the patch, they often exercise the same behaviour in the old and the new version of the code. In this article, we present a symbolic execution-based technique that is designed to generate test inputs that cover the new program behaviours introduced by a patch. The technique works by executing both the old and the new version in the same symbolic execution instance, with the old version shadowing the new one. During this combined shadow execution, whenever a branch point is reached where the old and the new version diverge, we generate a test case exercising the divergence and comprehensively test the new behaviours of the new version. We evaluate our technique on the Coreutils patches from the CoREBench suite of regression bugs, and show that it is able to generate test inputs that exercise newly added behaviours and expose some of the regression bugs.
The number of mobile devices sold worldwide has exponentially increased in recent years, surpassing that of personal computers in 2011. Such devices daily download and run millions of apps that take advantage of modern hardware features (e.g., multi-core processors, large OLED screens) to offer exciting user experiences. Clearly, there is a cost to pay in terms of energy consumption and, in particular, of reduced battery life. This has pushed researchers to investigate how to reduce the energy consumption of apps, for example, by optimizing the color palette used in the app's GUI. Whilst past research in this area aimed at optimizing energy while keeping an acceptable level of contrast, this paper proposes an approach, named GEMMA (Gui Energy Multi-objective optiMization for Android apps), for generating color palettes using a multi-objective optimization technique, which produces color solutions optimizing energy consumption and contrast while using consistent colors with respect to the original palette. The empirical evaluation demonstrates (i) substantial improvements in the three different objectives, (ii) a concrete reduction of the energy consumption as assessed by a power monitor, (iii) the attractiveness of the generated color compositions for apps' users, and (iv) the suitability of GEMMA to be adopted in industrial context
Uncertainty in timing properties (e.g., detection time of external events) is a common reality in embedded software systems since these systems interact with complex physical environments. Such time uncertainty leads to non-determinism. For example, as a result of time uncertainty, time-triggered operations may either generate different valid outputs across different executions, or experience failures (e.g., results not being generated in the expected time window) that occur only occasionally over many executions. For these reasons, time uncertainty makes the generation of effective test oracles for timing requirements a challenging task. To address the above challenge, we propose STUIOS (Stochastic Testing with Unique Input Output Sequences), an approach for the automated generation of stochastic oracles that verify the capability of a software system to fulfill timing constraints in the presence of time uncertainty. Such stochastic oracles entail the statistical analysis of repeated test case executions based on test output probabilities predicted by means of statistical model checking. Results from two industrial case studies in the automotive domain demonstrate that this approach improves the fault detection effectiveness of tests suites derived from timed automata, compared to traditional approaches.
Multi-level modelling promotes flexibility in modelling by enabling the use of several meta-levels instead of just two, as is the case in mainstream two-level modelling approaches. While this approach leads to simpler models for some scenarios, it introduces an additional degree of freedom as designers are able to decide the level where an element should reside, having to ascertain the suitability of such decisions. In this respect, model refactorings have been successfully applied in the context of two level-modelling to rearrange the elements of a model while preserving its meaning. Thus, in this paper, we propose their extension to tackle the refactoring of multi-level models in order to help designers in rearranging elements across and within levels and exploring the consequences. We show a classification and catalogue of multi-level refactorings, and provide support in our MetaDepth tool. Finally, we present an experiment based on model mutation that validates the predicted semantic side effects of the refactorings on the basis of more than 210.000 refactoring applications.