共查询到20条相似文献,搜索用时 15 毫秒
1.
Topic models are generative probabilistic models which have been applied to information retrieval to automatically organize and provide structure to a text corpus. Topic models discover topics in the corpus, which represent real world concepts by frequently co-occurring words. Recently, researchers found topics to be effective tools for structuring various software artifacts, such as source code, requirements documents, and bug reports. This research also hypothesized that using topics to describe the evolution of software repositories could be useful for maintenance and understanding tasks. However, research has yet to determine whether these automatically discovered topic evolutions describe the evolution of source code in a way that is relevant or meaningful to project stakeholders, and thus it is not clear whether topic models are a suitable tool for this task.In this paper, we take a first step towards evaluating topic models in the analysis of software evolution by performing a detailed manual analysis on the source code histories of two well-known and well-documented systems, JHotDraw and jEdit. We define and compute various metrics on the discovered topic evolutions and manually investigate how and why the metrics evolve over time. We find that the large majority (87%–89%) of topic evolutions correspond well with actual code change activities by developers. We are thus encouraged to use topic models as tools for studying the evolution of a software system. 相似文献
2.
Stephen W. Thomas Hadi Hemmati Ahmed E. Hassan Dorothea Blostein 《Empirical Software Engineering》2014,19(1):182-212
Software development teams use test suites to test changes to their source code. In many situations, the test suites are so large that executing every test for every source code change is infeasible, due to time and resource constraints. Development teams need to prioritize their test suite so that as many distinct faults as possible are detected early in the execution of the test suite. We consider the problem of static black-box test case prioritization (TCP), where test suites are prioritized without the availability of the source code of the system under test (SUT). We propose a new static black-box TCP technique that represents test cases using a previously unused data source in the test suite: the linguistic data of the test cases, i.e., their identifier names, comments, and string literals. Our technique applies a text analysis algorithm called topic modeling to the linguistic data to approximate the functionality of each test case, allowing our technique to give high priority to test cases that test different functionalities of the SUT. We compare our proposed technique with existing static black-box TCP techniques in a case study of multiple real-world open source systems: several versions of Apache Ant and Apache Derby. We find that our static black-box TCP technique outperforms existing static black-box TCP techniques, and has comparable or better performance than two existing execution-based TCP techniques. Static black-box TCP methods are widely applicable because the only input they require is the source code of the test cases themselves. This contrasts with other TCP techniques which require access to the SUT runtime behavior, to the SUT specification models, or to the SUT source code. 相似文献
3.
4.
Viet-An Nguyen Jordan Boyd-Graber Philip Resnik Deborah A. Cai Jennifer E. Midberry Yuanxin Wang 《Machine Learning》2014,95(3):381-421
Identifying influential speakers in multi-party conversations has been the focus of research in communication, sociology, and psychology for decades. It has been long acknowledged qualitatively that controlling the topic of a conversation is a sign of influence. To capture who introduces new topics in conversations, we introduce SITS—Speaker Identity for Topic Segmentation—a nonparametric hierarchical Bayesian model that is capable of discovering (1) the topics used in a set of conversations, (2) how these topics are shared across conversations, (3) when these topics change during conversations, and (4) a speaker-specific measure of “topic control”. We validate the model via evaluations using multiple datasets, including work meetings, online discussions, and political debates. Experimental results confirm the effectiveness of SITS in both intrinsic and extrinsic evaluations. 相似文献
5.
ContextMining software repositories has emerged as a research direction over the past decade, achieving substantial success in both research and practice to support various software maintenance tasks. Software repositories include bug repository, communication archives, source control repository, etc. When using these repositories to support software maintenance, inclusion of irrelevant information in each repository can lead to decreased effectiveness or even wrong results.ObjectiveThis article aims at selecting the relevant information from each of the repositories to improve effectiveness of software maintenance tasks.MethodFor a maintenance task at hand, maintainers need to implement the maintenance request on the current system. In this article, we propose an approach, MSR4SM, to extract the relevant information from each software repository based on the maintenance request and the current system. That is, if the information in a software repository is relevant to either the maintenance request or the current system, this information should be included to perform the current maintenance task. MSR4SM uses the topic model to extract the topics from these software repositories. Then, relevant information in each software repository is extracted based on the topics.ResultsMSR4SM is evaluated for two software maintenance tasks, feature location and change impact analysis, which are based on four subject systems, namely jEdit, ArgoUML, Rhino and KOffice. The empirical results show that the effectiveness of traditional software repositories based maintenance tasks can be greatly improved by MSR4SM.ConclusionsThere is a lot of irrelevant information in software repositories. Before we use them to implement a maintenance task at hand, we need to preprocess them. Then, the effectiveness of the software maintenance tasks can be improved. 相似文献
6.
Tse-Hsun Chen Stephen W. Thomas Ahmed E. Hassan 《Empirical Software Engineering》2016,21(5):1843-1919
Researchers in software engineering have attempted to improve software development by mining and analyzing software repositories. Since the majority of the software engineering data is unstructured, researchers have applied Information Retrieval (IR) techniques to help software development. The recent advances of IR, especially statistical topic models, have helped make sense of unstructured data in software repositories even more. However, even though there are hundreds of studies on applying topic models to software repositories, there is no study that shows how the models are used in the software engineering research community, and which software engineering tasks are being supported through topic models. Moreover, since the performance of these topic models is directly related to the model parameters and usage, knowing how researchers use the topic models may also help future studies make optimal use of such models. Thus, we surveyed 167 articles from the software engineering literature that make use of topic models. We find that i) most studies centre around a limited number of software engineering tasks; ii) most studies use only basic topic models; iii) and researchers usually treat topic models as black boxes without fully exploring their underlying assumptions and parameter values. Our paper provides a starting point for new researchers who are interested in using topic models, and may help new researchers and practitioners determine how to best apply topic models to a particular software engineering task. 相似文献
7.
Tong-Seng Quah 《Information Sciences》2009,179(4):430-5009
In this study, defect tracking is used as a proxy method to predict software readiness. The number of remaining defects in an application under development is one of the most important factors that allow one to decide if a piece of software is ready to be released. By comparing predicted number of faults and number of faults discovered in testing, software manager can decide whether the software is likely ready to be released or not.The predictive model developed in this research can predict: (i) the number of faults (defects) likely to exist, (ii) the estimated number of code changes required to correct a fault and (iii) the estimated amount of time (in minutes) needed to make the changes in respective classes of the application. The model uses product metrics as independent variables to do predictions. These metrics are selected depending on the nature of source code with regards to architecture layers, types of faults and contribution factors of these metrics. The use of neural network model with genetic training strategy is introduced to improve prediction results for estimating software readiness in this study. This genetic-net combines a genetic algorithm with a statistical estimator to produce a model which also shows the usefulness of inputs.The model is divided into three parts: (1) prediction model for presentation logic tier (2) prediction model for business tier and (3) prediction model for data access tier. Existing object-oriented metrics and complexity software metrics are used in the business tier prediction model. New sets of metrics have been proposed for the presentation logic tier and data access tier. These metrics are validated using data extracted from real world applications. The trained models can be used as tools to assist software mangers in making software release decisions. 相似文献
8.
Yasutaka Kamei Takafumi Fukushima Shane McIntosh Kazuhiro Yamashita Naoyasu Ubayashi Ahmed E. Hassan 《Empirical Software Engineering》2016,21(5):2072-2106
Unlike traditional defect prediction models that identify defect-prone modules, Just-In-Time (JIT) defect prediction models identify defect-inducing changes. As such, JIT defect models can provide earlier feedback for developers, while design decisions are still fresh in their minds. Unfortunately, similar to traditional defect models, JIT models require a large amount of training data, which is not available when projects are in initial development phases. To address this limitation in traditional defect prediction, prior work has proposed cross-project models, i.e., models learned from other projects with sufficient history. However, cross-project models have not yet been explored in the context of JIT prediction. Therefore, in this study, we empirically evaluate the performance of JIT models in a cross-project context. Through an empirical study on 11 open source projects, we find that while JIT models rarely perform well in a cross-project context, their performance tends to improve when using approaches that: (1) select models trained using other projects that are similar to the testing project, (2) combine the data of several other projects to produce a larger pool of training data, and (3) combine the models of several other projects to produce an ensemble model. Our findings empirically confirm that JIT models learned using other projects are a viable solution for projects with limited historical data. However, JIT models tend to perform best in a cross-project context when the data used to learn them are carefully selected. 相似文献
9.
Tom Arbuckle 《Science of Computer Programming》2011,76(12):1078-1097
In order to study software evolution, it is necessary to measure artefacts representative of project releases. If we consider the process of software evolution to be copying with subsequent modification, then, by analogy, placing emphasis on what remains the same between releases will lead to focusing on similarity between artefacts. At the same time, software artefacts-stored digitally as binary strings-are all information. This paper introduces a new method for measuring software evolution in terms of artefacts’ shared information content. A similarity value representing the quantity of information shared between artefact pairs is produced using a calculation based on Kolmogorov complexity. Similarity values for releases are then collated over the software’s evolution to form a map quantifying change through lack of similarity. The method has general applicability: it can disregard otherwise salient software features such as programming paradigm, language or application domain because it considers software artefacts purely in terms of the mathematically justified concept of information content. Three open-source projects are analysed to show the method’s utility. Preliminary experiments on udev and git verify the measurement of the projects’ evolutions. An experiment on ArgoUML validates the measured evolution against experimental data from other studies. 相似文献
10.
Karunanithi N. Whitley D. Malaiya Y.K. 《IEEE transactions on pattern analysis and machine intelligence》1992,18(7):563-574
The usefulness of connectionist models for software reliability growth prediction is illustrated. The applicability of the connectionist approach is explored using various network models, training regimes, and data representation methods. An empirical comparison is made between this approach and five well-known software reliability growth models using actual data sets from several different software projects. The results presented suggest that connectionist models may adapt well across different data sets and exhibit a better predictive accuracy. The analysis shows that the connectionist approach is capable of developing models of varying complexity 相似文献
11.
12.
《Computers & Industrial Engineering》1988,14(2):161-170
This paper discusses the development of a series of interactive computer models for measuring productivity. By using LOTUS 123, a series of flexible models were developed which can easily be modified to fit the productivity measurement system used by most companies. Rather than force the company's productivity measurement system to fit an available computer model, a company can now tailor the computer model to exactly fit its productivity measurement system. 相似文献
13.
软件可靠性建模是一个重要的研究领域,现有的软件可靠性模型基本上是非线性函数模型,估计这些模型的参数比较困难。粒子群优化是一类适合求解非线性优化问题的随机优化方法,提出一种基于粒子群优化的软件可靠性模型估计参数方法,该方法的关键是构造合适的适应函数。用该方法分别估计了5个实际软件系统的指数软件可靠性模型以及对数泊松执行时间模型,实验结果表明:该方法参数估计的精度高,对模型的适应性强。 相似文献
14.
Arie ten Cate 《Computational statistics & data analysis》2009,53(6):2055-2060
Simultaneous econometric models may contain pairs of complementary inequalities. It is discussed how to reformulate such models and solve them with econometric software which can handle only equalities. Two approaches are applied: the normal map representation and the Fischer-Burmeister NCP function. The latter seems to work best. The software programs TSP, SAS/ETS and EViews are tested. The test model describes two markets for electricity, each with fluctuating demand and an endogenous production capacity; the capacity of the trade link between the regions is also endogenous. 相似文献
15.
Liangzhe Chen K. S. M. Tozammel Hossain Patrick Butler Naren Ramakrishnan B. Aditya Prakash 《Data mining and knowledge discovery》2016,30(3):681-710
Surveillance of epidemic outbreaks and spread from social media is an important tool for governments and public health authorities. Machine learning techniques for nowcasting the Flu have made significant inroads into correlating social media trends to case counts and prevalence of epidemics in a population. There is a disconnect between data-driven methods for forecasting Flu incidence and epidemiological models that adopt a state based understanding of transitions, that can lead to sub-optimal predictions. Furthermore, models for epidemiological activity and social activity like on Twitter predict different shapes and have important differences. In this paper, we propose two temporal topic models (one unsupervised model as well as one improved weakly-supervised model) to capture hidden states of a user from his tweets and aggregate states in a geographical region for better estimation of trends. We show that our approaches help fill the gap between phenomenological methods for disease surveillance and epidemiological models. We validate our approaches by modeling the Flu using Twitter in multiple countries of South America. We demonstrate that our models can consistently outperform plain vocabulary assessment in Flu case-count predictions, and at the same time get better Flu-peak predictions than competitors. We also show that our fine-grained modeling can reconcile some contrasting behaviors between epidemiological and social models. 相似文献
16.
Detection of unanticipated faults for autonomous underwater vehicles using online topic models 下载免费PDF全文
Ben‐Yair Raanan James Bellingham Yanwu Zhang Mathieu Kemp Brian Kieft Hanumant Singh Yogesh Girdhar 《野外机器人技术杂志》2018,35(5):705-716
For robots to succeed in complex missions, they must be reliable in the face of subsystem failures and environmental challenges. In this paper, we focus on autonomous underwater vehicle (AUV) autonomy as it pertains to self‐perception and health monitoring, and we argue that automatic classification of state‐sensor data represents an important enabling capability. We apply an online Bayesian nonparametric topic modeling technique to AUV sensor data in order to automatically characterize its performance patterns, then demonstrate how in combination with operator‐supplied semantic labels these patterns can be used for fault detection and diagnosis by means of a nearest‐neighbor classifier. The method is evaluated using data collected by the Monterey Bay Aquarium Research Institute's Tethys long‐range AUV in three separate field deployments. Our results show that the proposed method is able to accurately identify and characterize patterns that correspond to various states of the AUV, and classify faults at a high rate of correct detection with a very low false detection rate. 相似文献
17.
Faisal Farooq Anurag Bhardwaj Venu Govindaraju 《International Journal on Document Analysis and Recognition》2009,12(3):153-164
Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered
a challenging task. Previous research in this area has shown that word recognizers perform adequately on constrained handwritten
documents which typically use a restricted vocabulary (lexicon). But in the case of unconstrained handwritten documents, state-of-the-art
word recognition accuracy is still below the acceptable limits. The objective of this research is to improve word recognition
accuracy on unconstrained handwritten documents by applying a post-processing or OCR correction technique to the word recognition
output. In this paper, we present two different methods for this purpose. First, we describe a lexicon reduction-based method
by topic categorization of handwritten documents which is used to generate smaller topic-specific lexicons for improving the
recognition accuracy. Second, we describe a method which uses topic-specific language models and a maximum-entropy based topic
categorization model to refine the recognition output. We present the relative merits of each of these methods and report
results on the publicly available IAM database. 相似文献
18.
19.
20.
《Calphad》2021
The utility of CALPHAD (CALculation of PHAse Diagrams) models and software for calculation of gas-on-solid adsorption equilibria is demonstrated. Thermodynamic models formulated in CEF (Compound Energy Formalism) are constructed that account for adsorption on single site, several sites as well as for adsorption in pores and multilayers. Site blocking (the case where the adsorption of a single molecule on a single site also blocks some adjacent sites) may be emulated by the ionic-liquid model within CEF. The parameters of these thermodynamic models may be determined by fitting to experimental results or to results of ab-initio calculations. Alternatively, the parameters may be calculated from already-fitted adsorption isotherm equations such as Langmuir, dual site Langmuir, Nitta and Ruthven's model for adsorption in pores. Pure-gas adsorption models may be extrapolated to mixed-gas adsorption. The thermodynamic consistency of such extrapolation is ensured by the CALPHAD method of modeling and the CEF formalism. Examples include adsorption of various gases on carbon as well as on zeolites and MOF (Metal Organic Framework). 相似文献