共查询到20条相似文献,搜索用时 0 毫秒
1.
为解决软件工程数据量大、属性多且多为离散型数据的特点,提高软件工程数据的挖掘效率,寻求更快速、高效的聚类算法,提出了将基于核函数的模糊聚类算法应用于源代码挖掘;同时采用TF-IDF方法对离散型文本数据进行处理,解决了核模糊聚类算法不能对文本数据直接进行聚类的问题.将遗传算法与KFCM算法相结合,克服了KFCM只能求解局部极小值的问题.实验结果表明,改进的KFCM算法对软件工程数据的挖掘有很好的聚类效果,且有较高的效率. 相似文献
2.
3.
Predicting source code changes by mining change history 总被引:1,自引:0,他引:1
Ying A.T.T. Murphy G.C. Ng R. Chu-Carroll M.C. 《IEEE transactions on pattern analysis and machine intelligence》2004,30(9):574-586
Software developers are often faced with modification tasks that involve source which is spread across a code base. Some dependencies between source code, such as those between source code written in different languages, are difficult to determine using existing static and dynamic analyses. To augment existing analyses and to help developers identify relevant source code during a modification task, we have developed an approach that applies data mining techniques to determine change patterns - sets of files that were changed together frequently in the past - from the change history of the code base. Our hypothesis is that the change patterns can be used to recommend potentially relevant source code to a developer performing a modification task. We show that this approach can reveal valuable dependencies by applying the approach to the Eclipse and Mozilla open source projects and by evaluating the predictability and interestingness of the recommendations produced for actual modification tasks on these systems. 相似文献
4.
V. P. Ivannikov A. A. Belevantsev A. E. Borodin V. N. Ignatiev D. M. Zhurikhin A. I. Avetisyan 《Programming and Computer Software》2014,40(5):265-275
This paper describes Svace, a tool for static program analysis developed at the Institute for Systems Programming, Russian Academy of Sciences. This tool allows one to find defects and potential vulnerabilities in the source program code written in C/C++ languages. The main features of the tool are simplicity of use, wide variety of supported types of warnings, scalability up to programs of millions of code lines, and acceptable quality of analysis (30–80% of true positive warnings). 相似文献
5.
6.
Mariano Ceccato Massimiliano Di Penta Paolo Falcarin Filippo Ricca Marco Torchiano Paolo Tonella 《Empirical Software Engineering》2014,19(4):1040-1074
Context: code obfuscation is intended to obstruct code understanding and, eventually, to delay malicious code changes and ultimately render it uneconomical. Although code understanding cannot be completely impeded, code obfuscation makes it more laborious and troublesome, so as to discourage or retard code tampering. Despite the extensive adoption of obfuscation, its assessment has been addressed indirectly either by using internal metrics or taking the point of view of code analysis, e.g., considering the associated computational complexity. To the best of our knowledge, there is no publicly available user study that measures the cost of understanding obfuscated code from the point of view of a human attacker. Aim: this paper experimentally assesses the impact of code obfuscation on the capability of human subjects to understand and change source code. In particular, it considers code protected with two well-known code obfuscation techniques, i.e., identifier renaming and opaque predicates. Method: We have conducted a family of five controlled experiments, involving undergraduate and graduate students from four Universities. During the experiments, subjects had to perform comprehension or attack tasks on decompiled clients of two Java network-based applications, either obfuscated using one of the two techniques, or not. To assess and compare the obfuscation techniques, we measured the correctness and the efficiency of the performed task. Results: —at least for the tasks we considered—simpler techniques (i.e., identifier renaming) prove to be more effective than more complex ones (i.e., opaque predicates) in impeding subjects to complete attack tasks. 相似文献
7.
《Information and Software Technology》2014,56(2):183-198
ContextSoftware development projects involve the use of a wide range of tools to produce a software artifact. Software repositories such as source control systems have become a focus for emergent research because they are a source of rich information regarding software development projects. The mining of such repositories is becoming increasingly common with a view to gaining a deeper understanding of the development process.ObjectiveThis paper explores the concepts of representing a software development project as a process that results in the creation of a data stream. It also describes the extraction of metrics from the Jazz repository and the application of data stream mining techniques to identify useful metrics for predicting build success or failure.MethodThis research is a systematic study using the Hoeffding Tree classification method used in conjunction with the Adaptive Sliding Window (ADWIN) method for detecting concept drift by applying the Massive Online Analysis (MOA) tool.ResultsThe results indicate that only a relatively small number of the available measures considered have any significance for predicting the outcome of a build over time. These significant measures are identified and the implication of the results discussed, particularly the relative difficulty of being able to predict failed builds. The Hoeffding Tree approach is shown to produce a more stable and robust model than traditional data mining approaches.ConclusionOverall prediction accuracies of 75% have been achieved through the use of the Hoeffding Tree classification method. Despite this high overall accuracy, there is greater difficulty in predicting failure than success. The emergence of a stable classification tree is limited by the lack of data but overall the approach shows promise in terms of informing software development activities in order to minimize the chance of failure. 相似文献
8.
There are a number of reasons why one might wish to transform the source code of an operational program:
- 1 To make the program conform to a standard layout.
- 2 To make the program conform to syntax and semantics standards.
- 3 To improve the performance of the program.
- 1 The benefit to be realized from transformation.
- 2 The cost of transformation.
- 3 The time involved in transformation.
- 4 The risk associated with transformation.
9.
Automatic evaluation of metadata quality in digital repositories 总被引:1,自引:0,他引:1
Owing to the recent developments in automatic metadata generation and interoperability between digital repositories, the production of metadata is now vastly surpassing manual quality control capabilities. Abandoning quality control altogether is problematic, because low-quality metadata compromise the effectiveness of services that repositories provide to their users. To address this problem, we present a set of scalable quality metrics for metadata based on the Bruce & Hillman framework for metadata quality control. We perform three experiments to evaluate our metrics: (1) the degree of correlation between the metrics and manual quality reviews, (2) the discriminatory power between metadata sets and (3) the usefulness of the metrics as low-quality filters. Through statistical analysis, we found that several metrics, especially Text Information Content, correlate well with human evaluation and that the average of all the metrics are roughly as effective as people to flag low-quality instances. The implications of this finding are discussed. Finally, we propose possible applications of the metrics to improve tools for the administration of digital repositories. 相似文献
10.
M. I. Glukhikh V. M. Itsykson V. A. Tsesko 《Automatic Control and Computer Sciences》2012,46(7):338-344
Development of dependency analysis methods in order to improve static code analysis precision is considered in this paper. Reasons for precision loss when detecting defects in program source code using abstract interpretation methods are explained. Need for program object dependency extraction and interpretation is justified by numerous real-world examples. Dependency classification is presented. Necessity for aggregate analysis of values and dependencies is considered. Dependency extraction from assignment statements is described. Dependency interpretation based on logic inference using logic and arithmetic rules is proposed. The methods proposed are implemented in defect detection tool Digitek Aegis, significant increase of precision is shown. 相似文献
11.
In this paper we describe the results of a study of the insertion of checkpoints within a legacy software system in the aerospace domain. The purpose of the checkpoints was to improve program fault-tolerance during program execution by rolling back system control to a saved state from which program execution can continue. The study used novice programmers for the determination of where the checkpoints were to be added. The focus was on the programmer’s understanding of the code, since this affected how the checkpoints were placed. The results should provide guidance to those interested in improving the fault-tolerance of legacy software systems, especially those written in older, nearly obsolescent programming languages. 相似文献
12.
数据挖掘是指从数据库的大量数据中揭示隐含的、先前未知的、潜在有用信息的非平凡的过程.使用可视化数据挖掘的技术从足球比赛的数据集中找到模式.这些模式可以在足球比赛中直接或间接地提供有益的见解,并在比赛中运用决策支持系统. 相似文献
13.
In a large software system knowing which files are most likely to be fault-prone is valuable information for project managers. They can use such information in prioritizing software testing and allocating resources accordingly. However, our experience shows that it is difficult to collect and analyze fine-grained test defects in a large and complex software system. On the other hand, previous research has shown that companies can safely use cross-company data with nearest neighbor sampling to predict their defects in case they are unable to collect local data. In this study we analyzed 25 projects of a large telecommunication system. To predict defect proneness of modules we trained models on publicly available Nasa MDP data. In our experiments we used static call graph based ranking (CGBR) as well as nearest neighbor sampling for constructing method level defect predictors. Our results suggest that, for the analyzed projects, at least 70% of the defects can be detected by inspecting only (i) 6% of the code using a Naïve Bayes model, (ii) 3% of the code using CGBR framework. 相似文献
14.
Source code documentation often contains summaries of source code written by authors. Recently, automatic source code summarization tools have emerged that generate summaries without requiring author intervention. These summaries are designed for readers to be able to understand the high-level concepts of the source code. Unfortunately, there is no agreed upon understanding of what makes up a “good summary.” This paper presents an empirical study examining summaries of source code written by authors, readers, and automatic source code summarization tools. This empirical study examines the textual similarity between source code and summaries of source code using Short Text Semantic Similarity metrics. We found that readers use source code in their summaries more than authors do. Additionally, this study finds that accuracy of a human written summary can be estimated by the textual similarity of that summary to the source code. 相似文献
15.
Statistical process control to improve coding and code review 总被引:1,自引:0,他引:1
Software process comprises activities such as estimation, planning, requirements analysis, design, coding, reviews, and testing, undertaken when creating a software product. Effective software process management involves proactively managing each of these activities. Statistical process control tools enable proactive software process management. One such tool, the control chart, can be used for managing, controlling, and improving the code review process. 相似文献
16.
Bojic D. Eisenbarth T. Koschke R. Simon D. Velasevic D. 《IEEE transactions on pattern analysis and machine intelligence》2004,30(2):140
For original paper by T. Eisenbarth et al. see ibid., vol.29, no.3, p.210-24 (2003).We compare three approaches that apply formal concept analysis on execution profiles. This survey extends the discourse of related research by Bojic and Velasevic (2000). 相似文献
17.
Tse-Hsun Chen Stephen W. Thomas Ahmed E. Hassan 《Empirical Software Engineering》2016,21(5):1843-1919
Researchers in software engineering have attempted to improve software development by mining and analyzing software repositories. Since the majority of the software engineering data is unstructured, researchers have applied Information Retrieval (IR) techniques to help software development. The recent advances of IR, especially statistical topic models, have helped make sense of unstructured data in software repositories even more. However, even though there are hundreds of studies on applying topic models to software repositories, there is no study that shows how the models are used in the software engineering research community, and which software engineering tasks are being supported through topic models. Moreover, since the performance of these topic models is directly related to the model parameters and usage, knowing how researchers use the topic models may also help future studies make optimal use of such models. Thus, we surveyed 167 articles from the software engineering literature that make use of topic models. We find that i) most studies centre around a limited number of software engineering tasks; ii) most studies use only basic topic models; iii) and researchers usually treat topic models as black boxes without fully exploring their underlying assumptions and parameter values. Our paper provides a starting point for new researchers who are interested in using topic models, and may help new researchers and practitioners determine how to best apply topic models to a particular software engineering task. 相似文献
18.
19.
Josep Carmona 《Data mining and knowledge discovery》2012,24(1):218-246
Traces are everywhere from information systems that store their continuous executions, to any type of health care applications that record each
patient’s history. The transformation of a set of traces into a mathematical model that can be used for a formal reasoning
is therefore of great value. The discovery of process models out of traces is an interesting problem that has received significant
attention in the last years. This is a central problem in Process Mining, a novel area which tries to close the cycle between system design and validation, by resorting on methods for the automated
discovery, analysis and extension of process models. In this work, algorithms for the derivation of a Petri net from a set of traces are presented. The methods are grounded on the theory of regions, which maps a model in the state-based domain (e.g., an automata) into a model in the event-based domain (e.g., a Petri net).
When dealing with large examples, a direct application of the theory of regions will suffer from two problems: one is the
state-explosion problem, i.e., the resources required by algorithms that work at the state-level are sometimes prohibitive. This paper introduces
decomposition and projection techniques to alleviate the complexity of the region-based algorithms for Petri net discovery,
thus extending its applicability to handle large inputs. A second problem is known as the overfitting problem for region-based approaches, which informally means that, in order to represent with high accuracy the trace set, the models
obtained are often spaghetti-like. By focusing on special type of processes called conservative and for which an elegant theory and efficient algorithms can be devised, the techniques presented in this paper alleviate
the overfitting problem and moreover incorporate structure into the models generated. 相似文献