首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
软件缺陷预测通常针对代码表面特征训练预测模型并对新样本进行预测,忽视了代码背后隐藏的不同技术方面和主题,从而导致预测不准确。针对这种问题,提出了一种基于主题模型的软件缺陷预测方法。将软件代码库视为不同技术方面和主题的集合,不同的主题或技术方面有不同的缺陷倾向。采用LDA主题模型对不同主题及其缺陷倾向进行建模,根据建模结果计算主题度量,并将传统度量方式和主题度量结合进行模型训练和预测。实验结果显示,该方法相对传统的软件缺陷预测技术有高的准确性,并且可以在软件演化中保证模型相对稳定,可以适用于各种缺陷预测任务。  相似文献   

2.
理解软件代码的功能是软件复用的一个重要环节。基于主题建模技术的代码理解方法能够挖掘软件代码中潜在的主题,这些主题在一定程度上代表了软件代码所实现的功能。但是使用主题建模技术所挖掘出的代码主题有着语义模糊、难以理解的弊端。潜在狄利克雷分配(Latent Dirichlet Allocation,LDA)技术是一种比较常用的主题建模技术, 其在软件代码主题挖掘领域已取得了较好的结果,但同样存在上述问题。为此,需要为主题生成解释性文本描述。基于LDA的软件代码主题摘要自动生成方法除了利用主题建模技术对源代码生成主题之外,还利用文档、问答信息等包含软件系统功能描述的各类软件资源挖掘出代码主题的描述文本并提取摘要,从而能够更好地帮助开发人员理解软件的功能。  相似文献   

3.
An inevitable consequence of the technology-driven economy has led to the increased importance of intellectual property protection through patents. Recent global pro-patenting shifts have further resulted in high technology overlaps. Technology components are now spread across a huge corpus of patent documents making its interpretation a knowledge-intensive engineering activity. Intelligent collaborative patent mining facilitates the integration of inputs from patented technology components held by diverse stakeholders. Topic generative models are powerful natural language tools used to decompose data corpus topics and associated word bag distributions. This research develops and validates a superior text mining methodology, called Excessive Topic Generation (ETG), as a preprocessing framework for topic analysis and visualization. The presented ETG methodology adapts the topic generation characteristics from Latent Dirichlet Allocation (LDA) with added capability to generate word distance relationships among key terms. The novel ETG approach is used as the core process for intelligent collaborative patent mining. A case study of 741 global Industrial Immersive Technology (IIT) patents covering inventive and novel concepts of Virtual Reality (VR), Augmented Reality (AR), and Brain Machine Interface (BMI) are systematically processed and analyzed using the proposed methodology. Based on the discovered topics of the IIT patents, patent classification (IPC/CPC) predictions are analyzed to validate the superior ETG results.  相似文献   

4.
韩俊明  王炜 《计算机科学》2015,42(Z11):464-466, 489
演化是软件生命周期中一个重要的部分。现在有大量软件已经演化了数个版本,而如何确认演化后的软件与演化目的相符合,成为了一个需要解决的问题。由于目前还没有一个系统的方法来处理此类问题,提出了采用LDA主题模型的方法对演化确认进行建模分析。用LDA方法对软件源代码中的某些特征进行建模,通过模型能够分析出源代码内潜在的主题。将提取分析出来的主题与软件演化发布的相关报告做对比,找出它们之间的区别,以此确认演化后的软件是否符合演化目的。  相似文献   

5.
ABSTRACT

Due to the widespread usage of electronic devices and the growing popularity of social media, a lot of text data is being generated at the rate never seen before. It is not possible for humans to read all data generated and find what is being discussed in his field of interest. Topic modeling is a technique to identify the topics present in a large set of text documents. In this paper, we have discussed the widely used techniques and tools for topic modeling. There has been a lot of research on topic modeling in English, but there is not much progress in the resource-scarce languages like Hindi despite Hindi being spoken by millions of people across the world. In this paper, we have discussed the challenges faced in developing topic models for Hindi. We have applied Latent Semantic Indexing (LSI), Non-negative Matrix Factorization (NMF), and Latent Dirichlet Allocation (LDA) algorithms for topic modeling in Hindi. The outcomes of the topic model algorithms are usually difficult to interpret for the common user. We have used various visualization techniques to represent the outcomes of topic modeling in a meaningful way. Then we have used the metrics like perplexity and coherence to evaluate the topic models. The results of Topic modeling in Hindi seem to be promising and comparable to some results reported in the literature on English datasets.  相似文献   

6.
Researchers across the globe have been increasingly interested in the manner in which important research topics evolve over time within the corpus of scientific literature. In a dataset of scientific articles, each document can be considered to comprise both the words of the document itself and its citations of other documents. In this paper, we propose a citation- content-latent Dirichlet allocation (LDA) topic discovery method that accounts for both document citation relations and the con-tent of the document itself via a probabilistic generative model. The citation-content-LDA topic model exploits a two-level topic model that includes the citation information for ‘father’ topics and text information for sub-topics. The model parameters are estimated by a collapsed Gibbs sampling algorithm. We also propose a topic evolution algorithm that runs in two steps: topic segmentation and topic dependency relation calculation. We have tested the proposed citation-content-LDA model and topic evolution algorithm on two online datasets, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) and IEEE Computer Society (CS), to demonstrate that our algorithm effectively discovers important topics and reflects the topic evolution of important research themes. According to our evaluation metrics, citation-content-LDA outperforms both content-LDA and citation-LDA.  相似文献   

7.
针对现有模型无法进行微博主题情感演化分析的问题,提出一种基于主题情感混合模型(TSCM)和情感周期性理论的主题情感演化模型——动态主题情感混合模型(DTSCM)。DTSCM通过捕获不同时间片中微博消息集的主题和情感,追踪不同时间片内主题与情感的变化趋势,获得主题情感演化图,从而实现主题和情感的演化分析。真实微博数据集上的实验结果表明,与当前优秀代表算法JST(Joint Sentiment/Topic)、S-LDA(Sentiment-Latent Dirichlet Allocation)和DPLDA(Dependency Phrases-Latent Dirichlet Allocation)相比,该方法的情感分类准确率分别提高了3.01%、4.33%和8.75%,并且可以获得主题情感演化图。这表明该方法具有更高的情感分类准确率并且可以进行微博主题情感演化分析,为舆情分析等应用提供了较好的帮助。  相似文献   

8.
ContextTopic models such as probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA) have demonstrated success in mining software repository tasks. Understanding software change messages described by the unstructured nature-language text is one of the fundamental challenges in mining these messages in repositories.ObjectiveWe seek to present a novel automatic change message classification method characterized by semi-supervised topic semantic analysis.MethodIn this work, we present a semi-supervised LDA based approach to automatically classify change messages. We use domain knowledge of software changes to make labeled samples which are added to build the semi-supervised LDA model. Next, we verify the cross-project analysis application of our method on three open-source projects. Our method has two advantages over existing software change classification methods: First of all, it mitigates the issue of how to set the appropriate number of latent topics. We do not have to choose the number of latent topics in our method, because it corresponds to the number of class labels. Second, this approach utilizes the information provided by the label samples in the training set.ResultsOur method automatically classified about 85% of the change messages in our experiment and our validation survey showed that 70.56% of the time our automatic classification results were in agreement with developer opinions.ConclusionOur approach automatically classifies most of the change messages which record the cause of the software change and the method is applicable to cross-project analysis of software change messages.  相似文献   

9.
话题跟踪中静态和动态话题模型的核捕捉衰减   总被引:1,自引:0,他引:1  
洪宇  仓玉  姚建民  周国栋  朱巧明 《软件学报》2012,23(5):1100-1119
话题跟踪是一项针对新闻话题进行相关信息识别、挖掘和自组织的研究课题,其关键问题之一是如何建立符合话题形态的统计模型.话题形态的研究涉及两个问题,其一是话题的结构特性,其二是话题变形.对比分析了现有词包式、层次树式和链式这3类主流话题模型的形态特征,尤其深入探讨了静态和动态话题模型拟合话题脉络的优势和劣势,并提出一种基于特征重叠比的核捕捉衰减评价策略,专门用于衡量静态和动态话题模型追踪话题发展趋势的能力.在此基础上,分别给出突发式增量式学习方法和时序事件链的更新算法,借以提高动态话题模型的核捕捉性能.实验基于国际标准评测语料TDT4,采用NIST(National Institute of Standards and Technology)提出的最小检测错误权衡系数评测法,并结合所提出的核捕捉衰减评价方法,对各类主要话题模型进行测试.实验结果显示,结构化的动态话题模型具有最佳的跟踪性能,且突发式增量式学习和时序事件链的更新算法分别给予动态话题模型0.4%和3.3%的性能改进.  相似文献   

10.
Large repositories of source code available over the Internet, or within large organizations, create new challenges and opportunities for data mining and statistical machine learning. Here we first develop Sourcerer, an infrastructure for the automated crawling, parsing, fingerprinting, and database storage of open source software on an Internet-scale. In one experiment, we gather 4,632 Java projects from SourceForge and Apache totaling over 38 million lines of code from 9,250 developers. Simple statistical analyses of the data first reveal robust power-law behavior for package, method call, and lexical containment distributions. We then develop and apply unsupervised, probabilistic, topic and author-topic (AT) models to automatically discover the topics embedded in the code and extract topic-word, document-topic, and AT distributions. In addition to serving as a convenient summary for program function and developer activities, these and other related distributions provide a statistical and information-theoretic basis for quantifying and analyzing source file similarity, developer similarity and competence, topic scattering, and document tangling, with direct applications to software engineering an software development staffing. Finally, by combining software textual content with structural information captured by our CodeRank approach, we are able to significantly improve software retrieval performance, increasing the area under the curve (AUC) retrieval metric to 0.92– roughly 10–30% better than previous approaches based on text alone. A prototype of the system is available at: . Erik Linstead, Sushil Bajracharya, and Trung Ngo have contributed equally to this work.  相似文献   

11.
主题情感混合模型可以同时提取语料的主题信息和情感倾向。针对短文本特征稀疏的问题,主题情感联合分析方法较少的问题,该文提出了BJSTM模型(Biterm Joint Sentiment Topic Model),在BTM模型(Biterm Topic Model)的基础上,增加情感层的设置,从而形成“情感-主题-词汇”的三层贝叶斯模型。对每个双词的情感和主题进行采样,从而对整个语料的词共现关系建模,一定程度上克服了短文本的稀疏性。实验表明,BJSTM模型在无监督情感分类和主题提取方面都有不错的表现。  相似文献   

12.
陈浙哲  鄢萌  夏鑫  刘忠鑫  徐洲  雷晏 《软件学报》2022,33(8):3015-3034
代码自然性(code naturalness)研究是自然语言处理领域和软件工程领域共同的研究热点之一,旨在通过构建基于自然语言处理技术的代码自然性模型,以解决各种软件工程任务.近年来,随着开源软件社区中源代码和数据规模的不断扩大,越来越多的研究人员注重钻研源代码中蕴藏的信息,并且取得了一系列研究成果.但与此同时,代码自然性研究在代码语料库构建、模型构建和任务应用等环节面临许多挑战.鉴于此,从代码自然性技术的代码语料库构建、模型构建和任务应用等方面对近年来代码自然性研究及应用进展进行梳理和总结.主要内容包括:(1)介绍了代码自然性的基本概念及其研究概况;(2)归纳目前代码自然性研究的语料库,并对代码自然性模型建模方法进行分类与总结;(3)总结代码自然性模型的实验验证方法和模型评价指标;(4)总结并归类了目前代码自然性的应用现状;(5)归纳代码自然性技术的关键问题;(6)展望代码自然性技术的未来发展.  相似文献   

13.
Software repositories provide a deluge of software artifacts to analyze. Researchers have attempted to summarize, categorize, and relate these artifacts by using semi-unsupervised machine-learning algorithms, such as Latent Dirichlet Allocation (LDA). LDA is used for concept and topic analysis to suggest candidate word-lists or topics that describe and relate software artifacts. However, these word-lists and topics are difficult to interpret in the absence of meaningful summary labels. Current attempts to interpret topics assume manual labelling and do not use domain-specific knowledge to improve, contextualize, or describe results for the developers. We propose a solution: automated labelled topic extraction. Topics are extracted using LDA from commit-log comments recovered from source control systems. These topics are given labels from a generalizable cross-project taxonomy, consisting of non-functional requirements. Our approach was evaluated with experiments and case studies on three large-scale Relational Database Management System (RDBMS) projects: MySQL, PostgreSQL and MaxDB. The case studies show that labelled topic extraction can produce appropriate, context-sensitive labels that are relevant to these projects, and provide fresh insight into their evolving software development activities.  相似文献   

14.
对外汉语教学领域,教材上的课文通常围绕一个话题展开,话题是教学内容的集中体现,也与词汇、语法等不同层面的语言知识间有着密切关联。该文基于大规模教材语料库研究教学话题分类体系,设计了一个包含四个一级话题、23个二级话题和246个三级话题的三层话题框架,并据此对197册汉语经典教材中的5 457个文段进行了人工标注及校对,构建了一个规模约12万句的面向对外汉语教学的话题语料库。为了更好地服务于汉语教学及相关研究工作,还抽取、计算了文段的语法点和新HSK词语等级信息,作为话题标注的补充维度加入资源库,以期为汉语教学领域的教师、研究者及教材编写者提供较为全面的话题信息参考。
  相似文献   

15.
话题演化分析是舆情监控的研究热点之一,面向微博热点话题进行演化分析,对于网络用户以及网络监管部门都有很重要的现实意义。针对在线词对主题模型(On-line Biterm Topic Model,OBTM)新旧主题混合、冗余词概率相对较高的问题,对OBTM进行改进,提出基于话题标签和先验参数的OBTM模型(Topic Labels and Prior Parameters OBTM,LPOBTM)。根据微博热点话题的话题标签,将微博文本集区分为含话题标签和不含话题标签的两类数据集,并设置不同的文档-主题先验参数;在前一时间片文档-主题概率分布的基础上,借鉴Sigmod函数对所有主题进行强度排名,从而优化当前时间片上主题-词分布的先验参数计算方法。实验结果表明,LPOBTM能够更准确地描述话题的内容演化情况,并且有更低的模型困惑度。  相似文献   

16.
This paper presents the results of an empirical study on the subjective evaluation of code smells that identify poorly evolvable structures in software. We propose use of the term software evolvability to describe the ease of further developing a piece of software and outline the research area based on four different viewpoints. Furthermore, we describe the differences between human evaluations and automatic program analysis based on software evolvability metrics. The empirical component is based on a case study in a Finnish software product company, in which we studied two topics. First, we looked at the effect of the evaluator when subjectively evaluating the existence of smells in code modules. We found that the use of smells for code evaluation purposes can be difficult due to conflicting perceptions of different evaluators. However, the demographics of the evaluators partly explain the variation. Second, we applied selected source code metrics for identifying four smells and compared these results to the subjective evaluations. The metrics based on automatic program analysis and the human-based smell evaluations did not fully correlate. Based upon our results, we suggest that organizations should make decisions regarding software evolvability improvement based on a combination of subjective evaluations and code metrics. Due to the limitations of the study we also recognize the need for conducting more refined studies and experiments in the area of software evolvability.
Casper LasseniusEmail:
  相似文献   

17.
Code transformation and analysis tools provide support for software engineering tasks such as style checking, testing, calculating software metrics as well as reverse‐ and re‐engineering. In this paper we describe the architecture and the applications of JTransform, a general Java source code processing and transformation framework. It consists of a Java parser generating a configurable parse tree and various visitors (transformers, tree evaluators) which produce different kinds of outputs. While our framework is written in Java, the paper further opens an opportunity for a new generation of XML‐based source code tools. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

18.
基于增量型聚类的自动话题检测研究   总被引:1,自引:0,他引:1  
张小明  李舟军  巢文涵 《软件学报》2012,23(6):1578-1587
随着网络信息飞速的发展,收集并组织相关信息变得越来越困难.话题检测与跟踪(topic detection and tracking,简称TDT)就是为解决该问题而提出来的研究方向.话题检测是TDT中重要的研究任务之一,其主要研究内容是把讨论相同话题的故事聚类到一起.虽然话题检测已经有了多年的研究,但面对日益变化的网络信息,它具有了更大的挑战性.提出了一种基于增量型聚类的和自动话题检测方法,该方法旨在提高话题检测的效率,并且能够自动检测出文本库中话题的数量.采用改进的权重算法计算特征的权重,通过自适应地提炼具有较强的主题辨别能力的文本特征来提高文档聚类的准确率,并且在聚类过程中利用BIC来判断话题类别的数目,同时利用话题的延续性特征来预聚类文档,并以此提高话题检测的速度.基于TDT-4语料库的实验结果表明,该方法能够大幅度提高话题检测的效率和准确率.  相似文献   

19.
Many of the existing approaches in Software Comprehension focus on program structure or external documentation. However, by analyzing formal information the informal semantics contained in the vocabulary of source code are overlooked. To understand software as a whole, we need to enrich software analysis with the developer knowledge hidden in the code naming. This paper proposes the use of information retrieval to exploit linguistic information found in source code, such as identifier names and comments. We introduce Semantic Clustering, a technique based on Latent Semantic Indexing and clustering to group source artifacts that use similar vocabulary. We call these groups semantic clusters and we interpret them as linguistic topics that reveal the intention of the code. We compare the topics to each other, identify links between them, provide automatically retrieved labels, and use a visualization to illustrate how they are distributed over the system. Our approach is language independent as it works at the level of identifier names. To validate our approach we applied it on several case studies, two of which we present in this paper.Note: Some of the visualizations presented make heavy use of colors. Please obtain a color copy of the article for better understanding.  相似文献   

20.
话题的延续和转换是篇章中重要的语用功能。该文从句首话题共享的角度对话题延续和转换进行了分类,分为句首话题延续、句中子话题延续、完全话题转换、兼语话题转换、新支话题转换五种,进而对话题转换的特殊情况——新支话题展开研究。基于33万字的广义话题结构语料库,该文对新支话题的句法成分、语义角色进行了统计和分析。通过句法成分分析发现,宾语从句或补语从句主语、主谓谓语句小主语、状性成分起始句主语、句末宾语、连谓句非句末宾语、兼语句兼语、介词宾语甚至状语等都能成为新支话题,从而引出新支句,其中,句末宾语作为新支话题的情况最多,但未发现间接宾语作为新支话题的情况;语义角色分析发现,大部分主体论元(施事、感事、经事、主事)和客体论元(受事、系事、结果、对象、与事)及少数凭借论元(方式)和环境论元(处所、终点)能成为新支话题引出新支句。同时,系事和受事成为新支话题的情况最显著;施事、结果和对象次之;原因和目的等论元难以成为新支话题。该文的研究揭示了句法、语义对话题转换这一语用现象的一种可能的约束途径,有助于人和计算机更深入地理解汉语篇章的话题转换机制,以期将这种语用现象逐步落实到语义直至句法的形式中,最终实现计算机对话题转换的自动分析。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号