首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Searching for relevant code in the local code base is a common activity during software maintenance. However, previous research indicates that 88% of manually composed search queries retrieve no relevant results. One reason that many searches fail is existing search tools’ dependence on string matching algorithms, which cannot find semantically related code. To solve this problem by helping developers compose better queries, researchers have proposed numerous query recommendation techniques, relying on a variety of dictionaries and algorithms. However, few of these techniques are empirically evaluated by usage data from real-world developers. To fill this gap, we designed a multi-recommendation system that relies on the cooperation between several query recommendation techniques. We implemented and deployed this recommendation system within the Sando code search tool and conducted a longitudinal field study. Our study shows that over 34% of all queries were adopted from recommendation; and recommended queries retrieved results 11% more often than manual queries.  相似文献   

2.
Source code terms such as method names and variable types are often different from conceptual words mentioned in a search query. This vocabulary mismatch problem can make code search inefficient. In this paper, we present COde voCABUlary (CoCaBu), an approach to resolving the vocabulary mismatch problem when dealing with free-form code search queries. Our approach leverages common developer questions and the associated expert answers to augment user queries with the relevant, but missing, structural code entities in order to improve the performance of matching relevant code examples within large code repositories. To instantiate this approach, we build GitSearch, a code search engine, on top of GitHub and Stack Overflow Q&A data. We evaluate GitSearch in several dimensions to demonstrate that (1) its code search results are correct with respect to user-accepted answers; (2) the results are qualitatively better than those of existing Internet-scale code search engines; (3) our engine is competitive against web search engines, such as Google, in helping users solve programming tasks; and (4) GitSearch provides code examples that are acceptable or interesting to the community as answers for Stack Overflow questions.  相似文献   

3.
在软件开发过程中,开发者经常会以复用代码的方式,提高软件开发效率。已有的研究通常采用传统的信息检索技术来实现代码推荐。这些方法存在自然语言查询的高层级的意图与代码的低层级的实现细节不匹配的问题。提出了一种基于序列到序列模型的代码片段推荐方法DeepCR。该方法结合程序静态分析技术与序列到序列模型,训练自然语言查询生成模型,为代码片段生成查询,通过计算生成的查询和开发者输入的自然语言查询的相似度得分来实现代码片段推荐。所构建的代码库的数据来源于Stack Overflow问答网站,确保了数据的真实性。通过计算代码片段推荐结果的平均倒数排名(MRR)和Hit@K来验证方法的有效性。实验结果表明,DeepCR优于现有研究工作,能够有效提高代码片段推荐效果。  相似文献   

4.
Our current understanding of how programmers perform feature location during software maintenance is based on controlled studies or interviews, which are inherently limited in size, scope and realism. Replicating controlled studies in the field can both explore the findings of these studies in wider contexts and study new factors that have not been previously encountered in the laboratory setting. In this paper, we report on a field study about how software developers perform feature location within source code during their daily development activities. Our study is based on two complementary field data sets: one that reflects complete IDE activity of 67 professional developers over approximately one month, and the other that reflects usage of an IR-based code search tool by nearly 600 developers. Analyzing this data, we report results on how often developers use which type of code search tools, on the types of queries and retreival strategies used by developers, and on patterns of developer feature location behavior following code search. The results of the study suggest that there is (1) a need for helping developers to devise better code search queries; (2) a lack of adoption of niche code search tools; (3) a need for code search tool to handle both lookup and exploratory queries; and (4) a need for better integration between code search, structured navigation, and debugging tools in feature location tasks.  相似文献   

5.
Semi-supervised learning is a machine learning paradigm that can be applied to create pseudo labels from unlabeled data for learning a ranking model, when there is only limited or no training examples available. However, the effectiveness of semi-supervised learning in information retrieval (IR) can be hindered by the low quality pseudo labels, hence the need for the training query filtering that removes the low quality queries. In this paper, we assume two application scenarios with respect to the availability of human labels. First, for applications without any labeled data available, a clustering-based approach is proposed to select the high quality training queries. This approach selects the training queries following the empirical observation that the relevant documents of high quality training queries are highly coherent. Second, for applications with limited labeled data available, a classification-based approach is proposed. This approach learns a weak classifier to predict the retrieval performance gain of a given training query by making use of query features. The queries with high performance gains are selected for the following transduction process to create the pseudo labels for learning to rank algorithms. Experimental results on the standard LETOR dataset show that our proposed approaches outperform the strong baselines.  相似文献   

6.
Predicting source code changes by mining change history   总被引:1,自引:0,他引:1  
Software developers are often faced with modification tasks that involve source which is spread across a code base. Some dependencies between source code, such as those between source code written in different languages, are difficult to determine using existing static and dynamic analyses. To augment existing analyses and to help developers identify relevant source code during a modification task, we have developed an approach that applies data mining techniques to determine change patterns - sets of files that were changed together frequently in the past - from the change history of the code base. Our hypothesis is that the change patterns can be used to recommend potentially relevant source code to a developer performing a modification task. We show that this approach can reveal valuable dependencies by applying the approach to the Eclipse and Mozilla open source projects and by evaluating the predictability and interestingness of the recommendations produced for actual modification tasks on these systems.  相似文献   

7.
Preserving mapping consistency under schema changes   总被引:1,自引:0,他引:1  
In dynamic environments like the Web, data sources may change not only their data but also their schemas, their semantics, and their query capabilities. When a mapping is left inconsistent by a schema change, it has to be detected and updated. We present a novel framework and a tool (ToMAS) for automatically adapting (rewriting) mappings as schemas evolve. Our approach considers not only local changes to a schema but also changes that may affect and transform many components of a schema. Our algorithm detects mappings affected by structural or constraint changes and generates all the rewritings that are consistent with the semantics of the changed schemas. Our approach explicitly models mapping choices made by a user and maintains these choices, whenever possible, as the schemas and mappings evolve. When there is more than one candidate rewriting, the algorithm may rank them based on how close they are to the semantics of the existing mappings.Received: 13 January 2004, Accepted: 26 March 2004, Published online: 12 August 2004Edited by: M. Carey  相似文献   

8.
9.
Developers commonly make use of a web search engine such as Google to locate online resources to improve their productivity. A better understanding of what developers search for could help us understand their behaviors and the problems that they meet during the software development process. Unfortunately, we have a limited understanding of what developers frequently search for and of the search tasks that they often find challenging. To address this gap, we collected search queries from 60 developers, surveyed 235 software engineers from more than 21 countries across five continents. In particular, we asked our survey participants to rate the frequency and difficulty of 34 search tasks which are grouped along the following seven dimensions: general search, debugging and bug fixing, programming, third party code reuse, tools, database, and testing. We find that searching for explanations for unknown terminologies, explanations for exceptions/error messages (e.g., HTTP 404), reusable code snippets, solutions to common programming bugs, and suitable third-party libraries/services are the most frequent search tasks that developers perform, while searching for solutions to performance bugs, solutions to multi-threading bugs, public datasets to test newly developed algorithms or systems, reusable code snippets, best industrial practices, database optimization solutions, solutions to security bugs, and solutions to software configuration bugs are the most difficult search tasks that developers consider. Our study sheds light as to why practitioners often perform some of these tasks and why they find some of them to be challenging. We also discuss the implications of our findings to future research in several research areas, e.g., code search engines, domain-specific search engines, and automated generation and refinement of search queries.  相似文献   

10.
Ranking is a core problem for information retrieval since the performance of the search system is directly impacted by the accuracy of ranking results. Ranking model construction has been the focus of both the fields of information retrieval and machine learning, and learning to rank in particular has attracted much interest. Many ranking models have been proposed, for example, RankSVM is a state‐of‐the‐art method for learning to rank and has been empirically demonstrated to be effective. However, most of the proposed methods do not consider about the significant differences between queries, only resort to a single function in ranking. In this paper, we present a novel ranking model named QoRank, which performs the learning task dependent on queries. We also propose a LSE (least‐squares estimation) ‐based weighted method to aggregate the ranking lists produced by base decision functions as the final ranking. Comparison of QoRank with other ranking techniques is conducted, and several evaluation criteria are employed to evaluate its performance. Experimental results on the LETOR OHSUMED data set show that QoRank strikes a good balance of accuracy and complexity, and outperforms the baseline methods. © 2010 Wiley Periodicals, Inc.  相似文献   

11.
近年来微博检索已经成为信息检索领域的研究热点。相关的研究表明,微博检索具有时间敏感性。已有工作根据不同的时间敏感性假设,例如,时间越新文档越相关,或者时间越接近热点时刻文档越相关,得到多种不同的检索模型,都在一定程度上提高了检索效果。但是这些假设主要来自于观察,是一种直观简化的假设,仅能从某个方面反映时间因素影响微博排序的规律。该文验证了微博检索具有复杂的时间敏感特性,直观的简化假设并不能准确地描述这种特性。在此基础上提出了一个利用微博的时间特征和文本特征,通过机器学习的方式来构建一个针对时间敏感的微博检索的排序学习模型(TLTR)。在时间特征上,考察了查询相关的全局时间特征以及查询-文档对的局部时间特征。在TREC Microblog Track 20112012数据集上的实验结果表明,TLTR模型优于现有的其他时间敏感的微博排序方法。  相似文献   

12.
Code smells are a popular mechanism to find structural design problems in software systems. Consequently, several tools have emerged to support the detection of code smells. However, the number of smells returned by current tools usually exceeds the amount of problems that the developer can deal with, particularly when the effort available for performing refactorings is limited. Moreover, not all the code smells are equally relevant to the goals of the system or its health. This article presents a semi-automated approach that helps developers focus on the most critical problems of the system. We have developed a tool that suggests a ranking of code smells, based on a combination of three criteria, namely: past component modifications, important modifiability scenarios for the system, and relevance of the kind of smell. These criteria are complementary and enable our approach to assess the smells from different perspectives. Our approach has been evaluated in two case-studies, and the results show that the suggested code smells are useful to developers.  相似文献   

13.
Both the quality and quantity of training data have significant impact on the accuracy of rank functions in web search. With the global search needs, a commercial search engine is required to expand its well tailored service to small countries as well. Due to heterogeneous intrinsic of query intents and search results on different domains (i.e., for different languages and regions), it is difficult for a generic ranking function to satisfy all type of queries. Instead, each domain should use a specific well tailored ranking function. In order to train each ranking function for each domain with a scalable strategy, it is critical to leverage existing training data to enhance the ranking functions of those domains without sufficient training data. In this paper, we present a boosting framework for learning to rank in the multi-task learning context to attack this problem. In particular, we propose to learn non-parametric common structures adaptively from multiple tasks in a stage-wise way. An algorithm is developed to iteratively discover super-features that are effective for all the tasks. The estimation of the regression function for each task is then learned as linear combination of those super-features. We evaluate the accuracy of multi-task learning methods for web search ranking using data from multiple domains from a commercial search engine. Our results demonstrate that multi-task learning methods bring significant relevance improvements over existing baseline method.  相似文献   

14.
Internet-scale open source software (OSS) production in various communities generates abundant reusable resources for software developers. However, finding the desired and mature software with keyword queries from a considerable number of candidates, especially for the fresher, is a significant challenge because current search services often fail to understand the semantics of user queries. In this paper, we construct a software term database (STDB) by analyzing tagging data in Stack Overflow and propose a correlation-based software search (CBSS) approach that performs correlation retrieval based on the term relevance obtained from STDB. In addition, we design a novel ranking method to optimize the initial retrieval result. We explore four research questions in four experiments, respectively, to evaluate the effectiveness of the STDB and investigate the performance of the CBSS. The experiment results show that the proposed CBSS can effectively respond to keyword-based software searches and significantly outperforms other existing search services at finding mature software.  相似文献   

15.
ContextThe functionality of a software system is most often expressed in terms of concepts from its problem or solution domains. The process of finding where these concepts are implemented in the source code is known as concept location and it is a prerequisite of software change.ObjectiveWe investigate a static approach to concept location named DepIR that combines program dependency search (DepS) with information retrieval-based search (IR). In this approach, programmers explore the static program dependencies of the source code components retrieved by the IR search engine.MethodThe paper presents an empirical study that compares DepIR with its constituent techniques. The evaluation is based on an empirical method of reenactment that emulates the steps of concept location for 50 past changes mined from software repositories of five software systems.ResultsThe results of the study indicate that DepIR significantly outperforms both DepS and IR.ConclusionDepIR allows developers to perform concept location efficiently. It allows finding concepts even with queries that do not rank the relevant software components highly. Since formulating a good query is not always easy, this tolerance of lower-quality queries significantly broadens the usability of DepIR compared to the traditional IR.  相似文献   

16.
ContextIt is critical to ensure the quality of a software system in the initial stages of development, and several approaches have been proposed to ensure that a conceptual schema correctly describes the user’s requirements.ObjectiveThe main goal of this paper is to perform automated reasoning on UML schemas containing arbitrary constraints, derived roles, derived attributes and queries, all of which must be specified by OCL expressions.MethodThe UML/OCL schema is encoded in a first order logic formalisation, and an existing reasoning procedure is used to check whether the schema satisfies a set of desirable properties. Due to the undecidability of reasoning in highly expressive schemas, such as those considered here, we also provide a set of conditions that, if satisfied by the schema, ensure that all properties can be checked in a finite period of time.ResultsThis paper extends our previous work on reasoning on UML conceptual schemas with OCL constraints by considering derived attributes and roles that can participate in the definition of other constraints, queries and derivation rules. Queries formalised in OCL can also be validated to check their satisfiability and to detect possible equivalences between them. We also provide a set of conditions that ensure finite reasoning when they are satisfied by the schema under consideration.ConclusionThis approach improves upon previous work by allowing automated reasoning for more expressive UML/OCL conceptual schemas than those considered so far.  相似文献   

17.
Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in plagiarism source retrieval. Heuristic-based query generation methods are widely used in the current research. Each heuristic-based method has its own advantages, and no one statistically outperforms the others on all suspicious document segments when generating queries for source retrieval. Further improvements on heuristic methods for source retrieval rely mainly on the experience of experts. This leads to difficulties in putting forward new heuristic methods that can overcome the shortcomings of the existing ones. This paper paves the way for a new statistical machine learning approach to select the best queries from the candidates. The statistical machine learning approach to query generation for source retrieval is formulated as a ranking framework. Specifically, it aims to achieve the optimal source retrieval performance for each suspicious document segment. The proposed method exploits learning to rank to generate queries from the candidates. To our knowledge, our work is the first research to apply machine learning methods to resolve the problem of query generation for source retrieval. To solve the essential problem of an absence of training data for learning to rank, the building of training samples for source retrieval is also conducted. We rigorously evaluate various aspects of the proposed method on the publicly available PAN source retrieval corpus. With respect to the established baselines, the experimental results show that applying our proposed query generation method based on machine learning yields statistically significant improvements over baselines in source retrieval effectiveness.  相似文献   

18.
In the study of data exchange one usually assumes an open-world semantics, making it possible to extend instances of target schemas. An alternative closed-world semantics only moves ‘as much data as needed’ from the source to the target to satisfy constraints of a schema mapping. It avoids some of the problems exhibited by the open-world semantics, but limits the expressivity of schema mappings. Here we propose a mixed approach: one can designate different attributes of target schemas as open or closed, to combine the additional expressivity of the open-world semantics with the better behavior of query answering in closed worlds. We define such schema mappings, and show that they cover a large space of data exchange solutions with two extremes being the known open and closed-world semantics. We investigate the problems of query answering and schema mapping composition, and prove two trichotomy theorems, classifying their complexity based on the number of open attributes. We find conditions under which schema mappings compose, extending known results to a wide range of closed-world mappings. We also provide results for restricted classes of queries and mappings guaranteeing lower complexity.  相似文献   

19.
Search engines retrieve and rank Web pages which are not only relevant to a query but also important or popular for the users. This popularity has been studied by analysis of the links between Web resources. Link-based page ranking models such as PageRank and HITS assign a global weight to each page regardless of its location. This popularity measurement has shown successful on general search engines. However unlike general search engines, location-based search engines should retrieve and rank higher the pages which are more popular locally. The best results for a location-based query are those which are not only relevant to the topic but also popular with or cited by local users. Current ranking models are often less effective for these queries since they are unable to estimate the local popularity. We offer a model for calculating the local popularity of Web resources using back link locations. Our model automatically assigns correct locations to the links and content and uses them to calculate new geo-rank scores for each page. The experiments show more accurate geo-ranking of search engine results when this model is used for processing location-based queries.  相似文献   

20.
This paper is concerned with supervised rank aggregation, which aims to improve the ranking performance by combining the outputs from multiple rankers. However, there are two main shortcomings in previous rank aggregation approaches. First, the learned weights for base rankers do not distinguish the differences among queries. This is suboptimal since queries vary significantly in terms of ranking. Besides, most current aggregation functions do not directly optimize the evaluation measures in ranking. In this paper, the differences among queries are taken into consideration, and a supervised rank aggregation function is proposed. This aggregation function is directly optimizing the evaluation measure NDCG, referred to as RankAgg.NDCG, We prove that RankAgg.NDCG can achieve better NDCG performance than the linear combination of the base rankers. Experimental results performed on benchmark datasets show our approach outperforms a number of baseline approaches.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号