首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
文档检索是自然语言处理的研究热点,相对于短文本文档具有信息丰富且冗长的特征。在长文本检索中,查询语句与长文本中的句子往往不是全部相关,可能会出现某些高相似片段的强干扰,因此查询语句与文档之间的相关性评分不能简单采用基于词语或字符串之间的相似度计算。提出了一种文本片段化机制(TSM)进行文档检索,首先将每个候选文档划分成片段,再计算查询语句与文档片段之间的相关度,所使用的相关度匹配方案考虑了语义和词频等因素,筛选出关键的文本片段并得出相关片段比率,综合这些片段信息计算查询与文档之间的相关性得分,从而获取Top-K文档集。针对Glasgow信息检索专用数据集的实验结果表明,利用文本片段化机制进行文本匹配可以提高信息检索的性能。  相似文献   

2.
In knowledge discovery in a text database, extracting and returning a subset of information highly relevant to a user's query is a critical task. In a broader sense, this is essentially identification of certain personalized patterns that drives such applications as Web search engine construction, customized text summarization and automated question answering. A related problem of text snippet extraction has been previously studied in information retrieval. In these studies, common strategies for extracting and presenting text snippets to meet user needs either process document fragments that have been delimitated a priori or use a sliding window of a fixed size to highlight the results. In this work, we argue that text snippet extraction can be generalized if the user's intention is better utilized. It overcomes the rigidness of existing approaches by dynamically returning more flexible start-end positions of text snippets, which are also semantically more coherent. This is achieved by constructing and using statistical language models which effectively capture the commonalities between a document and the user intention. Experiments indicate that our proposed solutions provide effective personalized information extraction services.  相似文献   

3.
This paper presents new algorithms-fuzzy c-medoids (FCMdd) and robust fuzzy c-medoids (RFCMdd)-for fuzzy clustering of relational data. The objective functions are based on selecting c representative objects (medoids) from the data set in such a way that the total fuzzy dissimilarity within each cluster is minimized. A comparison of FCMdd with the well-known relational fuzzy c-means algorithm (RFCM) shows that FCMdd is more efficient. We present several applications of these algorithms to Web mining, including Web document clustering, snippet clustering, and Web access log analysis  相似文献   

4.
基于角点特征的形状识别   总被引:1,自引:0,他引:1       下载免费PDF全文
根据飞机形状角点之间的距离,定义了一种新的多维距离特征向量,不同飞机具有不同的多维距离向量。计算多维距离特征向量之间的相关系数,比较多维距离特征向量之间的相关性,能将每种飞机从飞机模型库中识别出来。实验表明,这种新特征向量具有较好的稳定性和区分性,算法识别率高、速度快,几乎达到实时效果。  相似文献   

5.
基于滑动窗口的动态摘要算法   总被引:2,自引:0,他引:2       下载免费PDF全文
动态摘要是根据查询检索词从文章中动态提取的摘要。用户仅仅浏览动态摘要之后就能了解文章中与查询相关的部分,进而判断是否值得详细阅读整篇文章。该文根据搜索引擎对摘要速度和质量的要求,提出了一种使用滑动窗口抽取片断的算法,接着构造了摘要评测模型,使用同一个测试集对新动态摘要算法和Google、百度作对比实验。结果证明使用新方法生成的摘要能够言简意赅地概括文章的相关内容,在摘要指标的分项测试中取得了和Google基本相同的效果,但明显要比百度好,综合评价分别提高了5%和11%。  相似文献   

6.
Software developers insert logging statements in their source code to record important runtime information; such logged information is valuable for understanding system usage in production and debugging system failures. However, providing proper logging statements remains a manual and challenging task. Missing an important logging statement may increase the difficulty of debugging a system failure, while too much logging can increase system overhead and mask the truly important information. Intuitively, the actual functionality of a software component is one of the major drivers behind logging decisions. For instance, a method maintaining network communications is more likely to be logged than getters and setters. In this paper, we used automatically-computed topics of a code snippet to approximate the functionality of a code snippet. We studied the relationship between the topics of a code snippet and the likelihood of a code snippet being logged (i.e., to contain a logging statement). Our driving intuition is that certain topics in the source code are more likely to be logged than others. To validate our intuition, we conducted a case study on six open source systems, and we found that i) there exists a small number of “log-intensive” topics that are more likely to be logged than other topics; ii) each pair of the studied systems share 12% to 62% common topics, and the likelihood of logging such common topics has a statistically significant correlation of 0.35 to 0.62 among all the studied systems; and iii) our topic-based metrics help explain the likelihood of a code snippet being logged, providing an improvement of 3% to 13% on AUC and 6% to 16% on balanced accuracy over a set of baseline metrics that capture the structural information of a code snippet. Our findings highlight that topics contain valuable information that can help guide and drive developers’ logging decisions.  相似文献   

7.
隐式相关反馈常被用于提升检索系统的性能,目前大部分工作集中在研究隐式正反馈。该文同时考虑隐式正负反馈,将查询会话中被点击网页前的未被点击网页作为隐式负反馈信息,通过引入时间因子,估计用户在未被点击网页的标题和摘要上的停留时间,推断隐式负反馈与用户兴趣和行为的关系,达到优化检索结果的目的。在TREC Session 2011和2012数据集上的实验,验证了该文提出的带时间因子的隐式正负反馈算法TIPNF的有效性。  相似文献   

8.
贾长云  程永上  朱敏 《计算机应用》2010,30(4):1096-1098
为了有效提高移动终端多媒体信息的能力,讨论了一种新的多媒体信息查询方法——基于内容的递进目标搜索,提出了“查询流”的技术,通过多媒体查询方法(MQF)与XML片断请求单元及片断更新单元的结合,使得用户对多媒体信息的查询逐步进行,先查询相关的元数据描述,然后查询最终结果。这样有效降低了查询的数据通信量,非常适合于配置较低、信道有限的移动终端实现多媒体信息的查询。  相似文献   

9.
程靖云  王布宏  罗鹏 《计算机应用》2022,42(10):3170-3176
随着计算机软件规模和复杂度的不断增加,软件中存在的代码缺陷对公共安全形成了严重威胁。针对静态分析工具拓展性差,以及现有方法检测粒度粗、检测效果不够理想的问题,提出了一种基于程序切片和语义特征融合的代码缺陷静态检测方法。首先,对源代码中的关键点进行数据流和控制流分析,并采用基于过程间有限分布子集(IFDS)的切片方法,以获取由多行与代码缺陷相关的语句组成的代码片段;然后,通过词嵌入法获取代码片段语义相关的向量表示,从而在保证准确率的同时选择合适的代码片段长度;最后,利用文本卷积神经网络(TextCNN)和双向门控循环单元(BiGRU)分别提取代码片段中的局部关键特征和上下文序列特征,并将所提方法用于检测切片级别的代码缺陷。实验结果表明,所提方法能够有效检测不同类型的代码缺陷,并且检测效果显著优于静态分析工具Flawfinder;在细粒度的前提下,IFDS切片方法能进一步提高F1值和准确率,分别达到了89.64%和92.08%;与现有的基于程序切片的方法相比,在关键点为应用程序编程接口(API)或变量时,所提方法的F1值分别达到89.69%、89.74%,准确率分别达到92.15%、91.98%。可见在不显著增加时间复杂度的同时,所提方法具备更好的综合检测性能。  相似文献   

10.
In this paper, we present a complete set of procedures to automatically extract a music snippet, defined as the most representative or the highlighted excerpt of a music clip. We first generate a modified and compact similarity matrix based on selected features and distance metrics, and then several improved techniques for music repeated pattern discovery are utilized because a music snippet is usually a part of the repeated melody, main theme or chorus. During the process, redundant and wrongly detected patterns are discarded, boundaries are corrected using beat information, and final clusters are also further sorted according to the occurrence frequency and energy information. Subsequently, following our methods, we designed a music snippet extraction system which allows users to detect snippets. Experiments performed on the system show the superiority of our proposed approach. Supported by the National Natural Science Foundation of China (Grant No. 60873098)  相似文献   

11.
Eye-tracking technology was used to examine Internet search result evaluation strategies adopted by sixth-grade students (N?=?36) during ten experimental information search tasks. The relevancy of the search result’s title, URL, and snippet components was manipulated and selection of search results as well as looking into probabilities on the search result components was analysed. The results revealed that during first-pass inspection, students read the search engine page by first looking at the title of a search result. If the title was relevant, the probability of looking at the snippet of the search result increased. During second-pass inspection, there was a high probability of students focusing on the most promising search result by inspecting all of its components before making their selection. A cluster analysis revealed three viewing strategies: half of the students looked mainly at the titles and snippets; one-third with high probability examined all components; and one-sixth mainly focused on titles, leading to more frequent errors in search result selection. The results indicate that students generally made a flexible use of both eliminative and confirmatory evaluation strategies when reading Internet search results, while some seemed to not pay attention to snippet and URL components of the search results.  相似文献   

12.
解决多段落中文阅读理解任务需要考虑证据段落的稀疏性、中文语义的多样性和答案片段的有效性.基于此种情况,文中设计多段落中文阅读理解模型,利用数据增强的方式学习不包含答案的段落,利用字级别编码和中文词性标注丰富中文的语义表示,通过答案片段的特征训练答案有效性验证模型.将文中模型应用到CIPS-SOGOU事实类问答数据中,实验表明,完全匹配率和F1分数的平均分均有所提高.  相似文献   

13.
Short text clustering by finding core terms   总被引:1,自引:1,他引:0  
A new clustering strategy, TermCut, is presented to cluster short text snippets by finding core terms in the corpus. We model the collection of short text snippets as a graph in which each vertex represents a piece of short text snippet and each weighted edge between two vertices measures the relationship between the two vertices. TermCut is then applied to recursively select a core term and bisect the graph such that the short text snippets in one part of the graph contain the term, whereas those snippets in the other part do not. We apply the proposed method on different types of short text snippets, including questions and search results. Experimental results show that the proposed method outperforms state-of-the-art clustering algorithms for clustering short text snippets.  相似文献   

14.
针对嵌入式软件的运行能耗评估问题,设计一种基于LabVIEW和NI多功能数据采集卡的嵌入式软件能耗测量方法。利用编写的数据采集程序,直接对测量目标的电流电压通道以及用于标记状态的数字通道进行同步能耗数据采集。目标程序被封装成测量单元,通过GPIO端口向数字通道发送状态。该状态用于区别采集到的电流电压数据样本是否为目标程序执行时的样本,并采用近似能耗计算方法处理样本数据。通过对真实平台的能耗测量结果表明,该方法所得测量数据的最大差值保持在0.2 mJ左右,具有较高的精准性。  相似文献   

15.
Machine Learning - Matrices are a very common way of representing and working with data in data science and artificial intelligence. Writing a small snippet of code to make a simple matrix...  相似文献   

16.
Query recommendation helps users to describe their information needs more clearly so that search engines can return appropriate answers and meet their needs. State-of-the-art researches prove that the use of users’ behavior information helps to improve query recommendation performance. Instead of finding the most similar terms previous users queried, we focus on how to detect users’ actual information need based on their search behaviors. The key idea of this paper is that although the clicked documents are not always relevant to users’ queries, the snippets which lead them to the click most probably meet their information needs. Based on analysis into large-scale practical search behavior log data, two snippet click behavior models are constructed and corresponding query recommendation algorithms are proposed. Experimental results based on two widely-used commercial search engines’ click-through data prove that the proposed algorithms outperform practical recommendation methods of these two search engines. To the best of our knowledge, this is the first time that snippet click models are proposed for query recommendation task.  相似文献   

17.
Positional ranking functions, widely used in web search engines and related search systems, improve result quality by exploiting the positions of the query terms within documents. However, it is well known that positional indexes demand large amounts of extra space, typically about three times the space of a basic nonpositional index. Textual data, on the other hand, is needed to produce text snippets. In this paper, we study time–space trade-offs for search engines with positional ranking functions and text snippet generation. We consider both index-based and non-index based alternatives for positional data. We aim to answer the question of whether positional data should be indexed, and how.We show that there is a wide range of practical time–space trade-offs. Moreover, we show that using about 1.30 times the space of positional data, we can store everything needed for efficient query processing, with a minor increase in query time. This yields considerable space savings and outperforms, both in space and time, recent alternatives from literature. We also propose several efficient compressed text representations for snippet generation, which are able to use about half of the space of current state-of-the-art alternatives with little impact in query processing time.  相似文献   

18.
基于检索历史隐式地学习用户偏好是个性化检索研究的热点,而根据用户检索历史重构新的查询输入是其中主要的研究内容。已有的研究在利用检索历史进行查询重构时,通常不区分检索历史中的内容是否与当前查询相关,而是将全部检索历史视为整体,因而使重构后的查询含有较多噪声。该文基于相关词语在上下文中大量共现的特征,将用户历史检索结果的网页摘要作为上下文语境,结合用户点击,选择检索历史中与当前查询共现程度最高的词语重构查询模型。对初始检索结果重排序的实验表明,该方法可以有效地选择相关词语,减少噪声。用p@5和NDCG两种指标评价,比最好的基准系统分别相对提高12.8%和7.2%,比初始排序结果相对提高 26.0% 和11.4%。  相似文献   

19.
一种改进的TSP启发交叉算子   总被引:1,自引:1,他引:0       下载免费PDF全文
旅行商问题(TSP,Traveling Salesman Problem)是一种经典的NP组合优化问题。遗传算法在求解这类组合问题方面明显优于传统算法,同时也提出了许多求解较好路径的交叉算子。在对比分析唐立新提出的两种启发式交叉算法的基础上,提出了一种新的交叉算子。该算子通过判断父代的城市是否相邻来保存有效基因片断,通过加入一个移动的窗口来加快算法收敛。实验结果表明了该算子的有效性。  相似文献   

20.
With the growing requirements of web applications, web components are developed to package the implementation of commonly-used features for reuse. In some cases, the developer may want to reuse some features which cannot be customized by the component's APIs. He/she has to extract the implementation by hand. It is labor-intensive and error-prone. Considering the widely-used test cases which can be useful to specify the software features, a test-driven approach is proposed to extract the implementation of the desired features in web components. The satisfaction of the user's requirements is transformed into the passing rate of user-specified test cases. In this way, the quality of the extraction result can be evaluated automatically. Meanwhile, a record/replay-based GUI test generation method is proposed to ensure that the extraction result has the correct GUI appearance. To extract the feature implementation, a hierarchical genetic algorithm is proposed to find the code snippet that can pass all the tests and has the approximate smallest size. We compare our method with two existing feature extraction methods. The result shows that our method can extract the correct implementation with the minimum size. A human-subject study is conducted to show the effectiveness and weaknesses of our method in helping users extract the features.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号