首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 12 毫秒
1.
The technology of automatic document summarization is maturing and may provide a solution to the information overload problem. Nowadays, document summarization plays an important role in information retrieval. With a large volume of documents, presenting the user with a summary of each document greatly facilitates the task of finding the desired documents. Document summarization is a process of automatically creating a compressed version of a given document that provides useful information to users, and multi-document summarization is to produce a summary delivering the majority of information content from a set of documents about an explicit or implicit main topic. In our study we focus on sentence based extractive document summarization. We propose the generic document summarization method which is based on sentence clustering. The proposed approach is a continue sentence-clustering based extractive summarization methods, proposed in Alguliev [Alguliev, R. M., Aliguliyev, R. M., Bagirov, A. M. (2005). Global optimization in the summarization of text documents. Automatic Control and Computer Sciences 39, 42–47], Aliguliyev [Aliguliyev, R. M. (2006). A novel partitioning-based clustering method and generic document summarization. In Proceedings of the 2006 IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology (WI–IAT 2006 Workshops) (WI–IATW’06), 18–22 December (pp. 626–629) Hong Kong, China], Alguliev and Alyguliev [Alguliev, R. M., Alyguliev, R. M. (2007). Summarization of text-based documents with a determination of latent topical sections and information-rich sentences. Automatic Control and Computer Sciences 41, 132–140] Aliguliyev, [Aliguliyev, R. M. (2007). Automatic document summarization by sentence extraction. Journal of Computational Technologies 12, 5–15.]. The purpose of present paper to show, that summarization result not only depends on optimized function, and also depends on a similarity measure. The experimental results on an open benchmark datasets from DUC01 and DUC02 show that our proposed approach can improve the performance compared to sate-of-the-art summarization approaches.  相似文献   

2.
唐卡图像中关键区域对象的概念所表达内涵具有一定的相似性,进行定量计算和分析,对研究唐卡图像高层语义检索具有重要意义。针对该问题,引入形式概念分析,提出一种唐卡图像关键区域概念语义相似度的计算方法。首先提取唐卡图像中关键区域对象的概念和一系列的语义关键词作为形式背景来构造概念格,通过概念格计算概念间的语义相似度,实验结果表明,本文方法计算结果与人工判断结果相吻合,具有可行性和有效性。  相似文献   

3.
This research analyzes the gene relationship according to their annotations. We present here a similar genes discovery system (SGDS), based upon semantic similarity measure of gene ontology (GO) and Entrez gene, to identify groups of similar genes. In order to validate the proposed measure, we analyze the relationships between similarity and expression correlation of pairs of genes. We explore a number of semantic similarity measures and compute the Pearson correlation coefficient. Highly correlated genes exhibit strong similarity in the ontology taxonomies. The results show that our proposed semantic similarity measure outperforms the others and seems better suited for use in GO. We use MAPK homogenous genes group and MAP kinase pathway as benchmarks to tune the parameters in our system for achieving higher accuracy. We applied the SGDS to RON and Lutheran pathways, the results show that it is able to identify a group of similar genes and to predict novel pathways based on a group of candidate genes.  相似文献   

4.
谢德峰  吉建民 《计算机应用》2021,41(9):2489-2495
在自然语言处理(NLP)中,句法信息是完整句子中词汇与词汇之间的句法结构关系或者依存关系,是一种重要且有效的参考信息.语义解析任务是将自然语言语句直接转化成语义完整的、计算机可执行的语言.在以往的语义解析研究中,少有采用输入源的句法信息来提高端到端语义解析效率的工作.为了进一步提高端到端语义解析模型的准确率和效率,提出...  相似文献   

5.
基于词串粒度及权值的汉语句子相似度衡量   总被引:5,自引:0,他引:5  
提出了一种改进的汉语句子相似度衡量方法,用于基于实例的汉英机器翻译。该方法同时考虑了相同词串的数目及长度和对应的权值信息,克服了传统方法的显著不足,在理论上更有合理性。在小数据集上的实验也表明该方法是可行的。  相似文献   

6.
In this paper a study concerning the evaluation and analysis of natural language tweets is presented. Based on our experience in text summarisation, we carry out a deep analysis on user's perception through the evaluation of tweets manual and automatically generated from news. Specifically, we consider two key issues of a tweet: its informativeness and its interestingness. Therefore, we analyse: (1) do users equally perceive manual and automatic tweets?; (2) what linguistic features a good tweet may have to be interesting, as well as informative? The main challenge of this proposal is the analysis of tweets to help companies in their positioning and reputation on the Web. Our results show that: (1) automatically informative and interesting natural language tweets can be generated as a result of summarisation approaches; and (2) we can characterise good and bad tweets based on specific linguistic features not present in other types of tweets.  相似文献   

7.
Accurately measuring document similarity is important for many text applications, e.g. document similarity search, document recommendation, etc. Most traditional similarity measures are based only on “bag of words” of documents and can well evaluate document topical similarity. In this paper, we propose the notion of document structural similarity, which is expected to further evaluate document similarity by comparing document subtopic structures. Three related factors (i.e. the optimal matching factor, the text order factor and the disturbing factor) are proposed and combined to evaluate document structural similarity, among which the optimal matching factor plays the key role and the other two factors rely on its results. The experimental results demonstrate the high performance of the optimal matching factor for evaluating document topical similarity, which is as well as or better than most popular measures. The user study shows the good ability of the proposed overall measure with all three factors to further find highly similar documents from those topically similar documents, which is much better than that of the popular measures and other baseline structural similarity measures. Xiaojun Wan received a B.Sc. degree in information science, a M.Sc. degree in computer science and a Ph.D. degree in computer science from Peking University, Beijing, China, in 2000, 2003 and 2006, respectively. He is currently a lecturer at Institute of Computer Science and Technology of Peking University. His research interests include information retrieval and natural language processing.  相似文献   

8.
基于概念的Web信息检索   总被引:9,自引:0,他引:9  
随着Internet的不断发展,现已成为全球最大的共享信息基地,与此同时,如何准确地从中检索出用户所需要的信息也已成为研究的热点,并开发出了很多信息检索系统,但传统的基于关键词的检索系统,其查准率和查全率都有待提高,鉴于此,文章提出了一种基于概念的信息检索系统模型,给出了它的理论模型和工作机制,其核心技术是自然语言处理,在其内部,查询和索引都是建立在语义层次上的,因而有较好的查全率与查准率。文章最后,对其发展方向作了展望,希望对致力于信息检索研究的同行们有所启发。  相似文献   

9.
自然语言处理中的语义关系与句法模式互发现*   总被引:3,自引:0,他引:3  
在国家科技基础条件平台中如何建设汉语字词之间的语义关系库,并且利用初始的语义关系库自动获取句法模式和新的关系。使用了句法模式的概念,并提出了利用已有关系发现新模式、利用已有模式发现新关系的方法,创造性地设计相关模型并实现了一个中文语义关系知识库系统。利用此系统结合自然语言处理相关技术,从搜狗语料库和百度百科页面文件中大规模自动化获取了有效关系200多个,并从中提取了继承、同义等有效的新关系1 000多条。实验证明其效率达到约40%,主要取决于关系中查询词的距离取值和语料库本身的性质。  相似文献   

10.
The semantic web vision is one in which rich, ontology-based semantic markup will become widely available. The availability of semantic markup on the web opens the way to novel, sophisticated forms of question answering. AquaLog is a portable question-answering system which takes queries expressed in natural language and an ontology as input, and returns answers drawn from one or more knowledge bases (KBs). We say that AquaLog is portable because the configuration time required to customize the system for a particular ontology is negligible. AquaLog presents an elegant solution in which different strategies are combined together in a novel way. It makes use of the GATE NLP platform, string metric algorithms, WordNet and a novel ontology-based relation similarity service to make sense of user queries with respect to the target KB. Moreover it also includes a learning component, which ensures that the performance of the system improves over the time, in response to the particular community jargon used by end users.  相似文献   

11.
In image processing, image similarity indices evaluate how much structural information is maintained by a processed image in relation to a reference image. Commonly used measures, such as the mean squared error (MSE) and peak signal to noise ratio (PSNR), ignore the spatial information (e.g. redundancy) contained in natural images, which can lead to an inconsistent similarity evaluation from the human visual perception. Recently, a structural similarity measure (SSIM), that quantifies image fidelity through estimation of local correlations scaled by local brightness and contrast comparisons, was introduced by Wang et al. (2004). This correlation-based SSIM outperforms MSE in the similarity assessment of natural images. However, as correlation only measures linear dependence, distortions from multiple sources or nonlinear image processing such as nonlinear filtering can cause SSIM to under- or overestimate the true structural similarity. In this article, we propose a new similarity measure that replaces the correlation and contrast comparisons of SSIM by a term obtained from a nonparametric test that has superior power to capture general dependence, including linear and nonlinear dependence in the conditional mean regression function as a special case. The new similarity measure applied to images from noise contamination, filtering, and watermarking, provides a more consistent image structural fidelity measure than commonly used measures.  相似文献   

12.
张玉芳  徐安龙 《计算机应用》2012,32(5):1329-1331
目前,基于混合方法的相似度计算对影响语义相似度的因素分析不全面。针对这个问题,提出了基于多个影响术语语义相似度度量因素的综合方法。该方法结合语义层次,语义距离和局部语义密度,充分运用本体的语义信息来计算基因术语间的语义相似度。实验结果表明,该方法与人工打分的相关系数更高。  相似文献   

13.
14.
The complexity of Korean numeral classifiers demands semantic as well as computational approaches that employ natural language processing (NLP) techniques. The classifier is a universal linguistic device, having the two functions of quantifying and classifying nouns in noun phrase constructions. Many linguistic studies have focused on the fact that numeral classifiers afford decisive clues to categorizing nouns. However, few studies have dealt with the semantic categorization of classifiers and their semantic relations to the nouns they quantify and categorize in building ontologies. In this article, we propose the semantic recategorization of the Korean numeral classifiers in the context of classifier ontology based on large corpora and KorLex Noun 1.5 (Korean wordnet; Korean Lexical Semantic Network), considering its high applicability in the NLP domain. In particular, the classifier can be effectively used to predict the semantic characteristics of nouns and to process them appropriately in NLP. The major challenge is to make such semantic classification and the attendant NLP techniques efficient. Accordingly, a Korean numeral classifier ontology (KorLexClas 1.0), including semantic hierarchies and relations to nouns, was constructed.
Hyuk-Chul Kwon (Corresponding author)Email:
  相似文献   

15.
现有的话题追踪方法大多面向新闻数据,将其应用于论坛时效果不够理想。结合论坛的特点,提出一种基于语义相似度的论坛话题追踪方法。该方法首先通过构建话题和帖子的关键词表建立其文本表示模型,然后利用知网计算两个关键词表的语义相似度并以此作为帖子与话题的相关程度,最后根据相关程度实现论坛话题追踪。该方法较好地避免了向量空间模型的缺陷。实验表明,该方法能比较有效地解决面向论坛的话题追踪问题。  相似文献   

16.
In this paper, we present a system using computational linguistic techniques to extract metadata for image access. We discuss the implementation, functionality and evaluation of an image catalogers’ toolkit, developed in the Computational Linguistics for Metadata Building (CLiMB) research project. We have tested components of the system, including phrase finding for the art and architecture domain, functional semantic labeling using machine learning, and disambiguation of terms in domain-specific text vis a vis a rich thesaurus of subject terms, geographic and artist names. We present specific results on disambiguation techniques and on the nature of the ambiguity problem given the thesaurus, resources, and domain-specific text resource, with a comparison of domain-general resources and text. Our primary user group for evaluation has been the cataloger expert with specific expertise in the fields of painting, sculpture, and vernacular and landscape architecture.
Carolyn SheffieldEmail:

Judith L. Klavans   is a Senior Research Scientist at the University of Maryland Institute for Advanced Computer Studies (UMIACS), and Principal Investigator on the Mellon-funded Computational Linguistics for Metadata Building (CLiMB) and IMLS-supported T3 research projects. Her research includes text-mining from corpora and dictionaries, disambiguation, and multilingual multidocument summarization. Previously, she directed the Center for Research on Information Access at Columbia University. Carolyn Sheffield   holds an M.L.S. from the University of Maryland and her research interests include access issues surrounding visual and time-based materials. She designs, conducts and analyzes the CLiMB user studies and works closely with image catalogers to ensure that the CLiMB system reflects their needs and workflow. Eileen Abels   is Masters’ Program Director and Professor in the College of Information Science and Technology at Drexel University. Prior to joining Drexel in January 2007, Dr. Abels spent more than 15 years at the College of Information Studies at the University of Maryland. Her research focuses on user needs and information behaviors. She works with a broad range of information users including translators, business school students and faculty, engineers, scientists, and members of the general public. Dr. Abels holds a PhD from the University of California, Los Angeles. Jimmy Lin’s   research interests lie at the intersection of natural language processing and information retrieval. His work integrates knowledge- and data-driven approaches to address users’ information needs. Rebecca J. Passonneau   is a Research Scientist at the Center for Computational Learning Systems, Columbia University. Her areas of interest include linking empirical research methods on corpora with computational models of language processing, the intersection of language and context in semantics and pragmatics, corpus design and analysis, and evaluation methods for NLP. Her current projects involve working with machine learning for the Consolidated Edison utility company, and designing an experimental dialog system to take patron book orders by phone for the Andrew Heiskell Braille and Talking Book library. Tandeep Sidhu   is the Software Developer and Research Assistant for the CLiMB project. He is incharge of designing the CLiMB Toolkit as well as the NLP modules behind the Toolkit. He is currently pursuing his MS degree in Computer Science. Dagobert Soergel   has been teaching information organization at the University of Maryland since 1970 and is an internationally known expert in Knowledge Organization Systems and in Digital Libraries. In the CLiMB project he served as general consultant and was specially involved in the design of study on the relationship between an image and cataloging terms assigned to it.   相似文献   

17.
几何命题处理中的中文分词技术   总被引:1,自引:1,他引:1  
佘莉  符红光  方海光 《计算机工程》2005,31(18):180-182
如何将自然语言表述的初等几何命题自动转化为计算机可理解的作图语言是自然语言处理中的空白,也是实现教育软件人机交互的难点.而中文分词是自然语言处理的第1步,分词结果直接影响后期的处理工作.该文通过对几何范围内的受限语言的研究,建立了有效可行的语言理解模型,完成了词素的切分和词性标注,并在程序上得以实现.  相似文献   

18.
This paper comparatively analyzes a method to automatically classify case studies of building information modeling (BIM) in construction projects by BIM use. It generally takes a minimum of thirty minutes to hours of collection and review and an average of four information sources to identify a project that has used BIM in a manner that is of interest. To automate and expedite the analysis tasks, this study deployed natural language processing (NLP) and commonly used unsupervised learning for text classification, namely latent semantic analysis (LSA) and latent Dirichlet allocation (LDA). The results were validated against one of representative supervised learning methods for text classification—support vector machine (SVM). When LSA and LDA detected phrases in a BIM case study that had higher similarity values to the definition of each BIM use than the threshold values, the system determined that the project had deployed BIM in the detected approach. For the classification of BIM use, the BIM uses specified by Pennsylvania State University were utilized. The approach was validated using 240 BIM case studies (512,892 features). When BIM uses were employed in a project, the project was labeled as “1”; when they were not, the project was labeled as “0.” The performance was analyzed by changing parameters: namely, document segmentation, feature weighting, dimensionality reduction coefficient (k-value), the number of topics, and the number of iterations. LDA yielded the highest F1 score, 80.75% on average. LDA and LSA yielded high recall and low precision in most cases. Conversely, SVM yielded high precision and low recall in most cases and fluctuations in F1 scores.  相似文献   

19.
20.
角色反演算法   总被引:6,自引:0,他引:6  
白硕  张浩 《软件学报》2003,14(3):328-333
给出了面向上下文无关语言的句法分析的一种计算机制:角色反演算法.这种机制通过引入句法范畴的"角色"这一概念以及相应的角色反演操作,用较小的空间代价在Chart算法中实现了较强的"预读"(look ahead)功能.这使其能节约大量的无用边,从而加速分析过程的推进.这种机制可以用于自然语言处理等多种应用领域.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号