首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Automatic keyphrase extraction has many important applications including but not limited to summarization, cataloging/indexing, feature extraction for clustering and classification, and data mining. This paper presents the KP-Miner system, and demonstrates through experimentation and comparison with widely used systems that it is effective and efficient in extracting keyphrases from both English and Arabic documents of varied length. Unlike other existing keyphrase extraction systems, the KP-Miner system does not need to be trained on a particular document set in order to achieve its task. It also has the advantage of being configurable as the rules and heuristics adopted by the system are related to the general nature of documents and keyphrases. This implies that the users of this system can use their understanding of the document(s) being input into the system to fine-tune it to their particular needs.  相似文献   

2.
This paper describes the organization and results of the automatic keyphrase extraction task held at the Workshop on Semantic Evaluation 2010 (SemEval-2010). The keyphrase extraction task was specifically geared towards scientific articles. Systems were automatically evaluated by matching their extracted keyphrases against those assigned by the authors as well as the readers to the same documents. We outline the task, present the overall ranking of the submitted systems, and discuss the improvements to the state-of-the-art in keyphrase extraction.  相似文献   

3.
Keyphrase extraction from social media is a crucial and challenging task. Previous studies usually focus on extracting keyphrases that provide the summary of a corpus. However, they do not take users’ specific needs into consideration. In this paper, we propose a novel three-stage model to learn a keyphrase set that represents or related to a particular topic. Firstly, a phrase mining algorithm is applied to segment the documents into human-interpretable phrases. Secondly, we propose a weakly supervised model to extract candidate keyphrases, which uses a few pre-specific seed keyphrases to guide the model. The model consequently makes the extracted keyphrases more specific and related to the seed keyphrases (which reflect the user’s needs). Finally, to further identify the implicitly related phrases, the PMI-IR algorithm is employed to obtain the synonyms of the extracted candidate keyphrases. We conducted experiments on two publicly available datasets from news and Twitter. The experimental results demonstrate that our approach outperforms the state-of-the-art baselines and has the potential to extract high-quality task-oriented keyphrases.  相似文献   

4.
Character groundtruth for real, scanned document images is crucial for evaluating the performance of OCR systems, training OCR algorithms, and validating document degradation models. Unfortunately, manual collection of accurate groundtruth for characters in a real (scanned) document image is not practical because (i) accuracy in delineating groundtruth character bounding boxes is not high enough, (ii) it is extremely laborious and time consuming, and (iii) the manual labor required for this task is prohibitively expensive. Ee describe a closed-loop methodology for collecting very accurate groundtruth for scanned documents. We first create ideal documents using a typesetting language. Next we create the groundtruth for the ideal document. The ideal document is then printed, photocopied and then scanned. A registration algorithm estimates the global geometric transformation and then performs a robust local bitmap match to register the ideal document image to the scanned document image. Finally, groundtruth associated with the ideal document image is transformed using the estimated geometric transformation to create the groundtruth for the scanned document image. This methodology is very general and can be used for creating groundtruth for documents in typeset in any language, layout, font, and style. We have demonstrated the method by generating groundtruth for English, Hindi, and FAX document images. The cost of creating groundtruth using our methodology is minimal. If character, word or zone groundtruth is available for any real document, the registration algorithm can be used to generate the corresponding groundtruth for a rescanned version of the document  相似文献   

5.
马佩勋  高琰 《计算机应用研究》2013,30(12):3610-3613
传统的TF*PDF方法提取的关键短语可精确地描述话题并进行新闻报道的追踪, 但存在误将噪声数据识别为关键短语的情况。提出了一种基于位置权重TF*PDF的两段式关键短语提取方法滤除噪声数据。该方法将传统的TF*PDF算法与位置权重相结合, 计算词汇与短语的权重, 获取候选关键短语列表, 关键短语的脉冲值则用于过滤列表中的噪声。通过关键短语识别进程根据位置信息、频率信息等将热点词汇组合成短语。TF*PDF位置权重算法同时也用于为短语分配权重, 排名前K的短语被认为是热点关键短语。以真实网络数据为基础的实验结果表明, 该提取方法与传统的TF*PDF提取方法相比, 可更好地去除关键词短语中的绝对噪声, 较好地改善了热点话题检测的准确度。  相似文献   

6.
This paper proposes a new, efficient algorithm for extracting similar sections between two time sequence data sets. The algorithm, called Relay Continuous Dynamic Programming (Relay CDP), realizes fast matching between arbitrary sections in the reference pattern and the input pattern and enables the extraction of similar sections in a frame synchronous manner. In addition, Relay CDP is extended to two types of applications that handle spoken documents. The first application is the extraction of repeated utterances in a presentation or a news speech because repeated utterances are assumed to be important parts of the speech. These repeated utterances can be regarded as labels for information retrieval. The second application is flexible spoken document retrieval. A phonetic model is introduced to cope with the speech of different speakers. The new algorithm allows a user to query by natural utterance and searches spoken documents for any partial matches to the query utterance. We present herein a detailed explanation of Relay CDP and the experimental results for the extraction of similar sections and report results for two applications using Relay CDP. Yoshiaki Itoh has been an associate professor in the Faculty of Software and Information Science at Iwate Prefectural University, Iwate, Japan, since 2001. He received the B.E. degree, M.E. degree, and Dr. Eng. from Tokyo University, Tokyo, in 1987, 1989, and 1999, respectively. From 1989 to 2001 he was a researcher and a staff member of Kawasaki Steel Corporation, Tokyo and Okayama. From 1992 to 1994 he transferred as a researcher to Real World Computing Partnership, Tsukuba, Japan. Dr. Itoh's research interests include spoken document processing without recognition, audio and video retrieval, and real-time human communication systems. He is a member of ISCA, Acoustical Society of Japan, Institute of Electronics, Information and Communication Engineers, Information Processing Society of Japan, and Japan Society of Artificial Intelligence. Kazuyo Tanaka has been a professor at the University of Tsukuba, Tsukuba, Japan, since 2002. He received the B.E. degree from Yokohama National University, Yokohama, Japan, in 1970, and the Dr. Eng. degree from Tohoku University, Sendai, Japan, in 1984. From 1971 to 2002 he was research officer of Electrotechnical Laboratory (ETL), Tsukuba, Japan, and the National Institute of Advanced Science and Technology (AIST), Tsukuba, Japan, where he was working on speech analysis, synthesis, recognition, and understanding, and also served as chief of the speech processing section. His current interests include digital signal processing, spoken document processing, and human information processing. He is a member of IEEE, ISCA, Acoustical Society of Japan, Institute of Electronics, Information and Communication Engineers, and Japan Society of Artificial Intelligence. Shi-Wook Lee received the B.E. degree and M.E. degree from Yeungnam University, Korea and Ph.D. degree from the University of Tokyo in 1995, 1997, and 2001, respectively. Since 2001 he has been working in the Research Group of Speech and Auditory Signal Processing, the National Institute of Advanced Science and Technology (AIST), Tsukuba, Japan, as a postdoctoral fellow. His research interests include spoken document processing, speech recognition, and understanding.  相似文献   

7.
In this paper, we give a formal definition of a document image structure representation, and formulate document image structure extraction as a partitioning problem: finding an optimal solution partitioning the set of glyphs of an input document image into a hierarchical tree structure where entities within the hierarchy at each level have similar physical properties and compatible semantic labels. We present a unified methodology that is applicable to construction of document structures at different hierarchical levels. An iterative, relaxation-like method is used to find a partitioning solution that maximizes the probability of the extracted structure. All the probabilities used in the partitioning process are estimated from an extensive training set of various kinds of measurements among the entities within the hierarchy. The offline probabilities estimated in the training then drive all decisions in the online document structure extraction. We have implemented a text line extraction algorithm using this framework  相似文献   

8.
The keyphrases of a text entity are a set of words or phrases that concisely describe the main content of that text. Automatic keyphrase extraction plays an important role in natural language processing and information retrieval tasks such as text summarization, text categorization, full-text indexing, and cross-lingual text reuse. However, automatic keyphrase extraction is still a complicated task and the performance of the current keyphrase extraction methods is low. Automatic discovery of high-quality and meaningful keyphrases requires the application of useful information and suitable mining techniques. This paper proposes Topical and Structural Keyphrase Extractor (TSAKE) for the task of automatic keyphrase extraction. TSAKE combines the prior knowledge about the input langue learned by an N-gram topical model (TNG) with the co-occurrence graph of the input text to form some topical graphs. Different from most of the recent keyphrase extraction models, TSAKE uses the topic model to weight the edges instead of the nodes of the co-occurrence graph. Moreover, while TNG represents the general topics of the language, TSAKE applies network analysis techniques to each topical graph to detect finer grained sub-topics and extract more important words of each sub-topic. The use of these informative words in the ranking process of the candidate keyphrases improves the quality of the final keyphrases proposed by TSAKE. The results of our experiment studies conducted on three manually annotated datasets show the superiority of the proposed model over three baseline techniques and six state-of-the-art models.  相似文献   

9.
Knowledge and Information Systems - Massive open online courses (MOOCs) have emerged as a great resource for learners. Numerous challenges remain to be addressed in order to make MOOCs more useful...  相似文献   

10.
Multimedia Tools and Applications - The performance of document text recognition depends on text line segmentation algorithms, which heavily relies on the type of language, author’s writing...  相似文献   

11.

Automatic key concept identification from text is the main challenging task in information extraction, information retrieval, digital libraries, ontology learning, and text analysis. The main difficulty lies in the issues with the text data itself, such as noise in text, diversity, scale of data, context dependency and word sense ambiguity. To cope with this challenge, numerous supervised and unsupervised approaches have been devised. The existing topical clustering-based approaches for keyphrase extraction are domain dependent and overlooks semantic similarity between candidate features while extracting the topical phrases. In this paper, a semantic based unsupervised approach (KP-Rank) is proposed for keyphrase extraction. In the proposed approach, we exploited Latent Semantic Analysis (LSA) and clustering techniques and a novel frequency-based algorithm for candidate ranking is introduced which considers locality-based sentence, paragraph and section frequencies. To evaluate the performance of the proposed method, three benchmark datasets (i.e. Inspec, 500N-KPCrowed and SemEval-2010) from different domains are used. The experimental results show that overall, the KP-Rank achieved significant improvements over the existing approaches on the selected performance measures.

  相似文献   

12.
13.
Qian  Yue  Liu  Yu  Xu  Xiujuan  Sheng  Quan Z. 《World Wide Web》2020,23(4):2281-2302
World Wide Web - This paper studies a link-text algorithm to model scientific documents by citation influences, which is applied to document clustering and influence prediction. Most existing...  相似文献   

14.
Classifier-based acronym extraction for business documents   总被引:1,自引:1,他引:0  
Acronym extraction for business documents has been neglected in favor of acronym extraction for biomedical documents. Although there are overlapping challenges, the semi-structured and non-predictive nature of business documents hinder the effectiveness of the extraction methods used on biomedical documents and fail to deliver the expected performance. A classifier-based extraction subsystem is presented as part of the wider project, Binocle, for the analysis of French business corpora. Explicit and implicit acronym presentation cases are identified using textual and syntactical hints. Among the 7 features extracted from each candidate instance, we introduce “similarity” features, which compare a candidate’s characteristics with average length-related values calculated from a generic acronym repository. Commonly used rules for evaluating the candidate (matching first letters, ordered instances, etc.) are scored and aggregated in a single composite feature that permits a supple classification. One hundred and thirty-eight French business documents from 14 public organizations were used for the training and evaluation corpora, yielding a recall of 90.9% at a precision level of 89.1% for a search space size of 3 sentences.  相似文献   

15.
16.
Efficient extraction of schemas for XML documents   总被引:3,自引:0,他引:3  
In this paper, we present a technique for efficient extraction of concise and accurate schemas for XML documents. By restricting the schema form and applying some heuristic rules, we achieve the efficiency and conciseness. The result of an experiment with real-life DTDs shows that our approach attains high accuracy and is 20 to 200 times faster than existing approaches.  相似文献   

17.
Data mining algorithms such as data classification or clustering methods exploit features of entities to characterise, group or classify them according to their resemblance. In the past, many feature extraction methods focused on the analysis of numerical or categorical properties. In recent years, motivated by the success of the Information Society and the WWW, which has made available enormous amounts of textual electronic resources, researchers have proposed semantic data classification and clustering methods that exploit textual data at a conceptual level. To do so, these methods rely on pre-annotated inputs in which text has been mapped to their formal semantics according to one or several knowledge structures (e.g. ontologies, taxonomies). Hence, they are hampered by the bottleneck introduced by the manual semantic mapping process. To tackle this problem, this paper presents a domain-independent, automatic and unsupervised method to detect relevant features from heterogeneous textual resources, associating them to concepts modelled in a background ontology. The method has been applied to raw text resources and also to semi-structured ones (Wikipedia articles). It has been tested in the Tourism domain, showing promising results.  相似文献   

18.
To help users search domain-specific document collections or those limited in size, the author created a search system based on a generic framework. The system incorporates a simple domain-independent dialogue manager and an automatically created model of the domain.  相似文献   

19.
An expert advisor (SPECIFAC) has been developed to assist design engineers in the creation of written engineering specification documents. A ‘mix & match’ style of clause library is used to provide a framework for the particular specifications. A proprietary shell was employed to build the advisor and the shell has been linked with a professional word processor to perform text handling. The advisor seeks to suggest clauses that should be included, based on pre-defined associations with relevant criteria.

Knowledge representation was facilitated by the use of a two-stage process. First, a conceptual picture of the knowledge was built up to assist in the acquisition, structuring and coding of the knowledge. Second, a shell-dependent coding scheme was enacted.

This paper describes the design and operation of the expert system and explains how it improves upon existing methods. Some of the main features are considered in detail and the issues surrounding the important design decisions are discussed.

The advisor was developed as a pilot system for British Rail, Civil Engineering Department. It has resulted in resources being committed to implement SPECIFAC on an operational basis.  相似文献   


20.
Training speed of the classifier without degrading its predictive capability is an important concern in text classification. Feature selection plays a key role in this context. It selects a subset of most informative words (terms) from the set of all words. The correlative association of words towards the classes increases an incertitude for the words to represent a class. The representative words of a class are either of positive or negative nature. The standard feature selection methods, viz. Mutual Information (MI), Information Gain (IG), Discriminating Feature Selection (DFS) and Chi Square (CHI), do not consider positive and negative nature of the words that affects the performance of the classifiers. To address this issue, this paper presents a novel feature selection method named Correlative Association Score (CAS). It combines the strength, mutual information, and strong association of the words to determine their positive and negative nature for a class. CAS selects a few (k) informative words from the set of all words (m). These informative words generate a set of N-grams of length 1-3. Finally, the standard Apriori algorithm ensembles the power of CAS and CHI to select the top most, b informative N-grams, where b is a number set by an empirical evaluation. Multinomial Naive Bayes (MNB) and Linear Support Vector Machine (LSVM) classifiers evaluate the performance of the selected N-Grams. Four standard text data sets, viz. Webkb, 20Newsgroup, Ohsumed10, and Ohsumed23 are used for experimental analysis. Two standard performance measures named Macro_F1 and Micro_F1 show a significant improvement in the results using proposed CAS method.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号