首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
针对政策术语具有时效性、低频度、稀疏性和复合短语的特点, 传统术语抽取方法难以满足需求的问题, 设计实现了语义增强的多策略政策术语抽取系统. 该系统从频繁项挖掘和语义相似度两个维度对政策文本特征进行建模, 融合多种频繁模式挖掘策略选取特征种子词, 利用预训练语言模型增强语义匹配来召回低频且稀疏的政策术语, 实现了从无词库冷启动到有词库热启动半自动化的政策术语抽取. 该系统能够提升政策文本分析效果, 为建设智慧政务服务平台提供技术支持.  相似文献   

2.
中文领域术语自动抽取是中文信息处理中的一项基础性课题,并在自然语言生成、信息检索、文本摘要等领域中有广泛的应用。针对领域术语抽取问题,采用基于规则和多种统计策略相融合的方法,从词语度和领域度两个角度出发,提出一种领域术语的抽取算法并构建出相应的抽取系统。系统流程包括基于左右信息熵扩展的候选领域术语获取、基于词性搭配规则与边界信息出现概率知识库相结合的词语度筛选策略以及基于词频-逆文档频率(TF?IDF)的领域度筛选策略。运用此算法不但能抽取出领域的常见用词,还可以挖掘出领域新词。实验结果显示,基于如上方法构建的领域术语抽取系统的准确率为84.33%,所提方法能够有效支持中文领域术语的自动抽取。  相似文献   

3.
基于网络资源与用户行为信息的领域术语提取   总被引:1,自引:0,他引:1  
领域术语是反映领域特征的词语.领域术语自动抽取是自然语言处理中的一项重要任务,可以应用在领域本体抽取、专业搜索、文本分类、类语言建模等诸多研究领域,利用互联网上大规模的特定领域语料来构建领域词典成为一项既有挑战性又有实际价值的工作.当前,领域术语提取工作所利用的网络语料主要是网页对应的正文,但是由于网页正文信息抽取所面临的难题会影响领域术语抽取的效果,那么利用网页的锚文本和查询文本替代网页正文进行领域术语抽取,则可以避免网页正文信息抽取所面临的难题.针对锚文本和查询文本所存在的文本长度过短、语义信息不足等缺点,提出一种适用于各种类型网络数据及网络用户行为数据的领域数据提取方法,并使用该方法基于提取到的网页正文数据、网页锚文本数据、用户查询信息数据、用户浏览信息数据等开展了领域术语提取工作,重点考察不同类型网络资源和用户行为信息对领域术语提取工作的效果差异.在海量规模真实网络数据上的实验结果表明,基于用户查询信息和用户浏览过的锚文本信息比基于网页正文提取技术得到的正文取得了更好的领域术语提取效果.  相似文献   

4.
在面向限定领域的事实型问答系统中,基于模板匹配的问答是一种有效且稳定的方法。然而,现有的问题模板构建方法通常是在有监督场景下进行的,导致其严重依赖于人工标注数据,同时领域间可扩展性较差。因此,该文提出了一种改进Apriori算法的无监督模板抽取方法。对于限定领域问题样本,加入短语有序特征来挖掘频繁项集,将频繁项作为问题模板的框架词;同时,使用TF-IDF来度量模板的信息量,去除信息量小的模板;特别地,为了获取项数较长的模板,为Apriori算法引入了支持度自适应更新机制;最终,借助命名实体识别进行槽位识别,并组合框架词和槽,得到问题模板。实验表明,该方法可以在限定领域的问答数据集上有效挖掘问题模板,并取得了比基线模型更好的抽取效果。  相似文献   

5.
自动提取含字母词语的领域新术语的研究   总被引:1,自引:0,他引:1       下载免费PDF全文
新术语的提取是中文信息处理领域的一个重要研究课题。针对现有提取方法的不足和很多专业术语表现为字母词语的特点,该文提出了一种综合统计技术和规则筛选的方法:基于长串优先和串频统计的思路进行文本切分,得到共现字符串,利用词语搭配规则进行过滤,经过领域词典及评价函数的筛选,提取出领域新术语。该方法可发现包含字母词语、专业术语等未登录词在内的频率大于等于2的任意长度的专指语义串、短语和词。实验表明了该方法的有效性及新术语的准确率分布特征。  相似文献   

6.
中文领域本体学习中术语的自动抽取*   总被引:3,自引:0,他引:3  
提出一种领域术语自动抽取的混合策略,首先进行多字词候选术语抽取和分词,然后合并其结果,最后通过领域相关度和领域主题一致度抽取出最终领域术语。在多字词抽取和最终领域术语抽取阶段分别对现有方法进行了改进,降低了字符串分解的时间复杂度并提高了领域术语抽取的准确率和召回率。实验表明,术语抽取准确率为90.64%,优于现有的抽取方法。  相似文献   

7.
The importance of research on knowledge management is growing due to recent issues on Big Data. One of the most fundamental steps in knowledge management is the extraction of terminologies. Terms are often expressed in various forms and the variations often play a negative role, becoming an obstacle which causes knowledge systems to extract unnecessary ones. To solve the problem, we propose a method of term normalization which finds a normalized form (original and standard form defined in dictionaries) of variant terms. The method employs two characteristics of terms: appearance similarity measuring how similar terms are, context similarity measuring how many clue words they share. Through experiment, we show its positive influence of both similarities in term normalization.  相似文献   

8.
Automatic recognition of multi-word terms:. the C-value/NC-value method   总被引:6,自引:0,他引:6  
Technical terms (henceforth called terms ), are important elements for digital libraries. In this paper we present a domain-independent method for the automatic extraction of multi-word terms, from machine-readable special language corpora. The method, (C-value/NC-value ), combines linguistic and statistical information. The first part, C-value, enhances the common statistical measure of frequency of occurrence for term extraction, making it sensitive to a particular type of multi-word terms, the nested terms. The second part, NC-value, gives: 1) a method for the extraction of term context words (words that tend to appear with terms); 2) the incorporation of information from term context words to the extraction of terms. Received: 17 December 1998 / Revised: 19 May 1999  相似文献   

9.
新的关键字提取算法研究   总被引:2,自引:0,他引:2  
传统的关键字提取算法往往是基于高频词提取的,但文档中的关键字往往并不都是高频词,因此还需要从非高频词集中找出关键字.把一篇文档抽象为一个图:结点表示词语,边表示词语的同现关系;并基于文档的这种拓扑结构,提出了一种新的关键字提取算法,并和传统的关键字提取算法作了比较,在精确率,覆盖率方面均有不错的效果.  相似文献   

10.
中文专利文献中含有大量领域术语,对这些术语进行自动识别是信息抽取、文本挖掘等领域的重要任务。该文提出了基于专利文献标题的术语词性规则自动生成方法以及针对候选术语排序的TermRank算法。该方法首先从大量的中文专利文献标题中自动生成词性规则;然后利用生成的词性规则对中文专利文献正文部分进行规则匹配获得候选术语表;再利用提出的TermRank排序算法对候选术语表排序,最终得到术语列表。通过在9 725篇中文专利文献数据上实验,证实了该方法的有效性。
  相似文献   

11.
Technical-term translation represents one of the most difficult tasks for human translators since (1) most translators are not familiar with terms and domain-specific terminology and (2) such terms are not adequately covered by printed dictionaries. This paper describes an algorithm for translating technical words and terms from noisy parallel corpora across language groups. Given any word which is part of a technical term in the source language, the algorithm produces a ranked candidate match for it in the target language. Potential translations for the term are compiled from the matched words and are also ranked. We show how this ranked list helps translators in technical-term translation. Most algorithms for lexical and term translation focus on Indo-European language pairs, and most use a sentence-aligned clean parallel corpus without insertion, deletion or OCR noise. Our algorithm is language- and character-set-independent, and is robust to noise in the corpus. We show how our algorithm requires minimum preprocessing and is able to obtain technical-word translations without sentence-boundary identification or sentence alignment, from the English–Japanese awk manual corpus with noise arising from text insertions or deletions and on the English–Chinese HKUST bilingual corpus. We obtain a precision of 55.35% from the awk corpus for word translation including rare words, counting only the best candidate and direct translations. Translation precision of the best-candidate translation is 89.93% from the HKUST corpus. Potential term translations produced by the program help bilingual speakers to get a 47% improvement in translating technical terms.  相似文献   

12.
An important component of a spoken term detection (STD) system involves estimating confidence measures of hypothesised detections.A potential problem of the widely used lattice-based confidence estimation,however,is that the confidence scores are treated uniformly for all search terms,regardless of how much they may differ in terms of phonetic or linguistic properties.This problem is particularly evident for out-of-vocabulary (OOV) terms which tend to exhibit high intra-term diversity.To address the impact of term diversity on confidence measures,we propose in this work a term-dependent normalisation technique which compensates for term diversity in confidence estimation.We first derive an evaluation-metric-oriented normalisation that optimises the evaluation metric by compensating for the diverse occurrence rates among terms,and then propose a linear bias compensation and a discriminative compensation to deal with the bias problem that is inherent in lattice-based confidence measurement and from which the Term Specific Threshold (TST) approach suffers.We tested the proposed technique on speech data from the multi-party meeting domain with two state-ofthe-art STD systems based on phonemes and words respectively.The experimental results demonstrate that the confidence normalisation approach leads to a significant performance improvement in STD,particularly for OOV terms with phonemebased systems.  相似文献   

13.
Dictionary learning algorithms for sparse representation   总被引:11,自引:0,他引:11  
  相似文献   

14.
Bilingual termbanks are important for many natural language processing applications, especially in translation workflows in industrial settings. In this paper, we apply a log-likelihood comparison method to extract monolingual terminology from the source and target sides of a parallel corpus. The initial candidate terminology list is prepared by taking all arbitrary n-gram word sequences from the corpus. Then, a well-known statistical measure (the Dice coefficient) is employed in order to remove any multi-word terms with weak associations from the candidate term list. Thereafter, the log-likelihood comparison method is applied to rank the phrasal candidate term list. Then, using a phrase-based statistical machine translation model, we create a bilingual terminology with the extracted monolingual term lists. We integrate an external knowledge source—the Wikipedia cross-language link databases—into the terminology extraction (TE) model to assist two processes: (a) the ranking of the extracted terminology list, and (b) the selection of appropriate target terms for a source term. First, we report the performance of our monolingual TE model compared to a number of the state-of-the-art TE models on English-to-Turkish and English-to-Hindi data sets. Then, we evaluate our novel bilingual TE model on an English-to-Turkish data set, and report the automatic evaluation results. We also manually evaluate our novel TE model on English-to-Spanish and English-to-Hindi data sets, and observe excellent performance for all domains.  相似文献   

15.
Automatic keyword extraction from documents has long been used and proven its usefulness in various areas. Crowdsourced tagging for multimedia resources has emerged and looks promising to a certain extent. Automatic approaches for unstructured data, automatic keyword extraction and crowdsourced tagging are efficient but they all suffer from the lack of contextual understanding. In this paper, we propose a new model of extracting key contextual terms from unstructured data, especially from documents, with crowdsourcing. The model consists of four sequential processes: (1) term selection by frequency, (2) sentence building, (3) revised term selection reflecting the newly built sentences, and (4) sentence voting. Online workers read only a fraction of a document and participated in sentence building and sentence voting processes, and key sentences were generated as a result. We compared the generated sentences to the keywords entered by the author and to the sentences generated by offline workers who read the whole document. The results support the idea that sentence building process can help selecting terms with more contextual meaning, closing the gap between keywords from automated approaches and contextual understanding required by humans.  相似文献   

16.
One of the major challenges in Peer-to-Peer (P2P) file sharing systems is to support content-based search. Although there have been some proposals to address this challenge, they share the same weakness of using either servers or super-peers to keep global knowledge, which is required to identify importance of terms to avoid popular terms in query processing. As a result, they are not scalable and are prone to the bottleneck problem, which is caused by the high visiting load at the global knowledge maintainers. To that end, in this paper, we propose a novel adaptive indexing approach for content-based search in P2P systems, which can identify importance of terms without keeping global knowledge. Our method is based on an adaptive indexing structure that combines a Chord ring and a balanced tree. The tree is used to aggregate and classify terms adaptively, while the Chord ring is used to index terms of nodes in the tree. Specifically, at each node of the tree, the system classifies terms as either important or unimportant. Important terms, which can distinguish the node from its neighbor nodes, are indexed in the Chord ring. On the other hand, unimportant terms, which are either popular or rare terms, are aggregated to higher level nodes. Such classification enables the system to process queries on the fly without the need for global knowledge. Besides, compared to the methods that index terms separately, term aggregation reduces the indexing cost significantly. Taking advantage of the tree structure, we also develop an efficient search algorithm to tackle the bottleneck problem near the root. Finally, our extensive experiments on both benchmark and Wikipedia datasets validated the effectiveness and efficiency of the proposed method.  相似文献   

17.
基于多策略的专业领域术语抽取器的设计   总被引:9,自引:0,他引:9  
杜波  田怀凤  王立  陆汝占 《计算机工程》2005,31(14):159-160
设计了一个将统计方法与规则方法相结合的专业领域内术语抽取算法。针对专业领域术语的特点,利用多种衡量字符串中各字之间结合“紧密程度”的统计量,先使用阈值分类器抽取出双字候选项;然后再对这些候选项向左右进行一定程度的扩充,从中筛选出符合要求的多字候选项;最后将所得候选项进行过滤,得到最终结果。据此实现了一个以未切分标注的生语料为输入、以专业领域术语为输出的抽取程序,在对多个领域内的语料进行测试后对实验结果进行分析,指出其中存在的问题,对未来的工作作出了展望。  相似文献   

18.
Nowadays, people do not only navigate the web, but they also contribute contents to the Internet. Among other things, they write their thoughts and opinions in review sites, forums, social networks, blogs and other websites. These opinions constitute a valuable resource for businesses, governments and consumers. In the last years, some researchers have proposed opinion extraction systems, mostly domain-independent ones, to automatically extract structured representations of opinions contained in those texts. In this work, we tackle this task in a domain-oriented approach, defining a set of domain-specific resources which capture valuable knowledge about how people express opinions on a given domain. These resources are automatically induced from a set of annotated documents. Some experiments were carried out on three different domains (user-generated reviews of headphones, hotels and cars), comparing our approach to other state-of-the-art, domain-independent techniques. The results confirm the importance of the domain in order to build accurate opinion extraction systems. Some experiments on the influence of the dataset size and an example of aggregation and visualization of the extracted opinions are also shown.  相似文献   

19.
In this paper we provide an account of the cross-lingual lexical substitution task run as part of SemEval-2010. In this task both annotators (native Spanish speakers, proficient in English) and participating systems had to find Spanish translations for target words in the context of an English sentence. Because only translations of a single lexical unit were required, this task does not necessitate a full blown translation system. This we hope encouraged those working specifically on lexical semantics to participate without a requirement for them to use machine translation software, though they were free to use whatever resources they chose. In this paper we pay particular attention to the resources used by the various participating systems and present analyses to demonstrate the relative strengths of the systems as well as the requirements they have in terms of resources. In addition to the analyses of individual systems we also present the results of a combined system based on voting from the individual systems. We demonstrate that the system produces better results at finding the most frequent translation from the annotators compared to the highest ranked translation provided by individual systems. This supports our other analyses that the systems are heterogeneous, with different strengths and weaknesses.  相似文献   

20.
传统文本分类中的文档表示方法一般基于全文本(Bag-Of-Words)的分析,由于忽略了领域相关的语义特征,无法很好地应用于面向特定领域的文本分类任务.本文提出了一种基于语料库对比领域相关词汇提取的特征选择方法,结合SVM分类器实现了适用于特定领域的文本分类系统,能轻松应用到各个领域.该系统在2005年文本检索会议(TREC,Text REtrieval Conference)的基因领域文本分类任务(Genomics Track Categorization Task)的评测中取得第一名.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号