首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
该文探究手术操作术语归一化方法的构建。首先,分析手术操作术语归一化数据集的特点;其次,调研术语归一化的相关方法;最后,结合调研知悉的技术理论方法和数据集特征,建立手术操作术语归一化模型。该文融合文本相似度排序+BERT模型匹配开展建模,在2019年中文健康信息处理会议(CHIP2019)手术操作术语归一化学术评测中,验证集准确率为88.35%,测试集准确率为88.51%,在所有参赛队伍中排名第5。  相似文献   

2.
An information retrieval system has to retrieve all and only those documents that are relevant to a user query, even if index terms and query terms are not matched exactly. However, term mismatches between index terms and query terms have been a serious obstacle to the enhancement of retrieval performance. In this article, we discuss automatic term normalization between words and phrases in text corpora and their application to a Korean information retrieval system. We perform three new types of term normalizations: transliterated word normalization, noun phrase normalization, and context-based term normalization. Transliterated words are normalized into equivalence classes by using contextual similarity to alleviate lexical term mismatches. Then, noun phrases are normalized into phrasal terms by segmenting compound nouns as well as normalizing noun phrases. Moreover, context-based terms are normalized by using a combination of mutual information and word context to establish word similarities. Next, unsupervised clustering is done by using the K-means algorithm and cooccurrence clusters are identified to alleviate semantic term mismatches. These term normalizations are used in both the indexing and the retrieval system. The experimental results show that our proposed system can alleviate three types of term mismatches and can also provide the appropriate similarity measurements. As a result, our system can improve the retrieval effectiveness of the information retrieval system.  相似文献   

3.
电子病历中的临床术语描述形式具有多样性和不规范性,阻碍了医疗数据的分析和利用,因此对临床术语标准化的研究具有重要的现实意义。当前国内医疗机构临床术语标准化主要由人工完成,效率低,成本高。该文提出了一种基于BERT的临床术语标准化方法。该方法使用Jaccard相似度算法从标准术语集中挑选出候选词,基于BERT模型对原始词和候选词进行匹配得到标准化的结果。在CHIP2019临床术语标准化评测任务的数据集上准确率为90.04%。实验结果表明,该方法对于临床术语标准化任务是有效的。  相似文献   

4.
临床术语标准化任务是医学统计中不可或缺的一部分。在实际应用中,一个标准的临床术语可能有数种口语化和非标准化的描述,而对于一些应用例如临床知识库的构建而言,如何将这些描述进行标准化是必须要面对的问题。该文主要关注中文临床术语的标准化任务,即将非标准的中文临床术语的描述文本和给定的临床术语库中的标准词进行对应。尽管一些深度判别式模型在简单文本结构的医疗术语,例如,疾病、药品名等的标准化任务上取得了一定成效,但对于中文临床术语标准化任务而言,其带标准化的描述文本中经常包含的信息缺失、“一对多”等情况,仅依靠判别式模型无法得到完整的语义信息,因而导致模型效果欠佳。该文将临床术语标准化任务类比为翻译任务,引入深度生成式模型对描述文本的核心语义进行生成并得到标准词候选集,再利用基于BERT的语义相似度算法对候选集进行重排序得到最终标准词。该方法在第五届中国健康信息处理会议(CHIP2019)评测数据中进行了实验并取得了很好的效果。  相似文献   

5.
Computing the semantic similarity between terms (or short text expressions) that have the same meaning but which are not lexicographically similar is an important challenge in the information integration field. The problem is that techniques for textual semantic similarity measurement often fail to deal with words not covered by synonym dictionaries. In this paper, we try to solve this problem by determining the semantic similarity for terms using the knowledge inherent in the search history logs from the Google search engine. To do this, we have designed and evaluated four algorithmic methods for measuring the semantic similarity between terms using their associated history search patterns. These algorithmic methods are: a) frequent co-occurrence of terms in search patterns, b) computation of the relationship between search patterns, c) outlier coincidence on search patterns, and d) forecasting comparisons. We have shown experimentally that some of these methods correlate well with respect to human judgment when evaluating general purpose benchmark datasets, and significantly outperform existing methods when evaluating datasets containing terms that do not usually appear in dictionaries.  相似文献   

6.
临床术语标准化是医学文本信息抽取中不可或缺的一项任务。临床上对于同一种诊断、手术、药品、检查、化验、症状等,往往会有多种不同的写法,术语标准化(归一)要解决的问题就是为临床上各种不同的说法找到对应的标准名称。在检索技术生成候选答案的基础上,该文提出了基于BERT(bidirectional encoder representation from transformers) 对候选答案进行重排序的方法。实验表明,该方法在CHIP2019手术名称标准化数据集上单模型准确率达到89.1%、融合模型准确率达到92.8%,基本满足实际应用标准。同时该方法具备较好的泛化能力,可应用到其他医学种类术语的标准化任务上。  相似文献   

7.
临床术语标准化即对于医生书写的任一术语,给出其在标准术语集合内对应的标准词。标准词数量多且相似度高,存在Zero-shot和Few-shot等问题,给术语标准化带来了巨大的挑战。该文基于“中国健康信息处理大会”CHIP 2019评测1中提供的数据集,设计并实现了基于BERT蕴含分数排序的临床术语标准化系统。该系统由数据预处理、BERT蕴含打分、BERT数量预测、基于逻辑回归的重排序四个模块组成。用精确率(Accuracy)作为评价指标,最终结果为0.948 25,取得了评测1第一名的成绩。  相似文献   

8.
A general measure of similarity for categorical sequences   总被引:2,自引:1,他引:1  
Measuring the similarity between categorical sequences is a fundamental process in many data mining applications. A key issue is extracting and making use of significant features hidden behind the chronological and structural dependencies found in these sequences. Almost all existing algorithms designed to perform this task are based on the matching of patterns in chronological order, but such sequences often have similar structural features in chronologically different order. In this paper we propose SCS, a novel, effective and domain-independent method for measuring the similarity between categorical sequences, based on an original pattern matching scheme that makes it possible to capture chronological and non-chronological dependencies. SCS captures significant patterns that represent the natural structure of sequences, and reduces the influence of those which are merely noise. It constitutes an effective approach to measuring the similarity between data in the form of categorical sequences, such as biological sequences, natural language texts, speech recognition data, certain types of network transactions, and retail transactions. To show its effectiveness, we have tested SCS extensively on a range of data sets from different application fields, and compared the results with those obtained by various mainstream algorithms. The results obtained show that SCS produces results that are often competitive with domain-specific similarity approaches.  相似文献   

9.
When one tries to use the Web as a dictionary or encyclopedia, entering some single term into a search engine, the highly ranked pages in the result can include irrelevant or useless sites. The problem is that single-term queries, if taken literally, underspecify the type of page the user wants. For such problems automatic query expansion, also known as pseudo-feedback, is often effective. In this method the top n documents returned by an initial retrieval are used to provide terms for a second retrieval. This paper contributes, first, new normalization techniques for query expansion, and second, a new way of computing the similarity between an expanded query and a document, the "local relevance density" metric, which complements the standard vector product metric. Both of these techniques are shown to be useful for single-term queries, in Japanese, in experiments done over the World Wide Web in early 2001.  相似文献   

10.
基于统计的文本相似度量方法大多先采用TF-IDF方法将文本表示为词频向量,然后利用余弦计算文本之间的相似度。此类方法由于忽略文本中词项的语义信息,不能很好地反映文本之间的相似度。基于语义的方法虽然能够较好地弥补这一缺陷,但需要知识库来构建词语之间的语义关系。研究了以上两类文本相似度计算方法的优缺点,提出了一种新颖的文本相似度量方法,该方法首先对文本进行预处理,然后挑选TF-IDF值较高的词项作为特征项,再借助HowNet语义词典和TF-IDF方法对特征项进行语义分析和词频统计相结合的文本相似度计算,最后利用文本相似度在基准文本数据集合上进行聚类实验。实验结果表明,采用提出的方法得到的F-度量值明显优于只采用TF-IDF方法或词语语义的方法,从而证明了提出的文本相似度计算方法的有效性。  相似文献   

11.
一种改进的本体语义相似度计算及其应用   总被引:5,自引:1,他引:5  
词语相似度研究,是知识表示以及信息检索领域中的一个重要内容.词语相似度的计算方法一般是利用大规模的语料库来统计.本体给词语间相似度计算带来了新的机会.利用本体结构上的ISA关系,提出了本体内部概念之间的相似度计算方法.实验结果表明,该方法能充分利用本体特点来计算相关概念之间的相似度.结合一个简单本体,介绍了如何计算概念间的相似度,及其在智能检索系统中的应用.  相似文献   

12.
针对短文本内容简短、特征稀疏等特点,提出一种融合共现距离和区分度的短文本相似度计算方法。一方面,该方法在整个短文本语料库中利用两个共现词之间距离计算它们的共现距离相关度。另一方面通过计算共现区分度来提高距离相关度的准确度,然后对每个文本中词项进行相关性加权,最后通过词项的权重和词项之间的共现距离相关度计算两个文本的相似度。实验结果表明,本文提出的方法能够提高短文本相似度计算的准确率。  相似文献   

13.
针对短文本特征极度稀疏、上下文依赖性强等特点,以自顶向下的策略,提出一种基于核心词项平均划分相似度的短文本聚类算法CTMPS。该方法首先在整个短文本语料库中计算词项之间的概率相关性,以此为基础对短文本中词项进行加权,将权值较大的词项作为最能代表该短文本的核心词项形成核心词项集;以信息论为基础,将核心词项作为划分依据计算平均划分相似度,选择平均划分相似度值最大包含该核心词项的短文本形成一类,用此策略反复迭代直到满足要求。最后,实验结果表明,本文提出的方法显著地提高了短文本聚类的性能。  相似文献   

14.
Wang  Tao  Cai  Yi  Leung  Ho-fung  Lau  Raymond Y. K.  Xie  Haoran  Li  Qing 《Knowledge and Information Systems》2021,63(9):2313-2346

In text categorization, Vector Space Model (VSM) has been widely used for representing documents, in which a document is represented by a vector of terms. Since different terms contribute to a document’s semantics in various degrees, a number of term weighting schemes have been proposed for VSM to improve text categorization performance. Much evidence shows that the performance of a term weighting scheme often varies across different text categorization tasks, while the mechanism underlying variability in a scheme’s performance remains unclear. Moreover, existing schemes often weight a term with respect to a category locally, without considering the global distribution of a term’s occurrences across all categories in a corpus. In this paper, we first systematically examine pros and cons of existing term weighting schemes in text categorization and explore the reasons why some schemes with sound theoretical bases, such as chi-square test and information gain, perform poorly in empirical evaluations. By measuring the concentration that a term distributes across all categories in a corpus, we then propose a series of entropy-based term weighting schemes to measure the distinguishing power of a term in text categorization. Through extensive experiments on five different datasets, the proposed term weighting schemes consistently outperform the state-of-the-art schemes. Moreover, our findings shed new light on how to choose and develop an effective term weighting scheme for a specific text categorization task.

  相似文献   

15.
Deformable registration is prone to errors when it involves large and complex deformations, since the procedure can easily end up in a local minimum. To reduce the number of local minima, and thus the risk of misalignment, regularization terms based on prior knowledge can be incorporated in registration. We propose a regularization term that is based on statistical knowledge of the deformations that are to be expected. A statistical model, trained on the shapes of a set of segmentations, is integrated as a penalty term in a free-form registration framework. For the evaluation of our approach, we perform inter-patient registration of MR images, which were acquired for planning of radiation therapy of cervical cancer. The manual delineations of structures such as the bladder and the clinical target volume are available. For both structures, leave-one-patient-out registration experiments were performed. The propagated atlas segmentations were compared to the manual target segmentations by Dice similarity and Hausdorff distance. Compared with registration without the use of statistical knowledge, the segmentations were significantly improved, by 0.1 in Dice similarity and by 8 mm Hausdorff distance on average for both structures.  相似文献   

16.
In the past decade, existing and new knowledge and datasets have been encoded in different ontologies for semantic web and biomedical research. The size of ontologies is often very large in terms of number of concepts and relationships, which makes the analysis of ontologies and the represented knowledge graph computational and time consuming. As the ontologies of various semantic web and biomedical applications usually show explicit hierarchical structures, it is interesting to explore the trade-offs between ontological scales and preservation/precision of results when we analyze ontologies. This paper presents the first effort of examining the capability of this idea via studying the relationship between scaling biomedical ontologies at different levels and the semantic similarity values. We evaluate the semantic similarity between three gene ontology slims (plant, yeast, and candida, among which the latter two belong to the same kingdom - fungi) using four popular measures commonly applied to biomedical ontologies (Resnik, Lin, Jiang-Conrath, and SimRel). The results of this study demonstrate that with proper selection of scaling levels and similarity measures, we can significantly reduce the size of ontologies without losing substantial detail. In particular, the performances of Jiang- Conrath and Lin are more reliable and stable than that of the other two in this experiment, as proven by 1) consistently showing that yeast and candida are more similar (as compared to plant) at different scales, and 2) small deviations of the similarity values after excluding a majority of nodes from several lower scales. This study provides a deeper understanding of the application of semantic similarity to biomedical ontologies, and shed light on how to choose appropriate semantic similarity measures for biomedical engineering.   相似文献   

17.
术语是由一个到多个单词按照某种语义角色组合而成的,传统的基于统计的相似度计算方法,将术语看作一个基本单元来进行计算,忽略了术语内部的语义角色,且对于上下文信息不丰富的术语,无法利用统计的方法取得理想的效果;基于语义资源的相似度计算方法,所涵盖的词语有限,因此不包含在语义资源中的术语便无法计算相似度。针对这些问题,该文针对专利提出了基于语义角色的术语相似度计算方法,该方法弥补了传统方法的不足。该文对术语内部的单词进行语义角色标注,通过共享最近邻方法计算单词的相似度,然后根据不同的语义角色,利用单词相似度来计算术语相似度。实验表明,该方法与传统方法相比,取得了较好的效果。  相似文献   

18.
一种基于词汇链的关键词抽取方法   总被引:26,自引:6,他引:26  
关键词在文献检索、自动文摘、文本聚类/分类等方面有十分重要的作用。词汇链是由一系列词义相关的词语组成,最初被用于分析文本的结构。本文提出了利用词汇链进行中文文本关键词自动标引的方法,并给出了利用《知网》为知识库构建词汇链的算法。通过计算词义相似度首先构建词汇链,然后结合词频与区域特征进行关键词选择。该方法考虑了词汇之间的语义信息,能够改善关键词标引的性能。实验结果表明,与单纯的词频、区域方法相比,召回率提高了7.78%,准确率提高了9.33%。  相似文献   

19.
20.
应用图模型来研究多文档自动摘要是当前研究的一个热点,它以句子为顶点,以句子之间相似度为边的权重构造无向图结构。由于此模型没有充分考虑句子中的词项权重信息以及句子所属的文档信息,针对这个问题,该文提出了一种基于词项—句子—文档的三层图模型,该模型可充分利用句子中的词项权重信息以及句子所属的文档信息来计算句子相似度。在DUC2003和DUC2004数据集上的实验结果表明,基于词项—句子—文档三层图模型的方法优于LexRank模型和文档敏感图模型。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号