共查询到20条相似文献,搜索用时 125 毫秒
1.
针对目前数据库知识发现模型系统中传统文本信息抽取算法无法满足用户业务需求的问题,提出了一种基于用户需求描述的文本信息特征抽取模型。通过用户的业务需求模型进行特征化描述,将数据库中存储的原始本文信息进行预处理加工,计算的词频、权重,初步选取文本特征,根据用户需求描述计算特征相似度,过滤不相关的"噪声"信息,进而保留能够精确描述文本信息的特征。 相似文献
2.
3.
4.
通过分析文本挖掘中的2个关键步骤——文本特征空间构造和相似距离度量,指出流行的文本挖掘过程中存在着大量同义和关联噪声。大量存在的同义词和关联词,造成文本特征空间无法准确表达文本语义以及高维计算复杂性问题。利用潜在语义分析和关联规则挖掘构造同义和关联词集,用于减少文本特征空间中的同义词和关联词,降低信息冗余,改进挖掘效率。文中对相应的算法进行了描述,实验结果令人满意。 相似文献
5.
针对自然语言处理中查询主题漂移和词不匹配问题,提出基于CSC(Copulas-based Support and Confidence)框架的关联模式挖掘与规则扩展算法,并将基于统计学分析的关联模式与具有上下文语义信息的词向量融合,提出关联模式挖掘与词向量学习融合的伪相关反馈查询扩展模型.该模型对伪相关反馈文档集挖掘规则扩展词,对初检文档集进行词嵌入学习训练得到词向量,计算规则扩展词与原查询的向量相似度,提取向量相似度不低于阈值的规则扩展词作为最终扩展词.实验结果表明,所提扩展模型能有效地减少查询主题漂移和词不匹配问题,提高检索性能,与现有基于关联模式的和基于词向量的查询扩展方法比较,MAP(Mean Average Precision)平均增幅最大可达17.52%,对短查询更有效.所提挖掘方法可用于其他文本挖掘任务和推荐系统,以提高其性能. 相似文献
6.
针对信息增益模型在文本分类中的不足之处,提出了一种基于灰关系与信息增益的文本分类算法.首先基于改进的χ2统计进行类别特征选择用于类内文本表示,提高类别中心向量的表示能力;其次针对IG模型对低频词赋权过大问题,提出了基于频数和位置的改进加权方法;最后提出了基于灰关系的文本相似度计算途径,改善了基于距离的相似度计算模式的不足.试验表明,此算法提高了文本分类效率. 相似文献
7.
在分布式检索中,基于主题的语言模型集合选择方法首先引入Relevance Model计算用户查询和信息集合中文档的相似度,在此基础上通过文本聚类得到集合中文档的主题信息,加入语言模型计算得到各个信息集合的查询相关度排名,以此完成集合选择.实验表明,与ODRI、CRCS和基于传统语言模型的集合选择算法相比,该方法的检索效果得到了显著提高. 相似文献
8.
9.
10.
针对基于语义的短文本相似度计算方法在短文本分类中准确率较低这一问题,提出了结合词性的短文本相似度算法( GCSSA)。该方法在基于hownet(“知网”)语义的短文本相似度计算方法的基础上,结合类别特征词并添加关键词词性分析,对类别特征词和其他关键词的词性信息给定不同关键词以不同的权值系数,以此区别各种贡献度词项在短文本相似度计算中的重要程度。实验表明,该算法进行文本相似度计算后应用于短文本分类中较基于hownet的短文本分类算法在准确率宏平均和微平均上提升4%左右,有效提高了短文本分类的准确性。 相似文献
11.
12.
Leroy G. Hsinchun Chen 《IEEE transactions on information technology in biomedicine》2001,5(4):261-270
This paper describes the development and testing of the Medical Concept Mapper, a tool designed to facilitate access to online medical information sources by providing users with appropriate medical search terms for their personal queries. Our system is valuable for patients whose knowledge of medical vocabularies is inadequate to find the desired information, and for medical experts who search for information outside their field of expertise. The Medical Concept Mapper maps synonyms and semantically related concepts to a user's query. The system is unique because it integrates our natural language processing tool, i.e., the Arizona (AZ) Noun Phraser, with human-created ontologies, the Unified Medical Language System (UMLS) and WordNet, and our computer generated Concept Space, into one system. Our unique contribution results from combining the UMLS Semantic Net with Concept Space in our deep semantic parsing (DSP) algorithm. This algorithm establishes a medical query con-text based on the UMLS Semantic Net, which allows Concept Space terms to be filtered so as to isolate related terms relevant to the query. We performed two user studies in which Medical Concept Mapper terms were compared against human experts' terms. We conclude that the AZ Noun Phraser is well suited to extract medical phrases from user queries, that WordNet is not well suited to provide strictly medical synonyms, that the UMLS Metathesaurus is well suited to provide medical synonyms, and that Concept Space is well suited to provide related medical terms, especially when these terms are limited by our DSP algorithm 相似文献
13.
为了研究并提高文本的聚类算法的性能,根据蚁群算法在TSP问题中的应用方法,将其改进引用到文本的聚类处理的研究中。在文本的聚类处理研究中,改变蚂蚁的信息素释放机制,道路节点的聚合方式,从而最终将相似文本进行聚合。对改进的算法进行实验后的结果证明,这种新的算法可以使文本聚类的准确度提高,具有良好的聚类效果,能有效提高查询的文本召回率。蚁群算法在文本聚类中的应用是可行的。 相似文献
14.
15.
Zhou Hong Chan Syin Kok F. Lai 《Journal of Visual Communication and Image Representation》1998,9(4):287-299
We present a two-pass image retrieval system in which retrieval techniques for text and image documents are combined in a novel approach. In the first pass, the text-based initial query is matched against the text captions of the images in the database to obtain the initial retrieved set. In the second pass, text and image features obtained from this initial retrieved set are used to expand the initial query. Additional images from the database are then retrieved based on the expanded query. The image features that we have used are color histograms, DC coefficients from the discrete cosine transform, and two texture features: multiresolution simultaneous autoregressive model and local binary pattern. These are low-level statistical image features that can be easily computed. Extensive experiments have been performed on 1019 color pictures of mixed variety with captions, relevance judgments and queries supplied by a national archives agency. Objective precision-recall results have been obtained with various combinations of text and image features. The results show that the image features do not perform well when used on their own. However, when image features are used in query expansion, they increase the average precision more significantly than text annotations. Moreover, these findings are valid at all precision levels and are not sensitive to the image feature acquisition parameters. 相似文献
16.
BilVideo: a video database management system 总被引:1,自引:0,他引:1
The BilVideo video database management system provides integrated support for spatiotemporal and semantic queries for video. A knowledge base, consisting of a fact base and a comprehensive rule set implemented in Prolog, handles spatio-temporal queries. These queries contain any combination of conditions related to direction, topology, 3D relationships, object appearance, trajectory projection, and similarity-based object trajectories. The rules in the knowledge base significantly reduce the number of facts representing the spatio-temporal relations that the system needs to store. A feature database stored in an object-relational database management system handles semantic queries. To respond to user queries containing both spatio-temporal and semantic conditions, a query processor interacts with the knowledge base and object-relational database and integrates the results returned from these two system components. Because of space limitations, we only discuss the Web-based visual query interface and its fact-extractor and video-annotator tools. These tools populate the system's fact base and feature database to support both query types. 相似文献
17.
Lei Zheng Wetzel A.W. Gilbertson J. Becich M.J. 《IEEE transactions on information technology in biomedicine》2003,7(4):249-255
A prototype, content-based image retrieval system has been built employing a client/server architecture to access supercomputing power from the physician's desktop. The system retrieves images and their associated annotations from a networked microscopic pathology image database based on content similarity to user supplied query images. Similarity is evaluated based on four image feature types: color histogram, image texture, Fourier coefficients, and wavelet coefficients, using the vector dot product as a distance metric. Current retrieval accuracy varies across pathological categories depending on the number of available training samples and the effectiveness of the feature set. The distance measure of the search algorithm was validated by agglomerative cluster analysis in light of the medical domain knowledge. Results show a correlation between pathological significance and the image document distance value generated by the computer algorithm. This correlation agrees with observed visual similarity. This validation method has an advantage over traditional statistical evaluation methods when sample size is small and where domain knowledge is important. A multi-dimensional scaling analysis shows a low dimensionality nature of the embedded space for the current test set. 相似文献
18.
刑侦现勘图像数据库是具有保密性高、图像内容罕见等极具行业特色的图像数据库.针对现勘图像内容复杂、目标物体不明确的特点,提出了DCT-DCT波纹理特征,并与HSV颜色直方图特征、GIST特征相融合构成融合特征.与常用的图像特征相比,DCT-DCT波纹理特征能够得到较高的检索效率,而融合特征的平均检索查准率高于构成其本身的三种特征的平均检索查准率.最后,将语义分析技术引入到检索过程中,提出基于检索结果优化的现勘图像检索算法,利用支持向量机(Support Vector Machine,SVM)分类器对查询图像进行语义提取,并对初次检索的结果进行语义分析,根据初检结果中语义类别的占比选择二次检索方案,该算法能在按例查询的基础上进一步提高平均检索查准率. 相似文献
19.
For the complex questions of Chinese question answering system, we propose an answer extraction method with discourse structure feature combination. This method uses the relevance of questions and answers to learn to rank the answers. Firstly, the method analyses questions to generate the query string, and then submits the query string to search engines to retrieve relevant documents. Secondly, the method makes retrieved documents segmentation and identifies the most relevant candidate answers, in addition, it uses the rhetorical relations of rhetorical structure theory to analyze the relationship to determine the inherent relationship between paragraphs or sentences and generate the answer candidate paragraphs or sentences. Thirdly, we construct the answer ranking model, and extract five feature groups and adopt Ranking Support Vector Machine (SVM) algorithm to train ranking model. Finally, it reranks the answers with the training model and find the optimal answers. Experiments show that the proposed method combined with discourse structure features can effectively improve the answer extracting accuracy and the quality of non-factoid answers. The Mean Reciprocal Rank (MRR) of the answer extraction reaches 69.53%. 相似文献
20.
文本分类中改进型互信息特征选择的研究 总被引:5,自引:2,他引:3
互信息是文本分类中常用的特征选择方法.提出了一种新的基于互信息的特征选择方法.首先分析了特征选择影响文本分类精度的因素,将这些因素组合起来表征特征对于分类的强弱,并用公式直观地表示由组合因素计算出的特征值,根据这些值得大小选择对分类影响大的特征.最后理论证明了其可行性,并通过实验证明了该方法在提高分类精度方面比传统方法提高了10%. 相似文献