首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 216 毫秒
1.
为了满足对XML文档集合进行数据挖掘需求,本文提出了根据XML文档树的语义信息和结构信息来计算其结构相似度,通过结构相似度构造其结构相似度矩阵,在此基础上应用DBSCAN算法来对XML文档集合进行聚类.与其他聚类算法相比,其聚类的速度得到了很大的提高.  相似文献   

2.
在分布式检索中,基于主题的语言模型集合选择方法首先引入Relevance Model计算用户查询和信息集合中文档的相似度,在此基础上通过文本聚类得到集合中文档的主题信息,加入语言模型计算得到各个信息集合的查询相关度排名,以此完成集合选择.实验表明,与ODRI、CRCS和基于传统语言模型的集合选择算法相比,该方法的检索效果得到了显著提高.  相似文献   

3.
针对项目申报过程中,人工操作导致的填报不规范、审核效率低等问题,基于AJAX、WEBOFFICE中间件、XML等技术,给出了项目申报客观信息自动审核及文档自动生成的解决方案,设计与实现了集科研项目、科技奖励、科研成果、高级科研人才于一体的综合科研管理系统,实现了所有项目申报信息的在线直接录入、多项客观申报信息的自动化审核以及申报文档的自动生成,有效提升了信息审核效率,解决了文档格式不规范的问题.业务应用表明,系统为提升科研管理工作效率,规范管理流程,提高信息共享水平提供了有力支持.  相似文献   

4.
《现代电子技术》2016,(1):148-152
考虑到传统Web文档聚类算法聚类效果差、速度慢等问题,针对Web文档聚类算法进行深入研究,使用目标优化策略将Web文档聚类认为是最佳划归文档集合的范畴,并通过引入优化算法进行聚类划分。针对使用SVD表示的Web文档向量存在高维稀疏性等问题,使用LDA对Web文档簇的潜在语义子空间进行重构,从而降低Web文档向量空间的维数,最后在低维空间使用遗传算法进行寻优。常规的GA算法通常存在算法早熟以及局部寻优能力弱等问题。故提出一种改进型GA算法,通过引入自适应对偶种群、自适应终止规则以及新的生成子代规则来保证种群在迭代过程中的多样性以解决算法早熟问题,并且要提高算法的搜索效率以提高算法对局部寻优的能力。最后通过实验验证提出的基于改进型GA算法的Web文档聚类算法的聚类有效性。  相似文献   

5.
随着电子文本信息在机构内部的快速增加,人们无法应对堆积如山的文档,许多文档无法实现其信息价值,如何充分利用这些文本信息资源已成为一个迫切需要解决的问题。介绍一种可用于机构内部的智能文本分析系统以及相关的关键技术,其功能包括文档检索、文档自动摘要和话题自动识别与跟踪。利用智能文本分析系统能够充分实现文本文档的信息价值。  相似文献   

6.
李新叶  苑津莎 《电子学报》2007,35(11):2220-2225
传统基于关键词的搜索引擎不能充分利用XML文档的结构信息,搜索结果往往不精确;而基于结构信息和关键词的XML搜索技术又不适用于普通用户.基于关键词的XML语义检索克服了以上缺点,但需要提高检索效率.本文深入分析了XML文档结构潜藏的语义,提出了新的索引结构及两结点语义相关的判断函数,在此基础上提出了一种快速的XML语义检索算法,该算法大大减少了结点对语义相关的判断次数.对实际数据集的测试实验结果显示出新算法的有效性.  相似文献   

7.
可变类谱聚类遥感影像分割   总被引:3,自引:0,他引:3       下载免费PDF全文
李玉  袁永华  赵雪梅 《电子学报》2018,46(12):3021-3028
为实现遥感影像分割中类别数的准确、自动判别,提出了一种可变类谱聚类算法.根据影像的相似图构建权值矩阵和标准Laplacians矩阵,计算Laplacians矩阵较小特征值对应的特征向量生成特征向量矩阵,并视其与像素对应的向量行为像素特征点集;研究Laplacians矩阵处于不同(近似)块对角结构时类属同一目标类像素特征点的聚集性,定义聚类度指标,计算不同分割类别数对应聚类度;选择聚类度将发生最后一次较大跳变时的分割类别数作为算法估计类别数,并采用FCM(Fuzzy C-Means)算法划分该类别数对应像素特征点集实现影像分割.分别采用提出算法和基于特征间隙的算法分割合成及真实遥感影像.实验结果表明提出算法可准确地判别影像类别数.  相似文献   

8.
为了进一步利用和挖掘Word强大的功能,在对Word进行二次开发时需要实现一些比较复杂的文档处理功能,例如生成复杂的表格和图形,Word文档自动生成技术成为Word自动化操作应用中的一个重要的研究方向。文中基于VC平台下OLE技术对Word文档自动生成技术进行了研究。介绍了相关的技术如COM、MFC、Word对象模型、VBA等,并给出了Word文档自动生成的原理及实现细节。最后通过一个实例对Word文档自动生成技术进行了验证。  相似文献   

9.
总论     
0608351顶板裂隙水对锚索支护巷道稳定性的影响研究〔刊,中〕/王志清//湖南科技大学学报(自然科学版).—2005,20(4).—26-29(C2)0608352基于智能聚类的相关度内容检索方法〔刊,中〕/高慧颖//北京理工大学学报.—2005,25(12).—1075-1078(L)为了提高内容检索的相关度与检索效率,基于信息系统理论与自组织神经网络理论,提出基于智能聚类的相关度检索方法,并设计了检索算法。经过训练的自组织神经网络通过对查询需求进行聚类,使得内容的检索只在与查询需求同类的文本内容中进行,提高了检索效率,并通过在同一个向量空间对查询向量与文本内容的…  相似文献   

10.
由于传统系统在实际应用中无法准确计算出网络信息与检索词之间的关联性,导致系统的调和中数数值较低,为此提出基于元搜索引擎的网络信息智能检索系统设计。在系统硬件方面设计了元搜索引擎和检索器,利用元搜索引擎收集海量网络信息,由检索器实现系统网络信息检索功能;在系统软件方面,利用MySQL数据库对系统信息进行存储,并且采用空间向量方式存储网络信息,根据检索词的区分能力以及检索词在网络信息文档出现频率,计算出检索词与网络信息文档的相关度,结合计算结果对检索信息文档进行排序和合并,将合并后的网络信息作为系统最终的检索结果。经实验证明,设计系统调和中数高于传统系统。  相似文献   

11.
12.
I. Introduction Most of current Information Retrieval (IR) sys-tems try to match terms of queries with terms of documents. One major problem of these approaches lies in that users want to retrieve documents accord-ing to content, while individual words provide unre-liable evidence about the content of the texts[1?3]. When some parts of text in the document collection are missing, e.g. only the abstract is available, the word-use variability problem will have substantial impact on the IR per…  相似文献   

13.
Document examination is a vital mission for revealing illegal modifications that assist in the detection and resolution of criminal acts. Addition and alteration are more frequently used in handwritten documents. However, most of the documents have been modified with similar inks, and it is tough to detect or observe them with human eyes. As a result, there is a need for methods to automatically detect handwriting forgery to reach an accurate detection efficiently. In this paper, a novel and efficient method is proposed for automatically detecting altered handwritten documents and locating the fake part. Therefore, DE-Net is proposed to identify the forged document using a digitally scanned version of the document. Unlike the existing methods, a further localization schema is applied to locate the forged parts in the candidate forged document accurately. Where each forged document is segmented into objects. Color histograms of R, G, and B channels are used to generate a fused feature vector for each object. Then a structural similarity index (SSIM) is applied to detect the lower similarity parts as forged. The experimental results demonstrate that the proposed method can identify and localize foreign ink in handwritten documents with high performance.  相似文献   

14.
This paper proposes a multidimensional model for classifying drug information text documents. The concept of multidimensional category model is introduced for representing classes. In contrast with traditional flat and hierarchical category models, the multidimensional category model classifies each document using multiple predefined sets of categories, where each set corresponds to a dimension. Since a multidimensional model can be converted to flat and hierarchical models, three classification approaches are possible, i.e., classifying directly based on the multidimensional model and classifying with the equivalent flat or hierarchical models. The efficiency of these three approaches is investigated using drug information collection with two different dimensions: 1) drug topics and 2) primary therapeutic classes. In the experiments, k-nearest neighbor, naive Bayes, and two centroid-based methods are selected as classifiers. The comparisons among three approaches of classification are done using two-way analysis of variance, followed by the Scheffé's test for post hoc comparison. The experimental results show that multidimensional-based classification performs better than the others, especially in the presence of a relatively small training set. As one application, a category-based search engine using the multidimensional category concept was developed to help users retrieve drug information.  相似文献   

15.

In order to improve the search performance of rich text content, a cloud search engine system based on rich text content is designed. On the basis of traditional search engine hardware system, several hardware devices such as Solr index server, collector, Chinese word segmentation device and searcher are installed, and the data interface is adjusted. On the basis of hardware equipment and database support, this paper uses the open source Apache Tika framework to obtain the metadata of rich text documents, implements word segmentation according to the rich text content and semantics, and calculates the weight of each keyword. Input search keywords, establish a text index, use BM25 algorithm to calculate the similarity between keywords and text, and output the search results of rich text according to the similarity calculation results. The experimental results show that the design system has high recall rate, high throughput, and the construction time of each data item index in different files is short, which improves the search efficiency and search accuracy.

  相似文献   

16.
Despite all the technological developments in image acquisition and processing, preserving old documents and other data of historical interest is still a very challenging issue. Indeed, these documents are often proned to several types of artifacts affecting their readability. Furthermore, due to the considerable information considered in such media, reducing the size of the digitized documents is another challenging problem since their entropy is often high due to the presence of artifacts. Thus, we believe that directing the lost of information onto artifacts could bring an elegant solution to this issue. In this paper, we propose the first approach joining enhancement-compression for handwritten document images. This approach presents a novel foreground/background segmentation algorithm, using both directional and contrast features to highlight the original information. This pre-treatment step is embedded into DjVu encoder, which is commonly used in National Archives and libraries frameworks, to drive the compression rate. Both objective evaluation and perceptual judgment demonstrate the efficiency of the proposed scheme on the whole DIBCO datasets.  相似文献   

17.
为了消除光照对打印文件图像采集的影响,采用均匀模式的LBP对采集到的字符图像进行处理,提取LBP特征值,使用SVM进行打印文件的分类鉴别,分类准确率只有83.40%,说明LBP特征可以用于打印文件的分类鉴别,但是分类准确率不能满足实际应用的需要.运用两因子方差分析模型得出字符因子和打印机因子对鉴别率的影响都是显著的.采用纹理信息挖掘的方法消除字符因子的影响,研究打印机因子对打印文件鉴别的影响,可以将打印文件鉴别率提高到94.50%,具备实际应用的能力.  相似文献   

18.
在查阅、处理IC(集成电路)封装、测试行业相关文件、资料的过程中,发现该行业的文件编制标准很不统一:不同的公司使用不同的标准,同一公司的文件编制也往往使用不同的规则。这给该行业的文件使用者在查找和审核带来很大困扰。编制一份正式受控的技术类型的文件不应只考虑内容,还需附含文件名称、文件编号、文件版号等等.这样才能作为追溯和判别产品是否符合加工要求的依据。随便一段文字、一次谈话、一个邮件、一次电话、一个会议纪要往往不适合作为指导产品加工的文字依据。给出适合该行业文件编制规则的基本要求。其他行业的文件编制也可用作参考。  相似文献   

19.
Semantic features are critical intelligence information for mobile ubiquitous multimedia, how to manage and retrieve the semantic information has been an important issue. In this paper, a novel semantic retrieval approach named Data Hiding based Semantic Retrieval (DHSR) for ubiquitous multimedia is proposed. This approach consists of the following features: (1) Every multimedia document has to be semantically annotated by several users before saved into multimedia database. (2) Semantic information described by object ontology will be hidden in the multimedia document data. (3) Semantic information will not be lost even if the multimedia document is copied, cut or leave the database. Our work provides a search engine with convenient user interfaces. The experimental results show that DHSR can search the multimedia documents reflecting users’ query intent more effectively compared with some traditional approaches.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号