首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
一种改进的自适应文本信息过滤模型   总被引:19,自引:1,他引:18  
自适应信息过滤技术能够帮助用户从Web等信息海洋中获得感兴趣的内容或过滤无关垃圾信息.针对现有自适应过滤系统的不足,提出了一种改进的自适应文本信息过滤模型.模型中提供了两种相关性检索机制,在此基础上改进了反馈算法,并采用了增量训练的思想,对过滤中的自适应学习机制也提出了新的算法.基于本模型的系统在相关领域的国际评测中取得良好成绩.试验数据说明各项改进是有效的,新模型具有更高的性能.  相似文献   

2.
改进的Web文本自适应过滤策略   总被引:1,自引:0,他引:1  
面对实时网络信息过滤的新挑战,自适应信息过滤基本上能够解决问题。针对现有自适应系统的不足,本文提出提高模板准确性的学习和过滤阈值优化的新方法。改进的过滤策略过滤初期采用SVM算法,中后期采用改进的自适应模板过滤法。模板的更新采用改进的模板系数调整策略,并引入特征衰减因子来提高过滤的准确率。该系统运行于一个校园网关上,取得了较好的结果。  相似文献   

3.
In this paper we report our research on building WebSail, an intelligent web search engine that is able to perform real-time adaptive learning. WebSail learns from the user's relevance feedback, so that it is able to speed up its search process and to enhance its search performance. We design an efficient adaptive learning algorithm TW2 to search for web documents. WebSail employs TW2 together with an internal index database and a real-time meta-searcher to perform real-time adaptive learning to find desired documents with as little relevance feedback from the user as possible. The architecture and performance of WebSail are also discussed. Received 3 November 2000 / Revised 13 March 2001 / Accepted in revised form 17 April 2001  相似文献   

4.
Learning to rank is a supervised learning problem that aims to construct a ranking model for the given data. The most common application of learning to rank is to rank a set of documents against a query. In this work, we focus on point‐wise learning to rank, where the model learns the ranking values. Multivariate adaptive regression splines (MARS) and conic multivariate adaptive regression splines (CMARS) are supervised learning techniques that have been proven to provide successful results on various prediction problems. In this article, we investigate the effectiveness of MARS and CMARS for point‐wise learning to rank problem. The prediction performance is analyzed in comparison to three well‐known supervised learning methods, artificial neural network (ANN), support vector machine, and random forest for two datasets under a variety of metrics including accuracy, stability, and robustness. The experimental results show that MARS and ANN are effective methods for learning to rank problem and provide promising results.  相似文献   

5.
Kernel-based algorithms have been proven successful in many nonlinear modeling applications. However, the computational complexity of classical kernel-based methods grows superlinearly with the increasing number of training data, which is too expensive for online applications. In order to solve this problem, the paper presents an information theoretic method to train a sparse version of kernel learning algorithm. A concept named instantaneous mutual information is investigated to measure the system reliability of the estimated output. This measure is used as a criterion to determine the novelty of the training sample and informative ones are selected to form a compact dictionary to represent the whole data. Furthermore, we propose a robust learning scheme for the training of the kernel learning algorithm with an adaptive learning rate. This ensures the convergence of the learning algorithm and makes it converge to the steady state faster. We illustrate the performance of our proposed algorithm and compare it with some recent kernel algorithms by several experiments.  相似文献   

6.
Biblio is an adaptive system that automatically extracts meta-data from semi-structured and structured scanned documents. Instead of using hand-coded templates or other methods manually customized for each given document format, it uses example-based machine learning to adapt to customer-defined document and meta-data types. We provide results from experiments on the recognition of document information in two document corpuses: a set of scanned journal articles and a set of scanned legal documents. The first set is semi-structured, as the different journals use a variety of flexible layouts. The second set is largely free-form text based on poor quality scans of FAX-quality legal documents. We demonstrate accuracy on the semi-structured document set roughly comparable to hand-coded systems, and much worse performance on the legal documents.  相似文献   

7.
《Knowledge》2005,18(2-3):117-124
In this paper we propose an approach for refining a document ranking by learning filtering rulesets through relevance feedback. This approach includes two important procedures. One is a filtering method, which can be incorporated into any kinds of information retrieval systems. The other is a learning algorithm to make a set of filtering rules, each of which specifies a condition to identify relevant documents using combinations of characteristic words. Our approach is useful not only to overcome the limitation of the vector space model, but also to utilize tags of semi-structured documents like Web pages. Through experiments we show our approach improves the performance of relevance feedback in two types of IR systems adopting the vector space model and a Web search engine, respectively.  相似文献   

8.
雷蕾  王晓丹  周进登 《计算机科学》2012,39(12):245-248
情感分类任务旨在自动识别文本所表达的情感色彩信息(例如,褒或者贬、支持或者反对)。提出一种基于情 绪词与情感词协作学习的情感分类方法:在基于传统情感词资源的基础上,引入少量情绪词辅助学习,只利用大规模 未标注数据实现情感分类。具体来讲,基于文档一单词二部图的标签传播算法框架,利用情绪词与情感词构建两个视 图,通过协作学习的方法从大规模未标注语料中抽取高正确率的自动标注样本作为训练数据,然后训练分类器进行情 感分类。实验表明,该方法在多个领域的情感分类任务中都取得了较好的分类效果。  相似文献   

9.
王金宝 《计算机应用》2006,26(5):1099-1101
为了适应实时在线的网络信息过滤需求,提出了一种新的自适应过滤模型。在系统的初始化阶段,运用增量学习方法对附加的少量伪相关文档进行学习,采用改进的文档词频方法来抽取特征词,以此扩展需求模板,提高模板准确度。在系统测试阶段,以系统效能指标最优为目标,提出了将概率模型和文档正例分布统计方法相结合来实现阈值优化的新算法。  相似文献   

10.
李昕  钱旭  王自强 《计算机工程》2010,36(15):40-42,48
为有效解决文档聚类问题,提出一种基于间隔流形学习的文档聚类算法。该算法利用间隔Fisher分析将高维文档空间降维到低维特征空间,利用支持向量聚类算法进行聚类。在基准文档测试集上的实验结果表明,该算法的聚类性能优于其他常用的文档聚类算法。  相似文献   

11.
We develop an intelligent document delivery approach for filtering text information. Our approach can conduct content-based filtering via a machine learning technique which automatically constructs a filtering profile from training examples. The profiles, encoded in rule representation, are easily understood by human. Good features of high predictive power for the learning process are automatically extracted from the document content. As a result, our approach is able to operate without any prior information or restriction of the topic areas and yet achieve the filtering task. We have conducted an extensive simulation study to analyze the performance of our approach. We have also implemented a practical intelligent news article delivery system based on our approach. Both simulation study as well as practical experiments use real-world document collections and the results demonstrate that our approach is effective. ©1999 John Wiley & Sons, Inc.  相似文献   

12.
Semi-supervised document clustering, which takes into account limited supervised data to group unlabeled documents into clusters, has received significant interest recently. Because of getting supervised data may be expensive, it is important to get most informative knowledge to improve the clustering performance. This paper presents a semi-supervised document clustering algorithm and a new method for actively selecting informative instance-level constraints to get improved clustering performance. The semi- supervised document clustering algorithm is a Constrained DBSCAN (Cons-DBSCAN) algorithm, which incorporates instance-level constraints to guide the clustering process in DBSCAN. An active learning approach is proposed to select informative document pairs for obtaining user feedbacks. Experimental results show that Cons-DBSCAN with our proposed active learning approach can improve the clustering performance significantly when given a relatively small amount of constraints.  相似文献   

13.
基于扩展角分类神经网络的文档分类方法   总被引:10,自引:0,他引:10  
CC4神经网络是一种三层前馈网络的新型角分类(corner classification)训练算法,原用于元搜索引擎Anvish的文档分类.当各文档之间的规模接近时,CC4神经网络有较好的分类效果.然而当文档之间规模差别较大时,其分类性能较差.针对这一问题,本文意图扩展原始CC4神经网络,达到对文档有效分类的效果.为此,提出了一种基于MDS-NN的数据索引方法,将每一文档映射至k维空间数据点,并尽可能多地保持原始文档之间的距离信息.其次,通过将索引信息变换为CC4神经网络接受的0,1序列,实现对CC4神经网络的扩展,使其能够接受索引信息作为输入.实验结果表明对相互之间规模差别较大的文档,扩展CC4神经网络的性能优于原始CC4神经网络的性能.同时,扩展CC4神经网络的分类精度与文档索引方法有密切关系.  相似文献   

14.
Transforming paper documents into XML format with WISDOM++   总被引:1,自引:1,他引:0  
The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding. WISDOM++ is a document processing system that operates in five steps: document analysis, document classification, document understanding, text recognition with an OCR, and transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported. Received June 15, 2000 / Revised November 7, 2000  相似文献   

15.
Document image classification is an important step in Office Automation, Digital Libraries, and other document image analysis applications. There is great diversity in document image classifiers: they differ in the problems they solve, in the use of training data to construct class models, and in the choice of document features and classification algorithms. We survey this diverse literature using three components: the problem statement, the classifier architecture, and performance evaluation. This brings to light important issues in designing a document classifier, including the definition of document classes, the choice of document features and feature representation, and the choice of classification algorithm and learning mechanism. We emphasize techniques that classify single-page typeset document images without using OCR results. Developing a general, adaptable, high-performance classifier is challenging due to the great variety of documents, the diverse criteria used to define document classes, and the ambiguity that arises due to ill-defined or fuzzy document classes.  相似文献   

16.
基于Rough集潜在语义索引的Web文档分类   总被引:5,自引:0,他引:5  
Rough集(粗糙集)埋论是一种处理不确定或模糊知识的数学工具。提出了一种基于Rough集理论的潜在语义索引的Web文档分类方法。首先应用向量空间模型表示Web文档信息,然后通过矩阵的奇异值分解来进行信息过滤和潜在语义索引;运用属性约简算法生成分类规则,最后利用多知识库进行文档分类。通过试验比较,该方法具有较好的分类效果。  相似文献   

17.
基于向量空间模型的信息安全过滤系统   总被引:6,自引:0,他引:6  
信息过滤是指通过监控信息源以找到满足用户需求的信息的过程。详细地论述了基于向量空间模型的信息过滤系统,系统由训练和自适应过滤两个阶段组成,在训练阶段,通过主题处理和特征抽取建立初始的过滤模板,设置初始阈值;在过滤阶段,则根据用户的反馈信息自适应地调整模板和阈值,最后给出了评估方法和实验结果。  相似文献   

18.
In this paper, a novel robust training algorithm of multi-input multi-output recurrent neural network and its application in the fault tolerant control of a robotic system are investigated. The proposed scheme optimizes the gradient type training on basis of three new adaptive parameters, namely, dead-zone learning rate, hybrid learning rate, and normalization factor. The adaptive dead-zone learning rate is employed to improve the steady state response. The normalization factor is used to maximize the gradient depth in the training, so as to improve the transient response. The hybrid learning rate switches the training between the back-propagation and the real-time recurrent learning mode, such that the training is robust stable. The weight convergence and L 2 stability of the algorithm are proved via Lyapunov function and the Cluett’s law, respectively. Based upon the theoretical results, we carry out simulation studies of a two-link robot arm position tracking control system. A computed torque controller is designed to provide a specified closed-loop performance in a fault-free condition, and then the RNN compensator and the robust training algorithm are employed to recover the performance in case that fault occurs. Comparisons are given to demonstrate the advantages of the control method and the proposed training algorithm.  相似文献   

19.

Document filtering is increasingly deployed in Web environments to reduce information overload of users. We formulate online information filtering as a reinforcement learning problem, i.e., TD(0). The goal is to learn user profiles that best represent information needs and thus maximize the expected value of user relevance feedback. A method is then presented that acquires reinforcement signals automatically by estimating user's implicit feedback from direct observations of browsing behaviors. This "learning by observation" approach is contrasted with conventional relevance feedback methods which require explicit user feedbacks. Field tests have been performed that involved 10 users reading a total of 18,750 HTML documents during 45 days. Compared to the existing document filtering techniques, the proposed learning method showed superior performance in information quality and adaptation speed to user preferences in online filtering.  相似文献   

20.
The text categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. TC has been an application for many learning approaches, which proved effective. Nevertheless, TC provides many challenges to machine learning. In this paper, we suggest, for text categorization, the integration of external WordNet lexical information to supplement training data for a semi-supervised clustering algorithm which (i) uses a finite design set of labeled data to (ii) help agglomerative hierarchical clustering algorithms (AHC) partition a finite set of unlabeled data and then (iii) terminates without the capacity to classify other objects. This algorithm is the “semi-supervised agglomerative hierarchical clustering algorithm” (ssAHC). Our experiments use Reuters 21578 database and consist of binary classifications for categories selected from the 89 TOPICS classes of the Reuters collection. Using the vector space model (VSM), each document is represented by its original feature vector augmented with external feature vector generated using WordNet. We verify experimentally that the integration of WordNet helps ssAHC improve its performance, effectively addresses the classification of documents into categories with few training documents, and does not interfere with the use of training data. © 2001 John Wiley & Sons, Inc.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号