首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
一种改进的自适应文本信息过滤模型   总被引:19,自引:1,他引:18  
自适应信息过滤技术能够帮助用户从Web等信息海洋中获得感兴趣的内容或过滤无关垃圾信息.针对现有自适应过滤系统的不足,提出了一种改进的自适应文本信息过滤模型.模型中提供了两种相关性检索机制,在此基础上改进了反馈算法,并采用了增量训练的思想,对过滤中的自适应学习机制也提出了新的算法.基于本模型的系统在相关领域的国际评测中取得良好成绩.试验数据说明各项改进是有效的,新模型具有更高的性能.  相似文献   

2.
改进的Web文本自适应过滤策略   总被引:1,自引:0,他引:1  
面对实时网络信息过滤的新挑战,自适应信息过滤基本上能够解决问题。针对现有自适应系统的不足,本文提出提高模板准确性的学习和过滤阈值优化的新方法。改进的过滤策略过滤初期采用SVM算法,中后期采用改进的自适应模板过滤法。模板的更新采用改进的模板系数调整策略,并引入特征衰减因子来提高过滤的准确率。该系统运行于一个校园网关上,取得了较好的结果。  相似文献   

3.
In this paper we report our research on building WebSail, an intelligent web search engine that is able to perform real-time adaptive learning. WebSail learns from the user's relevance feedback, so that it is able to speed up its search process and to enhance its search performance. We design an efficient adaptive learning algorithm TW2 to search for web documents. WebSail employs TW2 together with an internal index database and a real-time meta-searcher to perform real-time adaptive learning to find desired documents with as little relevance feedback from the user as possible. The architecture and performance of WebSail are also discussed. Received 3 November 2000 / Revised 13 March 2001 / Accepted in revised form 17 April 2001  相似文献   

4.
Kernel-based algorithms have been proven successful in many nonlinear modeling applications. However, the computational complexity of classical kernel-based methods grows superlinearly with the increasing number of training data, which is too expensive for online applications. In order to solve this problem, the paper presents an information theoretic method to train a sparse version of kernel learning algorithm. A concept named instantaneous mutual information is investigated to measure the system reliability of the estimated output. This measure is used as a criterion to determine the novelty of the training sample and informative ones are selected to form a compact dictionary to represent the whole data. Furthermore, we propose a robust learning scheme for the training of the kernel learning algorithm with an adaptive learning rate. This ensures the convergence of the learning algorithm and makes it converge to the steady state faster. We illustrate the performance of our proposed algorithm and compare it with some recent kernel algorithms by several experiments.  相似文献   

5.
《Knowledge》2005,18(2-3):117-124
In this paper we propose an approach for refining a document ranking by learning filtering rulesets through relevance feedback. This approach includes two important procedures. One is a filtering method, which can be incorporated into any kinds of information retrieval systems. The other is a learning algorithm to make a set of filtering rules, each of which specifies a condition to identify relevant documents using combinations of characteristic words. Our approach is useful not only to overcome the limitation of the vector space model, but also to utilize tags of semi-structured documents like Web pages. Through experiments we show our approach improves the performance of relevance feedback in two types of IR systems adopting the vector space model and a Web search engine, respectively.  相似文献   

6.
Biblio is an adaptive system that automatically extracts meta-data from semi-structured and structured scanned documents. Instead of using hand-coded templates or other methods manually customized for each given document format, it uses example-based machine learning to adapt to customer-defined document and meta-data types. We provide results from experiments on the recognition of document information in two document corpuses: a set of scanned journal articles and a set of scanned legal documents. The first set is semi-structured, as the different journals use a variety of flexible layouts. The second set is largely free-form text based on poor quality scans of FAX-quality legal documents. We demonstrate accuracy on the semi-structured document set roughly comparable to hand-coded systems, and much worse performance on the legal documents.  相似文献   

7.
Learning to rank is a supervised learning problem that aims to construct a ranking model for the given data. The most common application of learning to rank is to rank a set of documents against a query. In this work, we focus on point‐wise learning to rank, where the model learns the ranking values. Multivariate adaptive regression splines (MARS) and conic multivariate adaptive regression splines (CMARS) are supervised learning techniques that have been proven to provide successful results on various prediction problems. In this article, we investigate the effectiveness of MARS and CMARS for point‐wise learning to rank problem. The prediction performance is analyzed in comparison to three well‐known supervised learning methods, artificial neural network (ANN), support vector machine, and random forest for two datasets under a variety of metrics including accuracy, stability, and robustness. The experimental results show that MARS and ANN are effective methods for learning to rank problem and provide promising results.  相似文献   

8.
雷蕾  王晓丹  周进登 《计算机科学》2012,39(12):245-248
情感分类任务旨在自动识别文本所表达的情感色彩信息(例如,褒或者贬、支持或者反对)。提出一种基于情 绪词与情感词协作学习的情感分类方法:在基于传统情感词资源的基础上,引入少量情绪词辅助学习,只利用大规模 未标注数据实现情感分类。具体来讲,基于文档一单词二部图的标签传播算法框架,利用情绪词与情感词构建两个视 图,通过协作学习的方法从大规模未标注语料中抽取高正确率的自动标注样本作为训练数据,然后训练分类器进行情 感分类。实验表明,该方法在多个领域的情感分类任务中都取得了较好的分类效果。  相似文献   

9.
王金宝 《计算机应用》2006,26(5):1099-1101
为了适应实时在线的网络信息过滤需求,提出了一种新的自适应过滤模型。在系统的初始化阶段,运用增量学习方法对附加的少量伪相关文档进行学习,采用改进的文档词频方法来抽取特征词,以此扩展需求模板,提高模板准确度。在系统测试阶段,以系统效能指标最优为目标,提出了将概率模型和文档正例分布统计方法相结合来实现阈值优化的新算法。  相似文献   

10.
Semi-supervised document clustering, which takes into account limited supervised data to group unlabeled documents into clusters, has received significant interest recently. Because of getting supervised data may be expensive, it is important to get most informative knowledge to improve the clustering performance. This paper presents a semi-supervised document clustering algorithm and a new method for actively selecting informative instance-level constraints to get improved clustering performance. The semi- supervised document clustering algorithm is a Constrained DBSCAN (Cons-DBSCAN) algorithm, which incorporates instance-level constraints to guide the clustering process in DBSCAN. An active learning approach is proposed to select informative document pairs for obtaining user feedbacks. Experimental results show that Cons-DBSCAN with our proposed active learning approach can improve the clustering performance significantly when given a relatively small amount of constraints.  相似文献   

11.
李昕  钱旭  王自强 《计算机工程》2010,36(15):40-42,48
为有效解决文档聚类问题,提出一种基于间隔流形学习的文档聚类算法。该算法利用间隔Fisher分析将高维文档空间降维到低维特征空间,利用支持向量聚类算法进行聚类。在基准文档测试集上的实验结果表明,该算法的聚类性能优于其他常用的文档聚类算法。  相似文献   

12.
基于扩展角分类神经网络的文档分类方法   总被引:10,自引:0,他引:10  
CC4神经网络是一种三层前馈网络的新型角分类(corner classification)训练算法,原用于元搜索引擎Anvish的文档分类.当各文档之间的规模接近时,CC4神经网络有较好的分类效果.然而当文档之间规模差别较大时,其分类性能较差.针对这一问题,本文意图扩展原始CC4神经网络,达到对文档有效分类的效果.为此,提出了一种基于MDS-NN的数据索引方法,将每一文档映射至k维空间数据点,并尽可能多地保持原始文档之间的距离信息.其次,通过将索引信息变换为CC4神经网络接受的0,1序列,实现对CC4神经网络的扩展,使其能够接受索引信息作为输入.实验结果表明对相互之间规模差别较大的文档,扩展CC4神经网络的性能优于原始CC4神经网络的性能.同时,扩展CC4神经网络的分类精度与文档索引方法有密切关系.  相似文献   

13.
Transforming paper documents into XML format with WISDOM++   总被引:1,自引:1,他引:0  
The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding. WISDOM++ is a document processing system that operates in five steps: document analysis, document classification, document understanding, text recognition with an OCR, and transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported. Received June 15, 2000 / Revised November 7, 2000  相似文献   

14.
Document image classification is an important step in Office Automation, Digital Libraries, and other document image analysis applications. There is great diversity in document image classifiers: they differ in the problems they solve, in the use of training data to construct class models, and in the choice of document features and classification algorithms. We survey this diverse literature using three components: the problem statement, the classifier architecture, and performance evaluation. This brings to light important issues in designing a document classifier, including the definition of document classes, the choice of document features and feature representation, and the choice of classification algorithm and learning mechanism. We emphasize techniques that classify single-page typeset document images without using OCR results. Developing a general, adaptable, high-performance classifier is challenging due to the great variety of documents, the diverse criteria used to define document classes, and the ambiguity that arises due to ill-defined or fuzzy document classes.  相似文献   

15.
基于Rough集潜在语义索引的Web文档分类   总被引:5,自引:0,他引:5  
Rough集(粗糙集)埋论是一种处理不确定或模糊知识的数学工具。提出了一种基于Rough集理论的潜在语义索引的Web文档分类方法。首先应用向量空间模型表示Web文档信息,然后通过矩阵的奇异值分解来进行信息过滤和潜在语义索引;运用属性约简算法生成分类规则,最后利用多知识库进行文档分类。通过试验比较,该方法具有较好的分类效果。  相似文献   

16.
在信息检索领域的排序任务中, 神经网络排序模型已经得到广泛使用. 神经网络排序模型对于数据的质量要求极高, 但是, 信息检索数据集通常含有较多噪音, 不能精确得到与查询不相关的文档. 为了训练一个高性能的神经网络排序模型, 获得高质量的负样本, 则至关重要. 借鉴现有方法doc2query的思想, 本文提出了深度、端到端的模型AQGM, 通过学习不匹配查询文档对, 生成与文档不相关、原始查询相似的对抗查询, 增加了查询的多样性,增强了负样本的质量. 本文利用真实样本和AQGM模型生成的样本, 训练基于BERT的深度排序模型, 实验表明,与基线模型BERT-base对比, 本文的方法在MSMARCO和TrecQA数据集上, MRR指标分别提升了0.3%和3.2%.  相似文献   

17.
王粲  夏元清  邹伟东 《计算机应用研究》2021,38(6):1724-1727,1764
针对极限学习机(extreme learning machine,ELM)隐节点不确定性导致的系统不稳定,以及对大型数据计算负担过重的问题,提出了基于自适应动量优化算法(adaptive and momentum method,AdaMom)的正则化极限学习机.算法主要思想是构造连续可微的目标函数,在梯度下降过程中计算自适应学习率,求自适应学习率与梯度乘积的指数加权平均值,通过迭代得到损失函数最小值对应的隐层输出权重矩阵.实验结果表明,在相同基准数据集的训练中,AdaMom-ELM算法具有非常良好的泛化性能和鲁棒性,提高了计算效率.  相似文献   

18.
19.
基于向量空间模型的信息安全过滤系统   总被引:6,自引:0,他引:6  
信息过滤是指通过监控信息源以找到满足用户需求的信息的过程。详细地论述了基于向量空间模型的信息过滤系统,系统由训练和自适应过滤两个阶段组成,在训练阶段,通过主题处理和特征抽取建立初始的过滤模板,设置初始阈值;在过滤阶段,则根据用户的反馈信息自适应地调整模板和阈值,最后给出了评估方法和实验结果。  相似文献   

20.
《Knowledge》2000,13(5):285-296
Machine-learning techniques play the important roles for information filtering. The main objective of machine-learning is to obtain users' profiles. To decrease the burden of on-line learning, it is important to seek suitable structures to represent user information needs. This paper proposes a model for information filtering on the Web. The user information need is described into two levels in this model: profiles on category level, and Boolean queries on document level. To efficiently estimate the relevance between the user information need and documents, the user information need is treated as a rough set on the space of documents. The rough set decision theory is used to classify the new documents according to the user information need. In return for this, the new documents are divided into three parts: positive region, boundary region, and negative region. An experimental system JobAgent is also presented to verify this model, and it shows that the rough set based model can provide an efficient approach to solve the information overload problem.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号