针对传统网页排序算法Okapi BM25通常会出现网页与查询关键词领域无关的领域漂移现象,以及改进算法需要人工建立领域向量的问题,提出了一种基于BM25和Softmax回归分类模型的网页搜索排序算法。该方法首先对网页文本进行数据预处理并利用词袋模型进行网页文本的向量表示,之后通过少量的网页数据来训练Softmax回归分类模型,来预测测试网页数据的类别分数,并与BM25信息检索的分数结合在一起,得到最终的网页排序结果。实验结果显示该检索算法无须人工建立领域向量,即可达到很好的网页排序结果。  相似文献   

具有量子特性的ACA-SVM网页分类方法   总被引:1,自引:0,他引:1       下载免费PDF全文
为了克服SVM网页分类计算复杂度高、不适应大规模场景的问题,提出了将具有量子特性的ACA和SVM进行融合的中文网页分类方法;对算法进行改进,提出了一种动态调整旋转门旋转角的策略。实验表明,该方法在精度、召回率及处理时间上均有明显提高。  相似文献   

基于链接的Web网页分类   总被引:1,自引:1,他引:0  
基于链接的特点,提出了获取链接信息的模型,将得到的链接信息结合对象本身的属性来共同训练分类规则。针对网页链接的特殊性,对链接有向图重新建模。实验证明链接信息的加入可以有效地改善分类的结果,链接有向图的重新建模同样提高了分类的准确性。  相似文献   

World Wide Web is a continuously growing giant, and within the next few years, Web contents will surely increase tremendously. Hence, there is a great requirement to have algorithms that could accurately classify Web pages. Automatic Web page classification is significantly different from traditional text classification because of the presence of additional information, provided by the HTML structure. Recently, several techniques have been arisen from combinations of artificial intelligence and statistical approaches. However, it is not a simple matter to find an optimal classification technique for Web pages. This paper introduces a novel strategy for vertical Web page classification, which is called Classification using Multi-layered Domain Ontology (CMDO). It employs several Web mining techniques, and depends mainly on proposed multi-layered domain ontology. In order to promote the classification accuracy, CMDO implies a distiller to reject pages related to other domains. CMDO also employs a novel classification technique, which is called Graph Based Classification (GBC). The proposed GBC has pioneering features that other techniques do not have, such as outlier rejection and pruning. Experimental results have shown that CMDO outperforms recent techniques as it introduces better precision, recall, and classification accuracy.  相似文献   

邓健爽  郑启伦  彭宏 《计算机应用》2006,26(5):1134-1136
网页自动分类是当前互联网搜索领域一个热点研究课题,目前主要有基于网页文本内容的分类和基于网页间超链接结构的分类。但是这些分类都只利用了网页的信息,没有考虑到网页所在网站提供的信息。文中提出了一种全新的对网站内部拓扑结构进行简约的算法,提取网站隐含的层次结构,生成层次结构树,从而达到对网站内部网页实现多层次分类的目的,并且已经成功应用到电子商务智能搜索和挖掘系统中。  相似文献   

针对网页非结构化信息抽取复杂度高的问题,提出了一种基于网页分割的Web信息提取算法。对网页噪音进行预处理,根据网页的文档对象模型树结构进行标签路径聚类,通过自动训练的阈值和网页分割算法快速判定网页的关键部分,根据数据块中的嵌套结构获取网页文本提取模板。对不同类型网站的实验结果表明,该算法运行速度快、准确度高。  相似文献   

In this paper, genetic algorithm oriented latent semantic features (GALSF) are proposed to obtain better representation of documents in text classification. The proposed approach consists of feature selection and feature transformation stages. The first stage is carried out using the state-of-the-art filter-based methods. The second stage employs latent semantic indexing (LSI) empowered by genetic algorithm such that a better projection is attained using appropriate singular vectors, which are not limited to the ones corresponding to the largest singular values, unlike standard LSI approach. In this way, the singular vectors with small singular values may also be used for projection whereas the vectors with large singular values may be eliminated as well to obtain better discrimination. Experimental results demonstrate that GALSF outperforms both LSI and filter-based feature selection methods on benchmark datasets for various feature dimensions.  相似文献   

Webnaut is an intelligent agent system that uses a genetic algorithm to collect and recommend Web pages. A feedback mechanism adapts to user interests as they evolve. The authors first describe intelligent assistant systems in general and then present the Webnaut architecture, its learning agent, and the genetic algorithm. They conclude with results from two preliminary experiments that tested the accuracy and adaptability of the learning agent  相似文献   

This paper proposes a genetic algorithm feature selection (GAFS) for image retrieval systems and image classification. Two texture features of adaptive motifs co-occurrence matrix (AMCOM) and gradient histogram for adaptive motifs (GHAM) and color feature of an adaptive color histogram for K-means (ACH) were used in this paper. In this paper, the feature selections have adopted sequential forward selection (SFS), sequential backward selection (SBS), and genetic algorithms feature selection (GAFS). Image retrieval and classification performance mainly build from three features: ACH, AMCOM and GHAM, where the classification system is used for two-class SVM classification. In the experimental results, we can find that all the methods regarding feature extraction mentioned in this study can contribute to better results with regard to image retrieval and image classification. The GAFS can provide a more robust solution at the expense of increased computational effort. By applying GAFS to image retrieval systems, not only could the number of features be effectively reduced, but higher image retrieval accuracy is elicited.  相似文献   

基于向量空间模型的文本分类中特征向量是极度稀疏的高维向量,只有降低向量空间维数才能提高分类效率。在利用统计方法选择文本分类特征降低特征空间维数的基础上,采用隐含语义分析技术,挖掘文档特征间的语义信息,利用矩阵奇异值分解理论进一步降低了特征空间维数。实验结果表明分类结果宏平均F1约提高了5%,验证了该方法的有效性。  相似文献   

基于时间链接分析的页面排序优化算法*   总被引:1,自引:0,他引:1  
鞠时光  吕霞   《计算机应用研究》2009,26(7):2438-2441
传统的页面排序算法偏重于旧网页,使得一些旧的页面经常出现在检索结果的前面。为了改进此类算法,引入时间链接分析,使用爬虫抓起页面时HTTP协议反馈回来的修改时间作为页面和链接的时间,并综合考虑页面的出入链接个数和时间来计算页面的权重值。开发出的WTPR算法能使新网页集在排序中上升,高质量的旧网页比普通的旧网页能获得较高的排序值。  相似文献   

基于代表样本动态生成的中文网页分类   总被引:2,自引:0,他引:2  
华北  曹先彬 《计算机应用》2006,26(10):2502-2504
针对中文网页分类问题该文设计了一种新的基于代表样本动态生成的分类算法。算法通过对原始训练样本集的训练逐个生成代表样本,并充分利用被裁剪训练样本的有效信息,对已生成的代表样本进行多次调整,从而使代表样本更具有代表性。基于该算法的中文网页分类器的实验结果表明,算法有效地压缩了原始训练样本集,提高了分类效率,同时保持了分类的准确性;具有较好的分类性能。  相似文献   

Nowadays, decision-making activities of knowledge-intensive enterprises depend heavily on the successful classification of patents. A considerable amount of time is required to achieve successful classification because of the complexity associated with patent information and of the large number of potential patents. Several different patent classification approaches have been developed in the past, but most of these studies focus on using computational models for the International Patent Classification (IPC) system rather than using these models in real-world cases of patent classification. In contrast to previous studies that combined algorithms and the IPC system directly without using expert screening, this study proposes a novel artificial intelligence (AI)-aided patent decision-making process. In this process, an expert screening approach is integrated with a hybrid genetic-based support vector machine (HGA-SVM) model for developing a patent classification system with the high classification accuracy and generalization ability for real-world patent searching cases. The proposed approach is tested on a real-world case—an expert's patent document searching history that contains 234 patent documents of semiconductor equipment components. The research results demonstrate that our proposed hybrid genetic algorithm approach can optimize all the parameters of the SVM for developing a patent classification system with a high accuracy. The proposed HGA-SVM model is able to dynamically and automatically classify patent documents by recording and learning the experts’ knowledge and logic. Finally, we propose a new decision-making process for improving the development of the SVM patent classification and searching system.  相似文献   

提出一种基于图的半指导学习算法用于网页分类.采用k近邻算法构建一个带权图,图中节点为已标志或未标志的网页,连接边的权重表示类的传播概率,将网页分类问题形式化为图中类的概率传播.为有效利用图中未标志节点辅助分类,结合网页的内容信息和链接信息计算网页间的链接权重,通过已标志节点,类别信息以一定概率从已标志节点推向未标志节点.实验表明,本文提出的算法能有效改进网页分类结果.  相似文献   

针对串行PageRank算法在处理海量网页数据时效率低下的问题,提出一种基于网页链接分类的PageRank并行算法.首先,将网页按照网页所属网站分类,为来自不同站点的网页设置不同的权重;其次,利用Hadoop并行计算框架,结合MapReduce分而治之的特点,并行计算网页排名;最后,采用一种包含3层:数据层、预处理层、计算层的数据压缩方法,对并行算法进行优化.实验结果表明,与串行PageRank算法相比,所提算法在最好情况下结果准确率提高了12%,计算效率提高了33%.  相似文献   

本文提出了基于粒子群(PSO)的训练ANN的新算法,以此为基础建立了对库存品进行ABC分类的模型.新算法充分结合了PSO与BP两者的优势,在训练过程中能同时优化权值以及神经元log-Sigmoid函数.实验结果表明,新算法是企业库存信息管理系统中进行决策预测的一种可行方法.  相似文献   

基于遗传算法的图像特征选择   总被引:2,自引:0,他引:2  
针对模式识别时,提取的特征参数量大而又有冗余的现象,提出了基于遗传算法的特征选择方法。介绍了遗传算法的基本原理,阐述并设计了适应度函数和遗传算子。仿真实验表明,该方法在求解的效率和解的质量方面都达到了令人满意的效果。  相似文献   

为了更好地界定本体中的概念,提出一种基于遗传算法(Genetic Algorithm,GA)的本体概念分类规则的学习方法.从已有的本体库中获取实例作为训练样本,通过该算法寻找一组与数据样本集一致的规则.以一组规则集作为遗传算法的个体,即优化的目标,同时考虑到规则集的覆盖性、一致性、简洁性和多样性4个方面建立适应值函数,优化得到一组能够分类概念的规则集合.进而这组规则集可用于指导和丰富本体知识,例如当本体中引入新的实例时,可以通过此概念分类规则集确定实例所属的概念.对已有本体学习后的实验结果表明该算法收敛性很好,而且能获得较好的规则集.  相似文献   

针对本体学习中的概念学习,提出了一种基于改进遗传算法的本体概念规则的学习方法。该方法在传统遗传算法的遗传操作算予中引入“杂交优势”思想对交叉算子进行了改进,并加强了变异算子的对算法的影响;同时算法在执行过程中对训练样本集使用了约减策略,从而找出了一个能正确覆盖样本空间中所有实例并且不覆盖任何错误实例的规则集合。  相似文献   

