共查询到6条相似文献,搜索用时 15 毫秒
1.
Topical Web crawling is an established technique for domain-specific information retrieval. However, almost all the conventional topical Web crawlers focus on building crawlers using different classifiers, which needs a lot of labeled training data that is very difficult to labelmanually. This paper presents a novel approach called clustering-based topical Web crawling which is utilized to retrieve information on a specific domain based on link-context and does not require any labeled training data. In order to collect domain-specific content units, a novel hierarchical clustering method called bottom-up approach is used to illustrate the process of clustering where a new data structure, a linked list in combination with CFu-tree, is implemented to store cluster label, feature vector and content unit. During clustering, four metrics are presented. First, comparison variation (CV) is defined to judge whether the closest pair of clusters can be merged. Second, cluster impurity (CIP) evaluates the cluster error. Then, the precision and recall of clustering are also presented to evaluate the accuracy and comprehensive degree of the whole clustering process. Link-context extraction technique is used to expand the feature vector of anchor text which improves the clustering accuracy greatly. Experimental results show that the performance of our proposed method overcomes conventional focused Web crawlers both in Harvest rate and Target recall. 相似文献
2.
As a data mining method, clustering, which is one of the most important tools in information retrieval, organizes data based on unsupervised learning which means that it does not require any training data. But, some text clustering algorithms cannot update existing clusters incrementally and, instead, have to recompute a new clustering from scratch. In view of above, this paper presents a novel down-top incremental conceptual hierarchical text clustering approach using CFu-tree (ICHTC-CF) representation, which starts with each item as a separate cluster. Term-based feature extraction is used for summarizing a cluster in the process. The Comparison Variation measure criterion is also adopted for judging whether the closest pair of clusters can be merged or a previous cluster can be split. And, our incremental clustering method is not sensitive to the input data order. Experimental results show that the performance of our method outperforms k-means, CLIQUE, single linkage clustering and complete linkage clustering, which indicate our new technique is efficient and feasible. 相似文献
3.
针对模糊聚类算法中存在的对初始值敏感、易陷入局部最优等问题,提出了一种融合改进的混合蛙跳算法(SFLA)的模糊C均值算法(FCM)用于Web搜索结果的聚类。新算法中,使用SFLA的优化过程代替FCM的基于梯度下降的迭代过程。改进的SFLA通过混沌搜索优化初始解,变异操作生成新个体,并设计了一种新的搜索策略,有效地提高了算法寻优能力。实验结果表明,该算法提高了模糊聚类算法的搜索能力和聚类精度,在全局寻优能力方面具有优势。 相似文献
4.
基于服务质量(QoS)的Web服务推荐能在众多功能相似的Web服务中发现最能满足用户非功能需求的Web服务,但QoS属性值预测算法仍存在预测准确度不高和数据稀疏性的问题。针对以上问题,提出了一种基于位置聚类和分层张量分解的QoS预测算法ClustTD,该算法基于用户和服务的位置属性将用户和服务聚类成多个局部组,分别对局部组和全局的用户、服务和时间上下文进行张量建模和分解,将局部和全局张量分解的QoS预测值进行加权组合,同时考虑了局部和全局因素,获得最终QoS预测值。实验结果表明,该算法具有较高的QoS预测准确率和Web服务推荐质量,并能在一定程度上解决数据稀疏性问题。 相似文献
5.
随着Internet上Web服务的快速增长,客户如何发现想要的Web服务,已经是Web服务技术中的难点和关键问题。鉴于UDDI注册中心的Web服务描述信息非常稀疏的特点,传统的基于关键字的服务匹配机制UDDI缺乏语义支持,搜索效率低;为了在UDDI注册中心提供的用WSDL描述的和未来用语义本体描述的Web服务信息的基础上提高Web服务匹配的查准率和查全率,提出了一种新的本体相似匹配方法,使Web服务匹配在查准率和查全率方面都有所提高。 相似文献