首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 9 毫秒
1.
2.
HITS算法是影响相当广泛的链接分析算法.但是,深入的研究表明,它很容易产生主题漂移.而HITS算法产生主题漂移的很大一部分原因在于页面被投影到错误的潜在语义基上.提出一种基于权值调整的超链主题提取算法(weighted adjustments based hyperlinks topic distillation),先在获得根集的过程中,用改进的权值进行相似度计算,得到相对更为准确的个性化根集,再利用HITS算法计算Web页面的权威值和中心值.实验结果表明,基于权值调整的超链主题提取算法可以很好地改善HITS算法所导致的主题漂移问题,更适合于Web查询的需要.  相似文献   

3.
从语义相关性角度分析超链归纳主题搜索(HITS)算法,发现其产生主题漂移的原因在于页面被投影到错误的语义基上,因此引入局部密集因子LDF(Local Density Factor)的概念。为了解决Web内容的重叠性,基于切平面的概念提出了一种新的主题提取算法(CPTDA)。CPTDA不但可以发现用户最感兴趣的主题页面集合,还可以发现与查询相关的其他页面集合。在10个查询上的实验结果表明,与HITS算法相比,CPTDA算法不仅可以减少30%-52%的主题漂移率,而且可以发现与查询相关的多个主题。  相似文献   

4.
给出了为主题精选算法(如HITS)构造Web邻域图的方法和实用系统。该系统基于Web搜索引擎(AltaVista),使用额外的Visual C 软件模块构造一个查询特定的邻域图,并将图节点和边信息存储于数据库中以供超链分析使用。Web环境的实验表明该邻域图构造方法是可行的,邻域图构造系统是可靠的。  相似文献   

5.
We have built a database that provides term vector information for large numbers of pages (hundreds of millions). The basic operation of the database is to take URLs and return term vectors. Compared to computing vectors by downloading pages via HTTP, the Term Vector Database is several orders of magnitude faster, enabling a large class of applications that would be impractical without such a database. This paper describes the Term Vector Database in detail. It also reports on two applications built on top of the database. The first application is an optimization of connectivity-based topic distillation. The second application is a Web page classifier used to annotate results returned by a Web search engine.  相似文献   

6.
Web页面主题相关性排序算法的研究   总被引:3,自引:0,他引:3       下载免费PDF全文
分析了Web页面主题的分布的特点,对经典的页面排序算法进行了探讨,提出了一种基于内容和超链接分析并结合用户点击行为的相关性排序算法。该算法考虑了超文本标记、锚文本、文本内容等对相关性的影响,引入动态比较矩阵来计算相应的权重系数,能够客观分析网页所包含的主题信息,使检索结果排序更合理。实验表明,该算法能有效提高查准率,较好地解决了主题的漂移现象,且具有较好的性能。  相似文献   

7.
王景中  邱铜相 《计算机应用》2015,35(10):2901-2904
针对传统的TF-IDF算法、K-means算法、自适应遗传算法在网络检索结果中含有大量不相关数据、语义检索准确性不高的问题,研究了TF-IDF算法的改进及其在语义检索中的应用。将正则表达式和语义分析技术相结合,从而实现对TF-IDF算法的改进。利用语义库对搜索主题进行描述,根据正则原子语义的重要性和在网页标签中的不同位置进行加权计算,得到正则原子在文档中的相似度。通过空间向量模型对文档相似度和主题模型进行余弦运算,从而获取最终的搜索结果。最后,将改进的TF-IDF算法、传统的TF-IDF算法、K-means算法和自适应遗传算法运用于聚焦主题网络爬虫中,对其检索结果进行了对比分析。计算结果表明,在聚焦主题网络爬虫语义分析的垂直搜索中,改进TF-IDF算法的相似度准确率比传统的TF-IDF算法检索准确率提高了17.1个百分点,遗漏率降低了7.76个百分点;比K-means算法检索准确率提高6个百分点;比自适应遗传算法检索准确率提高了8.1个百分点。总之,改进的TF-IDF算法可以有效地提高文档相似度检测的准确率,很好地改善聚焦主题网络爬虫在语义分析中的缺陷。  相似文献   

8.
Web信息检索中主题精选算法的研究与改进   总被引:3,自引:0,他引:3  
搜索引擎是目前最主要的Web信息检索工具,然而它的效果还不能令人满意。基于Web链接结构的主题精选算法的链接分析迭代往往会收敛于链接图中与查询主题不太相关的紧密交织区域(TKC),从而导致主题偏移。笔者对经典主题精选算法HITS的分析表明该算法还有给不同的Web站点规定了不平等的影响权重以及不能满足用户多粒度的信息需求等缺点。文章在分析主题精选算法研究的基础上针对其不足提出了改进算法g-HITSc,实验表明该算法是合理和有效的。  相似文献   

9.
针对网络中海量的Web服务聚类时,因其表征数据稀疏而导致使用传统建模方法所获效果不理想的问题,提出了一种基于BTM主题模型的Web服务聚类方法。该方法首先利用BTM学习整个Web服务描述文档集的隐含主题,通过推理得出每篇文档的主题分布,然后应用K Means算法对Web服务进行聚类。通过与LDA、TF IDF等方法进行对比发现,该方法在聚类纯度、熵和F Measure指标上均具有更好的效果。实验表明,该方法能够有效解决因Web服务描述所具有的短文本性质而导致的数据稀疏性问题,可显著提高服务聚类效果。  相似文献   

10.
基于JSP的网站访问统计系统的设计与实现   总被引:7,自引:0,他引:7  
介绍了利用JSP技术设计并实现的一个网站访问统计系统。  相似文献   

11.
应用Web结构挖掘的PageRank算法的改进研究   总被引:1,自引:0,他引:1       下载免费PDF全文
随着Internet技术的发展,Web网页成为人们获取信息的有效途径,Web数据挖掘逐渐成为研究的热点。基于Web结构挖掘的PageRank算法存在不足的情况下,提出了一种改进的算法,实验结果证明改进的算法较原算法具有较好的效果,具有一定的实用价值。  相似文献   

12.
This paper introduces a framework for trend modeling and detection on the Web through the usage of Opinion Mining and Topic Modeling tools based on the fusion of freely available information. This framework consists of a four step model that runs periodically: crawl a set of predefined sources of documents; search for potential sources and extract topics from the retrieved documents; retrieve opinionated documents from social networks for each detected topic and extract sentiment information from them. The proposed framework was applied to a set of 20 sources of documents over a period of 8 months. After the analysis period and that the proposed experiments were run, an F-Measure of 0.56 was obtained for the detection of significant events, implying that the proposed framework is a feasible model of how trends could be represented through the analysis of documents freely available on the Web.  相似文献   

13.
Web使用挖掘是近年来Web数据挖掘中的研究热点。针对传统遗传算法在提取关联规则问题时常采用固定染色体交叉概率和染色体变异概率,容易出现早熟、收敛速度较慢的问题,提出了改进的遗传算法,并在关联规则的提取中增加了用户页面兴趣度这一阈值,成功地运用到某商业网站服务器日志挖掘。实验证明,这种改进的遗传算法能够有效避免早熟收敛现象,是一种有效的方法。  相似文献   

14.
The article considers methods of intelligent data analysis (data mining) used in problems involved in the analysis of Web traffic, and also considers the application of the method of cluster analysis and a newly developed model for the study of Web user activity.  相似文献   

15.
A Web information visualization method based on the document set-wise processing is proposed to find the topic stream from a sequence of document sets. Although the hugeness as well as its dynamic nature of the Web is burden for the users, it will also bring them a chance for business and research if they can notice the trends or movement of the real world from the Web. A sequence of document sets found on the Web, such as online news article sets is focused on in this paper. The proposed method employs the immune network model, in which the property of memory cell is used to find the topical relation among document sets. After several types of memory cell models are proposed and evaluated, the experimental results show that the proposed method with memory cell can find more topic streams than that without memory cell. Yasufumi Takama, D.Eng.: He received his B.S., M.S. and Dr.Eng. degrees from the University of Tokyo in 1994, 1996, and 1999, respectively. From 1999 to 2002 he was with Tokyo Institute of Technology, Japan. Since 2002, he has been Associate Professor of Department of Electronic Systems and Engineering, Tokyo Metropolitan Institute of Technology, Tokyo, Japan. He has also been participating in JST (Japan Science and Technology Corporation) since October 2000. His current research interests include artificial intelligence, Web information retrieval and visualization systems, and artificial immune systems. He is a member of JSAI (Japanese Society of Artificial Intelligence), IPS J (Information Processing Society of Japan), and SOFT (Japan Society for Fuzzy Theory and Systems). Kaoru Hirota, D.Eng.: He received his B.E., M.E. and Dr.Eng. degrees in electronics from Tokyo Institute of Technology, Tokyo, Japan, in 1974, 1976, and 1979, respectively. From 1979 to 1982 and from 1982 to 1995 he was with the Sagami Institute of Technology and Hosei University, respectively. Since 1995, he has been with the Interdisciplinary Graduate School of Science and Technology, Tokyo Institute of Technology, Yokohama, Japan. He is now a department head professor of Department of Computational Intelligence and Systems Science. Dr.Hirota is a member of IFSA (International Fuzzy Systems Association (Vice President 1991–1993), Treasurer 1997–2001), IEEE (Associate Editors of IEEE Transactions on Fuzzy Systems (1993–1995) and IEEE Transactions on Industrial Electronics (1996–2000)) and SOFT (Japan Society for Fuzzy Theory and Systems (Vice President 1995–1997, President 2001–2003)), and he is an editor in chief of Int. J. of Advanced Computational Intelligence.  相似文献   

16.
Web挖掘及其应用研究   总被引:7,自引:0,他引:7  
Web挖掘就是利用数据挖掘技术,从Web文档和Web活动中提取感兴趣的,潜在的有用模式和隐藏的信息,本文详细阐述了Web的特点,Web挖掘的分类及应用。  相似文献   

17.
Together with the explosive growth of web video in sharing sites like YouTube, automatic topic discovery and visualization have become increasingly important in helping to organize and navigate such large-scale videos. Previous work dealt with the topic discovery and visualization problem separately, and did not take fully into account of the distinctive characteristics of multi-modality and sparsity in web video features. This paper tries to solve web video topic discovery problem with visualization under a single framework, and proposes a Star-structured K-partite Graph based co-clustering and ranking framework, which consists of three stages: (1) firstly, represent the web videos and their multi-model features (e.g., keyword, near-duplicate keyframe, near-duplicate aural frame, etc.) as a Star-structured K-partite Graph; (2) secondly, group videos and their features simultaneously into clusters (topics) and organize the generated clusters as a linked cluster network; (3) finally, rank each type of nodes in the linked cluster network by “popularity” and visualize them as a novel interface to let user interactively browse topics in multi-level scales. Experiments on a YouTube benchmark dataset demonstrate the flexibility and effectiveness of our proposed framework.  相似文献   

18.
基于查询扩展的Web链接主题提取算法   总被引:1,自引:0,他引:1  
HITS(Hypertext-Induced Topic Search)算法被广泛用于W曲链接结构分析,但它很容易产生主题漂移.从语义相关性角度进行分析,发现HITS算法产生主题漂移的原因在于页面被投影到错误的潜在语义基上.提出一种基于查询扩展的超链主题提取算法,利用用户查询日志扩展查询词,构造符合用户需要的个性化根集和基础集合,再利用HITS算法计算Web页面的权成值和中心值.实验结果表明,基于查询扩展的超链主题提取算法可以很好地改善HITS算法所导致的主题漂移问题,更适合于Web查询的需要.  相似文献   

19.
基于网页链接和内容分析的改进PageRank算法   总被引:9,自引:0,他引:9       下载免费PDF全文
结合网页链接分析和网页内容相关性分析提出一种改进的PageRank算法EPR(Extended PageRank),从分析网页内容相似性的角度解决相关性需求,从网页链接分析的角度解决权威性需求。算法为扩展PageRank提供了广阔的空间,并且实验证明,通过选择合适的参数EPR算法可以获得优于传统PageRank算法的排序结果。  相似文献   

20.
基于匹配算法的服务发现本体模型*   总被引:1,自引:1,他引:0  
针对服务发现领域存在的匹配问题,提出了基于匹配算法的服务发现本体模型。研究中,以本体技术为基础,分析服务发现模型所包含的主要元素,定义用户本体与服务本体之间最优匹配规划的命题。构建满足该命题的Web服务运行框架。针对运行框架中匹配规划和匹配模式,设计并实现MS算法和MP算法,获取候选匹配集及匹配规划的相关度。与现有服务发现方法相比,提出的服务发现本体模型具有较高的查全率与查准率,能够获得更多贴近用户服务请求的Web服务,具有较好的理论价值和应用前景。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号