首页 | 本学科首页   官方微博 | 高级检索  
     

一种改进的分布式搜索引擎模型
引用本文:钱立兵,季振洲,吴昊.一种改进的分布式搜索引擎模型[J].哈尔滨工业大学学报,2014,46(7):8-13.
作者姓名:钱立兵  季振洲  吴昊
作者单位:哈尔滨工业大学 计算机科学与技术学院, 150001 哈尔滨;哈尔滨工业大学 计算机科学与技术学院, 150001 哈尔滨;哈尔滨工业大学 计算机科学与技术学院, 150001 哈尔滨
基金项目:国家自然科学基金资助项目(61173024);广东省部产学研结合基金资助项目(2011A090200037).
摘    要:为了解决传统分布式搜索引擎存在的搜索性能问题,从索引结构、查询算法方面改进了传统模型.提出了一种非集中的高并行化搜索模型,该模型按照文档主题对索引分类,对较长的倒排记录表采用位图结构,利用多线程技术对索引节点实现并行搜索算法(multi max score heap,MMSH).实验结果表明:改进模型中的索引分类方法与倒排表结构的位图策略,能够增强Merge层查询的针对性,降低Merge层节点的CPU和内存开销;在倒排表不能完全存入内存情况下,MMSH算法能够实现高度并行化查询,其查询效率高于经典的term-at-a-time算法,缩短了平均查找时间,提高了系统吞吐量.索引分类、位图结构以及并行查询算法能够避免查询的盲目性,改善了分布式搜索引擎的性能.

关 键 词:分布式引擎  索引分类    倒排结构  并行搜索
收稿时间:2013/9/25 0:00:00

An improved model of distributed search engine
QIAN Libing,JI Zhenzhou and WU Hao.An improved model of distributed search engine[J].Journal of Harbin Institute of Technology,2014,46(7):8-13.
Authors:QIAN Libing  JI Zhenzhou and WU Hao
Affiliation:School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China;School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China;School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
Abstract:To solve the problem of search performance in traditional distributed search engine, a non-centralized high parallelization search model was proposed and the traditional model was improved in the index structure and search algorithm. In the model, the index was classified according to document theme, bitmap structure was employed for longer inverted record list, and parallel search algorithm (multi max score heap, MMSH) was achieved in index node by using multi-threading technology. Experimental results show that the improved search model with index classification and bitmap strategy of the inverted list structure can enhance the search pertinence in Merge layer, reduce CPU and memory cost. In the case that the inverted list can not be completely stored in memory, MMSH algorithm can implement highly parallel search and its query efficiency is higher than the classical term-at-a-time algorithm, which shortens the average search time and improves the system throughput. Index classification, bitmap structure and parallel query algorithm can avoid query blindness and improve the performance of distributed search engines.
Keywords:distributed indexing  index classification  inverted structure  parallel search
点击此处可从《哈尔滨工业大学学报》浏览原始摘要信息
点击此处可从《哈尔滨工业大学学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号