首页 | 本学科首页   官方微博 | 高级检索  
     

编码单位可变的倒排索引压缩算法研究
引用本文:安兆翔,瞿有利.编码单位可变的倒排索引压缩算法研究[J].计算机工程与应用,2019,55(15):82-88.
作者姓名:安兆翔  瞿有利
作者单位:北京交通大学 计算机与信息技术学院,北京,100044;北京交通大学 计算机与信息技术学院,北京,100044
基金项目:中央高校基本科研业务费专项资金
摘    要:倒排索引是大多数大型文本搜索系统的核心数据结构,索引压缩可以有效地减少倒排索引的空间占用,提升检索效率。针对倒排索引压缩算法中的字节对齐编码进行研究,对于其压缩率不够优秀的问题,提出了分区可变单位编码(PVU编码)。算法以可变单位方式代替固定字节存储,使实际存储空间更加贴合原码长度,从而提高压缩效果。针对序列均匀分区并非最优分区的问题,提出将最优分区问题转化为图论中最短路径问题的方法,使用Dijkstra算法求解序列的最优编码分区。通过对比实验验证了改进优化的PVU编码相较于传统的字节对齐编码能够更好地压缩倒排索引序列。

关 键 词:倒排索引  索引压缩  可变单位  分区优化

Research on Inverted Index Compression Algorithm with Variable Coding Units
AN Zhaoxiang,QU Youli.Research on Inverted Index Compression Algorithm with Variable Coding Units[J].Computer Engineering and Applications,2019,55(15):82-88.
Authors:AN Zhaoxiang  QU Youli
Affiliation:School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
Abstract:The inverted index is the data structure at the core of most large-scale search systems. Index compression can effectively reduce the space occupied by inverted index and improve retrieval efficiency. To solve the problem of poor compression of byte-aligned coding, Partitioned Variable Unit(PVU) coding is proposed. Instead of storing multiple bytes, PVU stores integers in multiple units. Units include multiple values. This method makes the storage space more suitable for the original code length and has a better compression effect. To solve the optimal partition problem, the partition problem is transformed into the shortest path problem in graph theory. Dijkstra’s algorithm is designed to solve this problem. Experiments show that the optimized PVU can compress the inverted index sequence better than the traditional byte-aligned coding.
Keywords:inverted index  index compression  variable unit  partition optimization  
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号