首页 | 本学科首页   官方微博 | 高级检索  
     

基于键值存储的分布式时序相似性搜索方法
引用本文:俞自生,李瑞远,郭阳,蒋忠元,鲍捷,郑宇.基于键值存储的分布式时序相似性搜索方法[J].软件学报,2022,33(3):950-967.
作者姓名:俞自生  李瑞远  郭阳  蒋忠元  鲍捷  郑宇
作者单位:西安电子科技大学 网络与信息安全学院, 陕西 西安 710126;北京京东智能城市大数据研究院, 北京 100176;重庆大学 计算机学院, 重庆 400044;北京京东智能城市大数据研究院, 北京 100176;北京航天航空大学 计算机学院, 北京 100191
基金项目:国家重点研发计划(2019YFB2103201);国家自然科学基金(61976168,62076191,61502375)
摘    要:时序相似性搜索是时序数据分析最基本的操作之一,具有广泛的应用场景.针对现有分布式算法无法应对维度增长、扫描范围过大和相似性计算耗时的问题,提出一种面向键值存储的分布式时序相似性搜索方法KV-Search.首先对时序数据分块,并设计其键值存入键值数据库,解决了时序数据维度高且不断增长的问题;其次,基于切比雪夫距离计算其下界,并利用键值范围扫描提前过滤无效数据,减少了数据传输;最后,利用基于分块的时序表示计算距离下界,避免了更高维度真实数据的计算,加快了查询效率.使用HBase实现了KV-Search,并利用真实的大规模数据集做了大量实验.实验结果表明, KV-Search算法在效率和扩展性方面均优于基准实验.

关 键 词:时间序列  相似性搜索  键值存储  剪枝过滤  分布式查询
收稿时间:2021/6/30 0:00:00
修稿时间:2021/7/31 0:00:00

Distributed Time Series Similarity Search Method Based on Key-value Data Stores
YU Zi-Sheng,LI Rui-Yuan,GUO Yang,JIANG Zhong-Yuan,BAO Jie,ZHENG Yu.Distributed Time Series Similarity Search Method Based on Key-value Data Stores[J].Journal of Software,2022,33(3):950-967.
Authors:YU Zi-Sheng  LI Rui-Yuan  GUO Yang  JIANG Zhong-Yuan  BAO Jie  ZHENG Yu
Affiliation:School of Cyber Engineering, Xidian University, Xi''an Shaanxi 710071, China;JD Intelligent Cities Research, Beijing 100176, China;College of Computer Science, Chongqing University, Chongqing 400044, China;JD Intelligent Cities Research, Beijing 100176, China;School of Computer Science and Engineering, Beihang University, Beijing 100191, China
Abstract:Time series similarity search is one of the most basic operations for temporal data analysis, which has various application scenarios. Existing distributed methods face the problems of dimension explosion, too large scan range, and time-consuming similarity calculation. To this end, this paper proposes a novel distributed time series similarity search algorithm KV-Search. First, time series are segmented into blocks and stored in the key-value database, which solves the problem of high and growing dimension. Second, the lower bound is calculated based on Chebyshev distance, and the invalid data is filtered out in advance using key value range scanning, which reduce the data transmission and calculation overhead. Third, a block-based time series representation is used to calculate the lower bound of distance, which avoids the calculation of higher dimensional real data. We implement KV-Search based on HBase, and conduct a set of extensive experiments using both real and synthetic time series data. The results show that our KV-Search is superior to benchmark experiment in efficiency and scalability.
Keywords:time series  similarity search  key-value storage  pruning filtration  distributed query
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号