基于键值存储的分布式时序相似性搜索方法 Distributed Time Series Similarity Search Method Based on Key-value Data Stores期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于键值存储的分布式时序相似性搜索方法

引用本文：	俞自生,李瑞远,郭阳,蒋忠元,鲍捷,郑宇.基于键值存储的分布式时序相似性搜索方法[J].软件学报,2022,33(3):950-967.

作者姓名：	俞自生李瑞远郭阳蒋忠元鲍捷郑宇

作者单位：	西安电子科技大学网络与信息安全学院, 陕西西安 710126;北京京东智能城市大数据研究院, 北京 100176;重庆大学计算机学院, 重庆 400044;北京京东智能城市大数据研究院, 北京 100176;北京航天航空大学计算机学院, 北京 100191

基金项目：	国家重点研发计划（2019YFB2103201）；国家自然科学基金（61976168，62076191，61502375）

摘要：	时序相似性搜索是时序数据分析最基本的操作之一,具有广泛的应用场景.针对现有分布式算法无法应对维度增长、扫描范围过大和相似性计算耗时的问题,提出一种面向键值存储的分布式时序相似性搜索方法KV-Search.首先对时序数据分块,并设计其键值存入键值数据库,解决了时序数据维度高且不断增长的问题;其次,基于切比雪夫距离计算其下界,并利用键值范围扫描提前过滤无效数据,减少了数据传输;最后,利用基于分块的时序表示计算距离下界,避免了更高维度真实数据的计算,加快了查询效率.使用HBase实现了KV-Search,并利用真实的大规模数据集做了大量实验.实验结果表明, KV-Search算法在效率和扩展性方面均优于基准实验.
关键词：	时间序列相似性搜索键值存储剪枝过滤分布式查询
收稿时间：	2021/6/30 0:00:00
修稿时间：	2021/7/31 0:00:00
Distributed Time Series Similarity Search Method Based on Key-value Data Stores

YU Zi-Sheng,LI Rui-Yuan,GUO Yang,JIANG Zhong-Yuan,BAO Jie,ZHENG Yu.Distributed Time Series Similarity Search Method Based on Key-value Data Stores[J].Journal of Software,2022,33(3):950-967.

Authors:	YU Zi-Sheng LI Rui-Yuan GUO Yang JIANG Zhong-Yuan BAO Jie ZHENG Yu

Affiliation:	School of Cyber Engineering, Xidian University, Xi''an Shaanxi 710071, China;JD Intelligent Cities Research, Beijing 100176, China;College of Computer Science, Chongqing University, Chongqing 400044, China;JD Intelligent Cities Research, Beijing 100176, China;School of Computer Science and Engineering, Beihang University, Beijing 100191, China

Abstract:	Time series similarity search is one of the most basic operations for temporal data analysis, which has various application scenarios. Existing distributed methods face the problems of dimension explosion, too large scan range, and time-consuming similarity calculation. To this end, this paper proposes a novel distributed time series similarity search algorithm KV-Search. First, time series are segmented into blocks and stored in the key-value database, which solves the problem of high and growing dimension. Second, the lower bound is calculated based on Chebyshev distance, and the invalid data is filtered out in advance using key value range scanning, which reduce the data transmission and calculation overhead. Third, a block-based time series representation is used to calculate the lower bound of distance, which avoids the calculation of higher dimensional real data. We implement KV-Search based on HBase, and conduct a set of extensive experiments using both real and synthetic time series data. The results show that our KV-Search is superior to benchmark experiment in efficiency and scalability.

Keywords:	time series similarity search key-value storage pruning filtration distributed query

	点击此处可从《软件学报》浏览原始摘要信息
	点击此处可从《软件学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏