首页 | 本学科首页   官方微博 | 高级检索  
     


Tracking frequent items over distributed probabilistic data
Authors:Yongxin?Tong,Xiaofei?Zhang,Lei?Chen  author-information"  >  author-information__contact u-icon-before"  >  mailto:leichen@cse.ust.hk"   title="  leichen@cse.ust.hk"   itemprop="  email"   data-track="  click"   data-track-action="  Email author"   data-track-label="  "  >Email author
Affiliation:1.State Key Laboratory of Software Development Environment, School of Computer Science and Engineering,Beihang University,Beijing,China;2.Department of Computer Science and Engineering,Hong Kong University of Science Technology,Hong Kong,China
Abstract:Tracking frequent items (also called heavy hitters) is one of the most fundamental queries in real-time data due to its wide applications, such as logistics monitoring, association rule based analysis, etc. Recently, with the growing popularity of Internet of Things (IoT) and pervasive computing, a large amount of real-time data is usually collected from multiple sources in a distributed environment. Unfortunately, data collected from each source is often uncertain due to various factors: imprecise reading, data integration from multiple sources (or versions), transmission errors, etc. In addition, due to network delay and limited by the economic budget associated with large-scale data communication over a distributed network, an essential problem is to track the global frequent items from all distributed uncertain data sites with the minimum communication cost. In this paper, we focus on the problem of tracking distributed probabilistic frequent items (TDPF). Specifically, given k distributed sites S = {S 1, … , S k }, each of which is associated with an uncertain database (mathcal {D}_{i}) of size n i , a centralized server (or called a coordinator) H, a minimum support ratio r, and a probabilistic threshold t, we are required to find a set of items with minimum communication cost, each item X of which satisfies P r(s u p(X) ≥ r × N) > t, where s u p(X) is a random variable to describe the support of X and (N={sum }_{i=1}^{k}n_{i}). In order to reduce the communication cost, we propose a local threshold-based deterministic algorithm and a sketch-based sampling approximate algorithm, respectively. The effectiveness and efficiency of the proposed algorithms are verified with extensive experiments on both real and synthetic uncertain datasets.
Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号