首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
随着互联网的高速发展,特别是近年来云计算、物联网等新兴技术的出现,社交网络等服务的广泛应用,人类社会的数据的规模正快速地增长,大数据时代已经到来。如何获取,分析大数据已经成为广泛的问题。但随着带来的数据的安全性必须引起高度重视。本文从大数据的概念和特征说起,阐述大数据面临的安全挑战,并提出大数据的安全应对策略。  相似文献   

3.
Time series analysis has always been an important and interesting research field due to its frequent appearance in different applications. In the past, many approaches based on regression, neural networks and other mathematical models were proposed to analyze the time series. In this paper, we attempt to use the data mining technique to analyze time series. Many previous studies on data mining have focused on handling binary-valued data. Time series data, however, are usually quantitative values. We thus extend our previous fuzzy mining approach for handling time-series data to find linguistic association rules. The proposed approach first uses a sliding window to generate continues subsequences from a given time series and then analyzes the fuzzy itemsets from these subsequences. Appropriate post-processing is then performed to remove redundant patterns. Experiments are also made to show the performance of the proposed mining algorithm. Since the final results are represented by linguistic rules, they will be friendlier to human than quantitative representation.  相似文献   

4.
Compression-based data mining of sequential data   总被引:3,自引:1,他引:2  
The vast majority of data mining algorithms require the setting of many input parameters. The dangers of working with parameter-laden algorithms are twofold. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate the significance of the reported patterns. This is especially likely when the user fails to understand the role of parameters in the data mining process. Data mining algorithms should have as few parameters as possible. A parameter-light algorithm would limit our ability to impose our prejudices, expectations, and presumptions on the problem at hand, and would let the data itself speak to us. In this work, we show that recent results in bioinformatics, learning, and computational theory hold great promise for a parameter-light data-mining paradigm. The results are strongly connected to Kolmogorov complexity theory. However, as a practical matter, they can be implemented using any off-the-shelf compression algorithm with the addition of just a dozen lines of code. We will show that this approach is competitive or superior to many of the state-of-the-art approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/XML/video datasets. As a further evidence of the advantages of our method, we will demonstrate its effectiveness to solve a real world classification problem in recommending printing services and products. Responsible editor: Johannes Gehrke  相似文献   

5.
The optimization capabilities of RDBMSs make them attractive for executing data transformations. However, despite the fact that many useful data transformations can be expressed as relational queries, an important class of data transformations that produce several output tuples for a single input tuple cannot be expressed in that way.

To overcome this limitation, we propose to extend Relational Algebra with a new operator named data mapper. In this paper, we formalize the data mapper operator and investigate some of its properties. We then propose a set of algebraic rewriting rules that enable the logical optimization of expressions with mappers and prove their correctness. Finally, we experimentally study the proposed optimizations and identify the key factors that influence the optimization gains.  相似文献   


6.
As the amount of multimedia data is increasing day-by-day thanks to cheaper storage devices and increasing number of information sources, the machine learning algorithms are faced with large-sized datasets. When original data is huge in size small sample sizes are preferred for various applications. This is typically the case for multimedia applications. But using a simple random sample may not obtain satisfactory results because such a sample may not adequately represent the entire data set due to random fluctuations in the sampling process. The difficulty is particularly apparent when small sample sizes are needed. Fortunately the use of a good sampling set for training can improve the final results significantly. In KDD’03 we proposed EASE that outputs a sample based on its ‘closeness’ to the original sample. Reported results show that EASE outperforms simple random sampling (SRS). In this paper we propose EASIER that extends EASE in two ways. (1) EASE is a halving algorithm, i.e., to achieve the required sample ratio it starts from a suitable initial large sample and iteratively halves. EASIER, on the other hand, does away with the repeated halving by directly obtaining the required sample ratio in one iteration. (2) EASE was shown to work on IBM QUEST dataset which is a categorical count data set. EASIER, in addition, is shown to work on continuous data of images and audio features. We have successfully applied EASIER to image classification and audio event identification applications. Experimental results show that EASIER outperforms SRS significantly. Surong Wang received the B.E. and M.E. degree from the School of Information Engineering, University of Science and Technology Beijing, China, in 1999 and 2002 respectively. She is currently studying toward for the Ph.D. degree at the School of Computer Engineering, Nanyang Technological University, Singapore. Her research interests include multimedia data processing, image processing and content-based image retrieval. Manoranjan Dash obtained Ph.D. and M. Sc. (Computer Science) degrees from School of Computing, National University of Singapore. He has worked in academic and research institutes extensively and has published more than 30 research papers (mostly refereed) in various reputable machine learning and data mining journals, conference proceedings, and books. His research interests include machine learning and data mining, and their applications in bioinformatics, image processing, and GPU programming. Before joining School of Computer Engineering (SCE), Nanyang Technological University, Singapore, as Assistant Professor, he worked as a postdoctoral fellow in Northwestern University. He is a member of IEEE and ACM. He has served as program committee member of many conferences and he is in the editorial board of “International journal of Theoretical and Applied Computer Science.” Liang-Tien Chia received the B.S. and Ph.D. degrees from Loughborough University, in 1990 and 1994, respectively. He is an Associate Professor in the School of Computer Engineering, Nanyang Technological University, Singapore. He has recently been appointed as Head, Division of Computer Communications and he also holds the position of Director, Centre for Multimedia and Network Technology. His research interests include image/video processing & coding, multimodal data fusion, multimedia adaptation/transmission and multimedia over the Semantic Web. He has published over 80 research papers.  相似文献   

7.
8.
Linear combinations of translates of a given basis function have long been successfully used to solve scattered data interpolation and approximation problems. We demonstrate how the classical basis function approach can be transferred to the projective space ℙ d−1. To be precise, we use concepts from harmonic analysis to identify positive definite and strictly positive definite zonal functions on ℙ d−1. These can then be applied to solve problems arising in tomography since the data given there consists of integrals over lines. Here, enhancing known reconstruction techniques with the use of a scattered data interpolant in the “space of lines”, naturally leads to reconstruction algorithms well suited to limited angle and limited range tomography. In the medical setting algorithms for such incomplete data problems are desirable as using them can limit radiation dosage.  相似文献   

9.
Existing automated test data generation techniques tend to start from scratch, implicitly assuming that no pre‐existing test data are available. However, this assumption may not always hold, and where it does not, there may be a missed opportunity; perhaps the pre‐existing test cases could be used to assist the automated generation of additional test cases. This paper introduces search‐based test data regeneration, a technique that can generate additional test data from existing test data using a meta‐heuristic search algorithm. The proposed technique is compared to a widely studied test data generation approach in terms of both efficiency and effectiveness. The empirical evaluation shows that test data regeneration can be up to 2 orders of magnitude more efficient than existing test data generation techniques, while achieving comparable effectiveness in terms of structural coverage and mutation score. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

10.
自互联网出现以来,数据保护一直是个难题。当社交媒体网站在数字市场上大展拳脚的那一刻,对用户数据和信息的保护让决策者们不得不保持警惕。在数字经济时代的背景下,数据逐渐成为企业提升竞争力的重要要素,围绕着数据展开的市场竞争越来越多。数字经济时代,企业对数据资源的重视与争夺,将网络平台权利与用户个人信息保护、互联网企业之间有关数据不正当竞争的纠纷和冲突,推上了风口浪尖。因此,如何协调和把握数据的合理利用和保护之间的关系,规制不正当竞争行为,以求在数字经济快速发展的洪流中,占据竞争优势显得尤为重要。文章将通过分析数据的二元性,讨论数据在数字经济时代的价值,并结合反不正当竞争法和实践案例,进一步讨论数据利用和保护的关系。  相似文献   

11.
为解决云计算中海量数据的存储管理问题,分析了关系数据模型和NoSQL数据模型各自的特点,提出了一种新的数据模型。该模型根据数据本身的特点将数据横向切分为一组实体的集合,不同的数据实体负责处理不同的数据应用,结合了关系数据模型的可用性与NoSQL数据模型的可扩展性。通过详细定义该模型的数据结构、约束条件以及数据操作,保证了数据模型的完整性。通过一个原型系统运行实例,验证了该模型的有效性,为云数据管理提供了可行的解决途径。  相似文献   

12.
Simon Gog  Matthias Petri 《Software》2014,44(11):1287-1314
Succinct data structures provide the same functionality as their corresponding traditional data structure in compact space. We improve on functions rank and select, which are the basic building blocks of FM‐indexes and other succinct data structures. First, we present a cache‐optimal, uncompressed bitvector representation that outperforms all existing approaches. Next, we improve, in both space and time, on a recent result by Navarro and Providel on compressed bitvectors. Last, we show techniques to perform rank and select on 64‐bit words that are up to three times faster than existing methods. In our experimental evaluation, we first show how our improvements affect cache and runtime performance of both operations on data sets larger than commonly used in the evaluation of succinct data structures. Our experiments show that our improvements to these basic operations significantly improve the runtime performance and compression effectiveness of FM‐indexes on small and large data sets. To our knowledge, our improvements result in FM‐indexes that are either smaller or faster than all current state of the art implementations. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

13.
In this article, we focus on the issues of fuzzy data dependencies. After introducing the notion of semantic equivalence degree, fuzzy functional and multivalued dependencies are defined. A set of sound and complete inference rules, similar to Armstrong's axioms for classic cases, for fuzzy functional dependencies (FFDs) and fuzzy multivalued dependencies (FMVDs) are proposed. The strategies and approaches for compressing fuzzy values by FFDs and FMVDs are investigated. By such processing, the unnecessary elements are eliminated from a fuzzy value and its range is compressed. © 2002 Wiley Periodicals, Inc.  相似文献   

14.
Modern medical imaging lets us create accurate computer models of anatomical structures. Some of these models can be animated to visualize joint kinematics. A model's size and complexity can significantly affect the efficiency of any desired animation or interactive manipulation. The model normally takes the form of a polygonal mesh; the more facets in the mesh, the slower the rendering process. Beyond a certain limit, real time interaction becomes impractical because the frame rate (image regeneration) is too slow. The many methods proposed for reducing the number of polygons in computer models normally entail a loss of detail in the final model. In some applications, retaining detail may be important. Joint kinematics, which we were investigating, falls into this category, and we sought a way to reduce the input data volume without introducing a corresponding decrease in the isosurface resolution. Our application requires only the bone's external surface, which is found by segmenting radiological data obtained from computerized tomodensitometry (a CT scan). By analyzing local bone morphology, we were able to identify and eliminate nearly 50 percent of the polygons generated by standard segmentation techniques, while retaining the full resolution of the required isosurface. The article discusses the relationships between bone morphology and bone intensity in a medical imaging data set and describes how these relationships can help us reduce the polygon count in the surface models generated  相似文献   

15.
Li  Menggang  Wang  Fang  Jia  Xiaojun  Li  Wenrui  Li  Ting  Rui  Guangwei 《Neural computing & applications》2021,33(10):4729-4739
Neural Computing and Applications - Economic data include data of various types and characteristics such as macro-data, meso-data, and micro-data. The source of economic data can be the data...  相似文献   

16.
Many data warehouses contain massive amounts of data, accumulated over long periods of time. In some cases, it is necessary or desirable to either delete “old” data or to maintain the data at an aggregate level. This may be due to privacy concerns, in which case the data are aggregated to levels that ensure anonymity. Another reason is the desire to maintain a balance between the uses of data that change as the data age and the size of the data, thus avoiding overly large data warehouses. This paper presents effective techniques for data reduction that enable the gradual aggregation of detailed data as the data ages. With these techniques, data may be aggregated to higher levels as they age, enabling the maintenance of more compact, consolidated data and the compliance with privacy requirements. Special care is taken to avoid semantic problems in the aggregation process. The paper also describes the querying of the resulting data warehouses and an implementation strategy based on current database technology.  相似文献   

17.
Having an effective data structure regards to fast data changing is one of the most important demands in spatio-temporal data. Spatio-temporal data have special relationships in regard to spatial and temporal values. Both types of data are complex in terms of their numerous attributes and the changes exhibited over time. A data model that is able to increase the performance of data storage and inquiry responses from a spatio-temporal system is demanded. The structure of the relationships between spatio-temporal data mimics the biological structure of the hair, which has a ‘Root’ (spatial values) and a ‘Shaft’ (temporal values) and undergoes growth. This paper aims to show the mathematical formulation of a Hair-Oriented Data Model (HODM) for spatio-temporal data and to demonstrate the model's performance by measuring storage size and query response time. The experiment was conducted by using more than 178,000 records of climate change spatio-temporal data that were implemented in implemented in an object-relational database using nested tables. The data structure and operations are implemented by SQL statements that are related to the concepts of Object-Relational databases. The performances of file storage and execution query are compared using a tabular and normalized entity relationship model that engages various types of queries. The results show that HODM has a lower storage size and a faster query response time for all studied types of spatio-temporal queries. The significances of the work are elaborated by doing comparison with the generic data models. The experimental results showed that the proposed data model is easier to develop and more efficient.  相似文献   

18.
Digital steganography is the art of inconspicuously hiding data within data. Steganography's goal in general is to hide data well enough that unintended recipients do not suspect the steganographic medium of containing hidden data. The software and links mentioned in this article are just a sample of the steganography tools currently available. As privacy concerns continue to develop along with the digital communication domain, steganography will undoubtedly play a growing role in society. For this reason, it is important that we are aware of digital steganography technology and its implications. Equally important are the ethical concerns of using steganography and steganalysis. Steganography enhances rather than replaces encryption. Messages are not secure simply by virtue of being hidden. Likewise, steganography is not about keeping your message from being known - it's about keeping its existence from being known  相似文献   

19.
Large-scale data visualization using parallel data streaming   总被引:2,自引:0,他引:2  
We present an architectural approach based on parallel data streaming to enable visualizations on a parallel cluster. Our approach requires less memory than other visualizations while achieving high code reuse. We implemented our architecture within the Visualization Toolkit (VTK). It includes specific additions to support message passing interfaces (MPIs); memory limit-based streaming of both implicit and explicit topologies; translation of streaming requests between topologies; and passing data and pipeline control between shared, distributed, and mixed memory configurations. The architecture directly supports both sort-first and sort-last parallel rendering  相似文献   

20.
Cost-constrained data acquisition for intelligent data preparation   总被引:1,自引:0,他引:1  
Real-world data is noisy and can often suffer from corruptions or incomplete values that may impact the models created from the data. To build accurate predictive models, data acquisition is usually adopted to prepare the data and complete missing values. However, due to the significant cost of doing so and the inherent correlations in the data set, acquiring correct information for all instances is prohibitive and unnecessary. An interesting and important problem that arises here is to select what kinds of instances to complete so the model built from the processed data can receive the "maximum" performance improvement. This problem is complicated by the reality that the costs associated with the attributes are different, and fixing the missing values of some attributes is inherently more expensive than others. Therefore, the problem becomes that given a fixed budget, what kinds of instances should be selected for preparation, so that the learner built from the processed data set can maximize its performance? In this paper, we propose a solution for this problem, and the essential idea is to combine attribute costs and the relevance of each attribute to the target concept, so that the data acquisition can pay more attention to those attributes that are cheap in price but informative for classification. To this end, we will first introduce a unique economical factor (EF) that seamlessly integrates the cost and the importance (in terms of classification) of each attribute. Then, we will propose a cost-constrained data acquisition model, where active learning, missing value prediction, and impact-sensitive instance ranking are combined for effective data acquisition. Experimental results and comparative studies from real-world data sets demonstrate the effectiveness of our method.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号