首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
随着互联网的高速发展,特别是近年来云计算、物联网等新兴技术的出现,社交网络等服务的广泛应用,人类社会的数据的规模正快速地增长,大数据时代已经到来。如何获取,分析大数据已经成为广泛的问题。但随着带来的数据的安全性必须引起高度重视。本文从大数据的概念和特征说起,阐述大数据面临的安全挑战,并提出大数据的安全应对策略。  相似文献   

2.
Many data warehouses contain massive amounts of data, accumulated over long periods of time. In some cases, it is necessary or desirable to either delete “old” data or to maintain the data at an aggregate level. This may be due to privacy concerns, in which case the data are aggregated to levels that ensure anonymity. Another reason is the desire to maintain a balance between the uses of data that change as the data age and the size of the data, thus avoiding overly large data warehouses. This paper presents effective techniques for data reduction that enable the gradual aggregation of detailed data as the data ages. With these techniques, data may be aggregated to higher levels as they age, enabling the maintenance of more compact, consolidated data and the compliance with privacy requirements. Special care is taken to avoid semantic problems in the aggregation process. The paper also describes the querying of the resulting data warehouses and an implementation strategy based on current database technology.  相似文献   

3.
Existing automated test data generation techniques tend to start from scratch, implicitly assuming that no pre‐existing test data are available. However, this assumption may not always hold, and where it does not, there may be a missed opportunity; perhaps the pre‐existing test cases could be used to assist the automated generation of additional test cases. This paper introduces search‐based test data regeneration, a technique that can generate additional test data from existing test data using a meta‐heuristic search algorithm. The proposed technique is compared to a widely studied test data generation approach in terms of both efficiency and effectiveness. The empirical evaluation shows that test data regeneration can be up to 2 orders of magnitude more efficient than existing test data generation techniques, while achieving comparable effectiveness in terms of structural coverage and mutation score. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

4.
基于数据仓库的QAR数据分析   总被引:1,自引:0,他引:1  
针对现有各航空公司分析QAR数据的平台不统一、QAR数据量大、分析和处理数据能力不足、导致海量的数据变成了信息垃圾,在对QAR数据和数据仓库技术研究基础上,结合航空公司关心的问题,设计了QAR数据仓库,给出了数据仓库模型的设计,详细阐述了QAR数据分析主题的提取方法,提出QAR数据仓库的星型模型.通过ETL设计了多维数据集,以超限事件为例展现了对QAR数据的安全分析,为航空公司改进飞行品质、提高安全系数提供依据.  相似文献   

5.
F-残缺数据是由F-数据(x)F与损失数据(x)-构成的数据对((x)F,(x)-),((x)F,(x)-)具有内-动态特性,应用于一类具有数据元减少特征的信息系统。基于F-残缺数据可视化问题,应用F-残缺数据提出F-残缺数据圆概念,讨论F-残缺数据辨识与恢复的几何方法。给出F-残缺数据圆位置关系定理,F-残缺数据的F-残缺数据圆辨识定理与辨识准则;给出F-残缺数据的F-残缺数据圆恢复定理,并给出应用。F-残缺数据圆是由数据圆OF,O-构成的数据圆对(OF,O-),F-残缺数据圆为研究F-残缺数据提供了一种几何方法。  相似文献   

6.
When users store data in big data platforms,the integrity of outsourced data is a major concern for data owners due to the lack of direct control over the data.However,the existing remote data auditing schemes for big data platforms are only applicable to static data.In order to verify the integrity of dynamic data in a Hadoop big data platform,we presents a dynamic auditing scheme meeting the special requirement of Hadoop.Concretely,a new data structure,namely Data Block Index Table,is designed to support dynamic data operations on HDFS(Hadoop distributed file system),including appending,inserting,deleting,and modifying.Then combined with the MapReduce framework,a dynamic auditing algorithm is designed to audit the data on HDFS concurrently.Analysis shows that the proposed scheme is secure enough to resist forge attack,replace attack and replay attack on big data platform.It is also efficient in both computation and communication.  相似文献   

7.
Cost-constrained data acquisition for intelligent data preparation   总被引:1,自引:0,他引:1  
Real-world data is noisy and can often suffer from corruptions or incomplete values that may impact the models created from the data. To build accurate predictive models, data acquisition is usually adopted to prepare the data and complete missing values. However, due to the significant cost of doing so and the inherent correlations in the data set, acquiring correct information for all instances is prohibitive and unnecessary. An interesting and important problem that arises here is to select what kinds of instances to complete so the model built from the processed data can receive the "maximum" performance improvement. This problem is complicated by the reality that the costs associated with the attributes are different, and fixing the missing values of some attributes is inherently more expensive than others. Therefore, the problem becomes that given a fixed budget, what kinds of instances should be selected for preparation, so that the learner built from the processed data set can maximize its performance? In this paper, we propose a solution for this problem, and the essential idea is to combine attribute costs and the relevance of each attribute to the target concept, so that the data acquisition can pay more attention to those attributes that are cheap in price but informative for classification. To this end, we will first introduce a unique economical factor (EF) that seamlessly integrates the cost and the importance (in terms of classification) of each attribute. Then, we will propose a cost-constrained data acquisition model, where active learning, missing value prediction, and impact-sensitive instance ranking are combined for effective data acquisition. Experimental results and comparative studies from real-world data sets demonstrate the effectiveness of our method.  相似文献   

8.
在众多提高数据挖掘效率的方法中,并行数据挖掘是一个从根本上解决该问题的有效途径.首先指出在数据挖掘过程中,不论采用顺序挖掘还是并行挖掘,都必须以数据挖掘的最终目的为前提,即尽可能多地发现数据中所含有的有用的知识,然后在此基础上提高数据挖掘的较率.在该想法基础上,提出了面向数据特征的数据划分过程,并进一步提出了加权式的并行数据挖掘基本方法.在这种数据挖掘过程中,可以得到相对于部分数据的知识,在很大程度上提高了数据挖掘的动态性能.  相似文献   

9.
Web数据挖掘中数据集成问题的研究   总被引:3,自引:0,他引:3  
在分析Web环境下数据源特点的基础上,对Web数据挖掘中的数据集成问题进行了深入的研究,给出了一个基于XML技术的集成方案.该方案采用Web数据存取方式将不同数据源集成起来,为Web数据挖掘提供了统一有效的数据集,解决了Web异构数据源集成的难题.通过一个具体实例介绍了Web数据集成的过程.  相似文献   

10.
Having an effective data structure regards to fast data changing is one of the most important demands in spatio-temporal data. Spatio-temporal data have special relationships in regard to spatial and temporal values. Both types of data are complex in terms of their numerous attributes and the changes exhibited over time. A data model that is able to increase the performance of data storage and inquiry responses from a spatio-temporal system is demanded. The structure of the relationships between spatio-temporal data mimics the biological structure of the hair, which has a ‘Root’ (spatial values) and a ‘Shaft’ (temporal values) and undergoes growth. This paper aims to show the mathematical formulation of a Hair-Oriented Data Model (HODM) for spatio-temporal data and to demonstrate the model's performance by measuring storage size and query response time. The experiment was conducted by using more than 178,000 records of climate change spatio-temporal data that were implemented in implemented in an object-relational database using nested tables. The data structure and operations are implemented by SQL statements that are related to the concepts of Object-Relational databases. The performances of file storage and execution query are compared using a tabular and normalized entity relationship model that engages various types of queries. The results show that HODM has a lower storage size and a faster query response time for all studied types of spatio-temporal queries. The significances of the work are elaborated by doing comparison with the generic data models. The experimental results showed that the proposed data model is easier to develop and more efficient.  相似文献   

11.
There is a trend that, virtually everyone, ranging from big Web companies to traditional enterprisers to physical science researchers to social scientists, is either already experiencing or anticipating unprecedented growth in the amount of data available in their world, as well as new opportunities and great untapped value. This paper reviews big data challenges from a data management respective. In particular, we discuss big data diversity, big data reduction, big data integration and cleaning, big data indexing and query, and finally big data analysis and mining. Our survey gives a brief overview about big-data-oriented research and problems.  相似文献   

12.
Time series analysis has always been an important and interesting research field due to its frequent appearance in different applications. In the past, many approaches based on regression, neural networks and other mathematical models were proposed to analyze the time series. In this paper, we attempt to use the data mining technique to analyze time series. Many previous studies on data mining have focused on handling binary-valued data. Time series data, however, are usually quantitative values. We thus extend our previous fuzzy mining approach for handling time-series data to find linguistic association rules. The proposed approach first uses a sliding window to generate continues subsequences from a given time series and then analyzes the fuzzy itemsets from these subsequences. Appropriate post-processing is then performed to remove redundant patterns. Experiments are also made to show the performance of the proposed mining algorithm. Since the final results are represented by linguistic rules, they will be friendlier to human than quantitative representation.  相似文献   

13.
The optimization capabilities of RDBMSs make them attractive for executing data transformations. However, despite the fact that many useful data transformations can be expressed as relational queries, an important class of data transformations that produce several output tuples for a single input tuple cannot be expressed in that way.

To overcome this limitation, we propose to extend Relational Algebra with a new operator named data mapper. In this paper, we formalize the data mapper operator and investigate some of its properties. We then propose a set of algebraic rewriting rules that enable the logical optimization of expressions with mappers and prove their correctness. Finally, we experimentally study the proposed optimizations and identify the key factors that influence the optimization gains.  相似文献   


14.
Within the operational phase buildings are now producing more data than ever before, from energy usage, utility information, occupancy patterns, weather data, etc. In order to manage a building holistically it is important to use knowledge from across these information sources. However, many barriers exist to their interoperability and there is little interaction between these islands of information.As part of moving building data to the cloud there is a critical need to reflect on the design of cloud-based data services and how they are designed from an interoperability perspective. If new cloud data services are designed in the same manner as traditional building management systems they will suffer from the data interoperability problems.Linked data technology leverages the existing open protocols and W3C standards of the Web architecture for sharing structured data on the web. In this paper we propose the use of linked data as an enabling technology for cloud-based building data services. The objective of linking building data in the cloud is to create an integrated well-connected graph of relevant information for managing a building. This paper describes the fundamentals of the approach and demonstrates the concept within a Small Medium sized Enterprise (SME) with an owner-occupied office building.  相似文献   

15.
Food consumption data are collected and used in several fields of science. The data are often combined from various sources and interchanged between different systems. There is, however, no harmonized and widely used data interchange format. In addition, food consumption data are often combined with other data such as food composition data. In the field of food composition, successful harmonization has recently been achieved by the European Food Information Resource Network, which is now the basis of a standard draft by the European Committee for Standardization. We present an XML-based data interchange format for food consumption based on work and experiences related to food composition.The aim is that the data interchange format will provide a basis for wider harmonization in the future.  相似文献   

16.
Discovering interesting patterns or substructures in data streams is an important challenge in data mining. Clustering algorithms are very often applied to identify single substructures although they are designed to partition a data set. Another problem of clustering algorithms is that most of them are not designed for data streams. This paper discusses a recently introduced procedure that deals with both problems. The procedure explores ideas from cluster analysis, but was designed to identify single clusters without the necessity to partition the whole data set into clusters. The new extended version of the algorithm is an incremental clustering approach applicable to stream data. It identifies new clusters formed by the incoming data and updates the data space partition. Clustering of artificial and real data sets illustrates the abilities of the proposed method.  相似文献   

17.
介绍了数据挖掘算法的两种传统数据访问方式及其缺点,提出了新的数据访问方式——基于Cache的数据挖掘算法的数据访问方法,该方法提供了三种模式缓存数据:单列模式、多列模式、混合模式,以适用多种数据挖掘算法的需要。设计实现了这种数据挖掘专用数据访问组件,该组件兼顾了传统访问方式的优点,实验证明在占用有限系统资源的情况下,保证了高效的数据访问效率并支持对海量数据的访问。  相似文献   

18.
Zhao  Chun  Ren  Lei  Zhang  Ziqiao  Meng  Zihao 《World Wide Web》2020,23(2):1407-1421

In the process of manufacturing, a large amount of manufacturing data is produced by different departments and different domain. In order to realise data sharing and linkage among supply chains, master data management method has been used. Through master data management, the key data can be shared and distributed uniformly. However, since these cross-domain data form a data network through the association of master data, how to evaluate the effectiveness and rationality of this network becomes the major issue in the proposed method. In this paper, a model of the master data network is built based on the theory of set pair analysis. In order to verify the master data, an evaluation method for the network is proposed. Finally, a case was presented to validate this network model and evaluation method.

  相似文献   

19.
基于GML的多源异构空间数据集成研究   总被引:5,自引:0,他引:5  
深入分析了数据格式转换、直接数据访问和数据互操作3种数据集成模式,描述了一个基于GML数据互操作模式的、多源异构空间数据集成模型,并分析了模型的运行机制和关键技术.该模型通过使用相应的GML转换接口把分布式异构空间数据源转化为统一的GML格式文档,通过集成引擎和相应的集成规则对异构空间数据进行有效的集成,实现数据共享的目的.  相似文献   

20.
A data envelopment analysis-based approach for data preprocessing   总被引:2,自引:0,他引:2  
In this paper, we show how the data envelopment analysis (DEA) model might be useful to screen training data so a subset of examples that satisfy monotonicity property can be identified. Using real-world health care and software engineering data, managerial monotonicity assumption, and artificial neural network (ANN) as a forecasting model, we illustrate that DEA-based data screening of training data improves forecasting accuracy of an ANN.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号