为了保证多学科领域背景下的数据挖掘分析与知识发现,使大数据分析兼顾领域复杂性、数据分析易用性和执行高效性.提出了领域业务驱动的大数据分析流程建模,指导大数据分析流程模型的构建及实施,将大数据分析流程划分为面向领域和面向平台的双层模型,并通过基于模型驱动的模型映射方法自动完成二者之间的转换.其中面向领域的分析模型从领域业...  相似文献   

Beyond the hype of Big Data, something within business intelligence projects is indeed changing. This is mainly because Big Data is not only about data, but also about a complete conceptual and technological stack including raw and processed data, storage, ways of managing data, processing and analytics. A challenge that becomes even trickier is the management of the quality of the data in Big Data environments. More than ever before the need for assessing the Quality-in-Use gains importance since the real contribution–business value–of data can be only estimated in its context of use. Although there exists different Data Quality models for assessing the quality of regular data, none of them has been adapted to Big Data. To fill this gap, we propose the “3As Data Quality-in-Use model”, which is composed of three Data Quality characteristics for assessing the levels of Data Quality-in-Use in Big Data projects: Contextual Adequacy, Operational Adequacy and Temporal Adequacy. The model can be integrated into any sort of Big Data project, as it is independent of any pre-conditions or technologies. The paper shows the way to use the model with a working example. The model accomplishes every challenge related to Data Quality program aimed for Big Data. The main conclusion is that the model can be used as an appropriate way to obtain the Quality-in-Use levels of the input data of the Big Data analysis, and those levels can be understood as indicators of trustworthiness and soundness of the results of the Big Data analysis.  相似文献   

随着公路网络的建设完善,区域间运输效率得到极大提升。但是如何提高道路交通运输安全水平,是当前亟需解决的重大课题。在此背景下,针对道路交通运输中两客一危营运车辆的交通运行安全行为展开研究,首先分析了两客一危营运车辆的运行特征,然后构建了两客一危营运车辆非法运营判别算法模型,最后基于大数据技术设计了两客一危非法营运动态监管全过程安全监管系统框架。应用表明,基于大数据的两客一危非法营运动态监管算法,能够准确识别两客一危车辆的疑似非法运营行为,为综合执法提供有效依据。  相似文献   

论文对国内外关于模式匹配的研究进行综合分析,主要从模式匹配的角度对复杂模式匹配过程进行了研究,并着重对结构化的模式匹配进行了研究,对结构相似度和语言相似度进行综合;在语言匹配的基础上,对结构匹配进行分类匹配,采用自顶向下分别从非叶子节点和叶子节点进行模式匹配,非叶子节点匹配结果对叶子节点匹配结果有传递指导作用.该方法是一种利用元素间的结构信息来辅助模式匹配的新方法.最终达到提高模式匹配结果准确率的目的.  相似文献   

数据清洗是数据仓库和数据挖掘中非常重要的一个环节。本文首先分析总结了数据清洗的有关概念,给出了数据清洗中需要解决的质量问题,并总结了解决这些问题的技术和方法。在此基础上提出了以人为中心的数据清洗过程模型。该模型集成了工作流技术、数据集成、数据转换和数据挖掘技术。给出了每个工具箱应该提供的基本功能。  相似文献   

当前,流程驱动的信息系统构建方式得到了越来越广泛的应用.在流程驱动的方式中,流程模型对数据模型有着不可忽视的影响.但是当前的数据模型异常检测方法都是针对数据模型本身的特点而未考虑流程模型.同样,流程模型的验证方法也缺乏对数据模型的考虑.文中提出并分析了面向业务流程的数据模型的异常问题,并给出了其3种基本类型.为了检测这些异常,文中提出了Data-process Graph(DP-Graph)模型,将数据模型和流程模型放在统一的架构下进行研究.而后,基于DP-Graph,文中提出了DPGT算法,有效地实现了面向业务流程的数据模型异常检测.文章中的实验结果验证了DPGT算法对于这些异常的高检出率.  相似文献   

复杂事件处理(Complex Event Processing, CEP)是一项伴随流式数据而出现的技术,用于不同数据源顺序混杂的事件流中发现感兴趣的事件模式。然而,随着数据量的不断递增,传统的CEP技术往往不能满足在大数据集上有效获取事件模式的处理需求。针对这一问题,本文结合数据挖掘中聚类分析与关联规则的思想,提出一种“复杂事件处理”算法,〖JP2〗并把其部署到分布式平台Hadoop上,从而发现大数据集中的复杂事件关系,有效地改变了传统技术面临海量数据的局限性。最后,应用本文算法到GPS大数据集中,发现其中的复杂事件模式,并通过实验验证本文方法具有可行性与有效性。  相似文献   

设计了一种新的适用于大数据的管理和分析模型——大数据随机样本划分(Random sample partition,RSP)模型,它是将大数据文件表达成一系列RSP数据块文件的集合,分布存储在集群节点上。RSP的生成操作使每个RSP数据块的分布与大数据的分布保持统计意义上的一致,因此,每个RSP数据块是大数据的一个随机样本数据,可以用来估计大数据的统计特征,或建立大数据的分类和回归模型。基于RSP模型,大数据的分析任务可以通过对RSP数据块的分析来完成,不需要对整个大数据进行计算,极大地减少了计算量,降低了对计算资源的要求,提高了集群系统的计算能力和扩展能力。本文首先给出RSP模型的定义、理论基础和生成方法;然后介绍基于RSP数据块的渐近式集成学习Alpha计算框架;之后讨论基于RSP模型和Alpha框架的大数据分析相关计算技术,包括:数据探索与清洗、概率密度函数估计、有监督子空间学习、半监督集成学习、聚类集成和异常点检测;最后讨论RSP模型在分而治之大数据分析和抽样方法上的创新,以及RSP模型和Alpha计算框架实现大规模数据分析的优势。  相似文献   

为解决传统数据质量评估实现方式灵活性与通用性较差的问题,通过对元数据应用与数据质量评估体系的研究,重点分析了元数据在数据质量评估过程中的作用、数据质量评估维度与评估算法;确定基础元数据、评估控制元数据与评估算法元数据,并构建元数据模型.通过实际应用证明模型具有良好的灵活性与通用性.  相似文献   

This study proposes a new technique, namely the Pattern Adaptive Neural Network (PANN), for simplifying existing noise detection and removal methods. This technique is developed based on a modified backpropagation algorithm using a fuzzy membership function on the error term. It is able to make use of noisy data in a single step, with an automatic adjustment of data contribution to network training. It is demonstrated via an application on an oil well data set. The results show that the predictions from PANN matched well with the expert interpretations on the data set regarding the data quality.  相似文献   

Value creation is a major factor not only in the sustainability of organizations but also in the maximization of profit, customer retention, business goals fulfillment, and revenue. When the value is intended to be created from Big Data scenarios, value creation entails being understood over a broader range of complexity. A question that arises here is how organizations can use this massive quantity of data and create business value? The present study seeks to provide a model for creating organizational value using Big Data Analytics (BDA). To this end, after reviewing the related literature and interviewing experts, the BDA-based organizational value creation model is developed. Accordingly, five hypotheses are formulated, and a questionnaire is prepared. Then, the respective questionnaire is given to the research statistical population (i.e., IT managers and experts, particularly those specializing in data analysis) to test the research hypotheses. In next phase, connections between model variables are scrutinized using the structural equation modeling (measurement and structural models). The results of the study indicate that investigating the infrastructures of the Big Data Analytics, as well as the capabilities of the organization and those of Big Data Analytics is the initial requirement to create organizational value using BDA. Thereby, the Big Data Analytics strategy is formulated, and ultimately, the organizational value is created as well.  相似文献   

针对复杂工业过程中产生积累的大量数据,分析了数据挖掘的技术基础和数据特点,提出了一种集成的数据挖掘模型,并给出了基于SQL Server 2000的数据挖掘实现方案,为复杂工业过程的知识发现提供借鉴和参考。  相似文献   

Most modern technologies, such as social media, smart cities, and the internet of things (IoT), rely on big data. When big data is used in the real-world applications, two data challenges such as class overlap and class imbalance arises. When dealing with large datasets, most traditional classifiers are stuck in the local optimum problem. As a result, it’s necessary to look into new methods for dealing with large data collections. Several solutions have been proposed for overcoming this issue. The rapid growth of the available data threatens to limit the usefulness of many traditional methods. Methods such as oversampling and undersampling have shown great promises in addressing the issues of class imbalance. Among all of these techniques, Synthetic Minority Oversampling TechniquE (SMOTE) has produced the best results by generating synthetic samples for the minority class in creating a balanced dataset. The issue is that their practical applicability is restricted to problems involving tens of thousands or lower instances of each. In this paper, we have proposed a parallel mode method using SMOTE and MapReduce strategy, this distributes the operation of the algorithm among a group of computational nodes for addressing the aforementioned problem. Our proposed solution has been divided into three stages. The first stage involves the process of splitting the data into different blocks using a mapping function, followed by a pre-processing step for each mapping block that employs a hybrid SMOTE algorithm for solving the class imbalanced problem. On each map block, a decision tree model would be constructed. Finally, the decision tree blocks would be combined for creating a classification model. We have used numerous datasets with up to 4 million instances in our experiments for testing the proposed scheme’s capabilities. As a result, the Hybrid SMOTE appears to have good scalability within the framework proposed, and it also cuts down the processing time.  相似文献   

目前,中国已发展成为全世界最大的汽车市场,汽车电动化、互联化、共享化和智能化成为行业未来发展的核心。在四化大背景下,传统车企、新进势力纷纷推出各类车型,抢占发展先机。监控各类实时运行数据成为国家安全管控的必要手段。基于此,通过调整传统Hbase+ElasticSearch技术路线,增加索引、底层计算等技术,建立符合汽车企业大数据特点的存储和分析模型。  相似文献   

针对飞机因飞参记录仪设备不同而采样率各异的情况,本文在不同采样率下对法向过载参数随时间历程采样到的峰谷点进行了抛物线插值恢复峰谷真值的研究.结果表明,随着采样率的增加,拟合插值后得到的峰谷值逐渐趋于稳定.收敛于法向过载参数变化的真实峰谷值,证实当采样率高于2时,该方法恢复获得的法向过载峰谷值能够满足疲劳损伤计算的精度要求,当采样率为4时,该方法几乎已经获得法向过栽变化的真实峰谷值.适当的数据处理方法可消除因飞参设备采样率差异而造成疲劳损伤计算结果精度不够的影响,这样可适当降低对飞参设备高采样率的要求,节省存储空问、提高存储与计算效率.  相似文献   

近年来,区块链技术不断发展,受到了广泛重视,被普遍视为解决数据安全问题的重要工具。RFID大数据是物联网中重要数据的来源,对数据的安全性要求也非常高。数据溯源追踪是RFID物联网技术的重要应用领域之一,目前广泛应用于农牧产品原产地追溯、工业生产的原材料和零配件追溯,以及消费品防伪等方面。区块链在改善大数据溯源安全性方面 发挥着重要作用。文中提出了一种基于区块链技术的RFID大数据溯源安全模型,并在RFID大数据的追踪溯源过程中应用区块链技术,形成了多方参与且信息透明、共享、保真的溯源链;在RFID溯源物品的生产、加工、销售等多个环节建立区块链账本,建立起RFID大数据的溯源全程链式路径,路径直达终端使用者,从而实现RFID大数据的溯源安全管理。  相似文献   

保质设计是近年来兴起的一种设计方法。文章从Morup质量两类论现点出发,对基于保质设计思想的设计活动进行了分析,并在此基础上,提出了保质设计的信息集成模型及信息转换的结构化模型,进而探讨了支持保质设计的数据总体模型。  相似文献   

