首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 156 毫秒
1.
基于开源源码大数据进行代码生成、缺陷预测等是当前智能化软件开发方法与技术的重要研究内容。然而现有的关注点主要聚焦于各种推荐、预测等智能算法的研究,较少对研究所使用数据的质量进行评估与分析。大部分智能化软件开发研究的数据来源于开源数据托管平台,受限于开发者自身水平,它们并不能保证都具有较高质量。根据"garbage in,garbage out",这会影响最终结果质量。源码数据的质量对相关的研究有重要影响,却没有得到足够的重视。针对上述问题,提出了一种面向开源源码大数据的方法块数据质量评估方法。首先研究如何定义和评估GitHub上抽取的源码的数据质量问题,然后对开源源码从不同维度进行质量评估。通过该源码数据质量评估方法可以帮助相关研究人员构建具有更高质量的数据集,进而提高智能化相关研究,比如代码生成、缺陷预测等的结果质量。  相似文献   

2.
数据质量多种性质的关联关系研究   总被引:1,自引:0,他引:1  
信息化时代数据海量增长的同时,用户需要利用多种指标从不同性质方面对数据质量进行评价和改善.但在目前数据质量管理过程中,影响数据可用性的多种重要因素并非完全孤立,在评估机制和指导数据清洗规则时,彼此会发生关联.本文研究了在实际信息系统中适用的综合性数据质量评估方法,将文献所提出以及在实际的信息系统中常用的数据质量性质指标,按其定义与性质进行归纳总结,提出了基于性质的数据质量综合评估框架.随后针对影响数据可用性的四个重要性质:精确性、完整性、一致性、时效性整理出在数据集合上的操作方法,并逐一介绍其违反模式的定义,随后给出其具体关系证明,进而确定数据质量多维关联关系评估策略,并通过实验验证了该策略的有效性.  相似文献   

3.
面向企业信息化的数据质量评估研究   总被引:1,自引:0,他引:1  
数据质量问题是企业信息化过程中面临的一项重要挑战,但针对数据质量评估的研究还缺乏足够的重视.文中从数据质量定义出发,阐述了数据质量的各个不同维度及其评估指标的确定,在对比分析已有成果的基础上给出了主观、客观两种评估方法,通过引入SOA上下文的可重用服务思想,设计了一种数据质量评估的服务框架,基于该框架对输入输出、流程管理、自动化评估等服务进行了阐述,并使用Web Services服务组件的形式实现了所有的功能需求.  相似文献   

4.
到目前为止,国际上既没有形成系统化的数据质量评估标准,也没有建立完整的数据质量评估体系。通过对国际和国内数据质量相关内容的研究,分析了大型企业对数据质量的需求,提出了一个数据质量元模型框架结构,构建了数据质量评估体系。该体系包括数据质量的分类和定义、数据质量评估指标算法和数据质量评估体系及流程,为企业对数据质量进行评估提供了可靠的依据。  相似文献   

5.
为解决传统数据质量评估实现方式灵活性与通用性较差的问题,通过对元数据应用与数据质量评估体系的研究,重点分析了元数据在数据质量评估过程中的作用、数据质量评估维度与评估算法;确定基础元数据、评估控制元数据与评估算法元数据,并构建元数据模型.通过实际应用证明模型具有良好的灵活性与通用性.  相似文献   

6.
由于信息系统所提供数据的质量不高(如数据残缺、数据不一致、数据重复等)导致管理者决策过程中经常面临“数据丰富,信息匮乏”的困惑是目前企业普遍存在的现象.为了切实提高信息系统所提供数据的可用性,研究了影响关系数据库数据质量的主要因素,提出了面向多数据源的统一元数据模型和数据库数据质量评估模型,构建了用于数据质量评估的交互式可视形态集.建立了一个面向关系数据库的数据质量可视分析系统,并结合具体企业应用实例进行验证.结果表明,该系统能够有效分析数据质量,提高企业分析决策的可靠性和准确性.  相似文献   

7.
数据质量是数据挖掘和数据分析结论有效性和准确性的基础、前提和保障,数据质量评估是解决数据质量问题的关键。数据质量评估的标准是多种多样的,其中准确性评估是一个重要的指标。设计并实现了一个基于OpenShift云计算环境的数据质量评估平台,利用Benford法则对数据的准确性进行评估。  相似文献   

8.
数据质量评估方法研究   总被引:24,自引:0,他引:24  
数据质量管理已经成为当今数据管理的关键问题,并得到了广泛的研究和应用。数据质量评估作为数据质量管理中的必要过程和基础部分,目前缺乏一种定量的系统的方法。针对数据质量评估中的这一问题,该文介绍了一些基本的数据质量评估指标,提出了一种数据质量评估模型,并阐述了该模型的构造技术和计算方法。  相似文献   

9.
在当今大数据时代下,数据质量的保证是大数据价值得以发挥的前提,数据质量的评估是其中一个重要的研究课题.本文基于规则库的数据质量评估方法,提出了数据质量评估整体模型,包括规则、规则库、数据质量评估指标、评估模板、评估报告.设计了规则评估模板,组合规则库中的规则,根据数据质量评估指标的重要性设置规则的权重,采用简单比率法和加权平均法相结合的评估方法,计算评估结果并确定数据质量的等级,利用了数据可视化技术来展现数据质量的评估结果.本文既考虑了单个规则的执行合格率,又考虑了各规则在数据质量评估模板中的比重,公正地准确地评估数据质量,并且简洁、直观地呈现评估结果.  相似文献   

10.
数据质量和数据清洗研究综述   总被引:75,自引:1,他引:75  
郭志懋  周傲英 《软件学报》2002,13(11):2076-2082
对数据质量,尤其是数据清洗的研究进行了综述.首先说明数据质量的重要性和衡量指标,定义了数据清洗问题.然后对数据清洗问题进行分类,并分析了解决这些问题的途径.最后说明数据清洗研究与其他技术的结合情况,分析了几种数据清洗框架.最后对将来数据清洗领域的研究问题作了展望.  相似文献   

11.
These days, endless streams of data are generated by various sources such as sensors, applications, users, etc. Due to possible issues in sources, such as malfunctions in sensors, platforms, or communication, the generated data might be of low quality, and this can lead to wrong outcomes for the tasks that rely on these data streams. Therefore, controlling the quality of data streams has become increasingly significant. Many approaches have been proposed for controlling the quality of data streams, and hence, various research areas have emerged in this field. To the best of our knowledge, there is no systematic literature review of research papers within this field that comprehensively reviews approaches, classifies them, and highlights the challenges.In this paper, we present the state of the art in the area of quality control of data streams, and characterize it along four dimensions. The first dimension represents the goal of the quality analysis, which can be either quality assessment, or quality improvement. The second dimension focuses on the quality control method, which can be online, offline, or hybrid. The third dimension focuses on the quality control technique, and finally, the fourth dimension represents whether the quality control approach uses any contextual information (inherent, system, organizational, or spatiotemporal context) or not. We compare and critically review the related approaches proposed in the last two decades along these dimensions. We also discuss the open challenges and future research directions.  相似文献   

12.
Data Quality is a critical issue in today’s interconnected society. Advances in technology are making the use of the Internet an ever-growing phenomenon and we are witnessing the creation of a great variety of applications such as Web Portals. These applications are important data sources and/or means of accessing information which many people use to make decisions or to carry out tasks. Quality is a very important factor in any software product and also in data. As quality is a wide concept, quality models are usually used to assess the quality of a software product. From the software point of view there is a widely accepted standard proposed by ISO/IEC (the ISO/IEC 9126) which proposes a quality model for software products. However, until now a similar proposal for data quality has not existed. Although we have found some proposals of data quality models, some of them working as “de facto” standards, none of them focus specifically on web portal data quality and the user’s perspective. In this paper, we propose a set of 33 attributes which are relevant for portal data quality. These have been obtained from a revision of literature and a validation process carried out by means of a survey. Although these attributes do not conform to a usable model, we think that it might be considered as a good starting point for constructing one.
Mario PiattiniEmail:

Angélica Caro   has a PhD in Computer Science and is Assistant Professor at the Department of Computer Science and Information Technologies of the Bio Bio University in Chillán, Chile. Her research interests include: Data quality, Web portals, data quality in Web portals and data quality measures. She is author of papers in national and international conferences on this subject. Coral Calero    has a PhD in Computer Science and is Associate Professor at the Escuela Superior de Informatica of the Castilla-La Mancha University in Ciudad Real. She is a member of the Alarcos Research Group, in the same University, specialized in Information Systems, Databases and Software Engineering. Her research interests include: advanced databases design, database quality, software metrics, database metrics. She is author of papers in national and international conferences on this subject. She has published in Information Systems Journal, Software Quality Journal, Information and Software Technology Journal and SIGMOD Record Journal. She has organized the web services quality workshop (WISE Conference, Rome 2003) and Database Maintenance and Reengineering workshop (ICSM Conference, Montreal 2002). Ismael Caballero    has an MSc and PhD in Computer Science from the Escuela Superior de Informática of the Castilla-La Mancha University in Ciudad Real. He actually works as an assistant professor in the Department of Information Systems and Technologies at the University of Castilla-La Mancha, and he has also been working in the R&D Department of Indra Sistemas since 2006. His research interests are focused on information quality management, information quality in SOA, and Global Software Development. Mario Piattini    has an MSc and a PhD in Computer Science (Politechnical University of Madrid) and a MSc in Psychology (UNED.). He is also a Certified Information System Auditor and a Certified information System Manager by ISACA (Information System Audit and Control Association) as well as a Full Professor in the Department of Computer Science at the University of Castilla-La Mancha, in Ciudad Real, Spain. Furthermore, he is the author of several books and papers on databases, software engineering and information systems. He is a coeditor of several international books: “Advanced Databases Technology and Design”, 2000, Artech House, UK; "Information and database quality”, 2002, Kluwer Academic Publishers, Norwell, USA; “Component-based software quality: methods and techniques”, 2004, Springer, Germany; “Conceptual Software Metrics”, Imperial College Press, UK, 2005. He leads the ALARCOS research group of the Department of Computer Science at the University of Castilla-La Mancha, in Ciudad Real, Spain. His research interests are: advanced databases, database quality, software metrics, security and audit, software maintenance.   相似文献   

13.
This research is motivated by large-scale pervasive sensing applications. We examine the benefits and costs of caching data for such applications. We propose and evaluate several approaches to querying for, and then caching data in a sensor field data server. We show that for some application requirements (i.e., when delay drives data quality), policies that emulate cache hits by computing and returning approximate values for sensor data yield a simultaneous quality improvement and cost saving. This win–win is because when system delay is sufficiently important, the benefit to both query cost and data quality achieved by using approximate values outweighs the negative impact on quality due to the approximation. In contrast, when data accuracy drives quality, a linear trade-off between query cost and data quality emerges. We also identify caching and lookup policies for which the sensor field query rate is bounded when servicing an arbitrary workload of user queries. This upper bound is achieved by having multiple user queries share the cost of a single sensor field query. Finally, we demonstrate that our results are robust to the manner in which the environment being monitored changes using models for two different sensing systems.  相似文献   

14.
3D point cloud data obtained from laser scans, images, and videos are able to provide accurate and fast records of the 3D geometries of construction-related objects. Thus, the construction industry has been using point cloud data for a variety of purposes including 3D model reconstruction, geometry quality inspection, construction progress tracking, etc. Although a number of studies have been reported on applying point cloud data for the construction industry in the recent decades, there has not been any systematic review that summaries these applications and points out the research gaps and future research directions. This paper, therefore, aims to provide a thorough review on the applications of 3D point cloud data in the construction industry and to provide recommendations on future research directions in this area. A total of 197 research papers were collected in this study through a two-fold literature search, which were published within a fifteen-year period from 2004 to 2018. Based on the collected papers, applications of 3D point cloud data in the construction industry are reviewed according to three categories including (1) 3D model reconstruction, (2) geometry quality inspection, and (3) other applications. Following the literature review, this paper discusses on the acquisition and processing of point cloud data, particularly focusing on how to properly perform data acquisition and processing to fulfill the needs of the intended construction applications. Specifically, the determination of required point cloud data quality and the determination of data acquisition parameters are discussed with regard to data acquisition, and the extraction and utilization of semantic information and the platforms for data visualization and processing are discussed with regard to data processing. Based on the review of applications and the following discussions, research gaps and future research directions are recommended including (1) application-oriented data acquisition, (2) semantic enrichment for as-is BIM, (3) geometry quality inspection in fabrication phase, and (4) real-time visualization and processing.  相似文献   

15.
Data mining tools can be very beneficial for discovering interesting and useful patterns in complicated manufacturing processes. These patterns can be used, for example, to improve manufacturing quality. However, data accumulated in manufacturing plants have unique characteristics, such as unbalanced distribution of the target attribute, and a small training set relative to the number of input features. Thus, conventional methods are inaccurate in quality improvement cases. Recent research shows, however, that a decomposition tactic may be appropriate here and this paper presents a new feature set decomposition methodology that is capable of dealing with the data characteristics associated with quality improvement. In order to examine the idea, a new algorithm called (Breadth-Oblivious-Wrapper) BOW has been developed. This algorithm performs a breadth first search while using a new F-measure splitting criterion for multiple oblivious trees. The new algorithm was tested on various real-world manufacturing datasets, specifically the food processing industry and integrated circuit fabrication. The obtained results have been compared to other methods, indicating the superiority of the proposed methodology. Received: September 2004 / Accepted: September 2005  相似文献   

16.
The quality of the data is directly related to the quality of the models drawn from that data. For that reason, many research is devoted to improve the quality of the data and to amend errors that it may contain. One of the most common problems is the presence of noise in classification tasks, where noise refers to the incorrect labeling of training instances. This problem is very disruptive, as it changes the decision boundaries of the problem. Big Data problems pose a new challenge in terms of quality data due to the massive and unsupervised accumulation of data. This Big Data scenario also brings new problems to classic data preprocessing algorithms, as they are not prepared for working with such amounts of data, and these algorithms are key to move from Big to Smart Data. In this paper, an iterative ensemble filter for removing noisy instances in Big Data scenarios is proposed. Experiments carried out in six Big Data datasets have shown that our noise filter outperforms the current state-of-the-art noise filter in Big Data domains. It has also proved to be an effective solution for transforming raw Big Data into Smart Data.  相似文献   

17.
科学数据汇交是科学数据共享的重要数据支撑,在重大科学研究计划中更需要考虑数据的汇交与共享。国家自然科学基金委员会于2010年正式启动了“黑河流域生态-水文过程集成研究”重大研究计划(以下简称“黑河计划”)后,就将黑河计划的数据汇交与管理进行了多次讨论,于2012年推出《黑河计划数据汇交与共享管理条例》(以下简称“管理条例”),由黑河计划数据管理中心负责具体执行。黑河计划数据管理中心基于管理条例,设计并实现了面向黑河计划的科学数据汇交管理系统。回顾了科学数据汇交的发展现状,介绍了管理条例的核心内容,设计了黑河计划数据汇交的技术流程,并将数据提供者的权益保护嵌入到数据服务流程中,并根据黑河计划特点进行数据共享管理。针对数据汇交系统在实现中的关键技术问题进行了讨论,包括集成GeoNetwork系统实现元数据的撰写、以提高数据质量为目标的数据审核、高效的数据服务以及数据的知识挖掘4个方面。  相似文献   

18.
全文对数据挖掘在网络教育应用进行了研究探讨,提高网络教学质量与学习效率。并提出了基于OLAP技术的数据挖掘应用和基于代理(Agent)的系统架构模型,搭建一套用于网络教学的辅助管理系统。  相似文献   

19.
全文对数据挖掘在网络教育应用进行了研究探讨,提高网络教学质量与学习效率.并提出了基于OLAP技术的数据挖掘应用和基于代理(Agent)的系统架构模型,搭建一套用于网络教学的辅助管理系统.  相似文献   

20.
随着海量数据的涌现和不断积累,数据治理成为提高数据质量、最大化数据价值的重要手段.其中,数据错误检测是提高数据质量的关键步骤,近年来引起了学术界及工业界的广泛关注.目前,绝大多数错误检测方法只适用于单数据源场景.然而在现实场景中,数据往往不集中存储与管理.不同来源且高度相关的数据能够提升错误检测的精度.但由于数据隐私安全问题,跨源数据往往不允许集中共享.鉴于此,提出了一种基于联邦学习的跨源数据错误检测方法 FeLeDetect,以在数据隐私保证的前提下,利用跨源数据信息提高错误检测精度.为了充分捕获每一个数据源的数据特征,首先提出一种基于图的错误检测模型GEDM,并在此基础上设计了一种联邦协同训练算法FCTA,以支持在各方数据不出本地的前提下,利用跨源数据协同训练GEDM.此外,为了降低联邦训练的通信开销和人工标注成本,还提出了一系列优化方法.最后,在3个真实数据集上进行了大量的实验.实验结果表明:(1)相较于5种现有最先进的错误检测方法,GEDM在本地场景和集中场景下,错误检测结果的F1分数平均提高了10.3%和25.2%;(2) FeLeDetect错误检测结果的F1分数较本地场景...  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号