首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
针对如何提高集成学习的性能,提出一种结合Rotation Forest和Multil3oost的集成学习方法—利用Rotation Forest中旋转变换的思想对原始数据集进行变换,旨在增加分类器间的差异度;利用Mu1tiI3oost在变换后的数据集上训练基分类器,旨在提高基分类器的准确度。最后用简单的多数投票法融合各基分类器的决策结果,将其作为集成分类器的输出。为了验证该方法的有效性,在公共数据集UCI上进行了实验,结果显示,该方法可获得较高的分类精度。  相似文献   

2.
张棪  曹健 《计算机科学》2016,43(Z6):374-379, 383
决策树作为机器学习中的一个预测模型,因其输出结果易于理解和解释,而被广泛应用于各个领域,成为了学术界研究的热点。随着数据产生速度的剧增,由于内存容量和处理器速度等限制,常规的决策树算法无法对大数据集进行处理,因此需要对决策树算法的实现进行针对性的处理。首先阐述了决策树的基本算法和优化方法,在此基础上结合大数据带来的挑战,分类比较了各类针对性算法的优缺点,并介绍了支撑这些算法运行的平台。最后讨论了面向大数据的决策树算法的未来发展方向。  相似文献   

3.
人体行为识别中的一个关键问题是如何表示高维的人体动作和构建精确稳定的人体分类模型.文中提出有效的基于混合特征的人体行为识别算法.该算法融合基于外观结构的人体重要关节点极坐标特征和基于光流的运动特征,可更有效获取视频序列中的运动信息,提高识别即时性.同时提出基于帧的选择性集成旋转森林分类模型(SERF),有效地将选择性集成策略融入到旋转森林基分类器的选择中,从而增加基分类器之间的差异性.实验表明SERF模型具有较高的分类精度和较强的鲁棒性.  相似文献   

4.
为了解决实际问题,大数据分析处理系统需要获取数据,然而实际场景中收集到的实际数据通常不完备.另外,大多数问题的解决方案通常是由问题引导或者仅仅进行数据分析,运行参数调整和设定带有较大的盲目性,难以达到应用的智能性.为此,文中提出平行数据的概念和框架,根据实际数据经计算实验产生真正的虚拟大数据,结合默顿定律,以期待的解决方案与问题进行广义对偶,引导大数据聚焦到实际问题.实际数据与虚拟数据动态互动,平行演化,形成一个虚实相生、数据动态变化的过程,最终使数据具备智能,进而解决未知的问题.平行数据不但是一种数据表示形式,更是一种数据演化机制与方式,其特色是虚实互动,所有数据的动力学轨迹构成了数据动力学系统.平行数据为数据处理、表示、挖掘和应用提供了一个新的范式.  相似文献   

5.
在大数据环境下,当利用机器学习算法对训练样本进行分类时,训练数据的高维度严重制约了分类算法的性能。文中应用L1准则的稀疏性,提出了一种在线特征提取算法,并用该算法对训练实例进行分类。利用公开数据集对算法的性能进行了分析,结果表明,提出的在线特征提取算法能准确地对训练实例进行分类,因而能更好地适用于大数据环境下的数据挖掘。  相似文献   

6.
Although the achievements of the computer science field have facilitated the tasks of collecting, storing, and accessing vast amounts of data efficiently, its annotation still remains a non–easily resolvable problem since no automated mechanism can perform reliably enough. This fact is even more profound when the objective is the high quality of generalization ability. Active learning constitutes a scheme that is exploited for tackling such problems, controlling the demanded human effort under a trade‐off assumption concerning the achievement of higher accuracy rates. In this work, the well‐known Rotation Forest algorithm is integrated with the active learning theory for constructing a robust and accurate classifier. Thus, apart from exploiting the collected labeled data, unlabeled data mining takes place through appropriate queries, whereas human expert decisions over the most questionable of them enrich the initial ones. Comprehensive comparisons of the proposed algorithm against four distinct learners inside the same learning scheme were executed. Moreover, the baseline strategy of random sampling and the corresponding supervised scenarios were included. During the evaluation stage, 13 publicly available multiclass datasets were assessed, and the obtained results verified our assumptions, regarding also the significant supremacy of the proposed algorithm against the majority of its rivals.  相似文献   

7.
针对线性回归、SVR以及大部分多变量回归树等回归模型不能直接利用分类型属性进行回归分析的问题,提出了一种可联合多种类型属性的决策树结点划分方法.该方法通过定义样本集合在分类型属性上的中心以及样本到中心的距离,使得分类型属性也可以像数值型属性一样参与样本的聚类过程,从而形成样本集的划分.之后,文中又为由该方法产生的决策树...  相似文献   

8.
陆地生态系统碳收支是全球碳循环研究的重要指标,也是气候变化的重要参数。针对该指标估测的不确定性,基于陆地生态系统通量观测研究网络的实测碳通量数据及遥感卫星观测数据产品,利用机器学习方法进行建模研究。研究选用随机森林算法自动从高质量的星—地训练数据集中学习特征、挖掘数据中的隐含信息以及时序间依赖关系的差异,建立了基于随机森林算法的碳收支参数GPP(Gross Primary Production)、NEP(Net Ecosystem Production)估算模型,并选择标准指标利用验证数据集对模型进行了客观评价。结果分析表明:与MODIS GPP产品相比,该方法在估算精度上有了提高,其中落叶阔叶林预测结果最优,决策系数为R2为0.82,均方根误差为1.93 gCm-2d-1,在其他植被类型上也明显优于传统光能利用率模型产品,更接近于地面通量观测数据。基于相同方法建立的NEP模型也得到了较好的估测结果,落叶阔叶林预测模型的输出结果与通量塔获得的NEP相关关系R2为0.70,RMSE=1.75 g C m-2d-1。GPP和NEP模型精度差异也表明,在进行机器学习建模时,训练数据集自变量的...  相似文献   

9.
Research associated with Big Data in the Cloud will be important topic over the next few years. The topic includes work on demonstrating architectures, applications, services, experiments and simulations in the Cloud to support the cases related to adoption of Big Data. A common approach to Big Data in the Cloud to allow better access, performance and efficiency when analysing and understanding the data is to deliver Everything as a Service. Organisations adopting Big Data this way find the boundaries between private clouds, public clouds and Internet of Things (IoT) can be very thin. Volume, variety, velocity, veracity and value are the major factors in Big Data systems but there are other challenges to be resolved.The papers of this special issue address a variety of issues and concerns in Big Data, including: searching and processing Big Data, implementing and modelling event and workflow systems, visualisation modelling and simulation and aspects of social media.  相似文献   

10.
大数据探讨     
大数据的重大意义正逐步被人们认识到。简要介绍大数据,从技术和工具、解决方案和应用案例等方面对大数据进行研究。并对大数据给计算机科学带来的若干问题进行探讨。  相似文献   

11.
在处理高度不平衡数据时,代价敏感随机森林算法存在自助法采样导致小类样本学习不充分、大类样本占比较大、容易削弱代价敏感机制等问题.文中通过对大类样本聚类后,多次采用弱平衡准则对每个集群进行降采样,使选择的大类样本与原训练集的小类样本融合生成多个新的不平衡数据集,用于代价敏感决策树的训练.由此提出基于聚类的弱平衡代价敏感随机森林算法,不仅使小类样本得到充分学习,同时通过降低大类样本数量,保证代价敏感机制受其影响较小.实验表明,文中算法在处理高度不平衡数据集时性能较优.  相似文献   

12.
Beyond the hype of Big Data, something within business intelligence projects is indeed changing. This is mainly because Big Data is not only about data, but also about a complete conceptual and technological stack including raw and processed data, storage, ways of managing data, processing and analytics. A challenge that becomes even trickier is the management of the quality of the data in Big Data environments. More than ever before the need for assessing the Quality-in-Use gains importance since the real contribution–business value–of data can be only estimated in its context of use. Although there exists different Data Quality models for assessing the quality of regular data, none of them has been adapted to Big Data. To fill this gap, we propose the “3As Data Quality-in-Use model”, which is composed of three Data Quality characteristics for assessing the levels of Data Quality-in-Use in Big Data projects: Contextual Adequacy, Operational Adequacy and Temporal Adequacy. The model can be integrated into any sort of Big Data project, as it is independent of any pre-conditions or technologies. The paper shows the way to use the model with a working example. The model accomplishes every challenge related to Data Quality program aimed for Big Data. The main conclusion is that the model can be used as an appropriate way to obtain the Quality-in-Use levels of the input data of the Big Data analysis, and those levels can be understood as indicators of trustworthiness and soundness of the results of the Big Data analysis.  相似文献   

13.
数据是信息系统运行的基础和核心,是机构稳定发展的宝贵资源。随着信息系统数据量成几何级数增加,特别是在当前大数据环境和信息技术快速发展情况下,海量数据迁移是企业解决存储空间不足、新老系统切换和信息系统升级改造等过程中必须面对的一个现实问题。如何在业务约束条件下,快速、正确、完整地实现海量数据迁移,保障数据的完整性、一致性和继承性,是一个关键研究课题。从海量数据管理的角度,阐述了海量数据迁移方法,比较了不同数据迁移的方案特点。  相似文献   

14.
Ren  Rui  Cheng  Jiechao  He  Xi-Wen  Wang  Lei  Zhan  Jian-Feng  Gao  Wan-Ling  Luo  Chun-Jie 《计算机科学技术学报》2019,34(6):1167-1184
Journal of Computer Science and Technology - With tremendous growing interests in Big Data, the performance improvement of Big Data systems becomes more and more important. Among many steps, the...  相似文献   

15.
Nowadays, Big Data, a large volume of both structured and unstructured data, is generated from Social Media. Social Media are powerful marketing tools and social big data can offer the business insights. The major challenge facing social big data is attaining efficient techniques to collect a large volume of social data and extract insights from the huge amount of collected data. Sentiment Analysis of social big data can provide business insights by extracting the public opinions. The traditional analytic platforms need to be scaled up for analyzing a large volume of social big data. Social data are by nature shorter and generally not constructed with proper grammatical rules and hence difficult to achieve high reliable result in Sentiment Analysis. Acquiring effective training data is a challenge, although learning based approaches are good for sentiment classification. Manual Labeling for training data is time and labor consuming. In this paper, Sentiment Analysis system on Big Data Analytics platform is proposed to provide valuable information by analyzing large scale social data in an efficient and timely manner since they have been implemented using a MapReduce framework and a Hadoop distributed storage (HDFS). The proposed Sentiment Analysis system consists of four modules: data collection, data cleaning and preprocessing, class labeling and sentiment classification. The system enables high-level performance of sentiment classification while taking advantage of combining lexicon-based classifier’s effortless setup process and learning based classifier. Twitter stream data is used for system evaluation as the Twitter is widespread Social Media and a good source of information in the sense of snapshots of moods and feelings as well as up-to-date events. The evaluation results show that this system achieve a promising accuracy by 84.2%. Moreover, this system is able to scale up to analyze the large scale data by decreasing the processing time when adding more nodes in the cluster.  相似文献   

16.
大数据的价值不仅仅局限于它的初始收集目的,而在于收集后可以用于其他用途并可重复使用。目前,包括美国在内的许多国家,都将大数据分析管理上升到国家战略层面,从国家层面通盘考虑其发展战略。  相似文献   

17.
The quality of the data is directly related to the quality of the models drawn from that data. For that reason, many research is devoted to improve the quality of the data and to amend errors that it may contain. One of the most common problems is the presence of noise in classification tasks, where noise refers to the incorrect labeling of training instances. This problem is very disruptive, as it changes the decision boundaries of the problem. Big Data problems pose a new challenge in terms of quality data due to the massive and unsupervised accumulation of data. This Big Data scenario also brings new problems to classic data preprocessing algorithms, as they are not prepared for working with such amounts of data, and these algorithms are key to move from Big to Smart Data. In this paper, an iterative ensemble filter for removing noisy instances in Big Data scenarios is proposed. Experiments carried out in six Big Data datasets have shown that our noise filter outperforms the current state-of-the-art noise filter in Big Data domains. It has also proved to be an effective solution for transforming raw Big Data into Smart Data.  相似文献   

18.
The rapid development and extensive application of geographic information system (GIS) and the advent of the age of big data bring about the generation of multi-resources spatial data, which makes data integration and fusion share more difficult due to the differences on data source, data accuracy and data modal. Meanwhile, study for multi-resources spatial data fusion methods has an important practical significance for reducing the production cost of geographic data, accelerating the updating speed of existing geographical information and improving the quality of GIS big data. To expound the formation and developing trends of multi-resources spatial data fusion methods systematically, and on the basis of referring to lots of related technical documents both at home and abroad, this paper makes a conclusion and discussion about multi-resources spatial data fusion methods, and foresees the prospects of data fusion in big data environment, which has certain reference value for the related research work.  相似文献   

19.
随着各种医疗信息系统的广泛应用与深入发展,医疗数据快速增长.医疗信息进八“大数据”时代,大数据的产生带来了较为突出的安全问题。针对提高医疗信息系统的大数据安全管理策略进行探讨,着重讨论了系统架构、备份机制、数据库纵深防御体系等方面的内容。  相似文献   

20.
Recent evolutions in computing science and web technology provide the environmental community with continuously expanding resources for data collection and analysis that pose unprecedented challenges to the design of analysis methods, workflows, and interaction with data sets. In the light of the recent UK Research Council funded Environmental Virtual Observatory pilot project, this paper gives an overview of currently available implementations related to web-based technologies for processing large and heterogeneous datasets and discuss their relevance within the context of environmental data processing, simulation and prediction. We found that, the processing of the simple datasets used in the pilot proved to be relatively straightforward using a combination of R, RPy2, PyWPS and PostgreSQL. However, the use of NoSQL databases and more versatile frameworks such as OGC standard based implementations may provide a wider and more flexible set of features that particularly facilitate working with larger volumes and more heterogeneous data sources.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号