首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Data streams are long, relatively unstructured sequences of characters that contain information such as electronic mail or a tape backup of various documents and reports created in an office. A conceptual framework is presented, using relational algebra and relational databases, within which data streams may be queried. As information is extracted from the data streams, it is put into a relational database that may be queried in the usual manner. The database schema evolves as the user's knowledge of the content of the data stream changes. Operators are defined in terms of relational algebra that can be used to extract data from a specially defined relation that contains all or part of the data stream. This approach to querying data streams permits the integration of unstructured data with structured data. The operators defined extend the functionality of relational algebra in much the same way that the join does relative to the basic operators select, project, union, difference, and Cartesian product  相似文献   

2.
Modern DBMSes are designed to support many transactions running simultaneously. DBMS thrashing is indicated by the existence of a sharp drop in transaction throughput. Thrashing behavior in DBMSes is a serious concern to database administrators (DBAs) as well as to DBMS implementers. From an engineering perspective, therefore, it is of critical importance to understand the causal factors of DBMS thrashing. However, understanding the origin of thrashing in modern DBMSes is challenging, due to many factors that may interact with each other.This article aims to better understand the thrashing phenomenon across multiple DBMSes. We identify some of the underlying causes of DBMS thrashing. We then propose a novel structural causal model to explicate the relationships between various factors contributing to DBMS thrashing. Our model derives a number of specific hypotheses to be subsequently tested across DBMSes, providing empirical support for this model as well as important engineering implications for improvements in transaction processing.  相似文献   

3.
多数据流上的联机方差分析是一个有意义的研究问题。针对以元组为单位流入的具有相同属性集的多支单数据流组成的多数据流,提出了分别对每支单数据流进行蓄水池抽样,构造一一对应于各单数据流的若干个多快照窗口,即两者之间是双射关系,可以将多快照窗口串行置于主存中,将元组包含的属性与多快照窗口中的各个快照窗口一一对应,且使得同一快照窗口中的各基本窗口与取自其对应的单数据流的属性值样本一一对应,然后对这些相互独立的样本进行方差分析。按顺序串行处理各个多快照窗口中的数据,就可以用串行化的方法来实现并行的多数据流上的联机方差分析。理论分析与实验表明,该方法是合理的和有效的。  相似文献   

4.
Data Mining and Knowledge Discovery - In recent years data stream mining and learning from imbalanced data have been active research areas. Even though solutions exist to tackle these two problems,...  相似文献   

5.
6.
7.
Software change prediction is crucial in order to efficiently plan resource allocation during testing and maintenance phases of a software. Moreover, correct identification of change-prone classes in the early phases of software development life cycle helps in developing cost-effective, good quality and maintainable software. An effective software change prediction model should equally recognize change-prone and not change-prone classes with high accuracy. However, this is not the case as software practitioners often have to deal with imbalanced data sets where instances of one type of class is much higher than the other type. In such a scenario, the minority classes are not predicted with much accuracy leading to strategic losses. This study evaluates a number of techniques for handling imbalanced data sets using various data sampling methods and MetaCost learners on six open-source data sets. The results of the study advocate the use of resample with replacement sampling method for effective imbalanced learning.  相似文献   

8.
Sketch is a memory-efficient data structure, and is used to store and query the frequency of any item in a given multiset. As it can achieve fast query and update, it has been applied to various fields. Different sketches have different advantages and disadvantages. Sketches are originally proposed for estimation of flow size in network measurement. The key factor of sketches for network measurement is the insertion speed and accuracy. In this paper, we propose a new sketch, which can significantly improve the insertion speed while improving the accuracy. Our key methods include on-chip/off-chip separation and partial update algorithm. Extensive experimental results show that our sketch significantly outperforms the state-of-the-art both in terms of accuracy and speed.  相似文献   

9.
A problem with use of the geostatistical Kriging error for optimal sampling design is that the design does not adapt locally to the character of spatial variation. This is because a stationary variogram or covariance function is a parameter of the geostatistical model. The objective of this paper was to investigate the utility of non-stationary geostatistics for optimal sampling design. First, a contour data set of Wiltshire was split into 25 equal sub-regions and a local variogram was predicted for each. These variograms were fitted with models and the coefficients used in Kriging to select optimal sample spacings for each sub-region. Large differences existed between the designs for the whole region (based on the global variogram) and for the sub-regions (based on the local variograms). Second, a segmentation approach was used to divide a digital terrain model into separate segments. Segment-based variograms were predicted and fitted with models. Optimal sample spacings were then determined for the whole region and for the sub-regions. It was demonstrated that the global design was inadequate, grossly over-sampling some segments while under-sampling others.  相似文献   

10.
To maintain competitive advantages, semiconductor industry has strived for continuous technology migrations and quick response to yield excursion. As wafer fabrication has been increasingly complicated in nano technologies, many factors including recipe, process, tool, and chamber with the multicollinearity affect the yield that are hard to detect and interpret. Although design of experiment (DOE) is a cost effective approach to consider multiple factors simultaneously, it is difficult to follow the design to conduct experiments in real settings. Alternatively, data mining has been widely applied to extract potential useful patterns for manufacturing intelligence. However, because hundreds of factors must be considered simultaneously to accurately characterize the yield performance of newly released technology and tools for diagnosis, data mining requires tremendous time for analysis and often generates too many patterns that are hard to be interpreted by domain experts. To address the needs in real settings, this study aims to develop a retrospective DOE data mining that matches potential designs with a huge amount of data automatically collected in semiconductor manufacturing to enable effective and meaningful knowledge extraction from the data. DOE can detect high-order interactions and show how interconnected factors respond to a wide range of values. To validate the proposed approach, an empirical study was conducted in a semiconductor manufacturing company in Taiwan and the results demonstrated its practical viability.  相似文献   

11.
12.
We address the issue of small data size for training models for regression problems, which is a significant issue in materials science. Many density estimators that use generative models based on deep neural networks have been proposed. With generative models, normalizing flows can provide exact density estimations. Using normalizing flows, we address training data augmentation issue, where we use a real-valued non-volume preserving model (real-NVP) as the normalizing flow. A generative adversarial net (GAN)-based training method is applied to improve real-NVP training using real-NVP as the generator. Using kernel ridge regression trained by generated data, generalization performance was measured for evaluating the models. Experiments were conducted with seven benchmark datasets and a dataset of ionic conductivity of materials to compare the GAN-based real-NVP to state-of-the-art models, such as real-NVP and masked autoregressive flows. The experimental results demonstrated that the GAN-based real-NVP was comparable to state-of-the-art models and implied that the data sampled by the GAN-based real-NVP were available as new training data.  相似文献   

13.
Information imprecision and uncertainty exist in many real-world applications and for this reason fuzzy data modeling has been extensively investigated in various data models. Currently, huge amounts of electronic data are available on the Internet, and XML has been the de facto standard of information representation and exchange over the Web. This paper focuses on fuzzy XML data modeling, which is mainly involved in the representation model of the fuzzy XML, its conceptual design, and its storage in databases. Based on “possibility distribution theory”, we developed this fuzzy XML data model. We developed this fuzzy UML data model to design the fuzzy XML model conceptually. We investigated the formal conversions from the fuzzy UML model to the fuzzy XML model and the formal mapping from the fuzzy XML model to the fuzzy relational databases.  相似文献   

14.
This article describes the basic elements in the Rutherford Laboratory on-line film measurement system, in which twelve film measuring machines are connected on line to an IBM 1130. The 1130 is linked to an IBM 360/195 by a parallel data link. The article mentions some of the lessons learnt in going on-line and describes current and future developments.  相似文献   

15.
This paper presents the scalable on-line execution (SOLE) algorithm for continuous and on-line evaluation of concurrent continuous spatio-temporal queries over data streams. Incoming spatio-temporal data streams are processed in-memory against a set of outstanding continuous queries. The SOLE algorithm utilizes the scarce memory resource efficiently by keeping track of only the significant objects. In-memory stored objects are expired (i.e., dropped) from memory once they become insignificant. SOLE is a scalable algorithm where all the continuous outstanding queries share the same buffer pool. In addition, SOLE is presented as a spatio-temporal join between two input streams, a stream of spatio-temporal objects and a stream of spatio-temporal queries. To cope with intervals of high arrival rates of objects and/or queries, SOLE utilizes a load-shedding approach where some of the stored objects are dropped from memory. SOLE is implemented as a pipelined query operator that can be combined with traditional query operators in a query execution plan to support a wide variety of continuous queries. Performance experiments based on a real implementation of SOLE inside a prototype of a data stream management system show the scalability and efficiency of SOLE in highly dynamic environments. This work was supported in part by the National Science Foundation under Grants IIS-0093116, IIS-0209120, and 0010044-CCR.  相似文献   

16.
A set of language facilities for the specification of security requirements in a relational data base is presented. An attempt is made to have a data base language unified from the points of view of data storage/retrieval, computation and protection. It is argued that the unit of protection as defined in the conventional data base systems is not sufficient from the point of view of security and therefore the underlying protection model takes a simple domain as the unit of protection. Further, the definition of data submodel is modified to define the cababilities of the users over the view of the data base provided to them. The line of attack adopted here is then contrasted with that of some of the existing data base systems.  相似文献   

17.
Periodic data play a major role in many application domains, spanning from manufacturing to office automation, from scheduling to data broadcasting. In many of such domains, the huge number of repetitions make the goal of extesionally storing and accessing such data very challenging. In this paper, we propose a new methodology, based on an intensional representation of periodic data. The representation model we propose captures the notion of periodic granularity provided by the temporal database glossary, and is an extension of the TSQL2 temporal relational data model. We define the algebraic operators, and introduce access algorithms to cope with them, proving that they are correct with respect to the traditional extesional approach. We also provide an experimental evaluation of our approach.  相似文献   

18.
We investigated whether a “context-aware” fisheye view can more successfully communicate the information contained in a set of process models (data flow diagrams) than a traditional “context-free” presentation. We conducted two controlled experiments: the first included a simple set of DFDs and tasks that required a basic understanding of the system, while the second involved more detailed views of the same processes, and also a more complex task. Subjects who used the fisheye process models outperformed those using the traditional presentations. This difference was reflected in task performance for all subjects, and in task completion time for inexperienced subjects.  相似文献   

19.
Markov random fields are typically used as priors in Bayesian image restoration methods to represent spatial information in the image. Commonly used Markov random fields are not in fact capable of representing the moderate-to-large scale clustering present in naturally occurring images and can also be time consuming to simulate, requiring iterative algorithms which can take hundreds of thousands of sweeps of the image to converge. Markov mesh models, a causal subclass of Markov random fields, are, however, readily simulated. We describe an empirical study of simulated realizations from various models used in the literature, and we introduce some new mesh-type models. We conclude, however, that while large-scale clustering may be represented by such models, strong directional effects are also present for all but very limited parameterizations. It is emphasized that the results do not detract from the use of Markov random fields as representers of local spatial properties, which is their main purpose in the implementation of Bayesian statistical approaches to image analysis. Brief allusion is made to the issue of parameter estimation  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号