首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A considerable body of literature attests to the significance of internal controls; however, little is known on how the clustering of accounting databases can function as an internal control procedure. To explore this issue further, this paper puts forward a semi-supervised tool that is based on self-organizing map and the IASB XBRL Taxonomy. The paper validates the proposed tool via a series of experiments on an accounting database provided by a shipping company. Empirical results suggest the tool can cluster accounting databases in homogeneous and well-separated clusters that can be interpreted within an accounting context. Further investigations reveal that the tool can compress a large number of similar transactions, and also provide information comparable to that of financial statements. The findings demonstrate that the tool can be applied to verify the processing of accounting transactions as well as to assess the accuracy of financial statements, and thus supplement internal controls.  相似文献   

2.
A major requirement of database systems in engineering design and manufacturing applications is support for storage and maintenance of complex objects. Frame based systems are capable of modeling complex objects. However, many of these systems are implemented in main memory. As the number of objects to be stored far exceeds the capacity of the main memory of a computer, such an implementation is often unusable. Therefore, new ways to model and manipulate large manufacturing databases are needed. In this paper we present an implementation of a frame based system on top of the POSTGRES extensible database system. Such an implementation combines the advantages of database management and frame based systems and allows for the development of large manufacturing database applications with minimal effort.  相似文献   

3.
A global optimization method for semi-supervised clustering   总被引:1,自引:0,他引:1  
In this paper, we adapt Tuy’s concave cutting plane method to the semi-supervised clustering. We also give properties of local optimal solutions of the semi-supervised clustering. Numerical examples show that this method can give a better solution than other semi-supervised clustering algorithms do.  相似文献   

4.
With the proliferation of healthcare data, the cloud mining technology for E-health services and applications has become a hot research topic. While on the other hand, these rapidly evolving cloud mining technologies and their deployment in healthcare systems also pose potential threats to patient’s data privacy. In order to solve the privacy problem in the cloud mining technique, this paper proposes a semi-supervised privacy-preserving clustering algorithm. By employing a small amount of supervised information, the method first learns a Large Margin Nearest Cluster metric using convex optimization. Then according to the trained metric, the method imposes multiplicative perturbation on the original data, which can change the distribution shape of the original data and thus protect the privacy information as well as ensuring high data usability. The experimental results on the brain fiber dataset provided by the 2009 PBC demonstrated that the proposed method could not only protect data privacy towards secure attacks, but improve the clustering purity.  相似文献   

5.
《Information Systems》2001,26(1):35-58
Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers. We propose a new clustering algorithm called CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size. CURE achieves this by representing each cluster by a certain fixed number of points that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers. To handle large databases, CURE employs a combination of random sampling and partitioning. A random sample drawn from the data set is first partitioned and each partition is partially clustered. The partial clusters are then clustered in a second pass to yield the desired clusters. Our experimental results confirm that the quality of clusters produced by CURE is much better than those found by existing algorithms. Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality.  相似文献   

6.
Composite kernels for semi-supervised clustering   总被引:3,自引:2,他引:1  
A critical problem related to kernel-based methods is how to select optimal kernels. A kernel function must conform to the learning target in order to obtain meaningful results. While solutions to the problem of estimating optimal kernel functions and corresponding parameters have been proposed in a supervised setting, it remains a challenge when no labeled data are available, and all we have is a set of pairwise must-link and cannot-link constraints. In this paper, we address the problem of optimizing the kernel function using pairwise constraints for semi-supervised clustering. We propose a new optimization criterion for automatically estimating the optimal parameters of composite Gaussian kernels, directly from the data and given constraints. We combine our proposal with a semi-supervised kernel-based algorithm to demonstrate experimentally the effectiveness of our approach. The results show that our method is very effective for kernel-based semi-supervised clustering.  相似文献   

7.
The scalability problem in data mining involves the development of methods for handling large databases with limited computational resources such as memory and computation time. In this paper, two scalable clustering algorithms, bEMADS and gEMADS, are presented based on the Gaussian mixture model. Both summarize data into subclusters and then generate Gaussian mixtures from their data summaries. Their core algorithm, EMADS, is defined on data summaries and approximates the aggregate behavior of each subcluster of data under the Gaussian mixture model. EMADS is provably convergent. Experimental results substantiate that both algorithms can run several orders of magnitude faster than expectation-maximization with little loss of accuracy.  相似文献   

8.
Image segmentation is a central process in image processing. There are many segmentation methods such as region growing, edge detection, split and merge and artificial neural networks (ANNs). However, the most important and popular are clustering methods. Normally, clustering methods select cluster centres randomly to segment an image into disjoint and homogeneous regions. The use of random cluster centres without a priori knowledge leads to degradation in the accuracy of the obtained results. However, combined with edge detection, shape representation can help in improving the clustering methods. The improvement is obtained by knowing the optimal location of the cluster centres at the beginning of the image segmentation process. In this article, a new geometric model for high-resolution satellite image segmentation is implemented that can overcome the problem encountered in random clustering processes. The proposed model uses Canny–Deriche edge detection and the modified non-uniform rational B-spline (NURBS) methods to generate the control points of the edges. These points are used to identify cluster centres that are necessary to create the population of the hybrid dynamic genetic algorithm (HDGA). The new geometric model is compared with the self-organizing maps (SOMs) method, which is an efficient unsupervised ANN method. Two experiments are conducted using high-resolution satellite images, and the results prove the high accuracy and reliability of the new evolutionary geometric model.  相似文献   

9.
基于核自调整进行半监督聚类   总被引:2,自引:1,他引:1  
半监督聚类是通过在无监督算法的基础上加入有限的背景知识来实现的。现有的基于核的半监督聚类算法对于核参数的设定仍需人工进行调节,其选择值会极大地影响最终的结果。通过将关联加入到聚类目标函数中,在聚类过程反复地优化高斯核参数,自动确定最佳RBF核,并将最佳核计算与SSKK算法结合起来得到SSKKOK算法。实验结果表明,该算法能在利用基于核半监督聚类算法功能的基础上自动设置有关的参数。  相似文献   

10.
Spectral clustering: A semi-supervised approach   总被引:2,自引:0,他引:2  
Recently, graph-based spectral clustering algorithms have been developing rapidly, which are proposed as discrete combinatorial optimization problems and approximately solved by relaxing them into tractable eigenvalue decomposition problems. In this paper, we first review the current existing spectral clustering algorithms in a unified-framework way and give a straightforward explanation about spectral clustering. We also present a novel model for generalizing the unsupervised spectral clustering to semi-supervised spectral clustering. Under this model, prior information given by some instance-level constraints can be generalized to space-level constraints. We find that (undirected) graph built on the enlarged prior information is more meaningful, hence the boundaries of the clusters are more correct. Experimental results based on toy data, real-world data and image segmentation demonstrate the advantages of the proposed model.  相似文献   

11.
A data model and algebra for probabilistic complex values   总被引:1,自引:0,他引:1  
We present a probabilistic data model for complex values. More precisely, we introduce probabilistic complex value relations, which combine the concept of probabilistic relations with the idea of complex values in a uniform framework. We elaborate a model-theoretic definition of probabilistic combination strategies, which has a rigorous foundation on probability theory. We then define an algebra for querying database instances, which comprises the operations of selection, projection, renaming, join, Cartesian product, union, intersection, and difference. We prove that our data model and algebra for probabilistic complex values generalizes the classical relational data model and algebra. Moreover, we show that under certain assumptions, all our algebraic operations are tractable. We finally show that most of the query equivalences of classical relational algebra carry over to our algebra on probabilistic complex value relations. Hence, query optimization techniques for classical relational algebra can easily be applied to optimize queries on probabilistic complex value relations.  相似文献   

12.
近年来,复杂网络中的社团发现越来越受到研究人员的关注并且许多方法被提了出来。在这种背景下,最近李等人提出了一种用来评估社团质量的函数,称之为模块密度函数(即D值)。该函数显示了较高的D值对应于较好的社团结构,然而,优化该函数是一个NP难问题。通过模块密度函数D的半指导聚类优化,论证了模块密度函数的半指导聚类与核k方法的等价性并提出了一种新的半指导核聚类检测复杂网络社团方法。在一个经典的计算机产生的随机网络中检验了该算法,并与基于模块密度的直接核方法做了比较。特别地,当网络中社团结构变得模糊时,实验结果显示这种新的算法在发现复杂网络社团上是有效的。  相似文献   

13.
An algebra for probabilistic databases   总被引:3,自引:0,他引:3  
An algebra is presented for a simple probabilistic data model that may be regarded as an extension of the standard relational model. The probabilistic algebra is developed in such a way that (restricted to α-acyclic database schemes) the relational algebra is a homomorphic image of it. Strictly probabilistic results are emphasized. Variations on the basic probabilistic data model are discussed. The algebra is used to explicate a commonly used statistical smoothing procedure and is shown to be potentially very useful for decision support with uncertain information  相似文献   

14.
Data clustering is typically considered a subjective process, which makes it problematic. For instance, how does one make statistical inferences based on clustering? The matter is different with pattern classification, for which two fundamental characteristics can be stated: (1) the error of a classifier can be estimated using “test data,” and (2) a classifier can be learned using “training data.” This paper presents a probabilistic theory of clustering, including both learning (training) and error estimation (testing). The theory is based on operators on random labeled point processes. It includes an error criterion in the context of random point sets and representation of the Bayes (optimal) cluster operator for a given random labeled point process. Training is illustrated using a nearest-neighbor approach, and trained cluster operators are compared to several classical clustering algorithms.  相似文献   

15.
In recent years, the expansion of acquisition devices such as digital cameras, the development of storage and transmission techniques of multimedia documents and the development of tablet computers facilitate the development of many large image databases as well as the interactions with the users. This increases the need for efficient and robust methods for finding information in these huge masses of data, including feature extraction methods and feature space structuring methods. The feature extraction methods aim to extract, for each image, one or more visual signatures representing the content of this image. The feature space structuring methods organize indexed images in order to facilitate, accelerate and improve the results of further retrieval. Clustering is one kind of feature space structuring methods. There are different types of clustering such as hierarchical clustering, density-based clustering, grid-based clustering, etc. In an interactive context where the user may modify the automatic clustering results, incrementality and hierarchical structuring are properties growing in interest for the clustering algorithms. In this article, we propose an experimental comparison of different clustering methods for structuring large image databases, using a rigorous experimental protocol. We use different image databases of increasing sizes (Wang, PascalVoc2006, Caltech101, Corel30k) to study the scalability of the different approaches.  相似文献   

16.
Non-negative matrix factorization for semi-supervised data clustering   总被引:3,自引:6,他引:3  
Traditional clustering algorithms are inapplicable to many real-world problems where limited knowledge from domain experts is available. Incorporating the domain knowledge can guide a clustering algorithm, consequently improving the quality of clustering. In this paper, we propose SS-NMF: a semi-supervised non-negative matrix factorization framework for data clustering. In SS-NMF, users are able to provide supervision for clustering in terms of pairwise constraints on a few data objects specifying whether they “must” or “cannot” be clustered together. Through an iterative algorithm, we perform symmetric tri-factorization of the data similarity matrix to infer the clusters. Theoretically, we show the correctness and convergence of SS-NMF. Moveover, we show that SS-NMF provides a general framework for semi-supervised clustering. Existing approaches can be considered as special cases of it. Through extensive experiments conducted on publicly available datasets, we demonstrate the superior performance of SS-NMF for clustering.
Ming DongEmail:
  相似文献   

17.
Approximating clusters in very large (VL=unloadable) data sets has been considered from many angles. The proposed approach has three basic steps: (i) progressive sampling of the VL data, terminated when a sample passes a statistical goodness of fit test; (ii) clustering the sample with a literal (or exact) algorithm; and (iii) non-iterative extension of the literal clusters to the remainder of the data set. Extension accelerates clustering on all (loadable) data sets. More importantly, extension provides feasibility—a way to find (approximate) clusters—for data sets that are too large to be loaded into the primary memory of a single computer. A good generalized sampling and extension scheme should be effective for acceleration and feasibility using any extensible clustering algorithm. A general method for progressive sampling in VL sets of feature vectors is developed, and examples are given that show how to extend the literal fuzzy (c-means) and probabilistic (expectation-maximization) clustering algorithms onto VL data. The fuzzy extension is called the generalized extensible fast fuzzy c-means (geFFCM) algorithm and is illustrated using several experiments with mixtures of five-dimensional normal distributions.  相似文献   

18.
19.
Many applications require the management of spatial data in a multidimensional feature space. Clustering large spatial databases is an important problem, which tries to find the densely populated regions in the feature space to be used in data mining, knowledge discovery, or efficient information retrieval. A good clustering approach should be efficient and detect clusters of arbitrary shape. It must be insensitive to the noise (outliers) and the order of input data. We propose WaveCluster, a novel clustering approach based on wavelet transforms, which satisfies all the above requirements. Using the multiresolution property of wavelet transforms, we can effectively identify arbitrarily shaped clusters at different degrees of detail. We also demonstrate that WaveCluster is highly efficient in terms of time complexity. Experimental results on very large datasets are presented, which show the efficiency and effectiveness of the proposed approach compared to the other recent clustering methods. Received June 9, 1998 / Accepted July 8, 1999  相似文献   

20.
通过对已标示和未标示数据的学习和分类,提出一种改进微分进化算法的半监督模糊聚类。先从大量的数据中选取一小部分进行标记,然后利用标记数据来指导进化过程,实现对未标记数据的分类。通过参考粒子群算法惯性权重思想,引入惯性加权系数,在计算初期能够维持个体的多样性,后期能够加快算法的收敛速度,有效提高了算法的性能。遥感图像数据实验结果显示该方法可以提高分类精度。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号