排序方式: 共有6条查询结果,搜索用时 0 毫秒
1
1.
We propose an approximate computation technique for inter-object distances of binary data sets. Our approach is based on locality sensitive hashing. We randomly select a number of projections of the data set and group objects into buckets based on the hash values of these projections. For each pair of objects, occurrences in the same bucket are counted and the exact Hamming distance is approximated based on the number of co-occurrences in all buckets. We parallelize the computation using mainly two schemes. The first assigns each random subspace to a processor for calculating the local co-occurrence matrix, where all the local co-occurrence matrices are combined into the final co-occurrence matrix. The second method provides the same distance approximation in longer runtimes by limiting the total message size in a parallel computing environment, which is especially useful for very large data sets generating immense message traffic. Our methods produce very accurate results, scale up well with the number of objects, and tolerate processor failures. Experimental evaluations on supercomputers and workstations with several processors demonstrate the usefulness of our methods. 相似文献
2.
Szymon Jaroszewicz Tobias Scheffer Dan A. Simovici 《Data mining and knowledge discovery》2009,18(1):56-100
We study a discovery framework in which background knowledge on variables and their relations within a discourse area is available
in the form of a graphical model. Starting from an initial, hand-crafted or possibly empty graphical model, the network evolves
in an interactive process of discovery. We focus on the central step of this process: given a graphical model and a database,
we address the problem of finding the most interesting attribute sets. We formalize the concept of interestingness of attribute
sets as the divergence between their behavior as observed in the data, and the behavior that can be explained given the current
model. We derive an exact algorithm that finds all attribute sets whose interestingness exceeds a given threshold. We then
consider the case of a very large network that renders exact inference unfeasible, and a very large database or data stream.
We devise an algorithm that efficiently finds the most interesting attribute sets with prescribed approximation bound and
confidence probability, even for very large networks and infinite streams. We study the scalability of the methods in controlled
experiments; a case-study sheds light on the practical usefulness of the approach. 相似文献
3.
Ngom Alioune Reischer Corina Simovici Dan A. Stojmenović Ivan 《Neural Processing Letters》2000,12(1):71-90
The (n,k,s)-perceptrons partition the input space V R
n
into s+1 regions using s parallel hyperplanes. Their learning abilities are examined in this research paper. The previously studied homogeneous (n,k,k–1)-perceptron learning algorithm is generalized to the permutably homogeneous (n,k,s)-perceptron learning algorithm with guaranteed convergence property. We also introduce a high capacity learning method that learns any permutably homogeneously separable k-valued function given as input. 相似文献
4.
We introduce purity dependencies as generalizations of functional dependencies in relational databases starting from the
notion of impurity measure. The impurity measure of a subset of a set relative to a partition of that set and the relative
impurity of two partitions allow us to define the relative impurity of two attribute sets of a table of a relational database
and to introduce purity dependencies. We discuss properties of these dependencies that generalize similar properties of functional
dependencies and we highlight their relevance for approximate classifications. Finally, an algorithm that mines datasets for
these dependencies is presented.
Received: 4 July 2000 / 16 November 2001 相似文献
5.
Evaluation of automatic text summarization is a challenging task due to the difficulty of calculating similarity of two texts. In this paper, we define a new dissimilarity measure – compression dissimilarity to compute the dissimilarity between documents. Then we propose a new automatic evaluating method based on compression dissimilarity. The proposed method is a completely “black box” and does not need preprocessing steps. Experiments show that compression dissimilarity could clearly distinct automatic summaries from human summaries. Compression dissimilarity evaluating measure could evaluate an automatic summary by comparing with high-quality human summaries, or comparing with its original document. The evaluating results are highly correlated with human assessments, and the correlation between compression dissimilarity of summaries and compression dissimilarity of documents can serve as a meaningful measure to evaluate the consistency of an automatic text summarization system. 相似文献
6.
Jaroszewicz S Simovici DA Kuo WP Ohno-Machado L 《IEEE transactions on bio-medical engineering》2004,51(7):1095-1102
Increasing interest in new pattern recognition methods has been motivated by bioinformatics research. The analysis of gene expression data originated from microarrays constitutes an important application area for classification algorithms and illustrates the need for identifying important predictors. We show that the Goodman-Kruskal coefficient can be used for constructing minimal classifiers for tabular data, and we give an algorithm that can construct such classifiers. 相似文献
1