共查询到20条相似文献,搜索用时 0 毫秒
1.
Jeong Hee Hwang Author Vitae 《Journal of Systems and Software》2010,83(7):1267-1274
XML has recently become very popular as a means of representing semistructured data and as a standard for data exchange over the Web, because of its varied applicability in numerous applications. Therefore, XML documents constitute an important data mining domain. In this paper, we propose a new method of XML document clustering by a global criterion function, considering the weight of common structures. Our approach initially extracts representative structures of frequent patterns from schemaless XML documents using a sequential pattern mining algorithm. Then, we perform clustering of an XML document by the weight of common structures, without a measure of pairwise similarity, assuming that an XML document is a transaction and frequent structures extracted from documents are items of the transaction. We conducted experiments to compare our method with previous methods. The experimental results show the effectiveness of our approach. 相似文献
2.
Sergio Greco Francesco Gullo Giovanni Ponti Andrea Tagarelli 《Journal of Computer and System Sciences》2011,77(6):988-1008
Clustering XML documents is extensively used to organize large collections of XML documents in groups that are coherent according to structure and/or content features. The growing availability of distributed XML sources and the variety of high-demand environments raise the need for clustering approaches that can exploit distributed processing techniques. Nevertheless, existing methods for clustering XML documents are designed to work in a centralized way. In this paper, we address the problem of clustering XML documents in a collaborative distributed framework. XML documents are first decomposed based on semantically cohesive subtrees, then modeled as transactional data that embed both XML structure and content information. The proposed clustering framework employs a centroid-based partitional clustering method that has been developed for a peer-to-peer network. Each peer in the network is allowed to compute a local clustering solution over its own data, and to exchange its cluster representatives with other peers. The exchanged representatives are used to compute representatives for the global clustering solution in a collaborative way. We evaluated effectiveness and efficiency of our approach on real XML document collections varying the number of peers. Results have shown that major advantages with respect to the corresponding centralized clustering setting are obtained in terms of runtime behavior, although clustering solutions can still be accurate with a moderately low number of nodes in the network. Moreover, the collaborativeness characteristic of our approach has revealed to be a convenient feature in distributed clustering as found in a comparative evaluation with a distributed non-collaborative clustering method. 相似文献
3.
Xiaodi Huang Xiaodong Zheng Wei Yuan Fei Wang Shanfeng Zhu 《Information Sciences》2011,181(11):2293-2302
Searching and mining biomedical literature databases are common ways of generating scientific hypotheses by biomedical researchers. Clustering can assist researchers to form hypotheses by seeking valuable information from grouped documents effectively. Although a large number of clustering algorithms are available, this paper attempts to answer the question as to which algorithm is best suited to accurately cluster biomedical documents. Non-negative matrix factorization (NMF) has been widely applied to clustering general text documents. However, the clustering results are sensitive to the initial values of the parameters of NMF. In order to overcome this drawback, we present the ensemble NMF for clustering biomedical documents in this paper. The performance of ensemble NMF was evaluated on numerous datasets generated from the TREC Genomics track dataset. With respect to most datasets, the experimental results have demonstrated that the ensemble NMF significantly outperforms classical clustering algorithms of bisecting K-means, and hierarchical clustering. We compared four different methods for constructing an ensemble NMF. For clustering biomedical documents, this research is the first to compare ensemble NMF with typical classical clustering algorithms, and validates ensemble NMF constructed from different graph-based ensemble algorithms. This is also the first work on ensemble NMF with Hybrid Bipartite Graph Formulation for clustering biomedical documents. 相似文献
4.
Anomaly detection in web documents using crisp and fuzzy-based cosine clustering methodology 总被引:1,自引:0,他引:1
Cluster analysis is a primary tool for detecting anomalous behavior in real-world data such as web documents, medical records of patients or other personal data. Most existing methods for document clustering are based on the classical vector-space model, which represents each document by a fixed-size vector of weighted key terms often referred to as key phrases. Since vector representations of documents are frequently very sparse, inverted files are used to prevent a tremendous computational overload which may be caused in large and diverse document collections such as pages downloaded from the World Wide Web. In order to reduce computation costs and space complexity, many popular methods for clustering web documents, including those using inverted files, usually assume a relatively small prefixed number of clusters.We propose several new crisp and fuzzy approaches based on the cosine similarity principle for clustering documents that are represented by variable-size vectors of key phrases, without limiting the final number of clusters. Each entry in a vector consists of two fields. The first field refers to a key phrase in the document and the second denotes an importance weight associated with this key phrase within the particular document. Removing the restriction on the total number of clusters, may moderately increase computing costs but on the other hand improves the method’s performance in classifying incoming vectors as normal or abnormal, based on their similarity to the existing clusters. All the procedures represented in this work are characterized by two features: (a) the number of clusters is not restricted by some relatively prefixed small number, i.e., an arbitrary new incoming vector which is not similar to any of the existing cluster centers necessarily starts a new cluster and (b) a vector with multiple appearance n in the training set is counted as n distinct vectors rather than a single vector. These features are the main reasons for the high quality performance of the proposed algorithms. We later describe them in detail and show their implementation in a real-world application from the area of web activity monitoring, in particular, by detecting anomalous documents downloaded from the internet by users with abnormal information interests. 相似文献
5.
Claudio Gennaro 《Multimedia Tools and Applications》2008,36(3):185-201
Although the Metadata Editor is an important part of any digital library, it becomes fundamental in the presence of audiovisual
content. This is because the metadata produced by automated support tools (such as speech recognizers and shot detection procedures)
is error-prone and often needs correction. In addition, scenes are manually annotated. This paper describes Regia, a prototype application for manually editing metadata for audiovisual documents developed in the ECHO project. Regia allows
the user to manually edit textual metadata and to hierarchically organize the segmentation of the audiovisual content. An
important feature of this metadata editor is that it is not hard-wired with a particular metadata attributes set. To achieve
this feature the XML schema of the metadata model is used by the editor as a configuration file.
相似文献
Claudio GennaroEmail: |
6.
基于动态聚类的文档碎纸片自动拼接算法 总被引:1,自引:0,他引:1
针对碎纸机三种碎纸模式进行拼接复原,提出了一种基于动态聚类的文档碎纸片自动拼接算法,定义了匹配度矩阵计算两块碎片最合理的拼接方式,设计了一种基于碎纸片特征向量的动态聚类行聚类算法进行行初步聚类,根据文字特征线及计算出的行距对初步聚类进行了调整修正,确定最终的行分类及行间顺序,根据提出的动态四邻近匹配算法,匹配出复原结果。实验表明,该方法实现简单,成功率高,能快速得到碎纸片的三种碎纸模式的拼接复原结果。 相似文献
7.
Rdvan Saraolu Kemal Tütüncü Novruz Allahverdi 《Expert systems with applications》2008,34(4):2545-2554
Searching for similar document has an important role in text mining and document management. In whether similar document search or in other text mining applications generally document classification is focused and class or category that the documents belong to is tried to be determined. The aim of the present study is the investigation of the case which includes the documents that belong to more than one category. The system used in the present study is a similar document search system that uses fuzzy clustering. The situation of belonging to more than one category for the documents is included by this system. The proposed approach consists of two stages to solve multicategories problem. The first stage is to find out the documents belonging to more than one category. The second stage is the determination of the categories to which these found documents belong to. For these two aims -threshold Fuzzy Similarity Classification Method (-FSCM) and Multiple Categories Vector Method (MCVM) are proposed as written order. Experimental results showed that proposed system can distinguish the documents that belong to more than one category efficiently. Regarding to the finding which documents belong to which classes, proposed system has better performance and success than the traditional approach. 相似文献
8.
基于《知网》的中文信息结构抽取研究 总被引:2,自引:0,他引:2
文章提出了一种在真实文本中抽取中文信息结构的方法—利用大规模基于语义依存关系的语料库对《知网》的中文信息结构模式进行训练,用这些带概率的模式作为规则建立部分依存分析器,从而从真实文本中最大限度地抽取符合知网中文信息结构定义的短语。该研究除了对将要建立的基于语义依存关系的语言模型是个有益的补充外,对于文本理解、对话系统甚至语音合成中的重音预测、韵律建模等等方面都有十分广阔的应用前景。 相似文献
9.
Ontology is playing an increasingly important role in knowledge management and the Semantic Web. This study presents a novel episode-based ontology construction mechanism to extract domain ontology from unstructured text documents. Additionally, fuzzy numbers for conceptual similarity computing are presented for concept clustering and taxonomic relation definitions. Moreover, concept attributes and operations can be extracted from episodes to construct a domain ontology, while non-taxonomic relations can be generated from episodes. The fuzzy inference mechanism is also applied to obtain new instances for ontology learning. Experimental results show that the proposed approach can effectively construct a Chinese domain ontology from unstructured text documents. 相似文献
10.
聚类是一种常用的基因表达数据处理手段,然而它又是主观的,如何选择符合数据内在分布的聚类算法成为目前急待解决的问题.根据经验,当选择最佳簇数k后,采用合理的聚类算法对目标数据重复聚类时,结果稳定性较好.因此提出一种基于稳定性的聚类算法选择.该方法将聚类结果的簇间分离度、簇内紧致度和聚类结果稳定性三者结合起来.在验证和应用三组数据时发现,比传统的评估方法,基于稳定性的聚类算法选择更客观、更可靠. 相似文献
11.
Ali A?telhadj Mohand Boughanem Mohamed Mezghiche Fatiha Souam 《Knowledge and Information Systems》2012,32(1):109-139
In this paper, we describe a method for clustering XML documents. Its goal is to group documents sharing similar structures. Our approach is two-step. We first automatically extract the structure from each XML document to be classified. This extracted structure is then used as a representation model to classify the corresponding XML document. The idea behind the clustering is that if XML documents share similar structures, they are more likely to correspond to the structural part of the same query. Finally, for the experimentation purpose, we tested our algorithms on both real (ACM SIGMOD Record corpus) and synthetic data. The results clearly demonstrate the interest of our approach. 相似文献
12.
13.
In this paper, we describe experimental methods of recognizing the document structures of various types of documents in the
framework of document understanding. Namely, we interpret document structures with individually characterized document knowledge.
The document understanding process is divided into three procedures: the first is the recognition of document structures from
a two-dimensional point of view; the second is the recognition of item relationships from a one-dimensional point of view;
and the third is the recognition of characters from a zero-dimensional point of view. The procedure for recognizing structures
plays the most important role in document understanding. This procedure extracts and classifies the logical item blocks from
paper-based documents distinctly.
We discuss the structure recognition methods for three classes of documents: 1) table-form documents, filled-in forms, cataloging
lists, etc. — each item block is surrounded by horizontal and vertical line segments; 2) library cataloging cards, name cards,
letters, etc. — each item block is separated by spaces; 3) newspapers, pamphlets, etc. — each item block is constructed hierarchically
and by combining under roughly specified layouts. The structure recognition procedure is characterized by individual recognition
methods: in class 1 documents, binary trees indicating the connective relationships among neighboring item blocks, which are
surrounded by line segments; in class 2 documents, binary trees defining the spatial and geometric relationships among neighboring
item blocks, which are separated by spaces; and in class 3 documents, composition rules specifying the constructive relationships
among neighboring item blocks, which are represented by adjacent relationship graphs. The methods are effective under the
knowledge-based frame-work and are integrated complementarily from the top-down (model-driven) and bottom-up (data-driven)
approaches. Of course, the integration means vary according to document classes. 相似文献
14.
Sudip Seal 《Information Processing Letters》2005,93(3):143-147
Microarrays are used for measuring expression levels of thousands of genes simultaneously. Clustering algorithms are used on gene expression data to find co-regulated genes. An often used clustering strategy is the Pearson correlation coefficient based hierarchical clustering algorithm presented in [Proc. Nat. Acad. Sci. 95 (25) (1998) 14863-14868], which takes O(N3) time. We note that this run time can be reduced to O(N2) by applying known hierarchical clustering algorithms [Proc. 9th Annual ACM-SIAM Symposium on Discrete Algorithms, 1998, pp. 619-628] to this problem. In this paper, we present an algorithm which runs in O(NlogN) time using a geometrical reduction and show that it is optimal. 相似文献
15.
针对EM算法中的初始类的数目很难决定,在迭代中经常产生部分最优的情况,将K-means算法与基于EM的聚类方法相结合,提出了一个新的适用于基因表达数据的模型聚类方法。新的聚类方法,首先利用K-means算法具有全局性、效率高的优点,快速得到聚类的起始类的划分,将其设置为高斯混合模型的初始参数值,进一步采用EM方法进行聚类,得到最优聚类结果。通过2次对真实数据集的实验测试,将新的算法分别与K均值算法和EM算法进行了比较。实验结果表明,新算法是一种有效的聚类方法,聚类结果的准确度得到了提高。 相似文献
16.
Weiling Cai Author Vitae Songcan Chen Author Vitae Daoqiang Zhang Author Vitae 《Pattern recognition》2009,42(7):1248-1259
Traditional pattern recognition generally involves two tasks: unsupervised clustering and supervised classification. When class information is available, fusing the advantages of both clustering learning and classification learning into a single framework is an important problem worthy of study. To date, most algorithms generally treat clustering learning and classification learning in a sequential or two-step manner, i.e., first execute clustering learning to explore structures in data, and then perform classification learning on top of the obtained structural information. However, such sequential algorithms cannot always guarantee the simultaneous optimality for both clustering and classification learning. In fact, the clustering learning in these algorithms just aids the subsequent classification learning and does not benefit from the latter. To overcome this problem, a simultaneous learning framework for clustering and classification (SCC) is presented in this paper. SCC aims to achieve three goals: (1) acquiring the robust classification and clustering simultaneously; (2) designing an effective and transparent classification mechanism; (3) revealing the underlying relationship between clusters and classes. To this end, with the Bayesian theory and the cluster posterior probabilities of classes, we define a single objective function to which the clustering process is directly embedded. By optimizing this objective function, the effective and robust clustering and classification results are achieved simultaneously. Experimental results on both synthetic and real-life datasets show that SCC achieves promising classification and clustering results at one time. 相似文献
17.
与传统的硬划分聚类相比,模糊聚类算法(以FCM为例)对数据的比例变化具有鲁棒性,能够更准确地反映数据点与类中心的实际关系,目前已得到广泛应用.然而对于时序基因表达数据来说,传统的聚类算法往往不能充分利用到数据中时间上的动态关联信息.因此可以在模糊聚类算法的基础上引入自回归(AR)模型,将时序基因表达数据作为一组时间序列进行动态的聚类分析.这样不仅可以充分利用到时序基因表达数据的内部自相关性,并且可以进一步利用隶属度函数对AR模型的预测过程进行模糊化调整,从而得到更为理想的聚类结果. 相似文献
18.
作为高通量筛选的一种有效方法,虚拟筛选得到了越来越广泛的应用。当靶分子结构未知时,往往使用基于配体的虚拟筛选方法。在基于配体的虚拟筛选方法中,相似性方法起着非常重要的作用。基于中药有效成分化合物数据库,进行了层次凝聚聚类分析。在化学信息系统中,有许多的距离/相似性度量方法和相似性系数。在化学结构的表示和特征选择方面,使用了广泛使用的Daylight分子指纹。采用CDK项目来计算基于Daylight分子指纹的Tanimoto系数作为分子相似性度量方法。对TCM数据库进行了层次凝聚聚类分析,并在聚类之前应用了化学结构领域知识来进行待聚类数据的预处理。在层次聚类时,设定了0.75作为聚类的相似度阈值。计算了层次聚类过程中Kelly方法中的惩罚值来获取最合适的簇数量,通过该方法得到的簇数量与采用0.75作为相似度阈值聚类得到的簇数量非常接近。针对每一个包含多个化合物的簇,选取了多个化合物作为该簇的代表性化合物。同时根据聚类结果分析了Tanimoto系数的缺点。在后续工作中,可对TCM数据库进行分子骨架分析和多样性分析,并基于分子骨架进行聚类。 相似文献
19.
The processing and management of XML data are popular research issues. However, operations based on the structure of XML data have not received strong attention. These operations involve, among others, the grouping of structurally similar XML documents. Such grouping results from the application of clustering methods with distances that estimate the similarity between tree structures. This paper presents a framework for clustering XML documents by structure. Modeling the XML documents as rooted ordered labeled trees, we study the usage of structural distance metrics in hierarchical clustering algorithms to detect groups of structurally similar XML documents. We suggest the usage of structural summaries for trees to improve the performance of the distance calculation and at the same time to maintain or even improve its quality. Our approach is tested using a prototype testbed. 相似文献
20.
为了有效地提高丈本聚类的质量和效率,在对已有的层次聚类和K-means算法分析和研究的基础上,针对互联网信息处理量大、实时性高的特点,设计并实现了一种用于高维稀疏相似矩阵的文本聚类算法.该算法结合了层次聚类和K-means聚类的思想,根据一个阈值来控制聚类算法的选取和新簇的建立,并通过文本特征提取和文档相似度矩阵计算实现文本聚类.实验结果表明,该算法的召回率和正确率更高. 相似文献