首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
当前专利是按照领域划分的,而基于功效特征可以实现跨领域的专利聚类,这在企业创新设计中具有重要意义,而精确提取专利功效特征和快速获得最优聚类结果是其中的关键任务。为此提出一种信息实体语义增强表示(ERNIE)和卷积神经网络(CNN)相结合的功效特征联合提取(FEI-Joint)模型来提取专利文献的功效特征,并且改进自组织神经网络(SOM)算法,从而提出具有早期拒绝策略与类合并思想的自组织神经网络(ERCM-SOM)来实现基于功效特征的专利聚类。对FEI-Joint模型与TF-IDF、狄利克雷分布(LDA)、CNN在特征提取后的聚类效果上进行比较和分析,结果表明其F-measure值比其他模型有明显提高。ERCM-SOM算法与K-Means算法、SOM算法相比,在F-measure值提高的同时,其时间较SOM算法有明显缩短。对比使用专利分类号(IPC)的专利分类,采用基于功效特征的聚类方法可实现跨领域的专利聚类效果,为设计者借鉴其他领域的设计方法奠定了基础。  相似文献   

2.
This correspondence presents a novel hierarchical clustering approach for knowledge document self-organization, particularly for patent analysis. Current keyword-based methodologies for document content management tend to be inconsistent and ineffective when partial meanings of the technical content are used for cluster analysis. Thus, a new methodology to automatically interpret and cluster knowledge documents using an ontology schema is presented. Moreover, a fuzzy logic control approach is used to match suitable document cluster(s) for given patents based on their derived ontological semantic webs. Finally, three case studies are used to test the approach. The first test case analyzed and clustered 100 patents for chemical and mechanical polishing retrieved from the World Intellectual Property Organization (WIPO). The second test case analyzed and clustered 100 patent news articles retrieved from online Web sites. The third case analyzed and clustered 100 patents for radio-frequency identification retrieved from WIPO. The results show that the fuzzy ontology-based document clustering approach outperforms the K-means approach in precision, recall, F-measure, and Shannon's entropy.  相似文献   

3.
Many studies on developing technologies have been published as articles, papers, or patents. We use and analyze these documents to find scientific and technological trends. In this paper, we consider document clustering as a method of document data analysis. In general, we have trouble analyzing documents directly because document data are not suitable for statistical and machine learning methods of analysis. Therefore, we have to transform document data into structured data for analytical purposes. For this process, we use text mining techniques. The structured data are very sparse, and hence, it is difficult to analyze them. This study proposes a new method to overcome the sparsity problem of document clustering. We build a combined clustering method using dimension reduction and K-means clustering based on support vector clustering and Silhouette measure. In particular, we attempt to overcome the sparseness in patent document clustering. To verify the efficacy of our work, we first conduct an experiment using news data from the machine learning repository of the University of California at Irvine. Second, using patent documents retrieved from the United States Patent and Trademark Office, we carry out patent clustering for technology forecasting.  相似文献   

4.
Structured documents have gained popularity with the advent of documentstructure markupstandards such as SGML, ODA, HyTime, and HTML.Document management systems can provide powerful facilities by maintaining thestructure information of documents.Since the hypermediadocument is also a kind of structured document, wecan apply the results of many studies, whichhave been performed in storing, retrieving, and managing structured documents,to the hypermedia document management.However, more factors should be considered in handling hypermedia documentsbecause they contain multimedia data and also have multiple complex structuressuch as hyperlink networks and spatial/temporal layout structures as well aslogical structures.In this paper, we propose an object-oriented model for multi-structuredhypermediadocuments and multimedia data, and a query language for retrievinghypermedia document elements based on the content and multiple complexstructures.By using unique element identifiers and an indexing scheme whichexploits multiple structures,we can process queries efficiently with minimal storage overheadfor maintaining structure information.  相似文献   

5.
A plethora of patents are approved by the patent officers each year and current patent systems face a solemn quandary of evaluating these patents’ qualities. Traditional researchers and analyzers have fixated on developing sundry patent quality indicators only, but these indicators do not have further prognosticating power on incipient patent applications or publications. Therefore, the data mining (DM) approaches are employed in this article to identify and to classify the new patent's quality in time. An automatic patent quality analysis and classification system, namely SOM-KPCA-SVM, is developed according to patent quality indicators and characteristics, respectively. First, the self-organizing map (SOM) approach is used to cluster patents published before into different quality groups according to the patent quality indicators and defines group quality type instead of via experts. The kernel principal component analysis (KPCA) approach is used to transform nonlinear feature space in order to improve classification performance. Finally, the support vector machine (SVM) is used to build up the patent quality classification model. The proposed SOM-KPCA-SVM is applied to classify patent quality automatically in patent data of the thin film solar cell. Experimental results show that our proposed system can capture the analysis effectively compared with traditional manpower approach.  相似文献   

6.
Patent retrieval primarily focuses on searching relevant legal documents with respect to a given query. Depending on the purposes of specific retrieval tasks, processes of patent retrieval may differ significantly. Given a patent application, it is challenging to determine its patentability, i.e., to decide whether a similar invention has been published. Therefore, it is more important to retrieve all possible relevant documents rather than only a small subset of patents from the top ranked results. However, patents are often lengthy and rich in technical terms. It is thus often requiring enormous human efforts to compare a given patent application with retrieved results. To this end, we propose an integrated framework, PatSearch, which automatically transforms the patent application into a reasonable and effective search query. The proposed framework first extracts representative yet distinguishable terms from a given application to generate an initial search query and then expands the query by combining content proximity with topic relevance. Further, a list of relevant patent documents will be retrieved based on the generated queries to provide enough information to assist patent analysts in making the patentability decision. Finally, a comparative summary is generated to assist patent analysts in quickly reviewing retrieved results related to the patent application. Extensive quantitative analysis and case studies on real-world patent documents demonstrate the effectiveness of our proposed approach.  相似文献   

7.
文本聚类的目标是把数据集中内容相似的文档归为一类,而使内容不同的文档分开。目前针对不同领域的需求,多种解决聚类问题的算法应运而生。然而,由于文本数据本身固有的复杂特点,如海量、高维、稀疏等,使得对海量文本数据的聚类仍然是一个棘手的问题。提出了层次非负矩阵分解聚类方法,该方法不但保留了非负矩阵分解的优点,如同步识别文档类别和找出类别本质特征,而且能够展现类别间的层次结构。这种类别层次结构在网页预览等应用中是非常有用的。在真实数据集20Newsgroups和Reuters-RCV1上的实验结果表明,层次非负矩阵分解相比已有的方法更有效。  相似文献   

8.
Self-organizing maps (SOM) have been applied on numerous data clustering and visualization tasks and received much attention on their success. One major shortage of classical SOM learning algorithm is the necessity of predefined map topology. Furthermore, hierarchical relationships among data are also difficult to be found. Several approaches have been devised to conquer these deficiencies. In this work, we propose a novel SOM learning algorithm which incorporates several text mining techniques in expanding the map both laterally and hierarchically. On training a set of text documents, the proposed algorithm will first cluster them using classical SOM algorithm. We then identify the topics of each cluster. These topics are then used to evaluate the criteria on expanding the map. The major characteristic of the proposed approach is to combine the learning process with text mining process and makes it suitable for automatic organization of text documents. We applied the algorithm on the Reuters-21578 dataset in text clustering and categorization tasks. Our method outperforms two comparing models in hierarchy quality according to users’ evaluation. It also receives better F1-scores than two other models in text categorization task.  相似文献   

9.
Web sites contain an ever increasing amount of information within their pages. As the amount of information increases so does the complexity of the structure of the web site. Consequently it has become difficult for visitors to find the information relevant to their needs. To overcome this problem various clustering methods have been proposed to cluster data in an effort to help visitors find the relevant information. These clustering methods have typically focused either on the content or the context of the web pages. In this paper we are proposing a method based on Kohonen’s self-organizing map (SOM) that utilizes both content and context mining clustering techniques to help visitors identify relevant information quicker. The input of the content mining is the set of web pages of the web site whereas the source of the context mining is the access-logs of the web site. SOM can be used to identify clusters of web sessions with similar context and also clusters of web pages with similar content. It can also provide means of visualizing the outcome of this processing. In this paper we show how this two-level clustering can help visitors identify the relevant information faster. This procedure has been tested to the access-logs and web pages of the Department of Informatics and Telecommunications of the University of Athens.  相似文献   

10.
The Self Organizing Map (SOM) algorithm has been utilized, with much success, in a variety of applications for the automatic organization of full-text document collections. A great advantage of the SOM method is that document collections can be ordered in such a way so that documents with similar content are positioned at nearby locations of the 2-dimensional SOM lattice. The resulting ordered map thus presents a general view of the document collection which helps the exploration of information contained in the whole document space. The most notable example of such an application is the WEBSOM method where the document collection is ordered onto a map by utilizing word category histograms for representing the documents data vectors. In this paper, we introduce the LSISOM method which resembles WEBSOM in the sense that the document maps are generated from word category histograms rather than simple histograms of the words. However, a major difference between the two methods is that in WEBSOM the word category histograms are formed using statistical information of short word contexts whereas in LSISOM these histograms are obtained from the SOM clustering of the Latent Semantic Indexing representation of document terms.  相似文献   

11.
基于频繁结构的XML文档聚类   总被引:1,自引:1,他引:0       下载免费PDF全文
研究基于频繁结构的XML文档聚类方法,其频繁结构包括频繁路径和频繁子树。首先介绍一种挖掘XML文档中所有嵌入频繁子树的算法SSTMiner,对SSTMiner算法进行修改,得到FrePathMiner算法和FreTreeMiner算法,分别用于挖掘XML文档中最大频繁路径和最大频繁子树,在此基础上,提出一种凝聚的层次聚类算法XMLCluster,分别以最大频繁路径和最大频繁子树作为XML文档的特征,对文档进行聚类。实验结果表明FrePathMiner算法和FreTreeMiner算法找到频繁结构的数量都比传统的ASPMiner算法多,这就可以为文档聚类提供更多的结构特征,从而获得更高的聚类精度。  相似文献   

12.
A patent quality analysis for innovative technology and product development   总被引:1,自引:0,他引:1  
Enterprises evaluate intellectual property rights and the quality of patent documents in order to develop innovative products and discover state-of-the-art technology trends. The product technologies covered by patent claims are protected by law, and the quality of the patent insures against infringement by competitors while increasing the worth of the invention. Thus, patent quality analysis provides a means by which companies determine whether or not to customize and manufacture innovative products. Since patents provide significant financial protection for businesses, the number of patents filed is increasing at a fast pace. Companies which cannot process patent information or fail to protect their innovations by filing patents lose market competitiveness. Current patent research is needed to estimate the quality of patent documents. The purpose of this research is to improve the analysis and ranking of patent quality. The first step of the proposed methodology is to collect technology specific patents and to extract relevant patent quality performance indicators. The second step is to identify the key impact factors using principal component analysis. These factors are then used as the input parameters for a back-propagation neural network model. Patent transactions help judge patent quality and patents which are licensed or sold with intellectual property usage rights are considered high quality patents. This research collected 283 patents sold or licensed from the news of patent transactions and 116 patents which were unsold but belong to the technology specific domains of interest. After training the patent quality model, 36 historical patents are used to verify the performance of the trained model. The match between the analytical results and the actual trading status reached an 85% level of accuracy. Thus, the proposed patent quality methodology evaluates the quality of patents automatically and effectively as a preliminary screening solution. The approach saves domain experts valuable time targeting high value patents for R&D commercialization and mass customization of products.  相似文献   

13.
After projecting high dimensional data into a two-dimension map via the SOM, users can easily view the inner structure of the data on the 2-D map. In the early stage of data mining, it is useful for any kind of data to inspect their inner structure. However, few studies apply the SOM to transactional data and the related categorical domain, which are usually accompanied with concept hierarchies. Concept hierarchies contain information about the data but are almost ignored in such researches. This may cause mistakes in mapping. In this paper, we propose an extended SOM model, the SOMCD, which can map the varied kinds of data in the categorical domain into a 2-D map and visualize the inner structure on the map. By using tree structures to represent the different kinds of data objects and the neurons’ prototypes, a new devised distance measure which takes information embedded in concept hierarchies into consideration can properly find the similarity between the data objects and the neurons. Besides the distance measure, we base the SOMCD on a tree-growing adaptation method and integrate the U-Matrix for visualization. Users can hierarchically separate the trained neurons on the SOMCD's map into different groups and cluster the data objects eventually. From the experiments in synthetic and real datasets, the SOMCD performs better than other SOM variants and clustering algorithms in visualization, mapping and clustering.  相似文献   

14.
A Semi-Structured Document Model for Text Mining   总被引:7,自引:0,他引:7       下载免费PDF全文
A semi-structured document has more structured information compared to an ordinary document,and the relation among semi-structured documents can be fully utilized.In order to take advantage of the structure and link information in a semi-structured document for better mining,a structured link vector model (SLVM) is presented in this paper,where a vector represents a document,and vectors‘ elements are determined by terms,document structure and neighboring documents.Text mining based on SLVM is described in the procedure of K-means for briefness and clarity:calculating document similarity and calculating cluster center.The clustering based on SLVM performs significantly better than that based on a conventional vector space model in the experiments,and its F value increases from 0.65-0.73 to 0.82-0.86.  相似文献   

15.
为得到好的聚类效果,需要挑选适合数据集簇结构的聚类算法。文中提出基于网格最小生成树的聚类算法选择方法,为给定数据集自动选择适合的聚类算法。该方法首先在数据集上构建出网格最小生成树,由树的数目确定数据集的潜在簇结构,然后为数据集选择适合所发现簇结构的聚类算法。实验结果表明该方法较有效,能为给定数据集找出适合其潜在簇结构的聚类算法。  相似文献   

16.
Clustering XML documents is extensively used to organize large collections of XML documents in groups that are coherent according to structure and/or content features. The growing availability of distributed XML sources and the variety of high-demand environments raise the need for clustering approaches that can exploit distributed processing techniques. Nevertheless, existing methods for clustering XML documents are designed to work in a centralized way. In this paper, we address the problem of clustering XML documents in a collaborative distributed framework. XML documents are first decomposed based on semantically cohesive subtrees, then modeled as transactional data that embed both XML structure and content information. The proposed clustering framework employs a centroid-based partitional clustering method that has been developed for a peer-to-peer network. Each peer in the network is allowed to compute a local clustering solution over its own data, and to exchange its cluster representatives with other peers. The exchanged representatives are used to compute representatives for the global clustering solution in a collaborative way. We evaluated effectiveness and efficiency of our approach on real XML document collections varying the number of peers. Results have shown that major advantages with respect to the corresponding centralized clustering setting are obtained in terms of runtime behavior, although clustering solutions can still be accurate with a moderately low number of nodes in the network. Moreover, the collaborativeness characteristic of our approach has revealed to be a convenient feature in distributed clustering as found in a comparative evaluation with a distributed non-collaborative clustering method.  相似文献   

17.
和导航中应用广泛。文本聚类作为一种无监督学习算法,其依据是聚类假设:同类的文档相似程度大,不同类的文档相似程度小。文中主要研究汉语文本聚类算法在新闻标题类文本中的应用。首先对采集到的若干条新闻标题进行分词和特征提取,将分词后的文本转化为词条矩阵;然后使用TF-IDF技术处理词条矩阵,得到基于分词权重的新的词条矩阵,对新的词条矩阵进行奇异值分解,得到主成分得分矩阵,提取主成分分析文本特征并根据主成分得分矩阵进行K-均值和分层聚类分析;最后将聚类结果用词云图的形式展示出来并评价聚类效果的好坏。实证显示,对词条矩阵的奇异值分解能降低向量空间的维数,提高聚类的精度和运算速度。  相似文献   

18.
随着信息的爆炸式增长,现有的搜索引擎在很多方面不能满足人们的需要。Web文档聚类可以减小搜索空间,加快检索速度,提高查询精度。提出了一种融合SOM(Self-Organizing Maps)粗聚类和改进PSO(Particle Swarm Optimization)细聚类的Web文档集成聚类算法。首先根据向量空间模型表示法,用特征词条及其权值表示Web文档信息,其次用SOM算法对文档特征集进行粗聚类,得到一组输出权值,然后用这组权值初始化改进的PSO算法,用改进PSO算法对此聚类结果进行细化,最终实现Web文档聚类。仿真结果表明,该算法能有效提高文档查询的查准率和查全率,具有一定的实用价值。  相似文献   

19.
覃晓  元昌安 《计算机应用》2008,28(3):757-760
自组织映射(SOM)算法作为一种聚类和高维可视化的无监督学习算法,为进行中文Web文档聚类提供了有力的手段。但是SOM算法天然存在着对网络初始权值敏感的缺陷,从而影响聚类质量。为此,引进遗传算法对SOM网络加以优化。提出了以遗传算法优化SOM网络的文本聚类算法(GSTCA);进行了对比实验,实验表明,改进后的算法GSTCA比SOM算法在Web中文文档聚类中具有更高的准确率,其F-measure值平均提高了14%,同时,实验还表明,GSTCA算法对网络初始权值是不敏感的,从而提高了算法的稳定性。  相似文献   

20.
The state-of-the-art text clustering methods suffer from the huge size of documents with high-dimensional features. In this paper, we studied fast SOM clustering technology for Text Information. Our focus is on how to enhance the efficiency of text clustering system whereas high clustering qualities are also kept. To achieve this goal, we separate the system into two stages: offline and online. In order to make text clustering system more efficient, feature extraction and semantic quantization are done offline. Although neurons are represented as numerical vectors in high-dimension space, documents are represented as collections of some important keywords, which is different from many related works, thus the requirement for both time and space in the offline stage can be alleviated. Based on this scenario, fast clustering techniques for online stage are proposed including how to project documents onto output layers in SOM, fast similarity computation method and the scheme of Incremental clustering technology for real-time processing, We tested the system using different datasets, the practical performance demonstrate that our approach has been shown to be much superior in clustering efficiency whereas the clustering quality are comparable to traditional methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号