首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.

Text document clustering is used to separate a collection of documents into several clusters by allowing the documents in a cluster to be substantially similar. The documents in one cluster are distinct from documents in other clusters. The high-dimensional sparse document term matrix reduces the clustering process efficiency. This study proposes a new way of clustering documents using domain ontology and WordNet ontology. The main objective of this work is to increase cluster output quality. This work aims to investigate and examine the method of selecting feature dimensions to minimize the features of the document name matrix. The sports documents are clustered using conventional K-Means with the dimension reduction features selection process and density-based clustering. A novel approach named ontology-based document clustering is proposed for grouping the text documents. Three critical steps were used in order to develop this technique. The initial step for an ontology-based clustering approach starts with data pre-processing, and the characteristics of the DR method are reduced with the Info-Gain collection. The documents are clustered using two clustering methods: K-Means and Density-Based clustering with DR Feature Selection Process. These methods validate the findings of ontology-based clustering, and this study compared them using the measurement metrics. The second step of this study examines the sports field ontology development and describes the principles and relationship of the terms using sports-related documents. The semantic web rational process is used to test the ontology for validation purposes. An algorithm for the synonym retrieval of the sports domain ontology terms has been proposed and implemented. The retrieved terms from the documents and sport ontology concepts are mapped to the retrieved synonym set words from the WorldNet ontology. The suggested technique is based on synonyms of mapped concepts. The proposed ontology approach employs the reduced feature set in order to clustering the text documents. The results are compared with two traditional approaches on two datasets. The proposed ontology-based clustering approach is found to be effective in clustering the documents with high precision, recall, and accuracy. In addition, this study also compared the different RDF serialization formats for sports ontology.

  相似文献   

2.
Many studies on developing technologies have been published as articles, papers, or patents. We use and analyze these documents to find scientific and technological trends. In this paper, we consider document clustering as a method of document data analysis. In general, we have trouble analyzing documents directly because document data are not suitable for statistical and machine learning methods of analysis. Therefore, we have to transform document data into structured data for analytical purposes. For this process, we use text mining techniques. The structured data are very sparse, and hence, it is difficult to analyze them. This study proposes a new method to overcome the sparsity problem of document clustering. We build a combined clustering method using dimension reduction and K-means clustering based on support vector clustering and Silhouette measure. In particular, we attempt to overcome the sparseness in patent document clustering. To verify the efficacy of our work, we first conduct an experiment using news data from the machine learning repository of the University of California at Irvine. Second, using patent documents retrieved from the United States Patent and Trademark Office, we carry out patent clustering for technology forecasting.  相似文献   

3.
Defining valid patents in a particular technological field is an indispensable step in patent analysis. To minimise the risk of missing valid patents, domain experts manually exclude irrelevant patents, known as noise patents, from an initial patent set derived using a loose retrieval query. However, this task has become time-consuming and labour intensive due to the increasing number of patents and rising complexity of technological knowledge. This study proposes a semi-automated approach to noise patent filtering based on information entropy theory and latent Dirichlet allocation. The proposed approach comprises four discrete steps: (1) structuring patents using a term-weighting method; (2) recommending noise patent seeds based on the information quantity of patents in terms of focal keyword groups; (3) measuring text similarities for patent clustering using latent Dirichlet allocation; and (4) identifying potential noise patent clusters with respect to the noise patent seeds. Our case study confirms that the proposed approach is valuable as a complementary noise patent filtering tool that will enable domain experts to focus more on their own knowledge-intensive tasks such as prior art analysis and research and development (R&D) strategy formulation.  相似文献   

4.
This paper aims to cluster Chinese patent documents with the structures. Both the explicit and implicit structures are analyzed to represent by the proposed structure expression. Accordingly, an unsupervised clustering algorithm called structured self-organizing map (SOM) is adopted to cluster Chinese patent documents with both similar content and structure. Structured SOM clusters the similar content of each sub-part structure, and then propagates the similarity to upper level ones. Experimental result showed the maps size and number of patents are proportional to the computing time, which implies the width and depth of structure affects the performance of structured SOM. Structured clustering of patents is helpful in many applications. In the lawsuit of copyright, companies are easy to find claim conflict in the existent patents to contradict the accusation. Moreover, decision-maker of a company can be advised to avoid hot-spot aspects of patents, which can save a lot of R&D effort.  相似文献   

5.
With the rapid growth of text documents, document clustering technique is emerging for efficient document retrieval and better document browsing. Recently, some methods had been proposed to resolve the problems of high dimensionality, scalability, accuracy, and meaningful cluster labels by using frequent itemsets derived from association rule mining for clustering documents. In order to improve the quality of document clustering results, we propose an effective Fuzzy Frequent Itemset-based Document Clustering (F2IDC) approach that combines fuzzy association rule mining with the background knowledge embedded in WordNet. A term hierarchy generated from WordNet is applied to discover generalized frequent itemsets as candidate cluster labels for grouping documents. We have conducted experiments to evaluate our approach on Classic4, Re0, R8, and WebKB datasets. Our experimental results show that our proposed approach indeed provide more accurate clustering results than prior influential clustering methods presented in recent literature.  相似文献   

6.
In order to process large numbers of explicit knowledge documents such as patents in an organized manner, automatic document categorization and search are required. In this paper, we develop a document classification and search methodology based on neural network technology that helps companies manage patent documents more effectively. The classification process begins by extracting key phrases from the document set by means of automatic text processing and determining the significance of key phrases according to their frequency in text. In order to maintain a manageable number of independent key phrases, correlation analysis is applied to compute the similarities between key phrases. Phrases with higher correlations are synthesized into a smaller set of phrases. Finally, the back-propagation network model is adopted as a classifier. The target output identifies a patent document’s category based on a hierarchical classification scheme, in this case, the international patent classification (IPC) standard. The methodology is tested using patents related to the design of power hand-tools. Related patents are automatically classified using pre-trained neural network models. In the prototype system, two modules are used for patent document management. The automatic classification module helps the user classify patent documents and the search module helps users find relevant and related patent documents. The result shows an improvement in document classification and identification over previously published methods of patent document management.  相似文献   

7.
专利作为一种包含大多数领域中最完整的设计信息,可以为设计者解决设计问题提供有价值的指导。针对现有的专利推荐方法难以有效地推荐跨领域专利的问题,提出一种基于深度学习的跨领域专利知识推荐方法,用于创新产品的概念设计。对产品功能和知识需求情境进行建模,将设计问题进行标准化表达,生成设计问题空间。提出一种半监督学习算法(TG-TCI)将专利功能信息按照功能基自动分类和标记,利用实体识别算法(BERT-BiLSTM-CRF)提取专利应用场景术语、技术术语,结合国际专利分类(IPC)信息以表示专利的功能、情境、技术和领域属性,从而生成专利知识空间。通过设计问题空间到专利知识空间的功能基和知识情境映射查找所需的跨领域专利,根据技术和领域属性对它们进行聚类和评估,选出特定的专利以激发设计者的创造力。以一个实际案例进行分析验证,证明了基于深度学习的专利知识推荐模型的可行性及有效性。  相似文献   

8.
为了发现企业技术实力和行业趋势,减少不必要的研发成本、做出正确决策,给出了基本专利同被引矩阵分析方法.利用改进的粗糙集K-Means模糊聚类方法实现对不同主题专利聚类,解决了重复计算中心向量带来的不准确性;进一步利用关联规则挖掘算法发现强关联规则,以强关联规则结论为该类别核心专利,提高针对性地选择专利.实例表明了该方法的有效性,为专利引文分析提供了可行的解决途径.  相似文献   

9.
Conclusion We have developed a document preparation and editing system for SGML-HTML documents that allows viewing, navigation, and printing. The system supports two languages, Ukrainian and English. Combination of the interpretation features of SGML for representation of a wide range of documents, in particular mathematical formulas of arbitrary complexity, and HTML for navigation by the document link structure provides new opportunities for electronic document processing [6] in Internet/Intranet environments. The system may prove particularly useful for patent preparation and maintenance, as the SGML standard (WIPO ST.32) has been recommended for international exchange of published information about inventions and patents. The editor may be incorporated in special-purpose document-handling workstations, for instance, in patent offices, in special archives, in publishing, and other systems based on corporate Intranet environments. Translated from Kibernetika i Sistemnyi Analiz, No. 4, pp. 117–123, July–August, 1998.  相似文献   

10.
基于分级神经网络的Web文档模糊聚类技术   总被引:2,自引:1,他引:1  
给出了一种多层向量空间模型,该模型将一篇文档的相关信息从逻辑上划分为多个相对独立的文本段,按照不同位置的文本段确定相应的索引项权重.然后提出了一种简明而有效的基于分级神经网络的模糊聚类算法.与现有方法不同,该模糊聚类方法采用自组织神经网络和模糊聚类网络两部分组成的3层神经网络来实现.首先采用自组织神经网络从原始数据产生一个初始聚类结果,然后运用FCM方法对初始聚类的数目进行优化.实验结果表明,提出的Web文档聚类算法具有较好的聚类特性,它能将与一个主题相关的web文档较完全和准确地聚成一类.  相似文献   

11.
Patent retrieval primarily focuses on searching relevant legal documents with respect to a given query. Depending on the purposes of specific retrieval tasks, processes of patent retrieval may differ significantly. Given a patent application, it is challenging to determine its patentability, i.e., to decide whether a similar invention has been published. Therefore, it is more important to retrieve all possible relevant documents rather than only a small subset of patents from the top ranked results. However, patents are often lengthy and rich in technical terms. It is thus often requiring enormous human efforts to compare a given patent application with retrieved results. To this end, we propose an integrated framework, PatSearch, which automatically transforms the patent application into a reasonable and effective search query. The proposed framework first extracts representative yet distinguishable terms from a given application to generate an initial search query and then expands the query by combining content proximity with topic relevance. Further, a list of relevant patent documents will be retrieved based on the generated queries to provide enough information to assist patent analysts in making the patentability decision. Finally, a comparative summary is generated to assist patent analysts in quickly reviewing retrieved results related to the patent application. Extensive quantitative analysis and case studies on real-world patent documents demonstrate the effectiveness of our proposed approach.  相似文献   

12.
In the process of analyzing knowledge innovation, it is necessary to identify the existing boundaries of knowledge so as to determine whether knowledge is new – outside these boundaries. For a patent to be granted, all aspects of the patent request must be studied to determine the patent innovation. Knowledge innovation for patent requests depends on analyzing current state of the art in multiple languages. Currently the process is usually limited to the languages and search terms the patent seeker knows. The paper describes a model for representing the patent request by a set of concepts related to a multilingual knowledge ontology. The search for patent knowledge is based on Fuzzy Logic Decision Support and allows a multilingual search. The model was analyzed using a twofold approach: a total of 104,296 patents from the United States Patent and Trademark Office were used to analyze the patent extraction process, and patents from the Korean, US, and Chinese patent offices were used in the analysis of the multilingual decision process. The results display high recall and precision and suggest that increasing the number of languages used only has minor effects on the model results.  相似文献   

13.
The creation and deployment of knowledge repositories for managing, sharing, and reusing tacit knowledge within an organization has emerged as a prevalent approach in current knowledge management practices. A knowledge repository typically contains vast amounts of formal knowledge elements, which generally are available as documents. To facilitate users' navigation of documents within a knowledge repository, knowledge maps, often created by document clustering techniques, represent an appealing and promising approach. Various document clustering techniques have been proposed in the literature, but most deal with monolingual documents (i.e., written in the same language). However, as a result of increased globalization and advances in Internet technology, an organization often maintains documents in different languages in its knowledge repositories, which necessitates multilingual document clustering (MLDC) to create organizational knowledge maps. Motivated by the significance of this demand, this study designs a Latent Semantic Indexing (LSI)-based MLDC technique capable of generating knowledge maps (i.e., document clusters) from multilingual documents. The empirical evaluation results show that the proposed LSI-based MLDC technique achieves satisfactory clustering effectiveness, measured by both cluster recall and cluster precision, and is capable of maintaining a good balance between monolingual and cross-lingual clustering effectiveness when clustering a multilingual document corpus.  相似文献   

14.
一种概念空间自生成方法   总被引:5,自引:2,他引:5  
文章提出一种自动生成概念空间的方法。首先通过SOM神经网络,对文本进行聚类,之后从结果中提取反映各类文本内容的概念,用于标注文本的类别,再通过模糊聚类进行概念自动抽象与归纳形成概念空间,用于文本的管理。SOM本身是无监督的学习方式,在设定好参数后,经过训练自动生成文本空间与概念空间的映射图。相关试验和结果表明概念空间对文本有很好的分类管理功能,便于文本检索。  相似文献   

15.
Nowadays, decision-making activities of knowledge-intensive enterprises depend heavily on the successful classification of patents. A considerable amount of time is required to achieve successful classification because of the complexity associated with patent information and of the large number of potential patents. Several different patent classification approaches have been developed in the past, but most of these studies focus on using computational models for the International Patent Classification (IPC) system rather than using these models in real-world cases of patent classification. In contrast to previous studies that combined algorithms and the IPC system directly without using expert screening, this study proposes a novel artificial intelligence (AI)-aided patent decision-making process. In this process, an expert screening approach is integrated with a hybrid genetic-based support vector machine (HGA-SVM) model for developing a patent classification system with the high classification accuracy and generalization ability for real-world patent searching cases. The proposed approach is tested on a real-world case—an expert's patent document searching history that contains 234 patent documents of semiconductor equipment components. The research results demonstrate that our proposed hybrid genetic algorithm approach can optimize all the parameters of the SVM for developing a patent classification system with a high accuracy. The proposed HGA-SVM model is able to dynamically and automatically classify patent documents by recording and learning the experts’ knowledge and logic. Finally, we propose a new decision-making process for improving the development of the SVM patent classification and searching system.  相似文献   

16.
In the era of technology-driven economies, patent infringement has become one of the main risks faced by companies, which exists in all stages of technological innovation. However, the increasing size of patent information as well as the inherent fuzziness of patent infringement risk make the early warning of this risk a knowledge-intensive engineering activity. In this study, a novel patent infringement early warning methodology based on intuitionistic fuzzy sets (IFSs) is proposed to accurately evaluate and classify patent infringement risk for its management. First, a hierarchical indicator system of the methodology is established, including indicators of regional judicial and administrative protection. Then entropy weights for IFSs and intuitionistic fuzzy weighted geometric (IFWG) operators are utilized to objectively and automatically aggregate indicator data on early warning patents and their similar patents to evaluate IFS results, which is a multi-layer data processing structure. Finally, the normalized Euclidean distances are used to classify risk levels. In a case study, Huawei's historical patents are taken as the test data, and the methodology is verified by comparing the output results and classification with the actual litigation status. Managerial implications for design engineers and patent attorneys are discussed corresponding to various technological innovation stages.  相似文献   

17.
In this paper, we extend the work of Kraft et al. to present a new method for fuzzy information retrieval based on fuzzy hierarchical clustering and fuzzy inference techniques. First, we present a fuzzy agglomerative hierarchical clustering algorithm for clustering documents and to get the document cluster centers of document clusters. Then, we present a method to construct fuzzy logic rules based on the document clusters and their document cluster centers. Finally, we apply the constructed fuzzy logic rules to modify the user's query for query expansion and to guide the information retrieval system to retrieve documents relevant to the user's request. The fuzzy logic rules can represent three kinds of fuzzy relationships (i.e., fuzzy positive association relationship, fuzzy specialization relationship and fuzzy generalization relationship) between index terms. The proposed fuzzy information retrieval method is more flexible and more intelligent than the existing methods due to the fact that it can expand users' queries for fuzzy information retrieval in a more effective manner.  相似文献   

18.
Customer clustering is an essential step to reduce the complexity of large-scale logistics network optimization. By properly grouping those customers with similar characteristics, logistics operators are able to reduce operational costs and improve customer satisfaction levels. However, due to the heterogeneity and high-dimension of customers’ characteristics, the customer clustering problem has not been widely studied. This paper presents a fuzzy-based customer clustering algorithm with a hierarchical analysis structure to address this issue. Customers’ characteristics are represented using linguistic variables under major and minor criteria, and then, fuzzy integration method is used to map the sub-criteria into the higher hierarchical criteria based on the trapezoidal fuzzy numbers. A fuzzy clustering algorithm based on Axiomatic Fuzzy Set is developed to group the customers into multiple clusters. The clustering validity index is designed to evaluate the effectiveness of the proposed algorithm and find the optimal clustering solution. Results from a case study in Anshun, China reveal that the proposed approach outperforms the other three prevailing algorithms to resolve the customer clustering problem. The proposed approach also demonstrates its capability of capturing the similarity and distinguishing the difference among customers. The tentative clustered regions, determined by five decision makers in Anshun City, are used to evaluate the effectiveness of the proposed approach. The validation results indicate that the clustered results from the proposed method match the actual clustered regions from the real world well. The proposed algorithm can be readily implemented in practice to help the logistics operators reduce operational costs and improve customer satisfaction levels. In addition, the proposed algorithm is potential to apply in other research domains.  相似文献   

19.
In this paper, an approach for automatically clustering a data set into a number of fuzzy partitions with a simulated annealing using a reversible jump Markov chain Monte Carlo algorithm is proposed. This is in contrast to the widely used fuzzy clustering scheme, the fuzzy c-means (FCM) algorithm, which requires the a priori knowledge of the number of clusters. The said approach performs the clustering by optimizing a cluster validity index, the Xie-Beni index. It makes use of the homogeneous reversible jump Markov chain Monte Carlo (RJMCMC) kernel as the proposal so that the algorithm is able to jump between different dimensions, i.e., number of clusters, until the correct value is obtained. Different moves, like birth, death, split, merge, and update, are used for sampling a candidate state given the current state. The effectiveness of the proposed technique in optimizing the Xie-Beni index and thereby determining the appropriate clustering is demonstrated for both artificial and real-life data sets. In a part of the investigation, the utility of the fuzzy clustering scheme for classifying pixels in an IRS satellite image of Kolkata is studied. A technique for reducing the computation efforts in the case of satellite image data is incorporated.  相似文献   

20.
Advances in computational methods have led, in the world of financial services, to huge databases of client and market information. In the past decade, various computational intelligence techniques have been applied in mining this data for obtaining knowledge and in-depth information about the clients and the markets. The paper discusses the application of fuzzy clustering in target selection from large databases for direct marketing purposes. Actual data from the campaigns of a large financial services provider are used as a test case. The results obtained with the fuzzy clustering approach are compared with those resulting from the current practice of using statistical tools for target selection  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号