首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
Topic modeling is a type of statistical model for discovering the latent “topics” that occur in a collection of documents through machine learning. Currently, latent Dirichlet allocation (LDA) is a popular and common modeling approach. In this paper, we investigate methods, including LDA and its extensions, for separating a set of scientific publications into several clusters. To evaluate the results, we generate a collection of documents that contain academic papers from several different fields and see whether papers in the same field will be clustered together. We explore potential scientometric applications of such text analysis capabilities.  相似文献   

3.
Plagiarism is one of the most important current debates among scientific stakeholders. A separate but related issue is the use of authors’ own ideas in different papers (i.e., self-plagiarism). Opinions on this issue are mixed, and there is a lack of consensus. Our goal was to gain deeper insight into plagiarism and self-plagiarism through a citation analysis of documents involved in these situations. The Déjà vu database, which comprises around 80,000 duplicate records, was used to select 247 pairs of documents that had been examined by curators on a full text basis following a stringent protocol. We then used the Scopus database to perform a citation analysis of the selected documents. For each document pair, we used specific bibliometric indicators, such as the number of authors, full text similarity, journal impact factor, the Eigenfactor, and article influence. Our results confirm that cases of plagiarism are published in journals with lower visibility and thus tend to receive fewer citations. Moreover, full text similarity was significantly higher in cases of plagiarism than in cases of self-plagiarism. Among pairs of documents with shared authors, duplicates not citing the original document showed higher full text similarity than those citing the original document, and also showed greater overlap in the references cited in the two documents.  相似文献   

4.
In this paper, we investigate whether a semantic representation of patent documents provides added value for a multi-dimensional visual exploration of a patent landscape compared to traditional approaches that use tf–idf (term frequency–inverse document frequency). Word embeddings from a pre-trained word2vec model created from patent text are used to calculate pairwise similarities in order to represent each document in the semantic space. Then, a hierarchical clustering method is applied to create several semantic aggregation levels for a collection of patent documents. For visual exploration, we have seamlessly integrated multiple interaction metaphors that combine semantics and additional metadata for improving hierarchical exploration of large document collections.  相似文献   

5.
The notion of ‘core documents’, first introduced in the context of co-citation analysis and later re-introduced for bibliographic coupling and extended to hybrid approaches, refers to the representation of the core of a document set according to given criteria. In the present study, core documents are used for the identification of new emerging topics. The proposed method proceeds from independent clustering of disciplines in different time windows. Cross-citations between core documents and clusters in different periods are used to detect new, exceptionally growing clusters or clusters with changing topics. Three paradigmatic types of new, emerging topics are distinguished. Methodology is illustrated using the example of four ISI subject categories selected from the life sciences, applied sciences and the social sciences.  相似文献   

6.
In this research work, we proposed a medical image analysis framework with two separate releases whether or not Synovial Sarcoma (SS) is the cell structure for cancer. Within this framework the histopathology images are decomposed into a third-level sub-band using a two-dimensional Discrete Wavelet Transform. Subsequently, the structure features (SFs) such as Principal Components Analysis (PCA), Independent Components Analysis (ICA) and Linear Discriminant Analysis (LDA) were extracted from this sub-band image representation with the distribution of wavelet coefficients. These SFs are used as inputs of the Support Vector Machine (SVM) classifier. Also, classification of PCA + SVM, ICA + SVM, and LDA + SVM with Radial Basis Function (RBF) kernel the efficiency of the process is differentiated and compared with the best classification results. Furthermore, data collected on the internet from various histopathological centres via the Internet of Things (IoT) are stored and shared on blockchain technology across a wide range of image distribution across secure data IoT devices. Due to this, the minimum and maximum values of the kernel parameter are adjusted and updated periodically for the purpose of industrial application in device calibration. Consequently, these resolutions are presented with an excellent example of a technique for training and testing the cancer cell structure prognosis methods in spindle shaped cell (SSC) histopathological imaging databases. The performance characteristics of cross-validation are evaluated with the help of the receiver operating characteristics (ROC) curve, and significant differences in classification performance between the techniques are analyzed. The combination of LDA + SVM technique has been proven to be essential for intelligent SS cancer detection in the future, and it offers excellent classification accuracy, sensitivity, specificity.  相似文献   

7.
Foreign patenting activity in some of the world major patent systems is being compared between countries and industries and is found to be, with a few notable exceptions, relatively unbiased. Furthermore, a brief dynamic analysis of the foreign patenting activity in the USA of a number of OECD-countries in 41 industrial sectors in terms of ‘Revealed Technological Advantage’ indices suggests that foreign patent data might provide a very useful addition to the arsenal of Science and Technology Output Indicators.  相似文献   

8.
In this paper we propose a methodology to mine concepts from documents and use these concepts to generate an objective summary of all relevant documents. We use the conceptual graph (CG) formalism as proposed by Sowa to represent the concepts and their relationships in the documents. In the present work we have modified and extended the definition of the concept given by Sowa. The modified and extended definition is discussed in detail in section 2 of this paper. A CG of a set of relevant documents can be considered as a semantic network. The semantic network is generated by automatically extracting CG for each document and merging them into one. We discuss (i) generation of semantic network using CGs and (ii) generation of multi-document summary. Here we use restricted Boltzmann machines, a deep learning technique, for automatically extracting CGs. We have tested our methodology using MultiLing 2015 corpus. We have obtained encouraging results, which are comparable to those from the state of the art systems.  相似文献   

9.
Journal rankings and journal ratings are important to governments, research institutes, and scientific research in general, and they frequently serve as the criteria for evaluating research performance to determine whether specific researchers will receive promotions and/or earn research grants. However, the only widely adopted journal assessment method is known as impact factor (IF), which focuses on citations in academic journals. However, IF disregards the technological applications and value of academic journals. In this article, we propose a method to rank academic journals that utilizes non-patent references in patent documents. We also compare the differences between journal rankings derived by using IF with those derived from the Intellectual Property Citation Index (IPCI) across different fields; moreover, some fields contain positive and significant correlations between IF and the IPCI. The results of this study offer a new perspective from which to assess the technological value of academic journals, particularly those in the technological and scientific fields. This study considers linkages among science and technology and the needs of the stakeholders in journal assessment to shed light on journal assessment and journal ranking methods.  相似文献   

10.
In recent years, the volume of information in digital form has increased tremendously owing to the increased popularity of the World Wide Web. As a result, the use of techniques for extracting useful information from large collections of data, and particularly documents, has become more necessary and challenging. Text clustering is such a technique; it consists in dividing a set of text documents into clusters (groups), so that documents within the same cluster are closely related, whereas documents in different clusters are as different as possible. Clustering depends on measuring the content (i.e., words) of a document in terms of relevance. Nevertheless, as documents usually contain a large number of words, some of them may be irrelevant to the topic under consideration or redundant. This can confuse and complicate the clustering process and make it less accurate. Accordingly, feature selection methods have been employed to reduce data dimensionality by selecting the most relevant features. In this study, we developed a text document clustering optimization model using a novel genetic frog-leaping algorithm that efficiently clusters text documents based on selected features. The proposed approach is based on two metaheuristic algorithms: a genetic algorithm (GA) and a shuffled frog-leaping algorithm (SFLA). The GA performs feature selection, and the SFLA performs clustering. To evaluate its effectiveness, the proposed approach was tested on a well-known text document dataset: the “20Newsgroup” dataset from the University of California Irvine Machine Learning Repository. Overall, after multiple experiments were compared and analyzed, it was demonstrated that using the proposed algorithm on the 20Newsgroup dataset greatly facilitated text document clustering, compared with classical K-means clustering. Nevertheless, this improvement requires longer computational time.  相似文献   

11.
孔万增  朱善安 《光电工程》2007,34(8):110-114
针对单样本人脸识别问题,本文提出了一种基于单样本切割的子模块主成分分析方法.该方法将单样本人脸图片切割成大小相同、互不重叠的多个子模块,切割后的子模块集构成新的样本集.对所有子模块作主成分分析(PCA)并提取特征,同一人脸的子模块特征系数作为分类识别的依据.在ORL人脸库上的测试结果表明,同PCA,(PC)2A,Sub-pattern LDA相比,该方法具有更好的识别率.  相似文献   

12.
Literature retrieval based on citation context   总被引:2,自引:0,他引:2  
While the citation context of a reference may provide detailed and direct information about the nature of a citation, few studies have specifically addressed the role of this information in retrieving relevant documents from the literature primarily due to the lack of full text databases. In this paper, we design a retrieval system based on full texts in the PubMed Central database. We constructed two modules in the retrieval system. One is a reference retrieval module based on citation contexts. Another is a citation context retrieval module for searching the citation contexts of a specific paper. The results of comparisons show that the reference retrieval module performed better than Google Scholar and PubMed database in terms of finding proper references based on topic words extracted from citation context. It also performed very well on searching highly cited papers and classic papers. The citation context retrieval module visualizes the topics of citation contexts as tag clouds and classifies citation contexts based on cue words in citation contexts.  相似文献   

13.
In this paper we analyze topic evolution over time within bioinformatics to uncover the underlying dynamics of that field, focusing on the recent developments in the 2000s. We select 33 bioinformatics related conferences indexed in DBLP from 2000 to 2011. The major reason for choosing DBLP as the data source instead of PubMed is that DBLP retains most bioinformatics related conferences, and to study dynamics of the field, conference papers are more suitable than journal papers. We divide a period of a dozen years into four periods: period 1 (2000–2002), period 2 (2003–2005), period 3 (2006–2008) and period 4 (2009–2011). To conduct topic evolution analysis, we employ three major procedures, and for each procedure, we develop the following novel technique: the Markov Random Field-based topic clustering, automatic cluster labeling, and topic similarity based on Within-Period Cluster Similarity and Between-Period Cluster Similarity. The experimental results show that there are distinct topic transition patterns between different time periods. From period 1 to period 3, new topics seem to have emerged and expanded, whereas from period 3 to period 4, topics are merged and display more rigorous interaction with each other. This trend is confirmed by the collaboration pattern over time.  相似文献   

14.
It is common to conduct a patent search when designing new products for commercial purposes. This paper examines whether reviewing patent documents to avoid infringement can affect design creativity based on an experimental study involving 106 undergraduate engineering students. As part of the study, the participants were divided into three groups and asked to individually design a water kettle in 20 min. Participants from the first group were each given an identical patent document at the start of the experiment and were warned not to infringe the patented design. Participants from the second group were each given another patent document and the same warning. Participants from the third group were not given any patent documentation. The experiment results show that reviewing patent documents prior to ideation, even when done to avoid infringement, can fixate and lead to the inclusion of design features associated with the patent documents reviewed. In addition, reviewing patent documents can also cause distractions and result in the exclusion of design features that may otherwise be included. The findings of this work can contribute towards design pedagogy and the development of processes to handle patent documents in design projects.  相似文献   

15.
F. Zhang  X. Zhang 《Scientometrics》2014,100(3):723-740
Patent activity in China for vibration-reduction control technology in high-speed railway vehicle systems was analyzed based on a portfolio of 193 patents or applications from the State Intellectual Property Office of the People’s Republic of China official Web-based database and a search of the World Intellectual Property Organization PCT database. Patent activity features such as timing, applicant, technology classification, technical themes, and patents in force were obtained and analyzed. As a further stage of research, patent data on locomotive wheel sets were analyzed by means of a matrix analysis of problems and technologies. The main statistical information and conclusions include estimating the development stage, discovering the distributions of applications and applicants, weighing the roles played by major applications, determining R&D hotspots, and providing a better understanding of domestic patent activities in this field. Policy implications for innovation-related domestic R&D institutions in the technologies under study were proposed based on the analytical results.  相似文献   

16.

Probabilistic topic modeling algorithms like Latent Dirichlet Allocation (LDA) have become powerful tools for the analysis of large collections of documents (such as papers, projects, or funding applications) in science, technology an innovation (STI) policy design and monitoring. However, selecting an appropriate and stable topic model for a specific application (by adjusting the hyperparameters of the algorithm) is not a trivial problem. Common validation metrics like coherence or perplexity, which are focused on the quality of topics, are not a good fit in applications where the quality of the document similarity relations inferred from the topic model is especially relevant. Relying on graph analysis techniques, the aim of our work is to state a new methodology for the selection of hyperparameters which is specifically oriented to optimize the similarity metrics emanating from the topic model. In order to do this, we propose two graph metrics: the first measures the variability of the similarity graphs that result from different runs of the algorithm for a fixed value of the hyperparameters, while the second metric measures the alignment between the graph derived from the LDA model and another obtained using metadata available for the corresponding corpus. Through experiments on various corpora related to STI, it is shown that the proposed metrics provide relevant indicators to select the number of topics and build persistent topic models that are consistent with the metadata. Their use, which can be extended to other topic models beyond LDA, could facilitate the systematic adoption of this kind of techniques in STI policy analysis and design.

  相似文献   

17.
18.
The notion of ‘core documents’, first introduced in the context of co-citation analysis and later re-introduced for bibliographic coupling, refers to the representation of the core of a publication set according to given criteria. In the present study, the notion of core documents is extended to the combination of citation-based and textual links. It is shown that core documents defined this way can be used to represent and describe document clusters and topics at different levels of aggregation. Methodology is illustrated using the example of two ISI Subject Categories selected from applied and social sciences.  相似文献   

19.
In this study, we evaluated future trends of worldwide patenting in nanotechnology and its domains using logistic growth curves while the patent activity from the main countries, technological domains and subdomains were assessed in four different contexts: worldwide, patents filed in the United States Patent and Trademark Office (USPTO), and patents applications in the triadic (TRIAD) and in the tetradic (TETRAD) countries. The indicators were developed based on a set of records recovered from the Derwent Innovation Index database. Nanotechnology has recently emerged as a new research field, with logistic trend behaviors generating interesting discussions since they suggest that technological development in nanotechnology and its domains has reached an initial maturation stage. Future scenarios were compiled due to the difficult to establish upper limits to forecasting curves. Although China’s share of patents is small in some cases, it was the only country to constantly increase the number of patents from a worldwide perspective. In contrast, the USA and the EU were the most active in the USPTO, TRIAD and TETRAD cases, followed by Japan and Korea. The technological subdomains of main interest from countries/region changed according to the perspective adopted, even though there was a clear bias towards semiconductors, surface treatments, electrical components, macromolecular chemistry, materials–metallurgy, pharmacy–cosmetics and analysis–measurement–control subdomains. We conclude that monitoring nanotechnology advances should be constantly reviewed in order to confirm the evidence observed and forecasted.  相似文献   

20.
Additive manufacturing (AM) or 3D printing includes techniques capable of manufacturing regular and irregular shapes for small batches of customized products. The ability to customize unusual shapes makes the process particularly suitable for prosthetic products used in biomedical applications. AM adoption in the field of biomedical applications (called bio-AM in this research) has seen significant growth over the last few years. This research develops an Intellectual Property (IP) analytical methodology to explore the portfolios and evolution of patents, as well as their relevance to Taiwan’s Ministry of Science and Technology (MOST) research projects in bio-AM domain. Specifically, global and domestic IP portfolios for bio-AM innovations are studied using the proposed method. First, the domain documents (of US patents and MOST projects) are collected from a global patent database and MOST project database. The key term frequency counts and technical clustering analysis of the collected documents are derived. The key terms and appearance frequencies in documents form the basis for document clustering and similarity analysis. The ontology of bio-AM is constructed based on the clustering results. Finally, the patents and projects in the adjusted clusters are subject to evolution analysis using concept lattice analysis. This research provides a computer supported IP evolution analysis system, based on the developed algorithms, for the decision support of IP and R&D strategic planning.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号