首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 140 毫秒
1.
Traditional topic models have been widely used for analyzing semantic topics from electronic documents. However, the obvious defects of topic words acquired by them are poor in readability and consistency. Only the domain experts are possible to guess their meaning. In fact, phrases are the main unit for people to express semantics. This paper presents a Distributed Representation-Phrase Latent Dirichlet Allocation (DRPhrase LDA) which is a phrase topic model. Specifically, we reasonably enhance the semantic information of phrases via distributed representation in this model. The experimental results show the topics quality acquired by our model is more readable and consistent than other similar topic models.  相似文献   

2.
Lamirel  Jean-Charles  Chen  Yue  Cuxac  Pascal  Al Shehabi  Shadi  Dugué  Nicolas  Liu  Zeyuan 《Scientometrics》2020,125(3):2971-2999

In the first part of this paper, we shall discuss the historical context of Science of Science both in China and at world level. In the second part, we use the unsupervised combination of GNG clustering with feature maximization metrics and associated contrast graphs to present an analysis of the contents of selected academic journal papers in Science of Science in China and the construction of an overall map of the research topics’ structure during the last 40 years. Furthermore, we highlight how the topics have evolved through analysis of publication dates and also use author information to clarify the topics’ content. The results obtained have been reviewed and approved by 3 leading experts in this field and interestingly show that Chinese Science of Science has gradually become mature in the last 40 years, evolving from the general nature of the discipline itself to related disciplines and their potential interactions, from qualitative analysis to quantitative and visual analysis, and from general research on the social function of science to its more specific economic function and strategic function studies. Consequently, the proposed novel method can be used without supervision, parameters and help from any external knowledge to obtain very clear and precise insights about the development of a scientific domain. The output of the topic extraction part of the method (clustering?+?feature maximization) is finally compared with the output of the well-known LDA approach by experts in the domain which serves to highlight the very clear superiority of the proposed approach.

  相似文献   

3.
Rehs  Andreas 《Scientometrics》2020,125(2):1229-1251

The detection of differences or similarities in large numbers of scientific publications is an open problem in scientometric research. In this paper we therefore develop and apply a machine learning approach based on structural topic modelling in combination with cosine similarity and a linear regression framework in order to identify differences in dissertation titles written at East and West German universities before and after German reunification. German reunification and its surrounding time period is used because it provides a structure with both minor and major differences in research topics that could be detected by our approach. Our dataset is based on dissertation titles in economics and business administration and chemistry from 1980 to 2010. We use university affiliation and year of the dissertation to train a structural topic model and then test the model on a set of unseen dissertation titles. Subsequently, we compare the resulting topic distribution of each title to every other title with cosine similarity. The cosine similarities and the regional and temporal origin of the dissertation titles they come from are then used in a linear regression approach. Our results on research topics in economics and business administration suggest substantial differences between East and West Germany before the reunification and a rapid conformation thereafter. In chemistry we observe minor differences between East and West before the reunification and a slightly increased similarity thereafter.

  相似文献   

4.
A knowledge organization system (KOS) can help easily indicate the deep knowledge structure of a patent document set. Compared to classification code systems, a personalized KOS made up of topics can represent the technology information in a more agile, detailed manner. This paper presents an approach to automatically construct a KOS of patent documents based on term clumping, Latent Dirichlet Allocation (LDA) model, K-Means clustering and Principal Components Analysis (PCA). Term clumping is adopted to generate a better bag-of-words for topic modeling and LDA model is applied to generate raw topics. Then by iteratively using K-Means clustering and PCA on the document set and topics matrix, we generated new upper topics and computed the relationships between topics to construct a KOS. Finally, documents are mapped to the KOS. The nodes of the KOS are topics which are represented by terms and their weights and the leaves are patent documents. We evaluated the approach with a set of Large Aperture Optical Elements (LAOE) patent documents as an empirical study and constructed the LAOE KOS. The method used discovered the deep semantic relationships between the topics and helped better describe the technology themes of LAOE. Based on the KOS, two types of applications were implemented: the automatic classification of patents documents and the categorical refinements above search results.  相似文献   

5.
Topic modeling is a type of statistical model for discovering the latent “topics” that occur in a collection of documents through machine learning. Currently, latent Dirichlet allocation (LDA) is a popular and common modeling approach. In this paper, we investigate methods, including LDA and its extensions, for separating a set of scientific publications into several clusters. To evaluate the results, we generate a collection of documents that contain academic papers from several different fields and see whether papers in the same field will be clustered together. We explore potential scientometric applications of such text analysis capabilities.  相似文献   

6.
Summary The paper discusses an application of bibliometric techniques in the social sciences. While the interest of policy makers is growing, the topic is getting more and more attention from bibliometricians. However, many efforts are put into developing tools to measure scientific output and impact outside the world of the Social Sciences Citation Index, while the use of the SSCI for bibliometric applications is covered with obscurity and myths. This study attempts to clarify some of the topics mentioned against the application of the SSCI for evaluation purposes. The study will cover topics like the existing publication and citation culture within the social sciences, the effect of variable citation windows, and the (geographical) origin of citation flows.  相似文献   

7.
Cagliero  Luca  Garza  Paolo  Kavoosifar  Mohammad Reza  Baralis  Elena 《Scientometrics》2018,116(2):1273-1301

Identifying the most relevant scientific publications on a given topic is a well-known research problem. The Author-Topic Model (ATM) is a generative model that represents the relationships between research topics and publication authors. It allows us to identify the most influential authors on a particular topic. However, since most research works are co-authored by many researchers the information provided by ATM can be complemented by the study of the most fruitful collaborations among multiple authors. This paper addresses the discovery of research collaborations among multiple authors on single or multiple topics. Specifically, it exploits an exploratory data mining technique, i.e., weighted association rule mining, to analyze publication data and to discover correlations between ATM topics and combinations of authors. The mined rules characterize groups of researchers with fairly high scientific productivity by indicating (1) the research topics covered by their most cited publications and the relevance of their scientific production separately for each topic, (2) the nature of the collaboration (topic-specific or cross-topic), (3) the name of the external authors who have (occasionally) collaborated with the group either on a specific topic or on multiple topics, and (4) the underlying correlations between the addressed topics. The applicability of the proposed approach was validated on real data acquired from the Online Mendelian Inheritance in Man catalog of genetic disorders and from the PubMed digital library. The results confirm the effectiveness of the proposed strategy.

  相似文献   

8.
Gao  Qiang  Huang  Xiao  Dong  Ke  Liang  Zhentao  Wu  Jiang 《Scientometrics》2022,127(3):1543-1563

The combination of the topic model and the semantic method can help to discover the semantic distributions of topics and the changing characteristics of the semantic distributions, further providing a new perspective for the research of topic evolution. This study proposes a solution for quantifying the semantic distributions and the changing characteristics based on words in topic evolution through the Dynamic topic model (DTM) and the word2vec model. A dataset in the field of Library and information science (LIS) is utilized in the empirical study, and the topic-semantic probability distribution is derived. The evolving dynamics of the topics are constructed. The characteristics of evolving dynamics are used to explain the semantic distributions of topics in topic evolution. Then, the regularities of evolving dynamics are summarized to explain the changing characteristics of semantic distributions in topic evolution. Results show that no topic is distributed in a single semantic concept, and most topics correspond to various semantic concepts in LIS. The three kinds of topics in LIS are the convergent, diffusive, and stable topics. The discovery of different modes of topic evolution can further prove the development of the field. In addition, findings indicate that the popularity of topics and the characteristics of evolving dynamics of topics are irrelevant.

  相似文献   

9.
An extended latent Dirichlet allocation (LDA) model is presented in this paper for patent competitive intelligence analysis. After part-of-speech tagging and defining the noun phrase extraction rules, technological words have been extracted from patent titles and abstracts. This allows us to go one step further and perform patent analysis at content level. Then LDA model is used for identifying underlying topic structures based on latent relationships of technological words extracted. This helped us to review research hot spots and directions in subclasses of patented technology in a certain field. For the extension of the traditional LDA model, another institution-topic probability level is added to the original LDA model. Direct competing enterprises’ distribution probability and their technological positions are identified in each topic. Then a case study is carried on within one of the core patented technology in next generation telecommunication technology-LTE. This empirical study reveals emerging hot spots of LTE technology, and finds that major companies in this field have been focused on different technological fields with different competitive positions.  相似文献   

10.
Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data, this paper takes news text as the research object and proposes LDA text topic clustering algorithm based on Spark big data platform. Since the TF-IDF (term frequency-inverse document frequency) algorithm under Spark is irreversible to word mapping, the mapped words indexes cannot be traced back to the original words. In this paper, an optimized method is proposed that TF-IDF under Spark to ensure the text words can be restored. Firstly, the text feature is extracted by the TF-IDF algorithm combined CountVectorizer proposed in this paper, and then the features are inputted to the LDA (Latent Dirichlet Allocation) topic model for training. Finally, the text topic clustering is obtained. Experimental results show that for large data samples, the processing speed of LDA topic model clustering has been improved based Spark. At the same time, compared with the LDA topic model based on word frequency input, the model proposed in this paper has a reduction of perplexity.  相似文献   

11.
Opportunistic multihop networks with mobile relays recently have drawn much attention from researchers across the globe due to their wide applications in various challenging environments. However, because of their peculiar intrinsic features like lack of continuous connectivity, network partitioning, highly dynamic behavior, and long delays, it is very arduous to model and effectively capture the temporal variations of such networks with the help of classical graph models. In this work, we utilize an evolving graph to model the dynamic network and propose a matrix‐based algorithm to generate all minimal path sets between every node pair of such network. We show that these time‐stamped‐minimal‐path sets (TS‐MPS) between each given source‐destination node pair can be used, by utilizing the well‐known Sum‐of‐Disjoint Products technique, to generate various reliability metrics of dynamic networks, ie, two‐terminal reliability of dynamic network and its related metrics, ie, two‐terminal reliabilities of the foremost, shortest, and fastest TS‐MPS, and Expected Hop Count. We also introduce and compute a new network performance metric?Expected Slot Count. We use two illustrative examples of dynamic networks, one of four nodes, and the other of five nodes, to show the salient features of our technique to generate TS‐MPS and reliability metrics.  相似文献   

12.
通过python爬取豆瓣网站上《少年的你》的短评文本,对评论文本进行清洗并利用构建的分词词典和停用词词典分别进行分词处理和去停用词处理后得到较为规范化的文本.利用TF-IDF算法提取评论文本的关键词,以关键词为基础建立LDA主题模型,从定量的角度提取评论主题,从而分析观众对这部电影的情感态度和评论的热点话题,为消费者的购买行为提供一定的决策支持,同时为商品提供者提供一定的发展方向.  相似文献   

13.

Bibliometric analysis is growing research filed supported in different tools. Some of these tools are based on network representation or thematic analysis. Despite years of tools development, still, there is the need to support merging information from different sources and enhancing longitudinal temporal analysis as part of trending topic evolution. We carried out a new scientometric open-source tool called ScientoPy and demonstrated it in a use case for the Internet of things topic. This tool contributes to merging problems from Scopus and Clarivate Web of Science sources, extracts and represents h-index for the analysis topic, and offers a set of possibilities for temporal analysis for authors, institutions, wildcards, and trending topics using four different visualizations options. This tool enables future bibliometric analysis in different emerging fields.

  相似文献   

14.
Research competitiveness analysis refers to the measurement, comparison and analysis of the research status (i.e., strength and/or weakness) of different scientific research bodies (e.g., institutions, researchers, etc.) in different research fields. Improving research competitiveness analysis method can be conducive to accurately obtaining the research status of research fields and research bodies. This paper presents a method of evaluating the competitiveness of research institutions based on research topic distribution. The method uses the LDA topic model to obtain a paper-topic distribution matrix to objectively assign the academic impact of papers (such as number of citations) to research topics. Then the method calculates the competitiveness of each research institution on each research topic with the help of an institution-paper matrix. Finally, the competitiveness and the research strength and/or weakness of the institutions are defined and characterized. A case study shows that the method can lead to an objective and effective evaluation of the research competitiveness of research institutions in a given research field.  相似文献   

15.
Glänzel  Wolfgang  Chi  Pei-Shan 《Scientometrics》2020,125(2):1011-1031

In the present study we discuss the challenge of “Scientometrics 2.0” as introduced by Priem and Hemminger (2010) in the light of possible applications to research evaluation. We use the Web of Science subject category public, environmental and occupational health to illustrate how indicators similar to those used in traditional scientometrics can be built, and we also discuss their opportunities and limitations. The discipline under study combines life sciences and social sciences in a unique manner and provides usable metrics reflecting both scholarly and wider impact. Nonetheless, metrics reflecting social media attention like tweets, retweets and Facebook likes, shares or comments are still subject to limitations in this research discipline as well. Furthermore, Usage metrics clearly point to the manipulation proneness of this measure. Although the counterparts of important bibliometric indicators proved to work for several altmetrics too, their interpretation and application to research assessment requires proper context analysis.

  相似文献   

16.

A thermodynamic approach has been applied to solving the problem of selecting the number of clusters/topics in topic modeling. The main principles of this approach are formulated and the behavior of topic models during temperature variations is studied. Using thermodynamic formalism, the existence of the entropy phase transition in topic models is shown and criteria for the choice of optimum number of clusters/ topics are determined.

  相似文献   

17.

Citations play a pivotal role in indicating various aspects of scientific literature. Quantitative citation analysis approaches have been used over the decades to measure the impact factor of journals, to rank researchers or institutions, to discover evolving research topics etc. Researchers doubted the pure quantitative citation analysis approaches and argued that all citations are not equally important; citation reasons must be considered while counting. In the recent past, researchers have focused on identifying important citation reasons by classifying them into important and non-important classes rather than individually classifying each reason. Most of contemporary citation classification techniques either rely on full content of articles, or they are dominated by content based features. However, most of the time content is not freely available as various journal publishers do not provide open access to articles. This paper presents a binary citation classification scheme, which is dominated by metadata based parameters. The study demonstrates the significance of metadata and content based parameters in varying scenarios. The experiments are performed on two annotated data sets, which are evaluated by employing SVM, KLR, Random Forest machine learning classifiers. The results are compared with the contemporary study that has performed similar classification employing rich list of content-based features. The results of comparisons revealed that the proposed model has attained improved value of precision (i.e., 0.68) just by relying on freely available metadata. We claim that the proposed approach can serve as the best alternative in the scenarios wherein content in unavailable.

  相似文献   

18.
Expert finding is of vital importance for exploring scientific collaborations to increase productivity by sharing and transferring knowledge within and across different research areas. Expert finding methods, including content-based methods, link structure-based methods, and a combination of content-based and link structure-based methods, have been studied in recent years. However, most state-of-the-art expert finding approaches have usually studied candidates’ personal information (e.g. topic relevance and citation counts) and network information (e.g. citation relationship) separately, causing some potential experts to be ignored. In this paper, we propose a topical and weighted factor graph model that simultaneously combines all the possible information in a unified way. In addition, we also design the Loopy Max-Product algorithm and related message-passing schedules to perform approximate inference on our cycle-containing factor graph model. Information Retrieval is chosen as the test field to identify representative authors for different topics within this area. Finally, we compare our approach with three baseline methods in terms of topic sensitivity, coverage rate of SIGIR PC (e.g. Program Committees or Program Chairs) members, and Normalized Discounted Cumulated Gain scores for different rankings on each topic. The experimental results demonstrate that our factor graph-based model can definitely enhance the expert-finding performance.  相似文献   

19.
Ma  Anqi  Liu  Yu  Xu  Xiujuan  Dong  Tao 《Scientometrics》2021,126(8):6803-6823

Predicting the impact of academic papers can help scholars quickly identify the high-quality papers in the field. How to develop efficient predictive model for evaluating potential papers has attracted increasing attention in academia. Many studies have shown that early citations contribute to improving the performance of predicting the long-term impact of a paper. Besides early citations, some bibliometric features and altmetric features have also been explored for predicting the impact of academic papers. Furthermore, paper metadata text such as title, abstract and keyword contains valuable information which has effect on its citation count. However, present studies ignore the semantic information contained in the metadata text. In this paper, we propose a novel citation prediction model based on paper metadata text to predict the long-term citation count, and the core of our model is to obtain the semantic information from the metadata text. We use deep learning techniques to encode the metadata text, and then further extract high-level semantic features for learning the citation prediction task. We also integrate early citations for improving the prediction performance of the model. We show that our proposed model outperforms the state-of-the-art models in predicting the long-term citation count of the papers, and metadata semantic features are effective for improving the accuracy of the citation prediction models.

  相似文献   

20.
In computer vision, emotion recognition using facial expression images is considered an important research issue. Deep learning advances in recent years have aided in attaining improved results in this issue. According to recent studies, multiple facial expressions may be included in facial photographs representing a particular type of emotion. It is feasible and useful to convert face photos into collections of visual words and carry out global expression recognition. The main contribution of this paper is to propose a facial expression recognition model (FERM) depending on an optimized Support Vector Machine (SVM). To test the performance of the proposed model (FERM), AffectNet is used. AffectNet uses 1250 emotion-related keywords in six different languages to search three major search engines and get over 1,000,000 facial photos online. The FERM is composed of three main phases: (i) the Data preparation phase, (ii) Applying grid search for optimization, and (iii) the categorization phase. Linear discriminant analysis (LDA) is used to categorize the data into eight labels (neutral, happy, sad, surprised, fear, disgust, angry, and contempt). Due to using LDA, the performance of categorization via SVM has been obviously enhanced. Grid search is used to find the optimal values for hyperparameters of SVM (C and gamma). The proposed optimized SVM algorithm has achieved an accuracy of 99% and a 98% F1 score.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号