首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到17条相似文献,搜索用时 15 毫秒
1.
Using text classification and multiple concepts to answer e-mails   总被引:1,自引:0,他引:1  
In text mining, the applications domain of text classification techniques is very broad to include text filtering, word identification, and web page classification, etc. Through text classification techniques, documents can be placed into previously defined classifications in order to save on time costs especially when manual document search methods are employed. This research uses text classification techniques applied to e-mail reply template suggestions in order to lower the burden of customer service personnel in responding to e-mails. Suggested templates allows customer service personnel, using a pre-determined number of templates, to find the needed reply template, and not waste time in searching for relevant answers from too much information available. Current text classification techniques are still single-concept based. This research hopes to use a multiple concept method to integrate the relationship between concepts and classifications which will thus allow easy text classification. Through integration of different concepts and classifications, a dynamically unified e-mail concept can recommend different appropriate reply templates. In so doing, the differences between e-mails can be definitely determined, effectively improving the accuracy of the suggested template. In addition, for e-mails with two or more questions, this research tries to come up with an appropriate reply template. Based on experimental verification, the method proposed in this research effectively proposes a template for e-mails of multiple questions. Therefore, using multiple concepts to display the document topic is definitely a clearer way of extracting information that a document wants to convey when the vector of similar documents is used.  相似文献   

2.
Consider a supervised learning problem in which examples contain both numerical- and text-valued features. To use traditional feature-vector-based learning methods, one could treat the presence or absence of a word as a Boolean feature and use these binary-valued features together with the numerical features. However, the use of a text-classification system on this is a bit more problematic—in the most straight-forward approach each number would be considered a distinct token and treated as a word. This paper presents an alternative approach for the use of text classification methods for supervised learning problems with numerical-valued features in which the numerical features are converted into bag-of-words features, thereby making them directly usable by text classification methods. We show that even on purely numerical-valued data the results of text classification on the derived text-like representation outperforms the more naive numbers-as-tokens representation and, more importantly, is competitive with mature numerical classification methods such as C4.5, Ripper, and SVM. We further show that on mixed-mode data adding numerical features using our approach can improve performance over not adding those features.  相似文献   

3.
Using Wikipedia knowledge to improve text classification   总被引:7,自引:7,他引:0  
Text classification has been widely used to assist users with the discovery of useful information from the Internet. However, traditional classification methods are based on the “Bag of Words” (BOW) representation, which only accounts for term frequency in the documents, and ignores important semantic relationships between key terms. To overcome this problem, previous work attempted to enrich text representation by means of manual intervention or automatic document expansion. The achieved improvement is unfortunately very limited, due to the poor coverage capability of the dictionary, and to the ineffectiveness of term expansion. In this paper, we automatically construct a thesaurus of concepts from Wikipedia. We then introduce a unified framework to expand the BOW representation with semantic relations (synonymy, hyponymy, and associative relations), and demonstrate its efficacy in enhancing previous approaches for text classification. Experimental results on several data sets show that the proposed approach, integrated with the thesaurus built from Wikipedia, can achieve significant improvements with respect to the baseline algorithm.
Pu WangEmail:
  相似文献   

4.
文本分类技术是知识管理系统实现知识有效组织、存储和检索的重要手段,而基于词向量空间模型的文本分类没有考虑知识管理系统的特点,从而也不能满足知识管理系统中多分类的需要.论文提出了一种新的基于知识本体的文本分类算法,该方法利用知识管理系统中的本体集,实现了多概念粒度分类,实验表明该方法具备良好的分类性能.  相似文献   

5.
Neighbor-weighted K-nearest neighbor for unbalanced text corpus   总被引:10,自引:0,他引:10  
Text categorization or classification is the automated assigning of text documents to pre-defined classes based on their contents. Many of classification algorithms usually assume that the training examples are evenly distributed among different classes. However, unbalanced data sets often appear in many practical applications. In order to deal with uneven text sets, we propose the neighbor-weighted K-nearest neighbor algorithm, i.e. NWKNN. The experimental results indicate that our algorithm NWKNN achieves significant classification performance improvement on imbalanced corpora.  相似文献   

6.
This paper presents a novel over-sampling method based on document content to handle the class imbalance problem in text classification. The new technique, COS-HMM (Content-based Over-Sampling HMM), includes an HMM that is trained with a corpus in order to create new samples according to current documents. The HMM is treated as a document generator which can produce synthetical instances formed on what it was trained with.To demonstrate its achievement, COS-HMM is tested with a Support Vector Machine (SVM) in two medical documental corpora (OHSUMED and TREC Genomics), and is then compared with the Random Over-Sampling (ROS) and SMOTE techniques. Results suggest that the application of over-sampling strategies increases the global performance of the SVM to classify documents. Based on the empirical and statistical studies, the new method clearly outperforms the baseline method (ROS), and offers a greater performance than SMOTE in the majority of tested cases.  相似文献   

7.
The use of the computing with words paradigm for the automatic text documents categorization problem is discussed. This specific problem of information retrieval (IR) becomes more and more important, notably in view of a fast proliferation of textual information available on the Internet. The main issues that have to be addressed here are: document representation and classification. The use of fuzzy logic for both problems has already been quite deeply studied though for the latter, i.e., classification, generally not in an IR context. Our approach is based mainly on the classical calculus of linguistically quantified propositions proposed by Zadeh. Moreover, we employ results related to fuzzy (linguistic) queries in IR, notably various interpretations of the weights of query terms. Some preliminary results on widely adopted text corpora are presented.  相似文献   

8.
Business Process Re-engineering (BPR) is being used to improve the efficiency of the organizational processes, however, a number of obstacles have prevented its full potential from being realised. One of these obstacles is caused by an emphasis on the business process itself at the exclusion of considering other important knowledge of the organization. Another is due to the lack of tools for identifying the cause of the inefficiencies and inconsistencies in BPR. In this paper we propose a methodology for BPR that overcomes these two obstacles through the use of a formal organizational ontology and knowledge structure and source maps. These knowledge maps are represented formally to facilitate an inferencing mechanism which helps to automatically identify the causes of the inefficiencies and inconsistencies. We demonstrate the applicability of this methodology through the use of a case study of a university domain.  相似文献   

9.
10.
This article reports on our experiments and results on the effectiveness of different feature sets and information fusion from some combinations of them in classifying free text documents into a given number of categories. We use different feature sets and integrate neural network learning into the method. The feature sets are based on the “latent semantics” of a reference library — a collection of documents adequately representing the desired concepts. We found that a larger reference library is not necessarily better. Information fusion almost always gives better results than the individual constituent feature sets, with certain combinations doing better than the others.  相似文献   

11.
A large-scale project produces a lot of text data during construction commonly achieved as various management reports. Having the right information at the right time can help the project team understand the project status and manage the construction process more efficiently. However, text information is presented in unstructured or semi-structured formats. Extracting useful information from such a large text warehouse is a challenge. A manual process is costly and often times cannot deliver the right information to the right person at the right time. This research proposes an integrated intelligent approach based on natural language processing technology (NLP), which mainly involves three stages. First, a text classification model based on Convolution Neural Network (CNN) is developed to classify the construction on-site reports by analyzing and extracting report text features. At the second stage, the classified construction report texts are analyzed with improved frequency-inverse document frequency (TF-IDF) by mutual information to identify and mine construction knowledge. At the third stage, a relation network based on the co-occurrence matrix of the knowledge is presented for visualization and better understanding of the construction on-site information. Actual construction reports are used to verify the feasibility of this approach. The study provides a new approach for handling construction on-site text data which can lead to enhancing management efficiency and practical knowledge discovery for project management.  相似文献   

12.
Risk identification is a knowledge-based process that requires the time-consuming and laborious identification of project-specific risk factors. Current practices for risk identification in construction rely heavily on an expert’s subjective knowledge of the current project and of similar historical projects to determine if a risk may affect the project under study. When quantitative risk-related data are available, they are often stored across multiple sources and in different types of documents complicating data sharing and reuse. The present study introduces an ontology-based approach for construction risk identification that maps and automates the representation of project context and risk information, thereby enhancing the storage, sharing, and reuse of knowledge for the purpose of risk identification. The study also presents a novel wind farm construction project risk ontology that has been validated by a group of industry experts. The resulting ontology-based risk identification approach is able to accommodate project context in the risk identification process and, through implementation of the proposed approach, has identified risk factors that affect the construction of onshore wind farm projects.  相似文献   

13.
14.
Literature on supervised Machine-Learning (ML) approaches for classifying text-based safety reports for the construction sector has been growing. Recent studies have emphasized the need to build ML approaches that balance high classification accuracy and performance on management criteria, such as resource intensiveness. However, despite being highly accurate, the extensively focused, supervised ML approaches may not perform well on management criteria as many factors contribute to their resource intensiveness. Alternatively, the potential for semi-supervised ML approaches to achieve balanced performance has rarely been explored in the construction safety literature. The current study contributes to the scarce knowledge on semi-supervised ML approaches by demonstrating the applicability of a state-of-the-art semi-supervised learning approach, i.e., Yet, Another Keyword Extractor (YAKE) integrated with Guided Latent Dirichlet Allocation (GLDA) for construction safety report classification. Construction-safety-specific knowledge is extracted as keywords through YAKE, relying on accessible literature with minimal manual intervention. Keywords from YAKE are then seeded in the GLDA model for the automatic classification of safety reports without requiring a large quantity of prelabeled datasets. The YAKE-GLDA classification performance (F1 score of 0.66) is superior to existing unsupervised methods for the benchmark data containing injury narratives from Occupational Health and Safety Administration (OSHA). The YAKE-GLDA approach is also applied to near-miss safety reports from a construction site. The study demonstrates a high degree of generality of the YAKE-GLDA approach through a moderately high F1 score of 0.86 for a few categories in the near-miss data. The current research demonstrates that, unlike the existing supervised approaches, the semi-supervised YAKE-GLDA approach can achieve a novel possibility of consistently achieving reasonably good classification performance across various construction-specific safety datasets yet being resource-efficient. Results from an objective comparative and sensitivity analysis contribute to much-required knowledge-contesting insights into the functioning and applicability of the YAKE-GLDA. The results from the current study will help construction organizations implement and optimize an efficient ML-based knowledge-mining strategy for domains beyond safety and across sites where the availability of a pre-labeled dataset is a significant limitation.  相似文献   

15.
Improving workers’ safety and health is one of the most critical issues in the construction industry. Research attempts have been made to better identify construction hazards on a jobsite by analyzing workers’ physical responses (e.g., stride and balance) or physiological responses (e.g., brain waves and heart rate) collected from the wearable devices. Among them, electroencephalogram (EEG) holds unique potential since it reveals abnormal patterns immediately when a hazard is perceived and recognized. Unfortunately, the unproven capacity of EEG signals for multi-hazard classification is a primary barrier towards ubiquitous hazard identification in real-time on jobsites. This study correlates EEG signal patterns with construction hazard types and develops an EEG classifier based on the experiments conducted in an immersive virtual reality (VR) environment. Hazards of different types (e.g., fall and slip/trip) were simulated in a VR environment. EEG signals were collected from subjects who wore both wearable EEG and VR devices during the experimentation. Two types of EEG features (time-domain/frequency-domain features and cognitive features) were extracted for training and testing. A total of eighteen advanced machine learning algorithms were used to develop the EEG classifier. The initial results showed that the LightGBM classifier achieved 70.1% accuracy based on the cognitive feature set for the 7-class classification. To improve the performance, the input data was relabeled, and three strategies were designed and tested. As a result, the combined approach (two-step ensemble classification) achieved 82.3% accuracy. As such, this study not only demonstrates the feasibility of coupling wearable EEG, VR, and machine learning to differentiate jobsite hazards but also provides strategies to improve multi-class classification performance. The research results support ubiquitous hazard identification and thereby contribute to the safety of the construction workplace.  相似文献   

16.
Accident reports provide information to understand why and how events occur. Learning from past accident reports is critical for preventing accidents or injuries in construction safety management. However, there are two issues: (1) manual analysis of such accident reports is time-consuming and labor-intensive; and (2) previous research mainly focused on analyzing the causal factors of accidents. Not much research concentrates on the injury effect in an accident and the influential relationship between accident cause and injury effect. To tackle this problem, a graph-based deep learning framework is proposed to identify accident-injury type and bodypart factors automatically to enable managers to make timely and better-informed decisions to prevent accidents and injuries for on-site safety. In this framework, a graph-based deep learning approach (specifically, the Graph Convolutional Network) is developed to automatically classify accident reports labeled with accident_type and injury_type, whereas the traversal method is developed to identify the bodypart factors. To further intuitively visualize these safety risk factors (e.g., accident_type, injury_type, and bodypart factors), the co-occurrence networks are drawn to further intuitively reveal the interdependency in accident-injury and injury-bodypart types respectively. From the perspective of theoretical and practical contributions, the framework proposed in this study not only represents a substantial data-driven advancement in construction accident report classification and keyword extraction tasks, but also enables managers to get knowledge of construction safety performance (i.e., accident causes and injury effects) and further formulate corresponding strategies to prevent accidents and injuries in on-site safety management.  相似文献   

17.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号