共查询到20条相似文献,搜索用时 15 毫秒
1.
The proliferation of Internet has not only led to the generation of huge volumes of unstructured information in the form of
web documents, but a large amount of text is also generated in the form of emails, blogs, and feedbacks, etc. The data generated
from online communication acts as potential gold mines for discovering knowledge, particularly for market researchers. Text
analytics has matured and is being successfully employed to mine important information from unstructured text documents. The
chief bottleneck for designing text mining systems for handling blogs arise from the fact that online communication text data
are often noisy. These texts are informally written. They suffer from spelling mistakes, grammatical errors, improper punctuation
and irrational capitalization. This paper focuses on opinion extraction from noisy text data. It is aimed at extracting and
consolidating opinions of customers from blogs and feedbacks, at multiple levels of granularity. We have proposed a framework
in which these texts are first cleaned using domain knowledge and then subjected to mining. Ours is a semi-automated approach,
in which the system aids in the process of knowledge assimilation for knowledge-base building and also performs the analytics.
Domain experts ratify the knowledge base and also provide training samples for the system to automatically gather more instances
for ratification. The system identifies opinion expressions as phrases containing opinion words, opinionated features and
also opinion modifiers. These expressions are categorized as positive or negative with membership values varying from zero
to one. Opinion expressions are identified and categorized using localized linguistic techniques. Opinions can be aggregated
at any desired level of specificity i.e. feature level or product level, user level or site level, etc. We have developed
a system based on this approach, which provides the user with a platform to analyze opinion expressions crawled from a set
of pre-defined blogs. 相似文献
2.
Recently by the development of the Internet and the Web, different types of social media such as web blogs become an immense source of text data. Through the processing of these data, it is possible to discover practical information about different topics, individual’s opinions and a thorough understanding of the society. Therefore, applying models which can automatically extract the subjective information from documents would be efficient and helpful. Topic modeling methods and sentiment analysis are the raised topics in natural language processing and text mining fields. In this paper a new structure for joint sentiment-topic modeling based on a Restricted Boltzmann Machine (RBM) which is a type of neural networks is proposed. By modifying the structure of RBM as well as appending a layer which is analogous to sentiment of text data to it, we propose a generative structure for joint sentiment topic modeling based on neural networks. The proposed method is supervised and trained by the Contrastive Divergence algorithm. The new attached layer in the proposed model is a layer with the multinomial probability distribution which can be used in text data sentiment classification or any other supervised application. The proposed model is compared with existing models in the experiments such as evaluating as a generative model, sentiment classification, information retrieval and the corresponding results demonstrate the efficiency of the method. 相似文献
3.
The low accuracy rates of text-shape dividers for digital ink diagrams are hindering their use in real world applications. While recognition of handwriting is well advanced and there have been many recognition approaches proposed for hand drawn sketches, there has been less attention on the division of text and drawing ink. Feature based recognition is a common approach for text-shape division. However, the choice of features and algorithms are critical to the success of the recognition. We propose the use of data mining techniques to build more accurate text-shape dividers. A comparative study is used to systematically identify the algorithms best suited for the specific problem. We have generated dividers using data mining with diagrams from three domains and a comprehensive ink feature library. The extensive evaluation on diagrams from six different domains has shown that our resulting dividers, using LADTree and LogitBoost, are significantly more accurate than three existing dividers. 相似文献
4.
This study analyses the online questions and chat messages automatically recorded by a live video streaming (LVS) system using data mining and text mining techniques. We apply data mining and text mining techniques to analyze two different datasets and then conducted an in-depth correlation analysis for two educational courses with the most online questions and chat messages respectively. The study found the discrepancies as well as similarities in the students’ patterns and themes of participation between online questions (student–instructor interaction) and online chat messages (student–students interaction or peer interaction). The results also identify disciplinary differences in students’ online participation. A correlation is found between the number of online questions students asked and students’ final grades. The data suggests that a combination of using data mining and text mining techniques for a large amount of online learning data can yield considerable insights and reveal valuable patterns in students’ learning behaviors. Limitations with data and text mining were also revealed and discussed in the paper. 相似文献
6.
Multi-label text classification is an increasingly important field as large amounts of text data are available and extracting relevant information is important in many application contexts. Probabilistic generative models are the basis of a number of popular text mining methods such as Naive Bayes or Latent Dirichlet Allocation. However, Bayesian models for multi-label text classification often are overly complicated to account for label dependencies and skewed label frequencies while at the same time preventing overfitting. To solve this problem we employ the same technique that contributed to the success of deep learning in recent years: greedy layer-wise training. Applying this technique in the supervised setting prevents overfitting and leads to better classification accuracy. The intuition behind this approach is to learn the labels first and subsequently add a more abstract layer to represent dependencies among the labels. This allows using a relatively simple hierarchical topic model which can easily be adapted to the online setting. We show that our method successfully models dependencies online for large-scale multi-label datasets with many labels and improves over the baseline method not modeling dependencies. The same strategy, layer-wise greedy training, also makes the batch variant competitive with existing more complex multi-label topic models. 相似文献
7.
Textual databases are useful sources of information and knowledge and if these are well utilised then issues related to future project management and product or service quality improvement may be resolved. A large part of corporate information, approximately 80%, is available in textual data formats. Text Classification techniques are well known for managing on-line sources of digital documents. The identification of key issues discussed within textual data and their classification into two different classes could help decision makers or knowledge workers to manage their future activities better. This research is relevant for most text based documents and is demonstrated on Post Project Reviews (PPRs) which are valuable source of information and knowledge. The application of textual data mining techniques for discovering useful knowledge and classifying textual data into different classes is a relatively new area of research. The research work presented in this paper is focused on the use of hybrid applications of text mining or textual data mining techniques to classify textual data into two different classes. The research applies clustering techniques at the first stage and Apriori Association Rule Mining at the second stage. The Apriori Association Rule of Mining is applied to generate Multiple Key Term Phrasal Knowledge Sequences (MKTPKS) which are later used for classification. Additionally, studies were made to improve the classification accuracies of the classifiers i.e. C4.5, K-NN, Naïve Bayes and Support Vector Machines (SVMs). The classification accuracies were measured and the results compared with those of a single term based classification model. The methodology proposed could be used to analyse any free formatted textual data and in the current research it has been demonstrated on an industrial dataset consisting of Post Project Reviews (PPRs) collected from the construction industry. The data or information available in these reviews is codified in multiple different formats but in the current research scenario only free formatted text documents are examined. Experiments showed that the performance of classifiers improved through adopting the proposed methodology. 相似文献
8.
This paper presents a topical text segmentation method based on intended boundaries detection and compares it to a well known default boundaries detection method, c99. We compared the two methods by running them on two different corpora of French texts and results are evaluated by two different methods: one using a modified classic measure, the FScore, the other based on a manual evaluation one the Internet. Our results showed that algorithms that are close when automatically evaluated can be quite far when manually evaluated. 相似文献
9.
In this paper we describe a novel framework for the discovery of the topical content of a data corpus, and the tracking of its complex structural changes across the temporal dimension. In contrast to previous work our model does not impose a prior on the rate at which documents are added to the corpus nor does it adopt the Markovian assumption which overly restricts the type of changes that the model can capture. Our key technical contribution is a framework based on (i) discretization of time into epochs, (ii) epoch-wise topic discovery using a hierarchical Dirichlet process-based model, and (iii) a temporal similarity graph which allows for the modelling of complex topic changes: emergence and disappearance, evolution, splitting, and merging. The power of the proposed framework is demonstrated on two medical literature corpora concerned with the autism spectrum disorder (ASD) and the metabolic syndrome (MetS)—both increasingly important research subjects with significant social and healthcare consequences. In addition to the collected ASD and metabolic syndrome literature corpora which we made freely available, our contribution also includes an extensive empirical analysis of the proposed framework. We describe a detailed and careful examination of the effects that our algorithms’s free parameters have on its output and discuss the significance of the findings both in the context of the practical application of our algorithm as well as in the context of the existing body of work on temporal topic analysis. Our quantitative analysis is followed by several qualitative case studies highly relevant to the current research on ASD and MetS, on which our algorithm is shown to capture well the actual developments in these fields. 相似文献
10.
Text mining techniques include categorization of text, summarization, topic detection, concept extraction, search and retrieval, document clustering, etc. Each of these techniques can be used in finding some non-trivial information from a collection of documents. Text mining can also be employed to detect a document’s main topic/theme which is useful in creating taxonomy from the document collection. Areas of applications for text mining include publishing, media, telecommunications, marketing, research, healthcare, medicine, etc. Text mining has also been applied on many applications on the World Wide Web for developing recommendation systems. We propose here a set of criteria to evaluate the effectiveness of text mining techniques in an attempt to facilitate the selection of appropriate technique. 相似文献
12.
Hepatocellular carcinoma (HCC) is the third leading cause of cancer-related mortality worldwide. New insights into the pathogenesis of this lethal disease are urgently needed. Chromosomal copy number alterations (CNAs) can lead to activation of oncogenes and inactivation of tumor suppressors in human cancers. Thus, identification of cancer-specific CNAs will not only provide new insight into understanding the molecular basis of tumor genesis but also facilitate the identification of HCC biomarkers using CNA. 相似文献
13.
Formative assessment and summative assessment are two widely accepted approaches of assessment. While summative assessment is a typically formal assessment used at the end of a lesson or course, formative assessment is an ongoing process of monitoring learners’ progresses of knowledge construction. Although empirical evidence has acknowledged that formal assessment is indeed superior to summative assessment, current e-learning assessment systems however seldom provide plausible solutions for conducting formative assessment. The major bottleneck of putting formative assessment into practice lies in its labor-intensive and time-consuming nature, which makes it hardly a feasible way of achievement evaluation especially when there are usually a large number of learners in e-learning environment. In this regard, this study developed EduMiner to relieve the burdens imposed on instructors and learners by capitalizing a series of text mining techniques. An empirical study was held to examine effectiveness and to explore outcomes of the features that EduMiner supported. In this study 56 participants enrolling in a “Human Resource Management” course were randomly divided into either experimental groups or control groups. Results of this study indicated that the algorithms introduced in this study serve as a feasible approach for conducting formative assessment in e-learning environment. In addition, learners in experimental groups were highly motivated to phrase the contents with higher-order level of cognition. Therefore a timely feedback of visualized representations is beneficial to facilitate online learners to express more in-depth ideas in discourses. 相似文献
14.
This paper addresses two types of classification of noisy, unstructured text such as newsgroup messages: (1) spotting messages containing topics of interest, and (2) automatic conceptual organization of messages without prior knowledge of topics of interest. In addition to applying our hidden Markov model methodology to spotting topics of interest in newsgroup messages, we present a robust methodology for rejecting messages which are off-topic. We describe a novel approach for automatically organizing a large, unstructured collection of messages. The approach applies an unsupervised topic clustering procedure to generate a hierarchical tree of topics. 相似文献
15.
Evidence based medicine (EBM) urges the medical doctor to incorporate the latest available clinical evidence at point of care. A major stumbling block in the practice of EBM is the difficulty to keep up to date with the clinical advances. In this paper we describe a corpus designed for the development and testing of text processing tools for EBM, in particular for tasks related to the extraction and summarisation of answers and corresponding evidence related to a clinical query. The corpus is based on material from the Clinical Inquiries section of The Journal of Family Practice. It was gathered and annotated by a combination of automated information extraction, crowdsourcing tasks, and manual annotation. It has been used for the original summarisation task for which it was designed, as well as for other related tasks such as the appraisal of clinical evidence and the clustering of the results. The corpus is available at SourceForge ( http://sourceforge.net/projects/ebmsumcorpus/). 相似文献
16.
Self-organizing maps (SOM) have been applied on numerous data clustering and visualization tasks and received much attention on their success. One major shortage of classical SOM learning algorithm is the necessity of predefined map topology. Furthermore, hierarchical relationships among data are also difficult to be found. Several approaches have been devised to conquer these deficiencies. In this work, we propose a novel SOM learning algorithm which incorporates several text mining techniques in expanding the map both laterally and hierarchically. On training a set of text documents, the proposed algorithm will first cluster them using classical SOM algorithm. We then identify the topics of each cluster. These topics are then used to evaluate the criteria on expanding the map. The major characteristic of the proposed approach is to combine the learning process with text mining process and makes it suitable for automatic organization of text documents. We applied the algorithm on the Reuters-21578 dataset in text clustering and categorization tasks. Our method outperforms two comparing models in hierarchy quality according to users’ evaluation. It also receives better F1-scores than two other models in text categorization task. 相似文献
17.
Most data mining approaches assume that the data can be provided from a single source. If data was produced from many physically distributed locations like Wal-Mart, these methods require a data center which gathers data from distributed locations. Sometimes, transmitting large amounts of data to a data center is expensive and even impractical. Therefore, distributed and parallel data mining algorithms were developed to solve this problem. In this paper, we survey the-state-of-the-art algorithms and applications in distributed data mining and discuss the future research opportunities. 相似文献
18.
Probabilistic topic models are widely used in different contexts to uncover the hidden structure in large text corpora. One of the main (and perhaps strong) assumption of these models is that generative process follows a bag-of-words assumption, i.e. each token is independent from the previous one. We extend the popular Latent Dirichlet Allocation model by exploiting three different conditional Markovian assumptions: (i) the token generation depends on the current topic and on the previous token; (ii) the topic associated with each observation depends on topic associated with the previous one; (iii) the token generation depends on the current and previous topic. For each of these modeling assumptions we present a Gibbs Sampling procedure for parameter estimation. Experimental evaluation over real-word data shows the performance advantages, in terms of recall and precision, of the sequence-modeling approaches. 相似文献
19.
Longitudinal data refer to the situation where repeated observations are available for each sampled object. Clustered data,
where observations are nested in a hierarchical structure within objects (without time necessarily being involved) represent
a similar type of situation. Methodologies that take this structure into account allow for the possibilities of systematic
differences between objects that are not related to attributes and autocorrelation within objects across time periods. A standard
methodology in the statistics literature for this type of data is the mixed effects model, where these differences between
objects are represented by so-called “random effects” that are estimated from the data (population-level relationships are
termed “fixed effects,” together resulting in a mixed effects model). This paper presents a methodology that combines the
structure of mixed effects models for longitudinal and clustered data with the flexibility of tree-based estimation methods.
We apply the resulting estimation method, called the RE-EM tree, to pricing in online transactions, showing that the RE-EM
tree is less sensitive to parametric assumptions and provides improved predictive power compared to linear models with random
effects and regression trees without random effects. We also apply it to a smaller data set examining accident fatalities,
and show that the RE-EM tree strongly outperforms a tree without random effects while performing comparably to a linear model
with random effects. We also perform extensive simulation experiments to show that the estimator improves predictive performance
relative to regression trees without random effects and is comparable or superior to using linear models with random effects
in more general situations. 相似文献
20.
Semantic entities carry the most important semantics of text data. Therefore, the identification and the relationship integration of semantic entities are very important for applications requiring semantics of text data. However, current strategies are still facing many problems such as semantic entity identification, new word identification and relationship integration among semantic entities. To address these problems, a two-phase framework for semantic entity identification with relationship integration in large scale text data is proposed in this paper. In the first semantic entities identification phase, we propose a novel strategy to extract unknown text semantic entities by integrating statistical features, Decision Tree (DT), and Support Vector Machine (SVM) algorithms. Compared with traditional approaches, our strategy is more effective in detecting semantic entities and more sensitive to new entities that just appear in the fresh data. After extracting the semantic entities, the second phase of our framework is for the integration of Semantic Entities Relationships (SER) which can help to cluster the semantic entities. A novel classification method using features such as similarity measures and co-occurrence probabilities is applied to tackle the clustering problem and discover the relationships among semantic entities. Comprehensive experimental results have shown that our framework can beat state-of-the-art strategies in semantic entity identification and discover over 80% relationship pairs among related semantic entities in large scale text data. 相似文献
|