首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 781 毫秒
1.
Text representation is an essential task in transforming the input from text into features that can be later used for further Text Mining and Information Retrieval tasks. The commonly used text representation model is Bags-of-Words (BOW) and the N-gram model. Nevertheless, some known issues of these models, which are inaccurate semantic representation of text and high dimensionality of word size combination, should be investigated. A pattern-based model named Frequent Adjacent Sequential Pattern (FASP) is introduced to represent the text using a set of sequence adjacent words that are frequently used across the document collection. The purpose of this study is to discover the similarity of textual pattern between documents that can be later converted to a set of rules to describe the main news event. The FASP is based on the Pattern-Growth’s divide-and-conquer strategy where the main difference between FASP and the prior technique is in the Pattern Generation phase. This approach is tested against the BOW and N-gram text representation model using Malay and English language news dataset with different term weightings in the Vector Space Model (VSM). The findings demonstrate that the FASP model has a promising performance in finding similarities between documents with the average vector size reduction of 34% against the BOW and 77% against the N-gram model using the Malay dataset. Results using the English dataset is also consistent, indicating that the FASP approach is also language independent.  相似文献   

2.
In this paper, we investigate the relative effect of two strategies for language resource addition for Japanese morphological analysis, a joint task of word segmentation and part-of-speech tagging. The first strategy is adding entries to the dictionary and the second is adding annotated sentences to the training corpus. The experimental results showed that addition of annotated sentences to the training corpus is better than the addition of entries to the dictionary. In particular, adding annotated sentences is especially efficient when we add new words with contexts of several real occurrences as partially annotated sentences, i.e. sentences in which only some words are annotated with word boundary information. According to this knowledge, we performed real annotation experiments on invention disclosure texts and observed word segmentation accuracy. Finally we investigated various language resource addition cases and introduced the notion of non-maleficence, asymmetricity, and additivity of language resources for a task. In the WS case, we found that language resource addition is non-maleficent (adding new resources causes no harm in other domains) and sometimes additive (adding new resources helps other domains). We conclude that it is reasonable for us, NLP tool providers, to distribute only one general-domain model trained from all the language resources we have.  相似文献   

3.
Text classification constitutes a popular task in Web research with various applications that range from spam filtering to sentiment analysis. In this paper, we argue that its performance depends on the quality of Web documents, which varies significantly. For example, the curated content of news articles involves different challenges than the user-generated content of blog posts and Social Media messages. We experimentally verify our claim, quantifying the main factors that affect the performance of text classification. We also argue that the established bag-of-words representation models are inadequate for handling all document types, as they merely extract frequent, yet distinguishing terms from the textual content of the training set. Thus, they suffer from low robustness in the context of noisy or unseen content, unless they are enriched with contextual, application-specific information. In their place, we propose the use of n-gram graphs, a model that goes beyond the bag-of-words representation, transforming every document into a graph: its nodes correspond to character or word n-grams and the co-occurring ones are connected by weighted edges. Individual document graphs can be combined into class graphs and graph similarities are employed to position and classify documents into the vector space. This approach offers two advantages with respect to bag models: first, classification accuracy increases due to the contextual information that is encapsulated in the edges of the n-gram graphs. Second, it reduces the search space to a limited set of robust, endogenous features that depend on the number of classes, rather than the size of the vocabulary. Our thorough experimental study over three large, real-world corpora confirms the superior performance of n-gram graphs across the main types of Web documents.  相似文献   

4.
Evidence based medicine (EBM) urges the medical doctor to incorporate the latest available clinical evidence at point of care. A major stumbling block in the practice of EBM is the difficulty to keep up to date with the clinical advances. In this paper we describe a corpus designed for the development and testing of text processing tools for EBM, in particular for tasks related to the extraction and summarisation of answers and corresponding evidence related to a clinical query. The corpus is based on material from the Clinical Inquiries section of The Journal of Family Practice. It was gathered and annotated by a combination of automated information extraction, crowdsourcing tasks, and manual annotation. It has been used for the original summarisation task for which it was designed, as well as for other related tasks such as the appraisal of clinical evidence and the clustering of the results. The corpus is available at SourceForge (http://sourceforge.net/projects/ebmsumcorpus/).  相似文献   

5.
Recent years have witnessed the rapid growth of text data, and thus the increasing importance of in-depth analysis of text data for various applications. Text data are often organized in a database with documents labeled by attributes like time and location. Different documents manifest different topics. The topics of the documents may change along the attributes of the documents, and such changes have been the subject of research in the past. However, previous analyses techniques, such as topic detection and tracking, topic lifetime, and burstiness, all focus on the topic behavior of the documents in a given attribute range without contrasting to the documents in the overall range. This paper introduces the concept of u n i q u e t o p i c s, referring to those topics that only appear frequently within a small range of documents but not in the whole range. These unique topics may reflect some unique characteristics of documents in this small range not found outside of the range. The paper aims at an efficient pruning-based algorithm that, for a user-given set of keywords and a user-given attribute, finds the maximal ranges along the given attribute and their unique topics that are highly related to the given keyword set. Thorough experiments show that the algorithm is effective in various scenarios.  相似文献   

6.
Building on literary theory and data from a field study of text in chemotherapy, this article introduces the concept of intertext and the associated concepts of corpus and intertextuality to CSCW. It shows that the ensemble of documents used and produced in practice can be said to form a corpus of written texts. On the basis of the corpus, or subsections thereof, the actors in cooperative work create intertext between relevant (complementary) texts in a particular situation, for a particular purpose. The intertext of a particular situation can be constituted by several kinds of intertextuality, including the complementary type, the intratextual type and the mediated type. In this manner the article aims to systematically conceptualise cooperative actors’ engagement with text in text-laden practices. The approach is arguably novel and beneficial to CSCW. The article also contributes with a discussion of computer enabling the activity of creating intertext. This is a key concern for cooperative work as intertext is central to text-centric work practices such as healthcare.  相似文献   

7.
This paper addresses the important problem of discerning hateful content in social media. We propose a detection scheme that is an ensemble of Recurrent Neural Network (RNN) classifiers, and it incorporates various features associated with user-related information, such as the users’ tendency towards racism or sexism. This data is fed as input to the above classifiers along with the word frequency vectors derived from the textual content. We evaluate our approach on a publicly available corpus of 16k tweets, and the results demonstrate its effectiveness in comparison to existing state-of-the-art solutions. More specifically, our scheme can successfully distinguish racism and sexism messages from normal text, and achieve higher classification quality than current state-of-the-art algorithms.  相似文献   

8.
We present the RST Signalling Corpus (Das et al. in RST signalling corpus, LDC2015T10. https://catalog.ldc.upenn.edu/LDC2015T10, 2015), a corpus annotated for signals of coherence relations. The corpus is developed over the RST Discourse Treebank (Carlson et al. in RST Discourse Treebank, LDC2002T07. https://catalog.ldc.upenn.edu/LDC2002T07, 2002) which is annotated for coherence relations. In the RST Signalling Corpus, these relations are further annotated with signalling information. The corpus includes annotation not only for discourse markers which are considered to be the most typical (or sometimes the only type of) signals in discourse, but also for a wide array of other signals such as reference, lexical, semantic, syntactic, graphical and genre features as potential indicators of coherence relations. We describe the research underlying the development of the corpus and the annotation process, and provide details of the corpus. We also present the results of an inter-annotator agreement study, illustrating the validity and reproducibility of the annotation. The corpus is available through the Linguistic Data Consortium, and can be used to investigate the psycholinguistic mechanisms behind the interpretation of relations through signalling, and also to develop discourse-specific computational systems such as discourse parsing applications.  相似文献   

9.
Stemming is a process of reducing a derivational or inflectional word to its root or stem by stripping all its affixes. It is been used in applications such as information retrieval, machine translation, and text summarization, as their pre-processing step to increase efficiency. Currently, there are a few stemming algorithms which have been developed for languages such as English, Arabic, Turkish, Malay and Amharic. Unfortunately, no algorithm has been used to stem text in Hausa, a Chadic language spoken in West Africa. To address this need, we propose stemming Hausa text using affix-stripping rules and reference lookup. We stemmed Hausa text, using 78 affix stripping rules applied in 4 steps and a reference look-up consisting of 1500 Hausa root words. The over-stemming index, under-stemming index, stemmer weight, word stemmed factor, correctly stemmed words factor and average words conflation factor were calculated to determine the effect of reference look-up on the strength and accuracy of the stemmer. It was observed that reference look-up aided in reducing both over-stemming and under-stemming errors, increased accuracy and has a tendency to reduce the strength of an affix stripping stemmer. The rationality behind the approach used is discussed and directions for future research are identified.  相似文献   

10.
Online event-based social services allow users to organize social events by specifying the themes, and invite friends to participate social events. While the event information can be spread over the social network, it is expected that by certain communication between event hosts, users interested in the event themes can be as more as possible. In this paper, by combining the ideas of team formation and influence maximization, we formulate a novel research problem, Influential Team Formation (ITF), to facilitate the organization of social events. Given a set L of required labels to describe the event topics, a social network, and the size k of the host team, ITF is to find a k-node set S that satisfying L and maximizing the Influence-Cost Ratio (i.e., the influence spread per communication cost between team members). Since ITF is proved to be NP-hard, we develop two greedy algorithms and one heuristic method to solve it. Extensive experiments conducted on Facebook and Google+ datasets exhibit the effectiveness and efficiency of the proposed methods. In addition, by employing the real event participation data in Meetup, we show that ITF with the proposed solutions is able to predict organizers of influential events.  相似文献   

11.
The speech recognition system basically extracts the textual information present in the speech. In the present work, speaker independent isolated word recognition system for one of the south Indian language—Kannada has been developed. For European languages such as English, large amount of research has been carried out in the context of speech recognition. But, speech recognition in Indian languages such as Kannada reported significantly less amount of work and there are no standard speech corpus readily available. In the present study, speech database has been developed by recording the speech utterances of regional Kannada news corpus of different speakers. The speech recognition system has been implemented using the Hidden Markov Tool Kit. Two separate pronunciation dictionaries namely phone based and syllable based dictionaries are built in-order to design and evaluate the performances of phone-level and syllable-level sub-word acoustical models. Experiments have been carried out and results are analyzed by varying the number of Gaussian mixtures in each state of monophone Hidden Markov Model (HMM). Also, context dependent triphone HMM models have been built for the same Kannada speech corpus and the recognition accuracies are comparatively analyzed. Mel frequency cepstral coefficients along with their first and second derivative coefficients are used as feature vectors and are computed in acoustic front-end processing. The overall word recognition accuracy of 60.2 and 74.35 % respectively for monophone and triphone models have been obtained. The study shows a good improvement in the accuracy of isolated-word Kannada speech recognition system using triphone HMM models compared to that of monophone HMM models.  相似文献   

12.
基于文本数据源的地理空间信息解析研究侧重于地名实体、空间关系等空间语义角色的标注和抽取,忽略了丰富的时间信息、主题事件信息及其时空一体化信息。该文通过分析中文文本中事件信息描述的语言特点和事件的时空语义特征,基于地名实体和空间关系标注研究成果,制定了中文文本的事件时空信息标注体系和标注模式,并以GATE(General Architecture for Text Engineering)为标注平台,以网页文本为数据源,构建了事件时空信息标注语料库。研究成果为中文文本中地理信息的语义解析提供标准化的训练和测试数据。
  相似文献   

13.
Automatic annotation is an essential technique for effectively handling and organizing Web objects (e.g., Web pages), which have experienced an unprecedented growth over the last few years. Automatic annotation is usually formulated as a multi-label classification problem. Unfortunately, labeled data are often time-consuming and expensive to obtain. Web data also accommodate much richer feature space. This calls for new semi-supervised approaches that are less demanding on labeled data to be effective in classification. In this paper, we propose a graph-based semi-supervised learning approach that leverages random walks and ? 1 sparse reconstruction on a mixed object-label graph with both attribute and structure information for effective multi-label classification. The mixed graph contains an object-affinity subgraph, a label-correlation subgraph, and object-label edges with adaptive weight assignments indicating the assignment relationships. The object-affinity subgraph is constructed using ? 1 sparse graph reconstruction with extracted structural meta-text, while the label-correlation subgraph captures pairwise correlations among labels via linear combination of their co-occurrence similarity and kernel-based similarity. A random walk with adaptive weight assignment is then performed on the constructed mixed graph to infer probabilistic assignment relationships between labels and objects. Extensive experiments on real Yahoo! Web datasets demonstrate the effectiveness of our approach.  相似文献   

14.
This paper proposes three methods of association analysis that address two challenges of Big Data: capturing relatedness among real-world events in high data volumes, and modeling similar events that are described disparately under high data variability. The proposed methods take as input a set of geotemporally-encoded text streams about violent events called “storylines”. These storylines are associated for two purposes: to investigate if an event could occur again, and to measure influence, i.e., how one event could help explain the occurrence of another. The first proposed method, Distance-based Bayesian Inference, uses spatial distance to relate similar events that are described differently, addressing the challenge of high variability. The second and third methods, Spatial Association Index and Spatio-logical Inference, measure the influence of storylines in different locations, dealing with the high-volume challenge. Extensive experiments on social unrest in Mexico and wars in the Middle East showed that these methods can achieve precision and recall as high as 80 % in retrieval tasks that use both keywords and geospatial information as search criteria. In addition, the experiments demonstrated high effectiveness in uncovering real-world storylines for exploratory analysis.  相似文献   

15.
Value-based requirements engineering: exploring innovative e-commerce ideas   总被引:5,自引:0,他引:5  
Innovative e-commerce ideas are characterised by commercial products yet unknown to the market, enabled by information technology such as the Internet and technologies on top of it. How to develop such products is hardly known. We propose an interdisciplinary approach, e 3 -value, to explore an innovative e-commerce idea with the aim of understanding such an idea thoroughly and evaluating it for potential profitability. Our methodology exploits a requirements engineering way of working, but employs concepts and terminology from business science, marketing and axiology. It shows how to model business requirements and improve business–IT alignment, in sophisticated multi-actor value constellations that are common in electronic commerce. In addition to the e 3 -value approach methodology, we also present the action research-based development of our methodology, by using one of the longitudinal projects we carried out in the field of online news article provisioning.  相似文献   

16.
Multimedia documents sharing and outsourcing have become part of the routine activity of many individuals and companies. Such data sharing puts at risk the privacy of individuals, whose identities need to be kept secret, when adversaries get the ability to associate the multimedia document’s content to possible trail of information left behind by the individual. In this paper, we propose de-linkability, a privacy-preserving constraint to bound the amount of information outsourced that can be used to re-identify individuals. We provide a sanitizing M D ?-algorithm to enforce de-linkability along with a utility function to evaluate the utility of multimedia documents that is preserved after the sanitizing process. A set of experiments are elaborated to demonstrate the efficiency of our technique.  相似文献   

17.
Consideration was given to construction of a unilateral asymptotic confidence interval for an unknown conditional probability of the event A under the condition B. Analytical expressions for the boundaries of such interval were obtained for various a priori restrictions on the probabilities of the events A and B such that the probability of the event B is separable from zero by a certain value, as well that there exists an upper a priori restriction on the probability of the event A. The interval estimates obtained were compared in precision.  相似文献   

18.
We present a new method for clausal theorem proving, named SGGS from semantically-guided goal-sensitive reasoning. SGGS generalizes to first-order logic the conflict-driven clause learning (CDCL) procedure for propositional satisfiability. Starting from an initial interpretation, used for semantic guidance, SGGS employs a sequence of constrained clauses to represent a candidate model, instance generation to extend it, resolution and other inferences to explain and solve conflicts, amending the model. We prove that SGGS is refutationally complete and model complete in the limit, regardless of initial interpretation. SGGS is also goal sensitive, if the initial interpretation is properly chosen, and proof confluent, because it repairs the current model without undoing steps by backtracking. Thus, SGGS is a complete first-order method that is simultaneously model-based à la CDCL, semantically-guided, goal-sensitive, and proof confluent.  相似文献   

19.
Assume that a tuple of binary strings \(\bar a\) = 〈a 1 ..., a n 〉 has negligible mutual information with another string b. Does this mean that properties of the Kolmogorov complexity of \(\bar a\) do not change significantly if we relativize them to b? This question becomes very nontrivial when we try to formalize it. In this paper we investigate this problem for a special class of properties (for properties that can be expressed by an ?-formula). In particular, we show that a random (conditional on \(\bar a\)) oracle b does not help to extract common information from the strings a i .  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号