首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Natural languages are known for their expressive richness. Many sentences can be used to represent the same underlying meaning. Only modelling the observed surface word sequence can result in poor context coverage and generalization, for example, when using n-gram language models (LMs). This paper proposes a novel form of language model, the paraphrastic LM, that addresses these issues. A phrase level paraphrase model statistically learned from standard text data with no semantic annotation is used to generate multiple paraphrase variants. LM probabilities are then estimated by maximizing their marginal probability. Multi-level language models estimated at both the word level and the phrase level are combined. An efficient weighted finite state transducer (WFST) based paraphrase generation approach is also presented. Significant error rate reductions of 0.5–0.6% absolute were obtained over the baseline n-gram LMs on two state-of-the-art recognition tasks for English conversational telephone speech and Mandarin Chinese broadcast speech using a paraphrastic multi-level LM modelling both word and phrase sequences. When it is further combined with word and phrase level feed-forward neural network LMs, a significant error rate reduction of 0.9% absolute (9% relative) and 0.5% absolute (5% relative) were obtained over the baseline n-gram and neural network LMs respectively.  相似文献   

2.
In this paper, we introduce the backoff hierarchical class n-gram language models to better estimate the likelihood of unseen n-gram events. This multi-level class hierarchy language modeling approach generalizes the well-known backoff n-gram language modeling technique. It uses a class hierarchy to define word contexts. Each node in the hierarchy is a class that contains all the words of its descendant nodes. The closer a node to the root, the more general the class (and context) is. We investigate the effectiveness of the approach to model unseen events in speech recognition. Our results illustrate that the proposed technique outperforms backoff n-gram language models. We also study the effect of the vocabulary size and the depth of the class hierarchy on the performance of the approach. Results are presented on Wall Street Journal (WSJ) corpus using two vocabulary set: 5000 words and 20,000 words. Experiments with 5000 word vocabulary, which contain a small numbers of unseen events in the test set, show up to 10% improvement of the unseen event perplexity when using the hierarchical class n-gram language models. With a vocabulary of 20,000 words, characterized by a larger number of unseen events, the perplexity of unseen events decreases by 26%, while the word error rate (WER) decreases by 12% when using the hierarchical approach. Our results suggest that the largest gains in performance are obtained when the test set contains a large number of unseen events.  相似文献   

3.
In speech recognition systems language models are used to estimate the probabilities of word sequences. In this paper special emphasis is given to numerals–words that express numbers. One reason for this is the fact that in a practical application a falsely recognized numeral can change important content information inside the sentence more than other types of errors. Standard \(n\) -gram language models can sometimes assign very different probabilities to different numerals, according to their relative frequencies in training corpus. Based on the assumption that some different numbers are more equally likely to occur, than what a standard \(n\) -gram language model estimates, this paper proposes several methods for sorting numerals into classes in an inflective language and language models based on these sorting techniques. We treat these classes as basic vocabulary units for the language model. We also expose the differences between the proposed language models and well known class-based language models. The presented approach is also transferable to other classes of words with similar properties, e.g. proper nouns. Results of experiments show that significant improvements are obtained on numeral-rich domains. Although numerals represent only a small portion of words in the test set, a relative reduction in word error rate of 1.4 % was achieved. Statistical significance tests were performed, which showed that these improvements are statistically significant. We also show that depending on the amount of numerals in a target domain the improvement in performance can grow up to 16 % relative.  相似文献   

4.
Lindenmayer systems are a class of parallel rewriting systems originally introduced to model the growth and development of filamentous organisms. Families of languages generated by deterministic Lindenmayer systems (i.e., those in which each string has a unique successor) are investigated. In particular, the use of nonterminals, homomorphisms, and the combination of these are studied for deterministic Lindenmayer systems using one-sided context (D1Ls) and two-sided context (D2Ls). Languages obtained from Lindenmayer systems by the use of nonterminals are called extensions. Typical results are: The closure under letter-to-letter homomorphism of the family of extensions of D1L languages is equal to the family of recursively enumerable languages, although the family of extensions of D1L languages does not even contain all regular languages. Let P denote the restriction that the system does not rewrite a letter as the empty word. The family of extensions of PD2L languages is equal to the family of languages accepted by deterministic linear bounded automata. The closure under nonerasing homomorphism of the family of extensions of PD1L languages does not even contain languages like {a1,a2,?, an}1--{λ}, n?2 . The closure of the family of PD1L languages under homomorphisms which map a letter either to itself or to the empty word is equal to the family of recursively enumerable languages. Strict inclusion results follow from necessary conditions for a language to be in one of the considered families. By stating the results in their strongest form, the paper contains a systematic classification of the effect of nonterminals, letter-to-letter homomorphisms, nonerasing homomorphisms and homomorphisms for all the basic types of deterministic Lindenmayer systems using context.  相似文献   

5.
Text representation is an essential task in transforming the input from text into features that can be later used for further Text Mining and Information Retrieval tasks. The commonly used text representation model is Bags-of-Words (BOW) and the N-gram model. Nevertheless, some known issues of these models, which are inaccurate semantic representation of text and high dimensionality of word size combination, should be investigated. A pattern-based model named Frequent Adjacent Sequential Pattern (FASP) is introduced to represent the text using a set of sequence adjacent words that are frequently used across the document collection. The purpose of this study is to discover the similarity of textual pattern between documents that can be later converted to a set of rules to describe the main news event. The FASP is based on the Pattern-Growth’s divide-and-conquer strategy where the main difference between FASP and the prior technique is in the Pattern Generation phase. This approach is tested against the BOW and N-gram text representation model using Malay and English language news dataset with different term weightings in the Vector Space Model (VSM). The findings demonstrate that the FASP model has a promising performance in finding similarities between documents with the average vector size reduction of 34% against the BOW and 77% against the N-gram model using the Malay dataset. Results using the English dataset is also consistent, indicating that the FASP approach is also language independent.  相似文献   

6.
We show the results of studying models of the Russian language constructed with recurrent artificial neural networks for systems of automatic recognition of continuous speech. We construct neural network models with different number of elements in the hidden layer and perform linear interpolation of neural network models with the baseline trigram language model. The resulting models were used at the stage of rescoring the N best list. In our experiments on the recognition of continuous Russian speech with extra-large vocabulary (150 thousands of word forms), the relative reduction in the word error rate obtained after rescoring the 50 best list with the neural network language models interpolated with the trigram model was 14%.  相似文献   

7.
刘鹏远  赵铁军 《软件学报》2009,20(5):1292-1300
为了解决困扰词义及译文消歧的数据稀疏及知识获取问题,提出一种基于Web利用n-gram统计语言模型进行消歧的方法.在提出词汇语义与其n-gram语言模型存在对应关系假设的基础上,首先利用Hownet建立中文歧义词的英文译文与知网DEF的对应关系并得到该DEF下的词汇集合,然后通过搜索引擎在Web上搜索,并以此计算不同DEF中词汇n-gram出现的概率,然后进行消歧决策.在国际语义评测SemEval-2007中的Multilingual Chinese English Lexical Sample Task测试集上的测试表明,该方法的Pmar值为55.9%,比其上该任务参评最好的无指导系统性能高出12.8%.  相似文献   

8.
A class of monoids that can model partial reversibility allowing simultaneously instances of two-sided reversibility, one-sided reversibility and no reversibility is considered. Some of the basic decidability problems involving their rational subsets, syntactic congruences and characterization of recognizability, are solved using purely automata-theoretic techniques, giving further insight into the structure of recognizable languages.  相似文献   

9.
Text classification constitutes a popular task in Web research with various applications that range from spam filtering to sentiment analysis. In this paper, we argue that its performance depends on the quality of Web documents, which varies significantly. For example, the curated content of news articles involves different challenges than the user-generated content of blog posts and Social Media messages. We experimentally verify our claim, quantifying the main factors that affect the performance of text classification. We also argue that the established bag-of-words representation models are inadequate for handling all document types, as they merely extract frequent, yet distinguishing terms from the textual content of the training set. Thus, they suffer from low robustness in the context of noisy or unseen content, unless they are enriched with contextual, application-specific information. In their place, we propose the use of n-gram graphs, a model that goes beyond the bag-of-words representation, transforming every document into a graph: its nodes correspond to character or word n-grams and the co-occurring ones are connected by weighted edges. Individual document graphs can be combined into class graphs and graph similarities are employed to position and classify documents into the vector space. This approach offers two advantages with respect to bag models: first, classification accuracy increases due to the contextual information that is encapsulated in the edges of the n-gram graphs. Second, it reduces the search space to a limited set of robust, endogenous features that depend on the number of classes, rather than the size of the vocabulary. Our thorough experimental study over three large, real-world corpora confirms the superior performance of n-gram graphs across the main types of Web documents.  相似文献   

10.
方言研究领域中的语音研究、词汇研究及语法研究是方言研究的三个重要组成部分,如何识别方言词汇,是方言词汇研究首要的环节。目前,汉语方言词汇研究的语料收集与整理主要通过专家人工整理的形式进行,耗时耗力。 随着信息技术的发展,人们的交流广泛通过网络进行,而输入法数据包含海量的语料资源以及地域信息,可以帮助进行方言词汇语料的自动发现。然而,目前尚没有文献研究如何利用拼音输入法数据对方言词汇进行系统化分析,因此在本文中,我们探讨借助中文输入法的用户行为来自动发现各地域方言词汇的方法。特别的,我们归纳得到输入法数据中表征方言词汇的两类特征,并基于对特征的不同组合识别方言词汇。最后我们通过实验评价了两类特征的不同组合方法对方言词汇识别效果的影响。  相似文献   

11.
12.
Recently, minimum perfect hashing (MPH)-based language model (LM) lookup methods have been proposed for fast access of N-gram LM scores in lexical-tree based LVCSR (large vocabulary continuous speech recognition) decoding. Methods of node-based LM cache and LM context pre-computing (LMCP) have also been proposed to combine with MPH for further reduction of LM lookup time. Although these methods are effective, LM lookup still takes a large share of overall decoding time when trigram LM lookahead (LMLA) is used for lower word error rate than unigram or bigram LMLAs. Besides computation time, memory cost is also an important performance aspect of decoding systems. Most speedup methods for LM lookup obtain higher speed at the cost of increased memory demand, which makes system performance unpredictable when running on computers with smaller memory capacities. In this paper, an order-preserving LM context pre-computing (OPCP) method is proposed to achieve both fast speed and small memory cost in LM lookup. By reducing hashing operations through order-preserving access of LM scores, OPCP cuts down LM lookup time effectively. In the meantime, OPCP significantly reduces memory cost because of reduced size of hashing keys and the need for only last word index of each N-gram in LM storage. Experimental results are reported on two LVCSR tasks (Wall Street Journal 20K and Switchboard 33K) with three sizes of trigram LMs (small, medium, large). In comparison with above-mentioned existing methods, OPCP reduced LM lookup time from about 30–80% of total decoding time to about 8–14%, without any increase of word error rate. Except for the small LM, the total memory cost of OPCP for LM lookup and storage was about the same or less than the original N-gram LM storage, much less than the compared methods. The time and memory savings in LM lookup by using OPCP became more pronounced with the increase of LM size.  相似文献   

13.
The class of external contextual languages isstrictly included in the class of linear languages. A reason for the strict inclusion in linear languages is that external contextual grammars generate languages in the exhaustive way: each sentential form belongs to the language of a grammar. In this paper we study the effect of adding various squeezing mechanisms to the basic classes of exhaustive contextual grammars. We obtain in this way a characterization of linear languages and a whole landscape of sublinear families. By restricting the contexts to be one-sided (only left-sided or only right-sided) we obtain a characterization of regular languages — here the subregular landscape reduces to two families.  相似文献   

14.
This paper presents an agent-based model of the emergence and evolution of a language system for Boolean coordination. The model assumes the agents have cognitive capacities for invention, adoption, abstraction, repair and adaptation, a common lexicon for basic concepts, and the ability to construct complex concepts using recursive combinations of basic concepts and logical operations such as negation, conjunction or disjunction. It also supposes the agents initially have neither a lexicon for logical operations nor the ability to express logical combinations of basic concepts through language. The results of the experiments we have performed show that a language system for Boolean coordination emerges as a result of a process of self-organisation of the agents’ linguistic interactions when these agents adapt their preferences for vocabulary, syntactic categories and word order to those they observe are used more often by other agents. Such a language system allows the unambiguous communication of higher-order logic terms representing logical combinations of basic properties with non-trivial recursive structure, and it can be reliably transmitted across generations according to the results of our experiments. Furthermore, the conceptual and linguistic systems, and simplification and repair operations of the agent-based model proposed are more general than those defined in previous works, because they not only allow the simulation of the emergence and evolution of a language system for the Boolean coordination of basic properties, but also for the Boolean coordination of higher-order logic terms of any Boolean type which can represent the meaning of nouns, sentences, verbs, adjectives, adverbs, prepositions, prepositional phrases and subexpressions not traditionally analysed as forming constituents, using linguistic devices such as syntactic categories, word order and function words.  相似文献   

15.
Extending Zipf’s law to n-grams for large corpora   总被引:1,自引:0,他引:1  
Experiments show that for a large corpus, Zipf’s law does not hold for all ranks of words: the frequencies fall below those predicted by Zipf’s law for ranks greater than about 5,000 word types in the English language and about 30,000 word types in the inflected languages Irish and Latin. It also does not hold for syllables or words in the syllable-based languages, Chinese or Vietnamese. However, when single words are combined together with word n-grams in one list and put in rank order, the frequency of tokens in the combined list extends Zipf’s law with a slope close to ?1 on a log-log plot in all five languages. Further experiments have demonstrated the validity of this extension of Zipf’s law to n-grams of letters, phonemes or binary bits in English. It is shown theoretically that probability theory alone can predict this behavior in randomly created n-grams of binary bits.  相似文献   

16.
With the popularity of model-driven methodologies and the abundance of modelling languages, a major question for a requirements engineer is: which language is suitable for modelling a system under study? We address this question from a semantic point-of-view for big-step modelling languages (BSMLs). BSMLs are a class of popular behavioural modelling languages in which a model can respond to an input by executing multiple transitions, possibly concurrently. We deconstruct the operational semantics of a large class of BSMLs into eight high-level, mostly orthogonal semantic aspects and their common semantic options. We analyse the characteristics of each semantic option. We use feature diagrams to present the design space of BSML semantics that arises from our deconstruction, as well as to taxonomize the syntactic features of BSMLs that exhibit semantic variations. We enumerate the dependencies between syntactic and semantic features. We also discuss the effects of certain combinations of semantic options when used together in a BSML semantics. Our goal is to empower a requirements engineer to compare and choose an appropriate BSML from the plethora of existing BSMLs, or to articulate the semantic features of a new desired BSML when such a BSML does not exist.  相似文献   

17.
In this paper, we describe a novel and effective approach for automatically decomposing a word into stem and suffixes. Russian and Turkish are used as exemplars of fusional and agglutinating languages. Rather than relying on corpus counts, we use a small number of word-pairs as training data, that can be particularly suited for under-resourced languages. For fusional languages, we initially learn a tree of aligned suffix rules (TASR) from word-pairs. The tree is built top-down, from general to specific rules, using suffix rule frequency and rule subsumption, and is executed bottom-up, i.e., the most specific rule that fires is chosen. TASR is used to segment a word form into a stem and suffix sequence. For fusional languages learning through generation (using TASR) is essential for proper stem extraction. Subsequently, an unsupervised segmentation algorithm graph-based unsupervised suffix segmentation (GBUSS) is used to segment the suffix sequence. GBUSS employs a suffix graph where node merging, guided by an information-theoretic measure, generates suffix sequences. The approach, experimentally validated on Russian, is shown to be highly effective. For agglutinating languages only the GBUSS is needed for word decomposition. Promising experimental results for Turkish are obtained.   相似文献   

18.
While the theory of languages of words is very mature, our understanding of relations on words is still lagging behind. And yet such relations appear in many new applications such as verification of parameterized systems, querying graph-structured data, and information extraction, for instance. Classes of well-behaved relations typically used in such applications are obtained by adapting some of the equivalent definitions of regularity of words for relations, leading to non-equivalent notions of recognizable, regular, and rational relations. The goal of this paper is to propose a systematic way of defining classes of relations on words, of which these three classes are just natural examples, and to demonstrate its advantages compared to some of the standard techniques for studying word relations. The key idea is that of a synchronization of a pair of words, which is a word over an extended alphabet. Using it, we define classes of relations via classes of regular languages over a fixed alphabet, just {1,2} for binary relations. We characterize some of the standard classes of relations on words via finiteness of parameters of synchronization languages, called shift, lag, and shiftlag. We describe these conditions in terms of the structure of cycles of graphs underlying automata, thereby showing their decidability. We show that for these classes there exist canonical synchronization languages, and every class of relations can be effectively re-synchronized using those canonical representatives. We also give sufficient conditions on synchronization languages, defined in terms of injectivity and surjectivity of their Parikh images, that guarantee closure under intersection and complement of the classes of relations they define.  相似文献   

19.
Adverbial constructions v gotovom vide, na vsyakii sluchai, etc. that are an intrinsic resource of the Russian language along with secondary prepositions of v vide and na sluchai type are discussed. The availability and linguistic features of these constructions shows grammaticalization occurring in the modern Russian language to cover not only separate units but also word combinations.  相似文献   

20.
We report an empirical study of n-gram posterior probability confidence measures for statistical machine translation (SMT). We first describe an efficient and practical algorithm for rapidly computing n-gram posterior probabilities from large translation word lattices. These probabilities are shown to be a good predictor of whether or not the n-gram is found in human reference translations, motivating their use as a confidence measure for SMT. Comprehensive n-gram precision and word coverage measurements are presented for a variety of different language pairs, domains and conditions. We analyze the effect on reference precision of using single or multiple references, and compare the precision of posteriors computed from k-best lists to those computed over the full evidence space of the lattice. We also demonstrate improved confidence by combining multiple lattices in a multi-source translation framework.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号