首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 414 毫秒
1.
汉越平行语料库的资源稀缺,很大程度上影响了汉越机器翻译效果。数据增强是提升汉越机器翻译的有效途径,基于双语词典的词汇替换数据增强是当前较为流行的方法。由于汉语-越南语属于低资源语言对,双语词典难以获得,而通过单语词向量获取低频词的同义词较为容易。因此,提出一种基于低频词的同义词替换的数据增强方法。该方法利用小规模的平行语料,首先通过对单语词向量的学习,获得一端语言低频词的同义词列表;然后对低频词进行同义词替换,再利用语言模型对替换后的句子进行筛选;最后将筛选后的句子与另一端语言中的句子进行匹配,获得扩展的平行语料。汉越翻译对比实验结果表明,提出的方法取得了很好的效果,扩展后的方法比基准和回译方法在BLEU值上分别提高了1.8和1.1。  相似文献   

2.
We propose a Deep Learning approach to the visual question answering task, where machines answer to questions about real-world images. By combining latest advances in image representation and natural language processing, we propose Ask Your Neurons, a scalable, jointly trained, end-to-end formulation to this problem. In contrast to previous efforts, we are facing a multi-modal problem where the language output (answer) is conditioned on visual and natural language inputs (image and question). We evaluate our approaches on the DAQUAR as well as the VQA dataset where we also report various baselines, including an analysis how much information is contained in the language part only. To study human consensus, we propose two novel metrics and collect additional answers which extend the original DAQUAR dataset to DAQUAR-Consensus. Finally, we evaluate a rich set of design choices how to encode, combine and decode information in our proposed Deep Learning formulation.  相似文献   

3.
变分方法是机器翻译领域的有效方法, 其性能较依赖于数据量规模. 然而在低资源环境下, 平行语料资源匮乏, 不能满足变分方法对数据量的需求, 因此导致基于变分的模型翻译效果并不理想. 针对该问题, 本文提出基于变分信息瓶颈的半监督神经机器翻译方法, 所提方法的具体思路为: 首先在小规模平行语料的基础上, 通过引入跨层注意力机制充分利用神经网络各层特征信息, 训练得到基础翻译模型; 随后, 利用基础翻译模型, 使用回译方法从单语语料生成含噪声的大规模伪平行语料, 对两种平行语料进行合并形成组合语料, 使其在规模上能够满足变分方法对数据量的需求; 最后, 为了减少组合语料中的噪声, 利用变分信息瓶颈方法在源与目标之间添加中间表征, 通过训练使该表征具有放行重要信息、阻止非重要信息流过的能力, 从而达到去除噪声的效果. 多个数据集上的实验结果表明, 本文所提方法能够显著地提高译文质量, 是一种适用于低资源场景的半监督神经机器翻译方法.  相似文献   

4.
This paper presents a novel method for semantic annotation and search of a target corpus using several knowledge resources (KRs). This method relies on a formal statistical framework in which KR concepts and corpus documents are homogeneously represented using statistical language models. Under this framework, we can perform all the necessary operations for an efficient and effective semantic annotation of the corpus. Firstly, we propose a coarse tailoring of the KRs w.r.t the target corpus with the main goal of reducing the ambiguity of the annotations and their computational overhead. Then, we propose the generation of concept profiles, which allow measuring the semantic overlap of the KRs as well as performing a finer tailoring of them. Finally, we propose how to semantically represent documents and queries in terms of the KRs concepts and the statistical framework to perform semantic search. Experiments have been carried out with a corpus about web resources which includes several Life Sciences catalogs and Wikipedia pages related to web resources in general (e.g., databases, tools, services, etc.). Results demonstrate that the proposed method is more effective and efficient than state-of-the-art methods relying on either context-free annotation or keyword-based search.  相似文献   

5.
We introduce the task of mapping search engine queries to DBpedia, a major linking hub in the Linking Open Data cloud. We propose and compare various methods for addressing this task, using a mixture of information retrieval and machine learning techniques. Specifically, we present a supervised machine learning-based method to determine which concepts are intended by a user issuing a query. The concepts are obtained from an ontology and may be used to provide contextual information, related concepts, or navigational suggestions to the user submitting the query. Our approach first ranks candidate concepts using a language modeling for information retrieval framework. We then extract query, concept, and search-history feature vectors for these concepts. Using manual annotations we inform a machine learning algorithm that learns how to select concepts from the candidates given an input query. Simply performing a lexical match between the queries and concepts is found to perform poorly and so does using retrieval alone, i.e., omitting the concept selection stage. Our proposed method significantly improves upon these baselines and we find that support vector machines are able to achieve the best performance out of the machine learning algorithms evaluated.  相似文献   

6.

Historically, the Multimedia community research has focused on output modalities, through studies on timing and multimedia processing. The Multimodal Interaction community, on the other hand, has focused on user-generated modalities, through studies on Multimodal User Interfaces (MUI). In this paper, aiming to assist the development of multimedia applications with MUIs, we propose the integration of concepts from those two communities in a unique high-level programming framework. The framework integrates user modalities —both user-generated (e.g., speech, gestures) and user-consumed (e.g., audiovisual, haptic)— in declarative programming languages for the specification of interactive multimedia applications. To illustrate our approach, we instantiate the framework in the NCL (Nested Context Language) multimedia language. NCL is the declarative language for developing interactive applications for Brazilian Digital TV and an ITU-T Recommendation for IPTV services. To help evaluate our approach, we discuss a usage scenario and implement it as an NCL application extended with the proposed multimodal features. Also, we compare the expressiveness of the multimodal NCL against existing multimedia and multimodal languages, for both input and output modalities.

  相似文献   

7.
Learning Syntax by Automata Induction   总被引:1,自引:1,他引:0  
In this paper we propose an explicit computer model for learning natural language syntax based on Angluin's (1982) efficient induction algorithms, using a complete corpus of grammatical example sentences. We use these results to show how inductive inference methods may be applied to learn substantial, coherent subparts of at least one natural language — English — that are not susceptible to the kinds of learning envisioned in linguistic theory. As two concrete case studies, we show how to learn English auxiliary verb sequences (such as could be taking, will have been taking) and the sequences of articles and adjectives that appear before noun phrases (such as the very old big deer). Both systems can be acquired in a computationally feasible amount of time using either positive examples, or, in an incremental mode, with implicit negative examples (examples outside a finite corpus are considered to be negative examples). As far as we know, this is the first computer procedure that learns a full-scale range of noun subclasses and noun phrase structure. The generalizations and the time required for acquisition match our knowledge of child language acquisition for these two cases. More importantly, these results show that just where linguistic theories admit to highly irregular subportions, we can apply efficient automata-theoretic learning algorithms. Since the algorithm works only for fragments of language syntax, we do not believe that it suffices for all of language acquisition. Rather, we would claim that language acquisition is nonuniform and susceptible to a variety of acquisition strategies; this algorithm may be one these.  相似文献   

8.
In order to find an appropriate architecture for a large-scale real-world application automatically and efficiently, a natural method is to divide the original problem into a set of subproblems. In this paper, we propose a simple neural-network task decomposition method based on output parallelism. By using this method, a problem can be divided flexibly into several subproblems as chosen, each of which is composed of the whole input vector and a fraction of the output vector. Each module (for one subproblem) is responsible for producing a fraction of the output vector of the original problem. The hidden structure for the original problem's output units are decoupled. These modules can be grown and trained in parallel on parallel processing elements. Incorporated with a constructive learning algorithm, our method does not require excessive computation and any prior knowledge concerning decomposition. The feasibility of output parallelism is analyzed and proved. Some benchmarks are implemented to test the validity of this method. Their results show that this method can reduce computational time, increase learning speed and improve generalization accuracy for both classification and regression problems.  相似文献   

9.
一种基于结构化语料库的概念语义网络自动生成算法   总被引:4,自引:0,他引:4  
概念语义网络是为了解决信息检索中的词汇不匹配的问题而提出的,是提高检索效果的基本途径之一.以面向自然语言的网络答疑为应用背景,提出了一种基于半结构化语料库的概念语义网络自动生成算法.通过分析语料的组成特点,对不同的概念关系类型,采取不同的模板进行文档抽取,并设定不同的窗口单元计算概念间的相关度;然后经过阈值筛选和角色转换,获得各种类型的概念关系,在此基础上进行语义网络的优化调整.实验结果表明,本算法获得的概念语义网络可以有效地提高问题检索的效果.  相似文献   

10.
In this paper, we propose a new information-theoretic method to simplify the computation of information and to unify several methods in one framework. The new method is called “supposed maximum information,” used to produce humanly comprehensible representations in competitive learning by taking into account the importance of input units. In the new learning method, by supposing the maximum information of input units, the actual information of input units is estimated. Then, the competitive network is trained with the estimated information in input units. The method is applied not to pure competitive learning, but to self-organizing maps, because it is easy to demonstrate visually how well the new method can produce more interpretable representations. We applied the method to three well-known sets of data, namely, the Kohonen animal data, the SPECT heart data and the voting data from the machine learning database. With these data, we succeeded in producing more explicit class boundaries on the U-matrices than did the conventional SOM. In addition, for all the data, quantization and topographic errors produced by our method were lower than those by the conventional SOM.  相似文献   

11.
We propose a statistical approach to speech-to-speech translation that uses finite-state models in all levels. Acoustic hidden Markov models (HMMs) model the pronunciation of the input-language phonemes and words, while the input–output word mapping, along with the syntax of the output language, are jointly modeled by means a large stochastic finite-state transducer. This allows for a complete integration of all the models so that the translation process can be performed by searching for an optimal path of states through the integrated network. As in speech recognition, HMMs can be trained from an input-language speech corpus, and the translation model is learned automatically from a parallel (text) training corpus. This approach has been assessed in the framework of the EuTrans project, funded by the European Union. Extensive experiments have been carried out with speech-input translations from Spanish to English and from Italian to English in applications involving the interaction (by telephone) of a customer with the front desk of a hotel. A summary of the most relevant results is presented.  相似文献   

12.
Learning to classify parallel input/output access patterns   总被引:1,自引:0,他引:1  
Input/output performance on current parallel file systems is sensitive to a good match of application access patterns to file system capabilities. Automatic input/output access pattern classification can determine application access patterns at execution time, guiding adaptive file system policies. In this paper, we examine and compare two novel input/output access pattern classification methods based on learning algorithms. The first approach uses a feedforward neural network previously trained on access pattern benchmarks to generate qualitative classifications. The second approach uses hidden Markov models trained on access patterns from previous executions to create a probabilistic model of input/output accesses. In a parallel application, access patterns can be recognized at the level of each local thread or as the global interleaving of all application threads. Classification of patterns at both levels is important for parallel file system performance; we propose a method for forming global classifications from local classifications. We present results from parallel and sequential benchmarks and applications that demonstrate the viability of this approach.  相似文献   

13.
We discuss how the standard Cost-Benefit Analysis should be modified in order to take risk (and uncertainty) into account. We propose different approaches used in finance (Value at Risk, Conditional Value at Risk, Downside Risk Measures, and Efficiency Ratio) as useful tools to model the impact of risk in project evaluation. After introducing the concepts, we show how they could be used in CBA and provide some simple examples to illustrate how such concepts can be applied to evaluate the desirability of a new project infrastructure.  相似文献   

14.
中文信息检索系统中的查询语句包含中文字、拼音、英文等多种形式,而有些查询语句过长,不利于纠错处理。现有的查询纠错方法不能很好的解决中文检索系统中的混合语言与中文长查询的问题。为了解决上述两个问题,该文提出了一种支持混合语言的并行纠错方法。该方法通过对混合语言统一编码,建立统一编码语言模型和异构字符词典树,并根据语言特点制定相应的编辑规则对查询词语进行统一处理,其中,针对中文长查询,提出双向并行的纠错模型。为了并行处理查询语句,我们在字符词典树和语言模型的基础上提出了逆向字符词典树和逆向语言模型的概念。模型中使用的训练语料库是从用户查询日志、网页点击日志、网页链接信息等文件中提取的高质量文本。实验表明,与单向查询纠错相比,支持混合语言的并行纠错方法在准确率上提升了9%,召回率降低了3%,在速度上提升了40%左右。  相似文献   

15.
Sentiment polarity detection is one of the most popular tasks related to Opinion Mining. Many papers have been presented describing one of the two main approaches used to solve this problem. On the one hand, a supervised methodology uses machine learning algorithms when training data exist. On the other hand, an unsupervised method based on a semantic orientation is applied when linguistic resources are available. However, few studies combine the two approaches. In this paper we propose the use of meta-classifiers that combine supervised and unsupervised learning in order to develop a polarity classification system. We have used a Spanish corpus of film reviews along with its parallel corpus translated into English. Firstly, we generate two individual models using these two corpora and applying machine learning algorithms. Secondly, we integrate SentiWordNet into the English corpus, generating a new unsupervised model. Finally, the three systems are combined using a meta-classifier that allows us to apply several combination algorithms such as voting system or stacking. The results obtained outperform those obtained using the systems individually and show that this approach could be considered a good strategy for polarity classification when we work with parallel corpora.  相似文献   

16.
在系统的微分域及相应的微分向量空间上定义了一个非交换的多项式环(算子环), 并利用这个环定义非线性系统的传递函数.用微分向量空间为工具,讨论单输入/单输出非线 性系统的实现问题.主要结果回答了:1)在什么条件下,不同的输入/输出微分方程有相同(等 价)的实现;2)在未知实现的条件下,如何确定输入/输出微分方程最小实现的阶数.覆盖了线 性系统理论的相关结果.  相似文献   

17.
Model-based approach is one of methods widely used for speaker identification, where a statistical model is used to characterize a specific speaker's voice but no interspeaker information is involved in its parameter estimation. It is observed that interspeaker information is very helpful in discriminating between different speakers. In this paper, we propose a novel method for the use of interspeaker information to improve performance of a model-based speaker identification system. A neural network is employed to capture the interspeaker information from the output space of those statistical models. In order to sufficiently utilize interspeaker information, a rival penalized encoding rule is proposed to design supervised learning pairs. For better generalization, moreover, a query-based learning algorithm is presented to actively select the input data of interest during training of the neural network. Comparative results on the KING speech corpus show that our method leads to a considerable improvement for a model-based speaker identification system.  相似文献   

18.
Given a question and its answer candidates (named QA corpus), answer selection is the task of identifying the most relevant answers to the question. Answer selection is widely used in question answering, web search, and so on. Current deep neural network models primarily utilize local features extracted from input question‐answer pairs (QA pairs). However, the global features contained in QA corpora are under‐utilized, and we argue that these global features substantially contribute to the answer selection task. To verify this point of view, we propose a novel model that combines local and global features for answer selection. In our model, two different global feature extractors are employed to extract statistical global features and deep global features from a QA corpus, respectively. Furthermore, we investigate the integration of these global features with local features in various experimental settings: statistical global features, deep global features, and a combination of statistical and deep global features. Our experimental results show that the global features are effective for answer selection. Our model obtains new state‐of‐the‐art results on two public answer selection datasets and performs especially well on YahooCQA, where it achieves 9.2 and 6% higher precision@1 (P@1) and mean reciprocal rank (MRR) scores than previously published models.  相似文献   

19.
有关命名实体的翻译等价对在多语言处理中有着非常重要的意义。在过去的几年里,双语字典查找,音译模型等方法先后被提出。另一种极具价值的方法是从平行语料库中自动抽取有关命名实体的翻译等价对,现有的方法要求预先对双语语料库的两种语言文本进行命名实体标注。提出了一种只要求对语料库中源语言进行命名实体标注,目标语言不需标注,然后利用训练得到的HMM词对齐结果来抽取有关命名实体翻译等价对的方法。在实验中,把中文作为源语言,英文作为目标语言。实验结果表明用该方法,即使在对齐模型只是部分准确的情况下,也得到了较高正确率的命名实体翻译等价对。  相似文献   

20.
In this paper, a higher-order-statistics (HOS)-based radial basis function (RBF) network for signal enhancement is introduced. In the proposed scheme, higher order cumulants of the reference signal were used as the input of HOS-based RBF. An HOS-based supervised learning algorithm, with mean square error obtained from higher order cumulants of the desired input and the system output as the learning criterion, was used to adapt weights. The motivation is that the HOS can effectively suppress Gaussian and symmetrically distributed non-Gaussian noise. The influence of a Gaussian noise on the input of HOS-based RBF and the HOS-based learning algorithm can be mitigated. Simulated results indicate that HOS-based RBF can provide better performance for signal enhancement under different noise levels, and its performance is insensitive to the selection of learning rates. Moreover, the efficiency of HOS-based RBF under the nonstationary Gaussian noise is stable  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号