首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 806 毫秒
指代是自然语言中一种常见的语言现象,对简化语言,减少冗余有很大的作用。指代消解是用计算机找出这些指代现象的一个过程。近几年英文指代消解研究取得了很大的成就,然而,中文指代消解研究目前还较少,一方面是由于中文自然语言处理的研究起步较晚,相关的知识较少,另外一方面就是中文相关的语料库较少,目前已知的仅有ACE2005, OntoNotes等。为了探讨语料库对中文名词短语指代消解的影响,该文实现了一个基于有监督学习方法的中文名词短语指代消解平台和一个基于无监督聚类方法的中文名词短语指代消解平台,在此平台的基础上从语料库的数量和质量两个方面来探讨语料对中文名词短语指代消解的影响。  相似文献   

The paper addresses the question of how an English language user interface will be understood by users from different linguistic and cultural backgrounds and provides some answers from the study of second language acquisition and the practice of language teaching and learning. It is accepted that for a number of reasons, translation of an English interface into other languages is not always feasible or appropriate. Existing knowledge of language learning problems and solutions can be applied to the design of English language interfaces so that they are more accessible to non-native speakers. The present article categorises language-related problems, gives examples in each category, and provides a set of guidelines. The conclusion reached is that making word collocations and co-occurrences visible and available is the key to building in sufficient verbal context for understanding—a measure which will also be helpful to native speakers of English.  相似文献   

名词短语一直是中外语言学领域的重要研究对象,近年来在自然语言处理领域也受到了研究者的持续关注。英文方面,已建立了一定规模的名词短语语义关系知识库。但迄今为止,尚未建立相应或更大规模的描述名词短语语义关系的中文资源。该文借鉴国内外诸多学者对名词短语语义分类的研究成果,对大规模真实语料中的基本复合名词短语实例进行试标注与分析,建立了中文基本复合名词短语语义关系体系及相应句法语义知识库,该库能够为中文基本复合名词短语句法语义的研究提供基础数据资源。目前该库共含有18 281条高频基本复合名词短语,每条短语均标注了语义关系、短语结构及是否指称实体等信息,每条短语包含的两个名词还分别标注了语义类信息。语义类信息基于北京大学《现代汉语语义词典》。基于该知识库,该文还做了基本复合名词短语句法语义的初步统计与分析。  相似文献   

Sentiment polarity detection is one of the most popular tasks related to Opinion Mining. Many papers have been presented describing one of the two main approaches used to solve this problem. On the one hand, a supervised methodology uses machine learning algorithms when training data exist. On the other hand, an unsupervised method based on a semantic orientation is applied when linguistic resources are available. However, few studies combine the two approaches. In this paper we propose the use of meta-classifiers that combine supervised and unsupervised learning in order to develop a polarity classification system. We have used a Spanish corpus of film reviews along with its parallel corpus translated into English. Firstly, we generate two individual models using these two corpora and applying machine learning algorithms. Secondly, we integrate SentiWordNet into the English corpus, generating a new unsupervised model. Finally, the three systems are combined using a meta-classifier that allows us to apply several combination algorithms such as voting system or stacking. The results obtained outperform those obtained using the systems individually and show that this approach could be considered a good strategy for polarity classification when we work with parallel corpora.  相似文献   

During the early stages of language acquisition, young infants face the task of learning a basic vocabulary without the aid of prior linguistic knowledge. Attempts have been made to model this complex behaviour computationally, using a variety of machine learning algorithms, a.o. non-negative matrix factorization (NMF). In this paper, we replace NMF in a vocabulary learning setting with a conceptually similar algorithm, probabilistic latent semantic analysis (PLSA), which can learn word representations incrementally by Bayesian updating. We further show that this learning framework is capable of modelling certain cognitive behaviours, e.g. forgetting, in a simple way.  相似文献   

We describe the design and evaluation of two different dynamic student uncertainty adaptations in wizarded versions of a spoken dialogue tutoring system. The two adaptive systems adapt to each student turn based on its uncertainty, after an unseen human “wizard” performs speech recognition and natural language understanding and annotates the turn for uncertainty. The design of our two uncertainty adaptations is based on a hypothesis in the literature that uncertainty is an “opportunity to learn”; both adaptations use additional substantive content to respond to uncertain turns, but the two adaptations vary in the complexity of these responses. The evaluation of our two uncertainty adaptations represents one of the first controlled experiments to investigate whether substantive dynamic responses to student affect can significantly improve performance in computer tutors. To our knowledge we are the first study to show that dynamically responding to uncertainty can significantly improve learning during computer tutoring. We also highlight our ongoing evaluation of our uncertainty-adaptive systems with respect to other important performance metrics, and we discuss how our corpus can be used by the wider computer speech and language community as a linguistic resource supporting further research on effective affect-adaptive spoken dialogue systems in general.  相似文献   

We are developing an intelligent robot and attempting to teach it language. While there are many aspects of this research, for the purposes here the most important are the following ideas. Language is primarily based on semantics, not syntax, which is still the focus in speech recognition research these days. To truly learn meaning, a language engine cannot simply be a computer program running on a desktop computer analyzing speech. It must be part of a more general, embodied intelligent system, one capable of using associative learning to form concepts from the perception of experiences in the world, and further capable of manipulating those concepts symbolically. In this paper, we present a general cascade model for learning concepts, and explore the use of hidden Markov models (HMMs) as part of the cascade model. HMMs are capable of automatically learning and extracting the underlying structure of continuous-valued inputs and representing that structure in the states of the model. These states can then be treated as symbolic representations of the inputs. We show how a cascade of HMMs can be embedded in a small mobile robot and used to find correlations among sensory inputs to learn a set of symbolic concepts, which are used for decision making and could eventually be manipulated linguistically  相似文献   

We report experimental results on automatic extraction of an English-Chinese translation lexicon, by statistical analysis of a large parallel corpus, using limited amounts of linguistic knowledge. To our knowledge, these are the first empirical results of the kind between an Indo-European and non-Indo-European language for any significant vocabulary and corpus size. The learned vocabulary size is about 6,500 English words, achieving translation precision in the 86–96% range, with alignment proceeding at paragraph, sentence, and word levels. Specifically, we report (1) progress on the HKUST English-Chinese Parallel Bilingual Corpus, (2) experiments supporting the usefulness of restricted lexical cues for statistical paragraph and sentence alignment, and (3) experiments that question the role of hand-derived monolingual lexicons for automatic word translation acquisition. Using a hand-derived monolingual lexicon, the learned translation lexicon averages 2.33 Chinese translations per English entry, with a manually-filtered precision of 95.1%, and an automatically-filtered weighted precision of 86.0%. We then introduce a fully automatic two-stage statistical methodology that is able to learn translations for collocations. A statistically-learned monolingual Chinese lexicon is first used to segment the Chinese text, before applying bilingual training to produce 6,429 English entries with 2.25 Chinese translations per entry. This method improves the manually-filtered precision to 96.0% and the automatically-filtered weighted precision to 91.0%, an error rate reduction of 35.7% from using a hand-derived monolingual lexicon.  相似文献   

Kazman  Rick 《Machine Learning》1994,16(1-2):87-120
This paper describes the theory and implementation ofBabel, a system which explores the hypothesis that much of the differences in the world's languages may be characterized by the inventory and properties of the lexical items and functional categories of those languages. The structure ofBabel assumes that functional categories are originally lacking in a child's syntax, and are acquired through a statistical induction process of lexical acquisition.Babel then uses information induced from the structure of the lexicon to create a model of syntax via a deductive, rule-based process. This model makes a number of predictions about the time course of language acquisition. These predictions are tested by runningBabel as a simulation of child language acquisition, using large samples of adult speech to children as input. The simulation results are shown to highly correlate to longitudinal studies of child language acquisition in English and Polish. Finally, the approach to handling noisy data withBabel is detailed.  相似文献   

语料库作为基本的语言数据库和知识库,是各种自然语言处理方法实现的基础。随着统计方法在自然语言处理中的广泛应用,语料库建设已成为重要的研究课题。自动分词是句法分析的一项不可或缺的基础性工作,其性能直接影响句法分析。本文通过对85万字节藏语语料的统计分析和藏语词的分布特点、语法功能研究,介绍基于词典库的藏文自动分词系统的模型,给出了切分用词典库的结构、格分块算法和还原算法。系统的研制为藏文输入法研究、藏文电子词典建设、藏文字词频统计、搜索引擎的设计和实现、机器翻译系统的开发、网络信息安全、藏文语料库建设以及藏语语义分析研究奠定了基础。  相似文献   

李业刚  黄河燕  史树敏  鉴萍  苏超 《软件学报》2015,26(7):1615-1625
针对传统方法对双语最大名词短语识别一致性差以及跨领域识别能力弱的缺点,提出一种基于半监督学习的双语最大名词短语识别算法.利用汉英最大名词短语的互译性和识别的互补性,把平行的汉语句子和英语句子这两个数据集看作一个数据集的两个不同的视图进行双语协同训练.在协同训练中,把双语对齐标注一致率作为标记置信度估计依据,进行增量标记数据的选择.实验结果表明:该算法显著提高了双语最大名词短语的识别能力,在跨领域测试和同领域测试中,F值分别比目前最好的最大名词短语识别模型提高了4.52%和3.08%.  相似文献   

汉语中介语是伴随着汉语国际教育产生的,随着汉语学习在全球的不断开展,汉语中介语的规模不断增长,由于这些语料在语言使用上有其独特性,使得中介语成为语言信息处理和智能语言辅助学习的独特资源。依存语法分析是语言信息处理的重要步骤,英语中介语的依存语法标注语料已经有很好的应用,目前汉语中介语语料库对句法的关注度较低,缺乏一个充分考虑汉语中介语特点的依存句法标注规范。该文着眼于汉语中介语的依存句法标注语料库的建构,探讨依存标注规范,在充分借鉴国际通用依存标注体系(Universal Dependencies)的基础上,制定了汉语中介语的依存标注规范,并进行了标注实践,形成了一个包括汉语教学语法点的中介语依存语料库。  相似文献   

Adaptive fuzzy command acquisition with reinforcement learning   总被引:2,自引:0,他引:2  
Proposes a four-layered adaptive fuzzy command acquisition network (AFCAN) for adaptively acquiring fuzzy command via interactions with the user or environment. It can catch the intended information from a sentence (command) given in natural language with fuzzy predicates. The intended information includes a meaningful semantic action and the fuzzy linguistic information of that action. The proposed AFCAN has three important features. First, we can make no restrictions whatever on the fuzzy command input, which is used to specify the desired information, and the network requires no acoustic, prosodic, grammar, and syntactic structure, Second, the linguistic information of an action is learned adaptively and it is represented by fuzzy numbers based on α-level sets. Third, the network can learn during the course of performing the task. The AFCAN can perform off-line as well as online learning. For the off-line learning, the mutual-information (MI) supervised learning scheme and the fuzzy backpropagation (FBP) learning scheme are employed when the training data are available in advance. The former learning scheme is used to learn meaningful semantic actions and the latter learn linguistic information. The AFCAN can also perform online learning interactively when it is in use for fuzzy command acquisition. For the online learning, the MI-reinforcement learning scheme and the fuzzy reinforcement learning scheme are developed for the online learning of meaningful actions and linguistic information, respectively. An experimental system is constructed to illustrate the performance and applicability of the proposed AFCAN  相似文献   

提出在面向数据的英汉机译系统中,一种以面向数据的语言分析技术作为基本框架的目标语生成机制。该机制通过对源语语句的句法分析树进行线性化操作,生成目标语译文。其中包括从源语语句句法分析树的所有片段组合形式中选择一个适合生成操作的生成片段组合形式、对生成片段组合形式中的所有片段进行线性化操作以及对所有片段已经线性化的生成片段组合形式进行线性操作,从而获取最终的目标语译文。为论证方法有效性,基于包含1,000个语句的真实英语语料构建知识源,并采用包含100个语句的真实英语语料作为测试集。实验表明,目标语译文质量比较令人满意,可成功地实现英汉机译。  相似文献   

中文机构名称的识别与分析   总被引:37,自引:7,他引:30  
中文机构名称数目庞大, 层出不穷, 绝大多数未能收入词典, 给自然语言处理带来困扰。但是, 从语言学的角度来看, 机构名称是一种偏正复合式专有名词, 同时又是一类较为简单的偏正名词词组, 有自己的结构规律和形态标记。本文以高校名称为重点,以中国内地、香港和台湾三地实际语料为依据, 从语言学和计算机技术两方面对机构名称的识别与分析展开讨论, 并总结出相应的规则。根据这些规则, 对六百多万字的三地语料库作高校名称识别, 正确率(指前后界定位均正确) 达97.3 % , 召回率为96.9 %。这些规则还可应用于拼音-汉字智能转换和机器翻译等其它领域。  相似文献   

付丹亚 《网友世界》2013,(20):119-120
聋哑学生受到自身生理条件的限制,所以英语习得困难重重。本论文通过调查问卷和访谈调查了商洛市特殊教育学校20名聋哑学生的英语习得状况,从年龄因素和学习态度、主观愿望、学习动机、学习毅力等英语学习中的非智力因素方面对数据进行了分析,认为聋哑学生英语习得中的年龄因素与语言学中有关年龄因素对语言习得产生影响的理论不符,如关键期假说理论中年龄的界限。而且聋哑学生英语习得现状虽然很差,但浓厚的学习兴趣,积极的学习态度,较高的学习动机,坚韧的学习毅力加上适当的配带助听器和语言康复训练,英语习得的实现是很有希望的。  相似文献   

动词子语类框架(Subcategorization Frame以下简称SCF)在句法分析、语义角色标注等方面的研究中具有不可或缺的重要作用。在子语类框架信息的获取过程中,首先要建立标准完备的子语类框架类型集。目前英语研究已经建立了获得普遍共识的子语类框架类型集。而汉语方面还没有标准的动词子类框架类型集。本文提出一种语言学知识与统计方法相结合的汉语动词子语类框架类型集的半自动获取方案。初步建立起既符合统计结果又基本符合语言学理论的汉语动词子语类框架类型集。实验证明,加入语言学理论的子语类框架类型集降低了对语料的依赖程度,比完全由分析语料产生的类型集更完备。  相似文献   

Machine learning techniques used in computer aided diagnosis (CAD) systems learn a hypothesis to help the medical experts make a diagnosis in the future. To learn a well-performed hypothesis, a large amount of expert-diagnosed examples are required, which places a heavy burden on experts. By exploiting large amounts of undiagnosed examples and the power of ensemble learning, the co-training-style random forest (Co-Forest) releases the burden on the experts and produces well-performed hypotheses. However, the Co-forest may suffer from a problem common to other co-training-style algorithms, namely, that the unlabeled examples may instead be wrongly-labeled examples that become accumulated in the training process. This is due to the fact that the limited number of originally-labeled examples usually produces poor component classifiers, which lack diversity and accuracy. In this paper, a new Co-Forest algorithm named Co-Forest with Adaptive Data Editing (ADE-Co-Forest) is proposed. Not only does it exploit a specific data-editing technique in order to identify and discard possibly mislabeled examples throughout the co-labeling iterations, but it also employs an adaptive strategy in order to decide whether to trigger the editing operation according to different cases. The adaptive strategy combines five pre-conditional theorems, all of which ensure an iterative reduction of classification error and an increase in the scale of new training sets under PAC learning theory. Experiments on UCI datasets and an application to small pulmonary nodules detection using chest CT images show that ADE-Co-Forest can more effectively enhance the performance of a learned hypothesis than Co-Forest and DE-Co-Forest (Co-Forest with Data Editing but without adaptive strategy).  相似文献   

The aim of this paper is to illustrate the potential of a parallel corpus in the context of (computer-assisted) language learning. In order to do so, we propose to answer two main questions (1) what corpus (data) to use and (2) how to use the corpus (data). We provide an answer to the what-question by describing the importance and particularities of compiling and processing a corpus for pedagogical purposes. In order to answer the how-question, we first investigate the central concepts of the interactionist theory of second language acquisition: comprehensible input, input enhancement, comprehensible output and output enhancement. By means of two case studies, we illustrate how the abovementioned concepts can be realized in concrete corpus-based language learning activities. We propose a design for a receptive and productive language task and describe how a parallel corpus can be at the basis of powerful language learning activities. The Dutch Parallel Corpus, a ten-million word sentence aligned and annotated parallel corpus, is used to develop these language tasks.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号