首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
多策略汉维句子对齐   总被引:2,自引:0,他引:2  
提出了一种错误抑制的多策略算法对齐汉维语句子。针对长度对齐算法无法避免错误蔓延的特点,提出了一种新的错误蔓延抑制策略:利用双语语料的词汇共现信息,自动抽取汉维语词汇搭配,结合句子长度特征,寻找1:1模式的句对作为锚点,将错误蔓延抑制在锚点内;在锚点之间,利用标点符号和长度混合方法进行句子对齐。算法实验结果验证了该多策略算法寻找的锚点的精度高,有效抑制了对齐错误的蔓延;采用的混合对齐算法,避免了基于词汇对齐算法的高时间复杂度的弱点,比传统的对齐算法性能有了较大提高,对齐准确率由95.0%提高到97.6%,召回率由96.8%提高到98.2%,采用的对齐正确性评价算法可以有效发现自动对齐中的噪音对齐。  相似文献   

2.
提出了改进的自适应汉维句子对齐算法对齐汉维语句子。针对传统对齐方法不能较好地适应语料类型的变化,算法利用当前待对齐汉维文本的字节长度比和历史匹配模式数据,动态修正对齐模型的参数,使其适应语料类型的变化,提高了汉维句子对齐算法的性能,对齐的正确率和召回率较长度对齐模型分别提高了3.5个百分点和2.7个百分点,较混合对齐提高了1.9个百分点和1.8个百分点。实验结果验证了该算法能够有效地适应语料类型的变化。  相似文献   

3.
本文首先以信息处理用彝汉词汇对齐的难点作为出发点,然后在分析参照Borwn词汇对齐模型的基础上提出基于彝汉双语词典的彝汉词汇对齐的实现算法BiDictAlign,并用此方法进行了实验测试,测试数据显示此方法具有良好的性能,为信息处理用彝汉双语料词汇对齐技术的研究进行了有意义的探索。  相似文献   

4.
维吾尔语新闻网页与对应的中文翻译网页在内容上往往并非完全可比,主要表现为双语句子序列的错位甚至部分句子缺失,这给维汉句子对齐造成了困难。此外,作为新闻要素的人名地名很多是未登录词,这进一步增加了维汉句子对齐的难度。为了提高维汉词汇的匹配概率,作者自动提取中文人名、地名并翻译为维吾尔译名,构造双语名称映射表并加入维汉双语词典。然后用维文句中词典词对应的中文译词在中文句中进行串匹配,以避免中文分词错误,累计所有匹配词对得到双语句对的词汇互译率。最后融合数字、标点、长度特征计算双语句对的相似度。在所有双语句子相似度构成的矩阵上,使用图匹配算法寻找维汉平行句对,在900个句对上最高达到95.67%的维汉对齐准确率。  相似文献   

5.
基于长度的扩展方法的汉英句子对齐   总被引:7,自引:4,他引:7  
本文提出了一种用于汉英平行语料库对齐的扩展方法。该扩展方法以基于长度的统计对齐方法为主,然后根据双语词典引入了词汇信息,而基于标点的方法作为对齐的后处理部分。这种扩展方法不仅避免了复杂的中文处理,例如,汉语分词和词性标注,而且在统计方法中引入了关键词信息,以提高句子对齐的正确率。本文中所用的双语语料是LDC 的关于香港的双语新闻报道。动态规划算法用于系统的实现。和单纯的基于长度的方法和词汇方法相比,我们的扩展方法提高了句子对齐的正确率,并且结果是比较理想的。  相似文献   

6.
一种汉英双语句子自动对齐算法   总被引:2,自引:0,他引:2  
双语语料库建设及其自动对齐研究对计算语言学的发展具有重要的意义.双语对齐技术是加工双语文本的核心,对齐效果的好坏直接影响了以后工作(诸如机器辅助翻译)的进行.基于汉英双语的实际情况,提出了一种新的句子对齐混合算法,该算法主要采用一种新的基于长度的对齐算法,并结合基于词典的对齐算法,通过正反双向对齐,进一步提高了句子对齐的准确率.最后通过100个文件,5000多句英汉双语对该算法进行了验证,从对齐效果可以发现,结果比较理想,因而可以证明,该算法在实际工作中是可行的.  相似文献   

7.
基于自动抽取词汇信息的双语句子对齐   总被引:9,自引:0,他引:9  
刘昕  周明  朱胜火  黄昌宁 《计算机学报》1998,21(Z1):151-158
双语语料库句子对齐已成为新一代机器翻译研究中的一个至关重要的问题.对齐方法主要有基于长度的方法和基于词汇的方法,两者各具特点:前者实现简单、效率高,但精度低;后者精度高但实现复杂.本文提出一种新的对齐方法,首先利用基于长度的方法对文本进行粗对齐,然后在双语平行文本中确定锚点并自动抽取双语对应的关键词汇,降低了对齐问题的复杂度并减少了错误的蔓延.最后再利用所得到的词汇对应信息进行句子的对齐.这种方法融合了基于长度和基于词汇方法的优点,实验表明,它很大程度地提高了对齐的精度.  相似文献   

8.
双语语料库的自动对齐已成为机器翻译研究中的一个重要研究课题.目前的句子对齐方法有基于长度的方法和基于词汇的方法,该文先分析了基于长度的方法,然后提出了基于译文的方法:通过使用一部翻译较完整的词典作为桥梁,把英汉句子之间的对应关系连接起来.根据英语文本中的单词,在词典中找到其对应的译文,并以译文到汉语句子中去匹配,根据评价函数和动态规划算法找到对齐句对.实验结果证明这种对齐方法消除了基于长度做法中错误蔓延的情况,它大大地提高了对齐的精度,其效果是令人满意的.  相似文献   

9.
《计算机工程》2019,(6):211-217
句子对齐是将源文本中的句子映射到目标文本中对应翻译的过程。在神经网络的框架下,基于相互对齐的源端和目标端句子中包含大量相互对齐的单词,提出一种句子对齐方法。使用门关联网络捕获源端句子和目标端句子词对之间的语义关系,并通过语义关系来确定源端句子和目标端句子是否对齐。对非单调文本进行对齐评估,结果表明,该方法F1值达到93.8%,有效提高了句子对齐的准确率。  相似文献   

10.
针对政府文献的汉语维吾尔语语料库特点,充分利用汉语和维吾尔语的句子特性,提出一种汉维句子级别的对齐方法。该方法重点分析政府领域的汉语和维吾尔语的句型,分别对汉语和维吾尔语的语料进行边界识别,避免了复杂句型对汉语-维吾尔语句子对齐的影响,使得该方法取得句子对齐达到97%与99%之间的正确率。对齐的汉语-维吾尔语句子对可以充实语料库的规模,为汉语-维吾尔语短语对齐以及汉维机器翻译提供翻译语料。  相似文献   

11.
基于语料库的英语从句识别研究   总被引:2,自引:0,他引:2  
为改善英汉机译系统复杂句的翻译效果,针对英语复杂句中从句的边界界定问题,本文提出一种基于语料库的方法识别从句,该方法利用词性信息,将规则方法和统计方法结合用于识别从句的边界,获得良好的实验结果,封闭测试的精确率为92.69% ,召回率91.04%;开放测试的精确率为80.34% ,召回率83.93%。  相似文献   

12.
基于统计的中文地名识别   总被引:20,自引:5,他引:20  
本文针对有特征词的中文地名识别进行了研究。该系统使用从大规模地名词典和真实文本语料库得到的统计信息以及针对地名特点总结出来的规则,通过计算地名的构词可信度和接续可信度从而识别中文地名。该模型对自动分词的切分作了有效的调整,系统闭式召回率和精确率分别为90.24%和93.14% ,开式召回率和精确率分别达86.86%和91.48%。  相似文献   

13.
Alzheimer’s disease is a non-reversible, non-curable, and progressive neurological disorder that induces the shrinkage and death of a specific neuronal population associated with memory formation and retention. It is a frequently occurring mental illness that occurs in about 60%–80% of cases of dementia. It is usually observed between people in the age group of 60 years and above. Depending upon the severity of symptoms the patients can be categorized in Cognitive Normal (CN), Mild Cognitive Impairment (MCI) and Alzheimer’s Disease (AD). Alzheimer’s disease is the last phase of the disease where the brain is severely damaged, and the patients are not able to live on their own. Radiomics is an approach to extracting a huge number of features from medical images with the help of data characterization algorithms. Here, 105 number of radiomic features are extracted and used to predict the alzhimer’s. This paper uses Support Vector Machine, K-Nearest Neighbour, Gaussian Naïve Bayes, eXtreme Gradient Boosting (XGBoost) and Random Forest to predict Alzheimer’s disease. The proposed random forest-based approach with the Radiomic features achieved an accuracy of 85%. This proposed approach also achieved 88% accuracy, 88% recall, 88% precision and 87% F1-score for AD vs. CN, it achieved 72% accuracy, 73% recall, 72% precisionand 71% F1-score for AD vs. MCI and it achieved 69% accuracy, 69% recall, 68% precision and 69% F1-score for MCI vs. CN. The comparative analysis shows that the proposed approach performs better than others approaches.  相似文献   

14.
ContextThe component field in a bug report provides important location information required by developers during bug fixes. Research has shown that incorrect component assignment for a bug report often causes problems and delays in bug fixes. A topic model technique, Latent Dirichlet Allocation (LDA), has been developed to create a component recommender for bug reports.ObjectiveWe seek to investigate a better way to use topic modeling in creating a component recommender.MethodThis paper presents a component recommender by using the proposed Discriminative Probability Latent Semantic Analysis (DPLSA) model and Jensen–Shannon divergence (DPLSA-JS). The proposed DPLSA model provides a novel method to initialize the word distributions for different topics. It uses the past assigned bug reports from the same component in the model training step. This results in a correlation between the learned topics and the components.ResultsWe evaluate the proposed approach over five open source projects, Mylyn, Gcc, Platform, Bugzilla and Firefox. The results show that the proposed approach on average outperforms the LDA-KL method by 30.08%, 19.60% and 14.13% for recall @1, recall @3 and recall @5, outperforms the LDA-SVM method by 31.56%, 17.80% and 8.78% for recall @1, recall @3 and recall @5, respectively.ConclusionOur method discovers that using comments in the DPLSA-JS recommender does not always make a contribution to the performance. The vocabulary size does matter in DPLSA-JS. Different projects need to adaptively set the vocabulary size according to an experimental method. In addition, the correspondence between the learned topics and components in DPLSA increases the discriminative power of the topics which is useful for the recommendation task.  相似文献   

15.
This paper proposes a novel approach which uses a multi-objective evolutionary algorithm based on decomposition to address the ontology alignment optimization problem. Comparing with the approach based on Genetic Algorithm (GA), our method can simultaneously optimize three goals (maximizing the alignment recall, the alignment precision and the f-measure). The experimental results shows that our approach is able to provide various alignments in one execution which are less biased to one of the evaluations of the alignment quality than GA approach, thus the quality of alignments are obviously better than or equal to those given by the approach based on GA which considers precision, recall and f-measure only, and other multi-objective evolutionary approach such as NSGA-II approach. In addition, the performance of our approach outperforms NSGA-II approach with the average improvement equal to 32.79  \(\%\) . Through the comparison of the quality of the alignments obtained by our approach with those by the state of the art ontology matching systems, we draw the conclusion that our approach is more effective and efficient.  相似文献   

16.
This paper describes a novel approach to morphological tagging for Korean, an agglutinative language with a very productive inflectional system. The tagger takes raw text as input and returns a lemmatized and morphologically disambiguated output for each word: the lemma is labeled with a part-of-speech (POS) tag and the inflections are labeled with inflectional tags. Unlike the standard approach to tagging for morphologically complex languages, in our proposed approach the tagging phase precedes the analysis phase. It comprises a trigram-based tagging component followed by a morphological rule application component, obtaining 95% precision and recall on unseen test data.  相似文献   

17.
离合触发词的构词语素可能因插入、颠倒、省略而产生多种合法分离形式,这些分离形式与原形一样也能表征事件。为完整抽取事件,提出一种基于依存分析的离合触发词合法分离形式判定算法。该方法首先借助依存分析考察离合触发词合法分离形式在句中所受的依存约束,然后将这些约束转化为可计算的判定规则,最后利用判定规则对离合触发词的合法分离形式进行判定。实验结果显示,排除稀疏数据前,此方法的正确率、召回率、F值分别为82.2%、88.3%、85.1%;排除稀疏数据后,正确率、召回率、F值提升到82.4%、88.7%、85.4%。方法已基本具备应用潜质。  相似文献   

18.
It has been shown in studies of biological synaptic plasticity that synaptic efficacy can change in a very short time window, compared to the time scale associated with typical neural events. This time scale is small enough to possibly have an effect on pattern recall processes in neural networks. We study properties of a neural network which uses a cyclic Hebb rule. Then we add the short term potentiation of synapses in the recall phase. We show that this approach preserves the ability of the network to recognize the patterns stored by the network and that the network does not respond to other patterns at the same time. We show that this approach dramatically increases the capacity of the network at the cost of a longer pattern recall process. We discuss that the network possesses two types of recall. The fast recall does not need synaptic plasticity to recognize a pattern, while the slower recall utilizes synaptic plasticity. This is something that we all experience in our daily lives: some memories can be recalled promptly whereas recollection of other memories requires much more time.  相似文献   

19.
为使知识库的信息搜索突破传统基于关键字查询的局限,提出一种基于本体的知识库语义扩展搜索方法。将本体和语义扩展引入知识库,对用户查询条件进行扩展搜索,通过相关度分析对搜索结果进行排序,使搜索效果得到优化。实验结果表明,该方法能提高搜索查全率和查准率。  相似文献   

20.
针对汉语人名识别的难点,基于最大熵算法提出了结合多知识、多模型的识别方法,充分考虑了人名的内部特征(小颗粒特征)和人名的语境信息。论文的主要贡献是:将概率信息赋予最大熵模型,极大提高人名的准确率和召回率;细化了分类模型,将人名识别分成中国人名识别、外国译名识别和单字人名识别;提出动态优先级方法来防止一个外国译名被部分识别为一个或几个中国人名。实验测试数据为1998年1月的人民日报和Sighan(2006)命名实体测试语料。测试结果表明,人民日报(1998-01)的召回率为90.06%,准确率为89.27%;Sighan(MSRA)语料的召回率为95.39%,准确率为96.71%;Sighan(LDC)语料的召回率为87.56%,准确率为91.04%。实验结果证明,提出的人名识别方法是非常有效的。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号