首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 24 毫秒
1.
Machine Translation - Learning bilingual word embeddings can be much easier if the parallel corpora are available with their words well aligned explicitly. However, in most cases, the parallel...  相似文献   

2.
Li  Yuling  Zhang  Yuhong  Yu  Kui  Hu  Xuegang 《Applied Intelligence》2021,51(11):7666-7678
Applied Intelligence - Recent studies have managed to learn cross-lingual word embeddings in a completely unsupervised manner through generative adversarial networks (GANs). These GANs-based...  相似文献   

3.
针对现有的基于深度学习的神经网络模型通常都是对单一的语料库进行训练学习,提出了一种大规模的多语料库联合学习的中文分词方法。语料库分别为简体中文数据集(PKU、MSRA、CTB6)和繁体中文数据集(CITYU、AS),每一个数据集输入语句的句首和句尾分别添加一对标志符。应用BLSTM(双向长短时记忆模型)和CRF(条件随机场模型)对数据集进行单独训练和多语料库共同训练的实验,结果表明大规模的多语料库共同学习训练能取得良好的分词效果。  相似文献   

4.
传统的情感分析研究大多基于机器学习算法,此类方法依赖大量人工抽取的特征与领域知识。使用卷积神经网络自动学习文本的特征表示,进而判别文本的情感极性。为了解决情感分析中监督训练样本不足的问题,利用大规模弱监督数据来训练卷积神经网络。同时引入“预训练-微调”策略,先在弱监督数据集上对卷积神经网络进行预训练,然后使用监督数据集进行微调训练来克服弱监督数据中的噪声问题。在SemEval-2013 Twitter情感分析数据集上进行实验验证,结果表明由于引入了弱监督数据参与训练,有效增强了卷积神经网络学习情感语义的能力,从而提升了模型的准确性。  相似文献   

5.
Neural Computing and Applications - Classifying e-mails into distinct labels can have a great impact on customer support. By using machine learning to label e-mails, the system can set up queues...  相似文献   

6.
Many studies are in an effort to explore urban spatial structure, and urban functional regions have become the subject of increasing attention among planners, engineers and public officials. Attempts have been made to identify urban functional regions using high spatial resolution (HSR) remote sensing images and extensive geo-data. However, the research scale and throughput have also been limited by the accessibility of HSR remote sensing data. Recently, big geo-data are becoming increasingly popular for urban studies since research is still accessible and objective with regard to the use of these data. This study aims to build a novel framework to provide an alternative solution for sensing urban spatial structure and discovering urban functional regions based on emerging geo-data – points of interest (POIs) data and an embedding learning method in the natural language processing (NLP) field. We started by constructing the intraurban functional corpus using a center-context pairs-based approach. A word embeddings representation model for training that corpus was used to extract multiprototype vectors in the second step, and the last step aggregated the functional parcels based on an introduced spatial clustering method, hierarchical density-based spatial clustering of applications with noise (HDBSCAN). The clustering results suggested that our proposed framework used in this study is capable of discovering the utilization of urban space with a reasonable level of accuracy. The limitation and potential improvement of the proposed framework are also discussed.  相似文献   

7.
Journal of Intelligent Information Systems - Medical free-text records store a lot of useful information that can be exploited in developing computer-supported medicine. However, extracting the...  相似文献   

8.
9.
重点研究将深度学习技术应用于藏文分词任务,采用多种深度神经网络模型,包括循环神经网络(RNN)、双向循环神经网络(Bi RNN)、层叠循环神经网络(Stacked RNN)、长短期记忆模型(LSTM)和编码器-标注器长短期记忆模型(Encoder-Labeler LSTM)。多种模型在以法律文本、政府公文、新闻为主的分词语料中进行实验,实验数据表明,编码器-标注器长短期记忆模型得到的分词结果最好,分词准确率可以达到92.96%,召回率为93.30%,F值为93.13%。  相似文献   

10.
Distant supervision relation extraction (DSRE) trains a classifier by automatically labeling data through aligning triples in the knowledge base (KB) with large-scale corpora. Training data generated by distant supervision may contain many mislabeled instances, which is harmful to the training of the classifier. Some recent methods show that relevant background information in KBs, such as entity type (e.g., Organization and Book), can improve the performance of DSRE. However, there are three main problems with these methods. Firstly, these methods are tailored for a specific type of information. A specific type of information only has a positive effect on a part of instances and will not be beneficial to all cases. Secondly, different background information is embedded independently, and no reasonable interaction is achieved. Thirdly, previous methods do not consider the side effect of the introduced noise of background information. To address these issues, we leverage five types of background information instead of a specific type of information in previous works and propose a novel edge-reasoning hybrid graph (ER-HG) model to realize reasonable interaction between different kinds of information. In addition, we further employ an attention mechanism for the ER-HG model to alleviate the side effect of noise. The ER-HG model integrates all types of information efficiently and is very robust to the noise of information. We conduct experiments on two widely used datasets. The experimental results demonstrate that our model outperforms the state-of-the-art methods significantly in held-out metric and robustness tests.  相似文献   

11.
针对单一词向量中存在的一词多义和一义多词的问题,以柬语为例提出了一种基于HDP主题模型的主题词向量的构造方法。在单一词向量基础上融入了主题信息,首先通过HDP主题模型得到单词主题标签,然后将其视为伪单词与单词一起输入Skip-Gram模型,同时训练出主题向量和词向量,最后将文本主题信息的主题向量与单词训练后得到的词向量进行级联,获得文本中每个词的主题词向量。与未融入主题信息的词向量模型相比,该方法在单词相似度和文本分类方面均取得了更好的效果,获取的主题词向量具有更多的语义信息。  相似文献   

12.
13.
刘春丽  李晓戈  刘睿  范贤  杜丽萍 《计算机应用》2016,36(10):2794-2798
为提高中文分词的准确率和未登录词(OOV)识别率,提出了一种基于字表示学习方法的中文分词系统。首先使用Skip-gram模型将文本中的词映射为高维向量空间中的向量;其次用K-means聚类算法将词向量聚类,并将聚类结果作为条件随机场(CRF)模型的特征进行训练;最后基于该语言模型进行分词和未登录词识别。对词向量的维数、聚类数及不同聚类算法对分词的影响进行了分析。基于第四届自然语言处理与中文计算会议(NLPCC2015)提供的微博评测语料进行测试,实验结果表明,在未利用外部知识的条件下,分词的F值和OOV识别率分别达到95.67%和94.78%,证明了将字的聚类特征加入到条件随机场模型中能有效提高中文短文本的分词性能。  相似文献   

14.
Word embedding, has been a great success story for natural language processing in recent years. The main purpose of this approach is providing a vector representation of words based on neural network language modeling. Using a large training corpus, the model most learns from co-occurrences of words, namely Skip-gram model, and capture semantic features of words. Moreover, adding the recently introduced character embedding model to the objective function, the model can also focus on morphological features of words. In this paper, we study the impact of training corpus on the results of word embedding and show how the genre of training data affects the type of information captured by word embedding models. We perform our experiments on the Persian language. In line of our experiments, providing two well-known evaluation datasets for Persian, namely Google semantic/syntactic analogy and Wordsim353, is also part of the contribution of this paper. The experiments include computation of word embedding from various public Persian corpora with different genres and sizes while considering comprehensive lexical and semantic comparison between them. We identify words whose usages differ between these datasets resulted totally different vector representation which ends to significant impact on different domains in which the results vary up to 9% on Google analogy and up to 6% on Wordsim353. The resulted word embedding for each of the individual corpora as well as their combinations will be publicly available for any further research based on word embedding for Persian.  相似文献   

15.
修驰  宋柔 《计算机应用》2013,33(3):780-783
中文自然语言处理中专业领域分词的难度远远高于通用领域。特别是在专业领域的分词歧义方面,一直没有找到有效的解决方法。针对该问题提出基于无监督学习的专业领域分词歧义消解方法。以测试语料自身的字符串频次信息、互信息、边界熵信息为分词歧义的评价标准,独立、组合地使用这三种信息解决分词歧义问题。实验结果显示该方法可以有效消解专业领域的分词歧义,并明显提高分词效果。  相似文献   

16.
RISC-V作为近年来最热门的开源指令集架构,被广泛应用于各个特定领域的微处理器,特别是机器学习领域的模块化定制.但是,现有的RISC-V应用需要将传统软件或模型在RISC-V指令集上重新编译或优化,故如何能快速地在RISC-V体系结构上部署、运行和测试机器学习框架是一个亟待解决的技术问题.使用虚拟化技术可以解决跨平台...  相似文献   

17.
于东  刘春花  田悦 《计算机应用》2016,36(2):455-459
针对从非结构化文本中抽取指定人物职衔履历属性问题,提出一种基于远距离监督和模式匹配的属性抽取方法。该方法从字符串模式和依存模式两个层面描述人物职衔履历特征,将问题分为两阶段。首先利用远距离监督知识和人工标注知识,挖掘具有高覆盖度的模式库,用于发现职衔履历属性和抽取候选集;其次利用职衔机构等属性间的文字接续关系,以及特定人物与候选属性的依存关系,设计候选集的过滤规则对候选项进行筛选,实现高准确度的属性抽取。实验结果显示,所提方法在CLP2014-PAE测试集上的F值达到55.37%,显著高于评测最好成绩(F值34.38%)和基于条件随机场(CRF)的有监督序列标注方法(F值43.79%),表明该方法能高覆盖度挖掘并抽取非结构化文档中的职衔履历属性。  相似文献   

18.
Pattern Analysis and Applications - In manifold learning, the intrinsic geometry of the manifold is explored and preserved by identifying the optimal local neighborhood around each observation. It...  相似文献   

19.
考虑到关系的多语义性以及不同实体和关系之间的确定性,提出一种面向多语义关系的知识图谱表示方法TransC。将关系划分为多条语义,构建关系的高斯混合模型;构建对应的云模型,获取最能表达该关系的语言值和确定性;将确定性作为权重,以加权欧式距离作为新的评分函数;使用多个真实的基准数据集对链接预测和三元组分类进行广泛的实验。实验结果表明,相较于现有的模型和方法,TransC在各项指标上都显示出其优越性。  相似文献   

20.
当前主流的中文分词方法是基于字标注的传统机器学习的方法。但传统机器学习方法需要人为地从中文文本中配置并提取特征,存在词库维度高且仅利用CPU训练模型时间长的缺点。针对以上问题,进行了研究提出基于LSTM(Long Short-Term Memory)网络模型的改进方法,采用不同词位标注集并加入预先训练的字嵌入向量(character embedding)进行中文分词。在中文分词评测常用的语料上进行实验对比,结果表明:基于LSTM网络模型的方法能得到比当前传统机器学习方法更好的性能;采用六词位标注并加入预先训练的字嵌入向量能够取得相对最好的分词性能;而且利用GPU可以大大缩短深度神经网络模型的训练时间;LSTM网络模型的方法也更容易推广并应用到其他自然语言处理(NLP)中序列标注的任务。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号