共查询到19条相似文献,搜索用时 203 毫秒
1.
深度神经网络(DNN)是目前中文分词的主流方法,但将针对某一领域训练的网络模型用于其他领域时,会因存在跨领域的未登录词(OOV)和表达鸿沟而造成性能显著下降,而在实际中对所有未知领域的训练语料进行人工标注和训练模型并不可行。为了解决这个问题,该文构建了一个基于新词发现的跨领域中文分词系统,可以自动完成从目标领域语料中提取新词、标注语料和训练网络模型的工作。此外,针对现有新词发现算法提取出的词表垃圾词串多以及自动标注语料中存在噪声样本的问题,提出了一种基于向量增强互信息和加权邻接熵的无监督新词发现算法以及一种基于对抗式训练的中文分词模型。实验中将使用北大开源新闻语料训练的网络模型提取出的特征迁移到医疗、发明专利和小说领域,结果表明该文所提方法在未登录词率、准确率、召回率和分词F值方面均优于现有方法。 相似文献
2.
3.
《现代电子技术》2019,(1):95-99
当前主流的中文分词方法是基于有监督的学习算法,该方法需要大量的人工标注语料,并且提取的局部特征存在稀疏等问题。针对上述问题,提出一种双向长短时记忆条件随机场(BI_LSTM_CRF)模型,可以自动学习文本特征,能对文本上下文依赖信息进行建模,同时CRF层考虑了句子字符前后的标签信息,对文本信息进行了推理。该分词模型不仅在MSRA,PKU,CTB 6.0数据集上取得了很好的分词结果,而且在新闻数据、微博数据、汽车论坛数据、餐饮点评数据上进行了实验,实验结果表明,BI_LSTM_CRF模型不仅在测试集上有很好的分词性能,同时在跨领域数据测试上也有很好的泛化能力。 相似文献
4.
中文层级地址分词是中文地址标准化的基础工作和地理编码的重要手段,同时也是中文分词和地理研究领域中关注的重点.高质量中文地址层级提取方法通常依赖于大量人工标注数据,而获取带标注的数据集耗时长,成本昂贵,不易实现.为解决上述问题,文中提出基于置信度的双向长短时记忆和条件随机场主动学习混合模型(Active-BiLSTM-C... 相似文献
5.
6.
7.
8.
9.
本文首先从中文输入法应用的角度出发,在阐述了N-gram模型的基础上对中文输入法的分词进行了详细的剖析,进一步根据训练数据的稀疏问题,使用Back-off模型进行数据的平滑处理.针对系统词库数量受限的问题,在构建词图的前提下,使用基于A*的算法求解前k优路径.最后实验结果表明,本文所使用的基于A*的算法与改进Dijkstra算法、基于DP的算法等常用的求前k优路径的算法相比,具有较高的效率和准确率,为中文分词及求取k-best算法的研究开拓了新的思路. 相似文献
10.
为解决作战命令的语法信息计算问题,在研究作战命令的语法时,采用把句子成分串映射为唯一整数的方法,在算法中采用边扫描边计算及叠层式计算的思路,改进了计算性能,缩短了计算时间.该模型计算简单,可运用到其它文体的语言信息处理中. 相似文献
11.
基于条件随机场的汉语词性标注 总被引:1,自引:0,他引:1
近年来条件随机场广泛应用于各类序列数据标注中,汉语词性标注中应用条件随机场对上下文建模时会扩展出数以亿计的特征,在深入分析特征产生机理的基础上对特征模板集进行了优化,采用条件随机场进一步研究了汉语词性标注中设定的特征模板集、扩展出的特征数、训练后模型大小、词性标注精度等指标之间的关系.实验结果表明,优化后的特征模板集在模型训练时间、训练后模型大小、标注精度等指标上达到了整体最优. 相似文献
12.
Semantic segmentation is a prominent problem in scene understanding expressed as a dense labeling task with deep learning models being one of the main methods to solve it. Traditional training algorithms for semantic segmentation models produce less than satisfactory results when not combined with post-processing techniques such as CRFs. In this paper, we propose a method to train segmentation models using an approach which utilizes classification information in the training process of the segmentation network. Our method employs the use of classification network that detects the presence of classes in the segmented output. These class scores are then used to train the segmentation model. This method is motivated by the fact that by conditioning the training of the segmentation model with these scores, higher order features can be captured. Our experiments show significantly improved performance of the segmentation model on the CamVid and CityScapes datasets with no additional post processing. 相似文献
13.
A hybrid approach to English Part-of-Speech (PoS) tagging with its target application being English-Chinese machine translation in business domain is presented, demonstrating how a present tagger can be adapted to learn from a small amount of data and handle unknown words for the purpose of machine translation. A small size of 998 k English annotated corpus in business domain is built semi-automatically based on a new tagset; the max-imum entropy model is adopted, and rule-based approach is used in post-processing. The tagger is further applied in Noun Phrase (NP) chunking. Exper-iments show that our tagger achieves an accuracy of 98.14%, which is a quite satisfactory result. In the application to NP chunking, the tagger gives rise to 2.21% increase in F-score, compared with the results using Stanford tagger. 相似文献
14.
Identifying gene names is an attractive research area of biology computing. However, accurate extraction of gene names is a challenging task with the lack of conventions for describing gene names. We devise a systematical architecture and apply the model using conditional random fields (CRFs) for extracting gene names from Medline. In order to improve the performance, biomedical ontology features are inserted into the model and post processing including boundary adjusting and word filter is presented to solve name overlapping problem and remove false positive single words. Pure string match method, baseline CRFs, and CRFs with our methods are applied to human gene names and HIV gene names extraction respectively in 1100 abstracts of Medline and their performances are contrasted. Results show that CRFs are robust for unseen gene names. Furthermore, CRFs with our methods outperforms other methods with precision 0.818 and recall 0.812. 相似文献
15.
Estimator learning automata for feature subset selection in high‐dimensional spaces,case study: Email spam detection
下载免费PDF全文
![点击此处可从《International Journal of Communication Systems》网站下载免费的PDF全文](/ch/ext_images/free.gif)
Seyyed Hossein Seyyedi Behrouz Minaei‐Bidgoli 《International Journal of Communication Systems》2018,31(8)
One of the difficult challenges facing data miners is that algorithm performance degrades if the feature space contains redundant or irrelevant features. Therefore, as a critical preprocess task, dimension reduction is used to build a smaller space containing valuable features. There are 2 different approaches for dimension reduction: feature extraction and feature selection, which itself is divided into wrapper and filter approaches. In high‐dimensional spaces, feature extraction and wrapper approaches are not applicable due to the time complexity. On the other hand, the filter approach suffers from inaccuracy. One main reason for this inaccuracy is that the subset's size is not determined considering specifications of the problem. In this paper, we propose ESS (estimator learning automaton‐based subset selection) as a new method for feature selection in high‐dimensional spaces. The innovation of ESS is that it combines wrapper and filter ideas and uses estimator learning automata to efficiently determine a feature subset that leads to a desirable tradeoff between the accuracy and efficiency of the learning algorithm. To find a qualified subset for a special processing algorithm that functions on an arbitrary dataset, ESS uses an automaton to score each candidate subset upon the scale of the subset and accuracy of the learning algorithm using it. In the end, the subset with the highest score is returned. We have used ESS for feature selection in the framework of spam detection, a text classification task for email as a pervasive communication medium. The results show achievement in reaching the goal stated above. 相似文献
16.
17.
Wang Haochang Zhao Tiejun Li Sheng Yu Hao 《电子科学学刊(英文版)》2007,24(6):838-844
Named entity recognition is a fundamental task in biomedical data mining. In this letter, a named entity recognition system based on CRFs (Conditional Random Fields) for biomedical texts is presented. The system makes extensive use of a diverse set of features, including local features, full text features and external resource features. All features incorporated in this system are described in detail, and the impacts of different feature sets on the performance of the system are evaluated. In order to improve the performance of system, post-processing modules are exploited to deal with the abbrevia- tion phenomena, cascaded named entity and boundary errors identification. Evaluation on this system proved that the feature selection has important impact on the system performance, and the post-processing explored has an important contribution on system performance to achieve better re- sults. 相似文献
18.
为了提高高分辨率图像分割效率,解决复杂图案中待分割目标边缘附近前景与背景区分度小而造成的分割目标不完整问题,本文通过引入超像素HOG特征,提出了一种基于超像素多特征融合(superpixel multi-feature fusion,SMFF)的快速图像分割算法.首先采用目前最有效的超像素算法对待分割图像进行超像素预分割,然后提取基于超像素的HOG特征、Lab颜色特征和空间位置特征,设计基于超像素的多特征度量算法,最终采用图割理论实现了基于超像素多特征融合的快速图像分割.实验结果验证了本文算法的有效性,其算法性能接近于目前最经典图像分割算法,且本文算法的时间性能要明显优于其它对比算法. 相似文献
19.
Design-based texture feature fusion using Gabor filters and co-occurrence probabilities. 总被引:1,自引:0,他引:1
A design-based method to fuse Gabor filter and grey level co-occurrence probability (GLCP) features for improved texture recognition is presented. The fused feature set utilizes both the Gabor filter's capability of accurately capturing lower and mid-frequency texture information and the GLCP's capability in texture information relevant to higher frequency components. Evaluation methods include comparing feature space separability and comparing image segmentation classification rates. The fused feature sets are demonstrated to produce higher feature space separations, as well as higher segmentation accuracies relative to the individual feature sets. Fused feature sets also outperform individual feature sets for noisy images, across different noise magnitudes. The curse of dimensionality is demonstrated not to affect segmentation using the proposed the 48-dimensional fused feature set. Gabor magnitude responses produce higher segmentation accuracies than linearly normalized Gabor magnitude responses. Feature reduction using principal component analysis is acceptable for maintaining the segmentation performance, but feature reduction using the feature contrast method dramatically reduced the segmentation accuracy. Overall, the designed fused feature set is advocated as a means for improving texture segmentation performance. 相似文献