共查询到20条相似文献,搜索用时 15 毫秒
1.
I. V. Mashechkin M. I. Petrovskiy D. S. Popov D. V. Tsarev 《Programming and Computer Software》2011,37(6):299-305
In the paper, the most state-of-the-art methods of automatic text summarization, which build summaries in the form of generic
extracts, are considered. The original text is represented in the form of a numerical matrix. Matrix columns correspond to
text sentences, and each sentence is represented in the form of a vector in the term space. Further, latent semantic analysis
is applied to the matrix obtained to construct sentences representation in the topic space. The dimensionality of the topic
space is much less than the dimensionality of the initial term space. The choice of the most important sentences is carried
out on the basis of sentences representation in the topic space. The number of important sentences is defined by the length
of the demanded summary. This paper also presents a new generic text summarization method that uses nonnegative matrix factorization
to estimate sentence relevance. Proposed sentence relevance estimation is based on normalization of topic space and further
weighting of each topic using sentences representation in topic space. The proposed method shows better summarization quality
and performance than state-of-the-art methods on the DUC 2001 and DUC 2002 standard data sets. 相似文献
2.
3.
4.
5.
对HMM算法进行了优化,采用遗传算法与LBG算法相结合的方法生成码本,通过实验验证了优化后算法在文本相关说话人身份认证方面效率有所提高. 相似文献
6.
Wai Lam Ruiz M. Srinivasan P. 《Knowledge and Data Engineering, IEEE Transactions on》1999,11(6):865-879
We develop an automatic text categorization approach and investigate its application to text retrieval. The categorization approach is derived from a combination of a learning paradigm known as instance-based learning and an advanced document retrieval technique known as retrieval feedback. We demonstrate the effectiveness of our categorization approach using two real-world document collections from the MEDLINE database. Next, we investigate the application of automatic categorization to text retrieval. Our experiments clearly indicate that automatic categorization improves the retrieval performance compared with no categorization. We also demonstrate that the retrieval performance using automatic categorization achieves the same retrieval quality as the performance using manual categorization. Furthermore, detailed analysis of the retrieval performance on each individual test query is provided 相似文献
7.
Automatic text segmentation and text recognition for video indexing 总被引:13,自引:0,他引:13
Efficient indexing and retrieval of digital video is an important function of video databases. One powerful index for retrieval
is the text appearing in them. It enables content-based browsing. We present our new methods for automatic segmentation of
text in digital videos. The algorithms we propose make use of typical characteristics of text in videos in order to enable
and enhance segmentation performance. The unique features of our approach are the tracking of characters and words over their
complete duration of occurrence in a video and the integration of the multiple bitmaps of a character over time into a single
bitmap. The output of the text segmentation step is then directly passed to a standard OCR software package in order to translate
the segmented text into ASCII. Also, a straightforward indexing and retrieval scheme is introduced. It is used in the experiments
to demonstrate that the proposed text segmentation algorithms together with existing text recognition algorithms are suitable
for indexing and retrieval of relevant video sequences in and from a video database. Our experimental results are very encouraging
and suggest that these algorithms can be used in video retrieval applications as well as to recognize higher level semantics
in videos. 相似文献
8.
Evaluation of automatic text summarization is a challenging task due to the difficulty of calculating similarity of two texts. In this paper, we define a new dissimilarity measure – compression dissimilarity to compute the dissimilarity between documents. Then we propose a new automatic evaluating method based on compression dissimilarity. The proposed method is a completely “black box” and does not need preprocessing steps. Experiments show that compression dissimilarity could clearly distinct automatic summaries from human summaries. Compression dissimilarity evaluating measure could evaluate an automatic summary by comparing with high-quality human summaries, or comparing with its original document. The evaluating results are highly correlated with human assessments, and the correlation between compression dissimilarity of summaries and compression dissimilarity of documents can serve as a meaningful measure to evaluate the consistency of an automatic text summarization system. 相似文献
9.
Pattern Analysis and Applications - Plagiarism is a serious problem in education, research, publishing and other fields. Automatic plagiarism detection systems are crucial for ensuring the... 相似文献
10.
Mohammad S. Khorsheed Abdulmohsen O. Al-Thubaity 《Language Resources and Evaluation》2013,47(2):513-538
A vast amount of valuable human knowledge is recorded in documents. The rapid growth in the number of machine-readable documents for public or private access necessitates the use of automatic text classification. While a lot of effort has been put into Western languages—mostly English—minimal experimentation has been done with Arabic. This paper presents, first, an up-to-date review of the work done in the field of Arabic text classification and, second, a large and diverse dataset that can be used for benchmarking Arabic text classification algorithms. The different techniques derived from the literature review are illustrated by their application to the proposed dataset. The results of various feature selections, weighting methods, and classification algorithms show, on average, the superiority of support vector machine, followed by the decision tree algorithm (C4.5) and Naïve Bayes. The best classification accuracy was 97 % for the Islamic Topics dataset, and the least accurate was 61 % for the Arabic Poems dataset. 相似文献
11.
Mohammad A. M. Abushariah 《International Journal of Speech Technology》2017,20(2):261-280
This research work aims to disseminate the efforts towards developing and evaluating TAMEEM V1.0, which is a state-of-the-art pure Modern Standard Arabic (MSA), automatic, continuous, speaker independent, and text independent speech recognizer using high proportion of the spoken data of the phonetically rich and balanced MSA speech corpus. The speech corpus contains speech recordings of Arabic native speakers from 11 Arab countries representing Levant, Gulf, and Africa regions of the Arabic World, which make about 45.30 h of speech data. The recordings contain about 39.28 h of 367 sentences that are considered phonetically rich and balanced, which are used for training TAMEEM V1.0 speech recognizer, and another 6.02 h of another 48 sentences that are used for testing purposes, which are mostly text independent and foreign to the training sentences. TAMEEM V1.0 speech recognizer is developed using the Carnegie Mellon University (CMU) Sphinx 3 tools in order to evaluate the speech corpus, whereby the speech engine uses three-emitting state Continuous Density Hidden Markov Model for tri-phone based acoustic models, and the language model contains uni-grams, bi-grams, and tri-grams. Using three different testing data sets, this work obtained 7.64% average Word Error Rate (WER) for speakers dependent with text independent data set. For speakers independent with text dependent data set, this work obtained 2.22% average WER, whereas 7.82% average WER is achieved for speakers independent with text independent data set. 相似文献
12.
Ismail Shahin 《International Journal of Speech Technology》2011,14(2):89-98
This paper addresses the formulation of a new speaker identification approach which employs knowledge of emotional content
of speaker information. Our proposed approach in this work is based on a two-stage recognizer that combines and integrates
both emotion recognizer and speaker recognizer into one recognizer. The proposed approach employs both Hidden Markov Models
(HMMs) and Suprasegmental Hidden Markov Models (SPHMMs) as classifiers. In the experiments, six emotions are considered including
neutral, angry, sad, happy, disgust and fear. Our results show that average speaker identification performance based on the
proposed two-stage recognizer is 79.92% with a significant improvement over a one-stage recognizer with an identification
performance of 71.58%. The results obtained based on the proposed approach are close to those achieved in subjective evaluation
by human listeners. 相似文献
13.
?nder K?rl? M. Bilginer Gülmezo?lu 《International Journal on Document Analysis and Recognition》2012,15(2):85-99
In the present article, new techniques have been introduced for revealing the individual features of a person??s handwriting pattern from the scanned images of handwritten text lines to facilitate text-independent writer identification. These techniques are aimed at designing a dynamic model which can be formalized according to any handwritten text line. Various combinations of the extracted features are applied to three well known classifiers for evaluating the contribution of features to define the correct identification rate. The K-NN, GMM, and Normal Density Discriminant Function Bayes classifiers are used in the present identification model. The experimental studies are conducted using two datasets obtained from the IAM database. The first dataset has already been proposed and used in the literature, whereas the second dataset is an expanded version of the first dataset and has been constituted for the first time in this study to analyze the performance of the extracted features under conditions such as an increased number of writers to discriminate in the database and a decreased number of text lines per writer. The remarkable identification rates obtained from the three classifiers on both datasets clearly indicate that the proposed feature extraction techniques can be effectively used in writer identification systems. 相似文献
14.
自动文摘技术应尽可能获取准确的相似度以确定句子或段落的权重,但目前常用的基于向量空间模型的计算方法却忽视句子、段落、文本中词的顺序.提出了一种新的基于相邻词序组的相似度度量方法并应用于文本的自动摘要,采用基于聚类的方法实现了词序组的向量表示并以此刻画句子、段落、文本,通过线性插值将基于不同长度词序组的相似度结果予以综合.同时,提出了新的基于含词序组重要性累计度的句子或段落的权重指标.实验证明利用词序信息可有效提高自动文摘质量. 相似文献
15.
V. A. Yatsko M. S. Starikov A. V. Butakov 《Automatic Documentation and Mathematical Linguistics》2010,44(3):111-120
This paper describes an experimental method for automatic text genre recognition based on 45 statistical, lexical, syntactic, positional, and discursive parameters. The suggested method includes: (1) the development of software permitting heterogeneous parameters to be normalized and clustered using the k-means algorithm; (2) the verification of parameters; (3) the selection of the parameters that are the most significant for scientific, newspaper, and artistic texts using two-factor analysis algorithms. Adaptive summarization algorithms have been developed based on these parameters. 相似文献
16.
自动文本分类特征选择方法研究 总被引:4,自引:4,他引:4
文本分类是指根据文本的内容将大量的文本归到一个或多个类别的过程,文本表示技术是文本分类的核心技术之一,而特征选择又是文本表示技术的关键技术之一,对分类效果至关重要。文本特征选择是最大程度地识别和去除冗余信息,提高训练数据集质量的过程。对文本分类的特征选择方法,包括信息增益、互信息、X^2统计量、文档频率、低损降维和频率差法等做了详细介绍、分析、比较研究。 相似文献
17.
18.
随着语义检索的发展,近年来涌现了许多基于本体的研究和应用,但本体本身仍离不开领域专家手工或半自动化的构建,成为了本体研究领域的一个瓶颈。因此,本文着眼于本体的自动化构建,提出了一种用FCA(形式概念分析)从文本中提取并自动生成符合W3C标准的OWL通用本体库的方法。解决了目前本体构建自动化程度低,领域依赖性强的问题,使得本体的发展和应用不再是空中楼阁。 相似文献
19.
20.
Omid Motlagh Danial Nakhaeinia Sai Hong Tang Babak Karasfi Weria Khaksar 《Neural computing & applications》2014,24(7-8):1569-1581
Online navigation with known target and unknown obstacles is an interesting problem in mobile robotics. This article presents a technique based on utilization of neural networks and reinforcement learning to enable a mobile robot to learn constructed environments on its own. The robot learns to generate efficient navigation rules automatically without initial settings of rules by experts. This is regarded as the main contribution of this work compared to traditional fuzzy models based on notion of artificial potential fields. The ability for generalization of rules has also been examined. The initial results qualitatively confirmed the efficiency of the model. More experiments showed at least 32 % of improvement in path planning from the first till the third path planning trial in a sample environment. Analysis of the results, limitations, and recommendations is included for future work. 相似文献