首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper presents a historical Arabic corpus named HAC. At this early embryonic stage of the project, we report about the design, the architecture and some of the experiments which we have conducted on HAC. The corpus, and accordingly the search results, will be represented using a primary XML exchange format. This will serve as an intermediate exchange tool within the project and will allow the user to process the results offline using some external tools. HAC is made up of Classical Arabic texts that cover 1600 years of language use; the Quranic text, Modern Standard Arabic texts, as well as a variety of monolingual Arabic dictionaries. The development of this historical corpus assists linguists and Arabic language learners to effectively explore, understand, and discover interesting knowledge hidden in millions of instances of language use. We used techniques from the field of natural language processing to process the data and a graph-based representation for the corpus. We provided researchers with an export facility to render further linguistic analysis possible.  相似文献   

2.
Traditionally, a corpus is a large structured set of text, electronically stored and processed. Corpora have become very important in the study of languages. They have opened new areas of linguistic research, which were unknown until recently. Corpora are also key to the development of optical character recognition (OCR) applications. Access to a corpus of both language and images is essential during OCR development, particularly while training and testing a recognition application. Excellent corpora have been developed for Latin-based languages, but few relate to the Arabic language. This limits the penetration of both corpus linguistics and OCR in Arabic-speaking countries. This paper describes the construction and provides a comprehensive study and analysis of a multi-modal Arabic corpus (MMAC) that is suitable for use in both OCR development and linguistics. MMAC currently contains six million Arabic words and, unlike previous corpora, also includes connected segments or pieces of Arabic words (PAWs) as well as naked pieces of Arabic words (NPAWs) and naked words (NWords); PAWs and Words without diacritical marks. Multi-modal data is generated from both text, gathered from a wide variety of sources, and images of existing documents. Text-based data is complemented by a set of artificially generated images showing each of the Words, NWords, PAWs and NPAWs involved. Applications are provided to generate a natural-looking degradation to the generated images. A ground truth annotation is offered for each such image, while natural images showing small paragraphs and full pages are augmented with representations of the text they depict. A statistical analysis and verification of the dataset has been carried out and is presented. MMAC was also tested using commercial OCR software and is publicly and freely available.  相似文献   

3.
4.
5.
Morphologically rich languages pose a challenge for statistical machine translation (SMT). This challenge is magnified when translating into a morphologically rich language. In this work we address this challenge in the framework of a broad-coverage English-to-Arabic phrase based statistical machine translation (PBSMT). We explore the largest-to-date set of Arabic segmentation schemes ranging from full word form to fully segmented forms and examine the effects on system performance. Our results show a difference of 2.31 BLEU points averaged over all test sets between the best and worst segmentation schemes indicating that the choice of the segmentation scheme has a significant effect on the performance of an English-to-Arabic PBSMT system in a large data scenario. We show that a simple segmentation scheme can perform as well as the best and more complicated segmentation scheme. An in-depth analysis on the effect of segmentation choices on the components of a PBSMT system reveals that text fragmentation has a negative effect on the perplexity of the language models and that aggressive segmentation can significantly increase the size of the phrase table and the uncertainty in choosing the candidate translation phrases during decoding. An investigation conducted on the output of the different systems, reveals the complementary nature of the output and the great potential in combining them.  相似文献   

6.
A vast amount of valuable human knowledge is recorded in documents. The rapid growth in the number of machine-readable documents for public or private access necessitates the use of automatic text classification. While a lot of effort has been put into Western languages—mostly English—minimal experimentation has been done with Arabic. This paper presents, first, an up-to-date review of the work done in the field of Arabic text classification and, second, a large and diverse dataset that can be used for benchmarking Arabic text classification algorithms. The different techniques derived from the literature review are illustrated by their application to the proposed dataset. The results of various feature selections, weighting methods, and classification algorithms show, on average, the superiority of support vector machine, followed by the decision tree algorithm (C4.5) and Naïve Bayes. The best classification accuracy was 97 % for the Islamic Topics dataset, and the least accurate was 61 % for the Arabic Poems dataset.  相似文献   

7.
8.
Multimedia Tools and Applications - Taxonomies are semantic resources that help to categorize and add meaning to data. In a hyperconnected world where information is generated at a rate that...  相似文献   

9.
A method for disambiguating word senses in a large corpus   总被引:1,自引:0,他引:1  
Word sense disambiguation has been recognized as a major problem in natural language processing research for over forty years. Both quantitive and qualitative methods have been tried, but much of this work has been stymied by difficulties in acquiring appropriate lexical resources. The availability of this testing and training material has enabled us to develop quantitative disambiguation methods that achieve 92% accuracy in discriminating between two very distinct senses of a noun. In the training phase, we collect a number of instances of each sense of the polysemous noun. Then in the testing phase, we are given a new instance of the noun, and are asked to assign the instance to one of the senses. We attempt to answer this question by comparing the context of the unknown instance with contexts of known instances using a Bayesian argument that has been applied successfully in related tasks such as author identification and information retrieval. The proposed method is probably most appropriate for those aspects of sense disambiguation that are closest to the information retrieval task. In particular, the proposed method was designed to disambiguate senses that are usually associated with different topics.William Gale is in a statistics department at AT&T Bell Laboratories. He has done research in physics, radio astronomy, and economics in the past, and founded the Society for Artificial Intelligence and Statistics. His current interests include lexical issues such as word sense discrimination, word similarity measures, and word correspondences in parallel texts.Kenneth Ward Church received his Ph.D. in Computer Science from MIT, and then went to work at AT&T Bell Laboratories on problems in speech and language. Recently, he has been advocating the use of statistical methods for analyzing large corpora.David Yarowsky is currently pursuing a Ph.D. in Computer Science at the University of Pennsylvania. He spent several years at AT&T Bell Laboratories doing research in statistical natural language processing.  相似文献   

10.
基于FARIMA的网络建模与性能分析   总被引:1,自引:0,他引:1  
给出了利用FARIMA模型进行建模、拟合实际网络流量的方法和参数估计的具体步骤,研究了长短相关对网络性能的影响.研究结果表明,不论长相关还是短相关,FARIMA模型对实际业务拟合二者都非常接近,当缓存较小时,网络性能将由短相关特性支配,而且随着缓存增加,长相关业务下系统性能的衰减要比短相关业务下衰减的慢,这些发现对今后网络设计性能研究具有重要的参考价值.  相似文献   

11.
In this paper, a structural method of recognising Arabic handwritten characters is proposed. The major problem in cursive text recognition is the segmentation into characters or into representative strokes. When we segment the cursive portions of words, we take into account the contextual properties of the Arabic grammar and the junction segments connecting the characters to each other along the writing line. The problem of overlapping characters is resolved with a contour-following algorithm associated with the labelling of the detected contours. In the recognition phase, the characters are gathered into ten families of candidate characters with similar shapes. Then a heterarchical analysis follows that checks the pattern via goal-directed feedback control.  相似文献   

12.
Statistical information on a substantial corpus of representative Spanish texts is needed in order to determine the significance of data about individual authors or texts by means of comparison. This study describes the organization and analysis of a 150,000-word corpus of 30 well-known twentieth-century Spanish authors. Tables show the computational results of analyses involving sentences, segments, quotations, and word length.The article explains the considerations that guided content, selection, and sample size, and describes special editing needed for the input of Spanish text. Separate sections highlight and comment upon some of the findings.The corpus and the tables provide objective data for studies of homogeneity and heterogeneity. The format of the tables permits others to add to the original 30 authors, organize the results by categories, or use the cumulative results for normative comparisons.Estelle Irizarry is Professor of Spanish at Georgetown University and author of 20 books and annotated editions dealing with Hispanic literature, art, and hoaxes. Her latest book, an edition of Infortunios de Alonso Ramirez, treats the disputed authorship of Spanish America's first novel. She is Courseware Editor of CHum.  相似文献   

13.
针对实际网络测量显示,网络业务呈现自相似性,对网络的性能产生了极大的影响.介绍了基于OPNET的网络仿真的具体方法和步骤,通过不同条件下对网络的模拟,收集了影响网络性能的一些关键性指标,并进行了仿真,结果表明,负载越大,平均排队延迟越大,自相似业务对网络性能产生了影响;并且通过仿真结果延迟和负载的变化来着重分析了网络性能在具体网络设计中的影响.  相似文献   

14.
提出了适用于纺织机械中并条机、梳棉机等前纺设备的棉条在线检测及纺织质量检测的一种压阻式棉条传感器,对传感器的结构模型和力学模型进行了分析,并进行了传感器动态特性试验,由试验曲线分析了传感器的动态特性,得出的结论满足棉纺织行业的使用要求。  相似文献   

15.
As the number of Arabic corpora is constantly increasing, there is an obvious and growing need for concordancing software for corpus search and analysis that supports as many features as possible of the Arabic language, and provides users with a greater number of functions. This paper evaluates six existing corpus search and analysis tools based on eight criteria which seem to be the most essential for searching and analysing Arabic corpora, such as displaying Arabic text in its right-to-left direction, normalising diacritics and Hamza, and providing an Arabic user interface. The results of the evaluation revealed that three tools: Khawas, Sketch Engine, and aConCorde, have met most of the evaluation criteria and achieved the highest benchmark scores. The paper concluded that developers’ conscious consideration of the linguistic features of Arabic when designing these three tools was the most significant factor behind their superiority.  相似文献   

16.
Since the continuous proliferation of the journalistic content online and the changing political landscape in many Arabic countries, we started our current research in order to implement a media monitoring system about the opinion mining in political field. This system allows political actors, despite of the large volume of online data, to be constantly informed about opinions expressed on the web in order to properly monitor their actual standing, orient their communication strategy and prepare the election campaigns. The developed system is based on a linguistic approach using NooJ’s linguistic engine to formalize the automatic recognition rules and apply them to a dynamic corpus composed of journalistic articles. The first implemented rules allow identifying and annotating the different political entities (political actors and organizations). Then these annotations are used in our system of media monitoring in order to identify the opinions associated with the extracted named entities. The system is mainly based on a set of local grammars developed for the identification of different structures of the political opinion phrases. These grammars are using the entries of the opinion lexicon that contain the different opinion words (verbs, adjectives, nouns) where each entry is associated with the corresponding semantic marker (polarity and intensity). Our developed system is able to identify and properly annotate the opinion holder, the opinion target and the polarity (positive or negative) of the phraseological expression (nominal or verbal) expressing the opinion. Our experiments showed that the adopted method of extraction is consistent with 0.83 F-measure.  相似文献   

17.
18.
《Ergonomics》2012,55(6):441-454
Observation reliability (agreement percentage and kappa coefficients) for six experienced ergonomists and six untrained participants was computed. Participants were first tested after a training session and 1 week later after an additional practice session. Two formats were used: free practice and directed exercise. Reliability was tested for 17 variables and 20 sequences using photographic and video supports. The participants were asked to indicate whether they were confident about their answer, to rate this confidence on a scale of 1 to 10, and when the confidence rating was below 8, to provide a reason for this. Experience and additional practices had no clear impact on reliability, which was excellent overall. The main reason given was that the event to be observed took place at the borderline between two classes. The observers' rating on the scale appeared to be tied to the subsequent reliability computed. The use of a confidence scale appeared to be a useful tool for forecasting observation problems.  相似文献   

19.
Denis D  Lortie M  Bruxelles M 《Ergonomics》2002,45(6):441-454
Observation reliability (agreement percentage and kappa coefficients) for six experienced ergonomists and six untrained participants was computed. Participants were first tested after a training session and 1 week later after an additional practice session. Two formats were used: free practice and directed exercise. Reliability was tested for 17 variables and 20 sequences using photographic and video supports. The participants were asked to indicate whether they were confident about their answer, to rate this confidence on a scale of 1 to 10, and when the confidence rating was below 8, to provide a reason for this. Experience and additional practices had no clear impact on reliability, which was excellent overall. The main reason given was that the event to be observed took place at the borderline between two classes. The observers' rating on the scale appeared to be tied to the subsequent reliability computed. The use of a confidence scale appeared to be a useful tool for forecasting observation problems.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号