首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The sentiment analysis (SA) applications are becoming popular among the individuals and organizations for gathering and analysing user's sentiments about products, services, policies, and current affairs. Due to the availability of a wide range of English lexical resources, such as part‐of‐speech taggers, parsers, and polarity lexicons, development of sophisticated SA applications for the English language has attracted many researchers. Although there have been efforts for creating polarity lexicons in non‐English languages such as Urdu, they suffer from many deficiencies, such as lack of publically available sentiment lexicons with a proper scoring mechanism of opinion words and modifiers. In this work, we present a word‐level translation scheme for creating a first comprehensive Urdu polarity resource: “Urdu Lexicon” using a merger of existing resources: list of English opinion words, SentiWordNet, English–Urdu bilingual dictionary, and a collection of Urdu modifiers. We assign two polarity scores, positive and negative, to each Urdu opinion word. Moreover, modifiers are collected, classified, and tagged with proper polarity scores. We also perform an extrinsic evaluation in terms of subjectivity detection and sentiment classification, and the evaluation results show that the polarity scores assigned by this technique are more accurate than the baseline methods.  相似文献   

2.
Stemming is the basic operation in Natural language processing (NLP) to remove derivational and inflectional affixes without performing a morphological analysis. This practice is essential to extract the root or stem. In NLP domains, the stemmer is used to improve the process of information retrieval (IR), text classifications (TC), text mining (TM) and related applications. In particular, Urdu stemmers utilize only uni-gram words from the input text by ignoring bigrams, trigrams, and n-gram words. To improve the process and efficiency of stemming, bigrams and trigram words must be included. Despite this fact, there are a few developed methods for Urdu stemmers in the past studies. Therefore, in this paper, we proposed an improved Urdu stemmer, using hybrid approach divided into multi-step operation, to deal with unigram, bigram, and trigram features as well. To evaluate the proposed Urdu stemming method, we have used two corpora; word corpus and text corpus. Moreover, two different evaluation metrics have been applied to measure the performance of the proposed algorithm. The proposed algorithm achieved an accuracy of 92.97% and compression rate of 55%. These experimental results indicate that the proposed system can be used to increase the effectiveness and efficiency of the Urdu stemmer for better information retrieval and text mining applications.  相似文献   

3.
With the advent of globalization epoch, the Internet-based resources for Urdu are increasing in depth and breadth at a higher pace than ever and thus require a mechanism for computational processing of Urdu text. Information retrieval (IR) systems have now become the major tool for seeking varied information on the web. It uses variant forms of the word transformed through stemmer. Broadly speaking, current Urdu stemmers can be categorized into two major categories: linguistic-based stemmers and statistical stemmers. In this paper, the authors explain the applications where stemming is used as a first step and highlight the challenges in Urdu text stemming. This is the first comparative study of the state-of-the-art Urdu stemmers, based on various distinct features such as used approach, main idea, limitations, the rules or affixes, data set, evaluation criteria and claimed accuracy. A comparative analysis, among state-of-the-art Urdu stemmers, is performed by using the standard data set. Finally, we outline the relevant research gaps in the literature and suggest recommendations for future research on Urdu text stemming.  相似文献   

4.
Extensive work has been done on different activities of natural language processing for Western languages as compared to its Eastern counterparts particularly South Asian Languages. Western languages are termed as resource-rich languages. Core linguistic resources e.g. corpora, WordNet, dictionaries, gazetteers and associated tools being developed for Western languages are customarily available. Most South Asian Languages are low resource languages e.g. Urdu is a South Asian Language, which is among the widely spoken languages of sub-continent. Due to resources scarcity not enough work has been conducted for Urdu. The core objective of this paper is to present a survey regarding different linguistic resources that exist for Urdu language processing, to highlight different tasks in Urdu language processing and to discuss different state of the art available techniques. Conclusively, this paper attempts to describe in detail the recent increase in interest and progress made in Urdu language processing research. Initially, the available datasets for Urdu language are discussed. Characteristic, resource sharing between Hindi and Urdu, orthography, and morphology of Urdu language are provided. The aspects of the pre-processing activities such as stop words removal, Diacritics removal, Normalization and Stemming are illustrated. A review of state of the art research for the tasks such as Tokenization, Sentence Boundary Detection, Part of Speech tagging, Named Entity Recognition, Parsing and development of WordNet tasks are discussed. In addition, impact of ULP on application areas, such as, Information Retrieval, Classification and plagiarism detection is investigated. Finally, open issues and future directions for this new and dynamic area of research are provided. The goal of this paper is to organize the ULP work in a way that it can provide a platform for ULP research activities in future.  相似文献   

5.
Neural Computing and Applications - In order to provide benchmark performance for Urdu text document classification, the contribution of this paper is manifold. First, it provides a publicly...  相似文献   

6.
Neural Computing and Applications - The article [Benchmarking performance of machine and deep learning-based methodologies for Urdu text document classification], written by [Muhammad Nabeel Asim,...  相似文献   

7.
This paper presents a scheme for ranking of spelling error corrections for Urdu. Conventionally spell-checking techniques do not provide any explicit ranking mechanism. Ranking is either implicit in the correction algorithm or corrections are not ranked at all. The research presented in this paper shows that for Urdu, phonetic similarity between the corrections and the erroneous word can serve as a useful parameter for ranking the corrections. This combined with a new technique Shapex that uses visual similarity of characters for ranking gives an improvement of 23% in the accuracy of the one-best match compared to the result obtained when the ranking is done on the basis of word frequencies only.
Sarmad HussainEmail:
  相似文献   

8.
9.
This paper presents, a grammatically motivated, sentiment classification model, applied on a morphologically rich language: Urdu. The morphological complexity and flexibility in grammatical rules of this language require an improved or altogether different approach. We emphasize on the identification of the SentiUnits, rather than, the subjective words in the given text. SentiUnits are the sentiment carrier expressions, which reveal the inherent sentiments of the sentence for a specific target. The targets are the noun phrases for which an opinion is made. The system extracts SentiUnits and the target expressions through the shallow parsing based chunking. The dependency parsing algorithm creates associations between these extracted expressions. For our system, we develop sentiment-annotated lexicon of Urdu words. Each entry of the lexicon is marked with its orientation (positive or negative) and the intensity (force of orientation) score. For the evaluation of the system, two corpora of reviews, from the domains of movies and electronic appliances are collected. The results of the experimentation show that, we achieve the state of the art performance in the sentiment analysis of the Urdu text.  相似文献   

10.
Stemming is one of the basic steps in natural language processing applications such as information retrieval, parts of speech tagging, syntactic parsing and machine translation, etc. It is a morphological process that intends to convert the inflected forms of a word into its root form. Urdu is a morphologically rich language, emerged from different languages, that includes prefix, suffix, infix, co-suffix and circumfixes in inflected and multi-gram words that need to be edited in order to convert them into their stems. This editing (insertion, deletion and substitution) makes the stemming process difficult due to language morphological richness and inclusion of words of foreign languages like Persian and Arabic. In this paper, we present a comprehensive review of different algorithms and techniques of stemming Urdu text and also considering the syntax, morphological similarity and other common features and stemming approaches used in Urdu like languages, i.e. Arabic and Persian analyzed, extract main features, merits and shortcomings of the used stemming approaches. In this paper, we also discuss stemming errors, basic difference between stemming and lemmatization and coin a metric for classification of stemming algorithms. In the final phase, we have presented the future work directions.  相似文献   

11.
12.
13.
In this paper, an automatic and efficient algorithm for outline capture of character images, stored as bitmap, is presented. This method is well suited for characters of non-Roman languages like Arabic, Japanese, Urdu, Persian, etc. Contemporary word processing systems store shapes of the characters in terms of their outlines, and outlines are expressed as cubic Bezier curves. The process of capturing outlines includes steps: detection of boundary, finding Corner Points and Break Points and fitting the curve. The work done in this paper, fully automate the above process and produces optimal results.  相似文献   

14.
In this paper, we report on the role of the Urdu grammar in the Parallel Grammar (ParGram) project (Butt, M., King, T. H., Niño, M.-E., &; Segond, F. (1999). A grammar writer’s cookbook. CSLI Publications; Butt, M., Dyvik, H., King, T. H., Masuichi, H., &; Rohrer, C. (2002). ‘The parallel grammar project’. In: Proceedings of COLING 2002, Workshop on grammar engineering and evaluation, pp. 1–7). The Urdu grammar was able to take advantage of standards in analyses set by the original grammars in order to speed development. However, novel constructions, such as correlatives and extensive complex predicates, resulted in expansions of the analysis feature space as well as extensions to the underlying parsing platform. These improvements are now available to all the project grammars.  相似文献   

15.
Automatic Speaker Recognition (ASR) refers to the task of identifying a person based on his or her voice with the help of machines. ASR finds its potential applications in telephone based financial transactions, purchase of credit card and in forensic science and social anthropology for the study of different cultures and languages. Results of ASR are highly dependent on database, i.e., the results obtained in ASR are meaningless if recording conditions are not known. In this paper, a methodology and a typical experimental setup used for development of corpora for various tasks in the text-independent speaker identification in different Indian languages, viz., Marathi, Hindi, Urdu and Oriya have been described. Finally, an ASR system is presented to evaluate the corpora.  相似文献   

16.

Steganography is used for multimedia data security. It is a process of hiding the data within multimedia communication between the parties embedding the secret data inside a carrier file to be protected during its transmission. The research focus is on hiding within Arabic text steganography as a current challenging research area. The work innovation is utilizing text pseudo-spaces characters for data hiding. We present two studies for this text Steganography utilizing pseudo-spaces alone as well as combined with Kashida (extension character) as the old Arabic text stego techniques. Experimental results have shown that the proposed algorithms achieved high capacity and security ratio as compared to state-of-the-art Steganography methods presented for Arabic language. The proposed pseudo-spaces stego technique is of great benefit that can be further used for languages similar to Arabic such as Urdu and Persian as well as opening direction of text-stego research for other languages of the world.

  相似文献   

17.
18.
19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号