首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Internet search engines allow access to online information from all over the world. However, there is currently a general assumption that users are fluent in the languages of all documentsthat they might search for. This has for historical reasons usually been a choice between English and the locally supported language. Given the rapidly growing size of the Internet, it is likely that future users will need to access information in languages in which they are not fluent or have no knowledge of at all. This papershows how information retrieval and machine translation can becombined in a cross-language information access frameworkto help overcome the language barrier. We presentencouraging preliminary experimental results using English queries toretrieve documents from the standard Japanese language BMIR-J2retrieval test collection. We outline the scope and purpose ofcross-language information access and provide an example applicationto suggest that technology already exists to provide effective andpotentially useful applications.  相似文献   

2.
Information retrieval (IR) is the science of identifying documents or sub-documents from a collection of information or database. The collection of information does not necessarily be available in only one language as information does not depend on languages. Monolingual IR is the process of retrieving information in query language whereas cross-lingual information retrieval (CLIR) is the process of retrieving information in a language that differs from query language. In current scenario, there is a strong demand of CLIR system because it allows the user to expand the international scope of searching a relevant document. As compared to monolingual IR, one of the biggest problems of CLIR is poor retrieval performance that occurs due to query mismatching, multiple representations of query terms and untranslated query terms. Query expansion (QE) is the process or technique of adding related terms to the original query for query reformulation. Purpose of QE is to improve the performance and quality of retrieved information in CLIR system. In this paper, QE has been explored for a Hindi–English CLIR in which Hindi queries are used to search English documents. We used Okapi BM25 for documents ranking, and then by using term selection value, translated queries have been expanded. All experiments have been performed using FIRE 2012 dataset. Our result shows that the relevancy of Hindi–English CLIR can be improved by adding the lowest frequency term.  相似文献   

3.
Stemming is one of the basic steps in natural language processing applications such as information retrieval, parts of speech tagging, syntactic parsing and machine translation, etc. It is a morphological process that intends to convert the inflected forms of a word into its root form. Urdu is a morphologically rich language, emerged from different languages, that includes prefix, suffix, infix, co-suffix and circumfixes in inflected and multi-gram words that need to be edited in order to convert them into their stems. This editing (insertion, deletion and substitution) makes the stemming process difficult due to language morphological richness and inclusion of words of foreign languages like Persian and Arabic. In this paper, we present a comprehensive review of different algorithms and techniques of stemming Urdu text and also considering the syntax, morphological similarity and other common features and stemming approaches used in Urdu like languages, i.e. Arabic and Persian analyzed, extract main features, merits and shortcomings of the used stemming approaches. In this paper, we also discuss stemming errors, basic difference between stemming and lemmatization and coin a metric for classification of stemming algorithms. In the final phase, we have presented the future work directions.  相似文献   

4.
A large number of internet users share their knowledge and opinions in online social networks like forums, weblogs, etc. This fact has attracted many researchers from different fields to study online social networks. The Persian language is one of the dominant languages in the Middle East which is the official language of Iran, Afghanistan and Tajikistan; so, a large number of Persians are active in online social networks. Despite this fact, very few studies exist about Persian social networks. In this paper we will study the characteristics of Persian bloggers based on a new collection, named irBlogs. The collection contains nearly 5 million posts and the network of more than 560,000 Persian bloggers which assures the reliability of the results of this study. Some of the analyzed characteristics are: the similarities and the differences between formal Persian and the language style that is used by Persian bloggers, the interests of the bloggers and the impact of other web resources on Persian blogosphere. Our analysis show that IT, sports, society, culture and politics are the main interests of Persian bloggers. Also, analysis of the links that are shared by Persian bloggers shows that news agencies, knowledge-bases and other social networks have a great impact on Persian bloggers and they are interested to share multimedia content.  相似文献   

5.
Extraction and normalization of temporal expressions from documents are important steps towards deep text understanding and a prerequisite for many NLP tasks such as information extraction, question answering, and document summarization. There are different ways to express (the same) temporal information in documents. However, after identifying temporal expressions, they can be normalized according to some standard format. This allows the usage of temporal information in a term- and language-independent way. In this paper, we describe the challenges of temporal tagging in different domains, give an overview of existing annotated corpora, and survey existing approaches for temporal tagging. Finally, we present our publicly available temporal tagger HeidelTime, which is easily extensible to further languages due to its strict separation of source code and language resources like patterns and rules. We present a broad evaluation on multiple languages and domains on existing corpora as well as on a newly created corpus for a language/domain combination for which no annotated corpus has been available so far.  相似文献   

6.
Automatic text classification based on vector space model (VSM), artificial neural networks (ANN), K-nearest neighbor (KNN), Naives Bayes (NB) and support vector machine (SVM) have been applied on English language documents, and gained popularity among text mining and information retrieval (IR) researchers. This paper proposes the application of VSM and ANN for the classification of Tamil language documents. Tamil is morphologically rich Dravidian classical language. The development of internet led to an exponential increase in the amount of electronic documents not only in English but also other regional languages. The automatic classification of Tamil documents has not been explored in detail so far. In this paper, corpus is used to construct and test the VSM and ANN models. Methods of document representation, assigning weights that reflect the importance of each term are discussed. In a traditional word-matching based categorization system, the most popular document representation is VSM. This method needs a high dimensional space to represent the documents. The ANN classifier requires smaller number of features. The experimental results show that ANN model achieves 93.33% which is better than the performance of VSM which yields 90.33% on Tamil document classification.  相似文献   

7.
Abstract: Vast amounts of medical information reside within text documents, so that the automatic retrieval of such information would certainly be beneficial for clinical activities. The need for overcoming the bottleneck provoked by the manual construction of ontologies has generated several studies and research on obtaining semi-automatic methods to build ontologies. Most techniques for learning domain ontologies from free text have important limitations. Thus, they can extract concepts so that only taxonomies are generally produced although there are other types of semantic relations relevant in knowledge modelling. This paper presents a language-independent approach for extracting knowledge from medical natural language documents. The knowledge is represented by means of ontologies that can have multiple semantic relationships among concepts.  相似文献   

8.
This paper describes the complete process and a tool for the automatic construction of a multimedia hypertext starting from a large collection of multimedia documents. Through the use of an authoring methodology, the document collection is automatically authored, and the result is a multimedia hypertext, also called a hypermedia, written in hypertext mark-up language (HTML), almost a standard among hypermedia mark-up languages. The resulting hypermedia can be browsed and queried with Mosaic, an interface developed in the framework of the World Wide Web Project. In particular, the set of methods and techniques used for the automatic construction of hypermedia is described in this paper, and their relevance in the context of multimedia information retrieval is highlighted.  相似文献   

9.
This paper reports a document retrieval technique that retrieves machine-printed Latin-based document images through word shape coding. Adopting the idea of image annotation, a word shape coding scheme is proposed, which converts each word image into a word shape code by using a few shape features. The text contents of imaged documents are thus captured by a document vector constructed with the converted word shape code and word frequency information. Similarities between different document images are then gauged based on the constructed document vectors. We divide the retrieval process into two stages. Based on the observation that documents of the same language share a large number of high-frequency language-specific stop words, the first stage retrieves documents with the same underlying language as that of the query document. The second stage then re-ranks the documents retrieved in the first stage based on the topic similarity. Experiments show that document images of different languages and topics can be retrieved properly by using the proposed word shape coding scheme.  相似文献   

10.
中国多民族文字信息处理中的Unicode编程   总被引:5,自引:0,他引:5  
随着我国经济及信息技术的高速发展,少数民族文字处理信息化和数字化的迫切性日益突出.依据合作项目“MDL”,结合各少数民族文字特点,对编码字符集选择、Unicode编程以及CodePage映射Unicode字符编码转换进行了详细论述.Unicode编程技术和文字代码转换是少数民族文字处理的基础,同时也是“MDL”实现全文检索的关键.“MDL”的建立,将对我国少数民族语言文字数字图书馆起到重要作用.  相似文献   

11.
针对日渐丰富的多语种文本数据,为了实现对同一类别体系下不同语种的文本分类,充分发挥多语种文本信息的价值,提出一种结合双向长短时记忆单元和卷积神经网络的多语种文本分类模型BiLSTM-CNN模型。针对每个语种,利用双向长短时记忆神经网络提取文本特征,并引入卷积神经网络进行特征优化,获得各语种更深层次的文本表示,最后将各语种的文本表示级联输入到softmax函数预测类别。在中英朝科技文献平行数据集上进行了实验验证,实验结果表明,该方法相比于基准方法分类正确率提高了4%,且对任一语种文本均能正确分类,具有良好的扩展性。  相似文献   

12.
语义网是提高网络信息检索质量的重要途径。利用现在流行的Web信息检索系统,扩展其处理语义Web文档中的苎义评注的能力,采用更有效的文档分类算法,将语义Web文档组织成蔟,提高检索质量。介绍的算法与标准的基于图匹配算法相比,计算代价降低,便于后续的存储、抽取和处理等工作。  相似文献   

13.
With the ever-increasing growth of the World Wide Web, there is an urgent need for an efficient information retrieval system that can search and retrieve handwritten documents when presented with user queries. However, unconstrained handwriting recognition remains a challenging task with inadequate performance thus proving to be a major hurdle in providing robust search experience in handwritten documents. In this paper, we describe our recent research with focus on information retrieval from noisy text derived from imperfect handwriting recognizers. First, we describe a novel term frequency estimation technique incorporating the word segmentation information inside the retrieval framework to improve the overall system performance. Second, we outline a taxonomy of different techniques used for addressing the noisy text retrieval task. The first method uses a novel bootstrapping mechanism to refine the OCR’ed text and uses the cleaned text for retrieval. The second method uses the uncorrected or raw OCR’ed text but modifies the standard vector space model for handling noisy text issues. The third method employs robust image features to index the documents instead of using noisy OCR’ed text. We describe these techniques in detail and also discuss their performance measures using standard IR evaluation metrics.  相似文献   

14.
法律人工智能因其高效、便捷的特点,近年来受到社会各界的广泛关注。法律文书是法律在社会生活中最常见的表现形式,应用自然语言理解方法智能地处理法律文书内容是一个重要的研究和应用方向。该文梳理与总结面向法律文书的自然语言理解技术,首先介绍了五类面向法律文书的自然语言理解任务形式: 法律文书信息提取、类案检索、司法问答、法律文书摘要和判决预测。然后,该文探讨了运用现有自然语言理解技术应对法律文书理解的主要挑战,指出需要解决好法律文书与日常生活语言之间的表述差异性、建模好法律文书中特有的推理与论辩结构,并且需要将法条、推理模式等法律知识融入自然语言理解模型。  相似文献   

15.
目前XML查询语言及查询界面对Web用户过于复杂,该文描述了一种XML文档索引机制,在此基础上建立了一个通用的贝叶斯网络查询模型。用户只需在交互界面输入自然语言描述的查询,系统就能对其实现基于语义的构造,由它生成多个结构化查询;对这些查询建立贝叶斯网络,计算查询在给定文档下的概率,选择概率最大的前3个查询提交给系统执行。  相似文献   

16.
Although using domain specific knowledge sources for information retrieval yields more accurate results compared to pure keyword-based methods, more improvements can be achieved by considering both relations between concepts in an ontology and also their statistical dependencies over the corpus. In this paper, an innovative approach named concept-based pseudo-relevance feedback is introduced for improving accuracy of biomedical retrieval systems. Proposed method uses a hybrid retrieval algorithm for discovering relevancy between queries and documents which is based on a combination of keyword- and concept-based approaches. It also uses a pseudo-relevance feedback mechanism for expanding initial queries with auxiliary biomedical concepts extracted from top-ranked results of hybrid information retrieval. Using concept-based similarities makes it possible for the system to detect related documents to users’ queries, which are semantically close to each other while not necessarily sharing common keywords. In addition, expanding initial queries with concepts introduced by pseudo-relevance feedback captures those relations between queries and documents, which rely on statistical dependencies between concepts they contain. As a matter of fact, these relations may remain undetected, examining merely existing links between concepts in an external knowledge source. Proposed approach is evaluated using OHSUMED test collection and standard evaluation methods from text retrieval conference (TREC). Experimental results on MEDLINE documents (in OHSUMED collection) show 21% improvement over keyword-based approach in terms of mean average precision, which is a noticeable gain.  相似文献   

17.
In this article we illustrate a methodology for building cross-language search engine. A synergistic approach between thesaurus-based approach and corpus-based approach is proposed. First, a bilingual ontology thesaurus is designed with respect to two languages: English and Spanish, where a simple bilingual listing of terms, phrases, concepts, and subconcepts is built. Second, term vector translation is used – a statistical multilingual text retrieval techniques that maps statistical information about term use between languages (Ontology co-learning). These techniques map sets of t f id f term weights from one language to another. We also applied a query translation method to retrieve multilingual documents with an expansion technique for phrasal translation. Finally, we present our findings.  相似文献   

18.
To find a document in the sea of information, you must embark on a search process, usually computer-aided. In the traditional information retrieval model, the final goal is to identify and collect a small number of documents to read in detail. In this case, a single query yielding a scalar indication of relevance usually suffices. In contrast, document corpus management seeks to understand what is happening in the collection of documents as a whole (i.e. to find relationships among documents). You may indeed read or skim individual documents, but only to better understand the rest of the document set. Document corpus management seeks to identify trends, discover common links and find clusters of similar documents. The results of many single queries must be combined in various ways so that you can discover trends. We describe a new system called the Stereoscopic Field Analyzer (SFA) that aids in document corpus management by employing 3D volumetric visualization techniques in a minimally immersive real-time interaction style. This interactive information visualization system combines two-handed interaction and stereoscopic viewing with glyph-based rendering of the corpora contents. SFA has a dynamic hypertext environment for text corpora, called Telltale, that provides text indexing, management and retrieval based on n-grams (n character sequences of text). Telltale is a document management and information retrieval engine which provides document similarity measures (n-gram-based m-dimensional vector inner products) visualized by SFA for analyzing patterns and trends within the corpus  相似文献   

19.
20.
Sentiment analysis and opinion mining are valuable for extraction of useful subjective information out of text documents. These tasks have become of great importance, especially for business and marketing professionals, since online posted products and services reviews impact markets and consumers shifts. This work is motivated by the fact that automating retrieval and detection of sentiments expressed for certain products and services embeds complex processes and pose research challenges, due to the textual phenomena and the language specific expression variations. This paper proposes a fast, flexible, generic methodology for sentiment detection out of textual snippets which express people’s opinions in different languages. The proposed methodology adopts a machine learning approach with which textual documents are represented by vectors and are used for training a polarity classification model. Several documents’ vector representation approaches have been studied, including lexicon-based, word embedding-based and hybrid vectorizations. The competence of these feature representations for the sentiment classification task is assessed through experiments on four datasets containing online user reviews in both Greek and English languages, in order to represent high and weak inflection language groups. The proposed methodology requires minimal computational resources, thus, it might have impact in real world scenarios where limited resources is the case.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号