首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 937 毫秒
1.
This paper proposes a novel Chinese-English Cross-Lingual Information Retrieval(CECLIR)model PME,in which bilingual dictionary and comparable corpora are used to translate the query terms.The proximity and mutual information of the term-paris in the CHinese and English comparable corpora are employed not only to resolve the translation ambiguities but also to perform the query expansion so as to deal with the out-of -vocabulary issues in the CECLIR.The evaluation results show that the query precision of PME algorithm is about 84.4% of the monolingual information retrieval.  相似文献   

2.
跨语言信息检索中查询语句翻译转换算法   总被引:1,自引:0,他引:1       下载免费PDF全文
张孝飞  黄河燕  陈肇雄  代六玲 《计算机工程》2007,33(11):166-167,212
跨语言信息检索中,输入的查询语句往往是一系列关键词组合,而不是一个完整意义上的句子,致使查询关键词序列缺乏必要的语法、语境信息,难以实现查询语句的精确翻译。该文基于大规模双语语料库,以向量空间模型和词汇同现互信息为理论基础,运用传统单语信息检索技术,将查询语句的翻译问题转换为查询关键词词典义项的boost值计算,重构目标语查询语句。  相似文献   

3.
An information retrieval system has to retrieve all and only those documents that are relevant to a user query, even if index terms and query terms are not matched exactly. However, term mismatches between index terms and query terms have been a serious obstacle to the enhancement of retrieval performance. In this article, we discuss automatic term normalization between words and phrases in text corpora and their application to a Korean information retrieval system. We perform three new types of term normalizations: transliterated word normalization, noun phrase normalization, and context-based term normalization. Transliterated words are normalized into equivalence classes by using contextual similarity to alleviate lexical term mismatches. Then, noun phrases are normalized into phrasal terms by segmenting compound nouns as well as normalizing noun phrases. Moreover, context-based terms are normalized by using a combination of mutual information and word context to establish word similarities. Next, unsupervised clustering is done by using the K-means algorithm and cooccurrence clusters are identified to alleviate semantic term mismatches. These term normalizations are used in both the indexing and the retrieval system. The experimental results show that our proposed system can alleviate three types of term mismatches and can also provide the appropriate similarity measurements. As a result, our system can improve the retrieval effectiveness of the information retrieval system.  相似文献   

4.
现有文本数据集上的实体搜索和自然语言查询方法无法处理需要将分散在不同文档中的信息碎片链接起来以满足有复杂实体关系的查询,而知识库上的查询虽然可以表示实体间的复杂关系,但由于知识库的异构性和不完全性,通常查全率较低。针对这些问题,提出使用文本数据集对知识库进行扩展,并设计相应的含文本短语的三元组模式查询以支持对知识库和文本数据的统一查询。在此基础上,设计并实现了查询放松机制和对结果元组的评分模型,并给出了高效的查询处理方法。使用YAGO、ClueWeb09和其上的FACC1数据集,在三个不同的查询测试集(实体检索、实体关系检索和复杂的实体关系查询)上与两个典型相关工作作了比较。实验结果显示,扩展知识图谱上使用查询放松规则的实体关系检索系统的检索效果大大超出了其他系统,具体地在三个查询测试集上,其平均正确率均值(MAP)比其他系统分别提升了27%、37%和64%以上。  相似文献   

5.
双语平行语料库是构造高质量统计机器翻译系统的重要基础。与传统的通过扩大双语平行语料库规模来提高翻译质量的策略不同,本文旨在尽可能地挖掘现有资源的潜力来提高统计机器翻译的性能。文中提出了一种基于信息检索模型的统计机器翻译训练数据选择与优化方法,通过选择现有训练数据资源中与待翻译文本相似的句子组成训练子集,可在不增加计算资源的情况下获得与使用全部数据相当甚至更优的机器翻译结果。通过将选择出的数据子集加入原始训练数据中优化训练数据的分布可进一步提高机器翻译的质量。实验证明,该方法对于有效利用现有数据资源提高统计机器翻译性能有很好的效果。  相似文献   

6.
Yong Suk Choi 《Knowledge》2011,24(8):1139-1150
Recently, due to the widespread on-line availability of syntactically annotated text corpora, some automated tools for searching in such text corpora have gained great attention. Generally, those conventional corpus search tools use a decomposition-matching-merging method based on relational predicates for matching a tree pattern query to the desired parts of text corpora. Thus, their query formulation and expressivity are often complicated due to poorly understood query formalisms, and their searching tasks may require a big computational overhead due to a large number of repeated trials of matching tree patterns. To overcome these difficulties, we present TPEMatcher, a tool for searching in parsed text corpora. TPEMatcher provides not only an efficient way of query formulation and searching but also a good query expressivity based on concise syntax and semantics of tree pattern query. We also demonstrate that TPEMatcher can be effectively used for a text mining in practice with its useful interface providing in-depth details of search results.  相似文献   

7.
With the public availability of a number of syntactically parsed text corpora, it has been increasingly important to efficiently extract desired information from such corpora. Many conventional works extract a desired text part by matching the parse tree of each sentence to a query that is represented as a structural form of relational predicates expressing a common structural pattern of desired text parts. However, although those works can be useful for limited types of simple queries, they are not very efficient in general because query formulations are sometimes very complicated for complex patterns of desired text parts and query matching tasks are likely to be exponentially time-consuming when considering a variety of complex sentential structures in a text corpus. In order to overcome such inadequacy, we present a novel tree pattern expression (TPE) that can represent various structural patterns intuitively and reduce pattern-matching complexity significantly. This paper first proposes TPE and its pattern-matching algorithm, and then theoretically analyzes the complexity of the proposed pattern-matching algorithm. It also illustrates a TPE-based information extraction system, which is applied to real text mining in a bio-text corpus. It finally shows some experimental results with some discussions in comparison with other systems.  相似文献   

8.
无论多么复杂的查询语句都是由查询目标和查询条件组成的,查询条件决定着查询语句的结构,查询条件之间可以是并列的,也可以是嵌套的,查询条件的顺序也不是固定的。本文采取信息抽取的原理来抽取查询条件语义信息,提出了抽取查询条件语义信息的算法,这些算法可形成中文查询语句的各种类型的查询条件。实验表明,这些算法可以
以有效地抽取查询条件语义信息。  相似文献   

9.
A parallel corpus is an essential resource for statistical machine translation (SMT) but is often not available in the required amounts for all domains and languages. An approach is presented here which aims at producing parallel corpora from available comparable corpora. An SMT system is used to translate the source-language part of a comparable corpus and the translations are used as queries to conduct information retrieval from the target-language side of the comparable corpus. Simple filters are then used to score the SMT output and the IR-returned sentence with the filter score defining the degree of similarity between the two. Using SMT system output gives the benefit of trying to correct one of the common errors by sentence tail removal. The approach was applied to Arabic–English and French–English systems using comparable news corpora and considerable improvements were achieved in the BLEU score. We show that our approach is independent of the quality of the SMT system used to make the queries, strengthening the claim of applicability of the approach for languages and domains with limited parallel corpora available to start with. We compare our approach with one of the earlier approaches and show that our approach is easier to implement and gives equally good improvements.  相似文献   

10.
为了提高高校信息化水平,方便智能手机用户查询学校相关信息,在深入进行需求分析基础上,利用Android技术和SQLite数据库,研究了以南通职业大学为例的校园信息平台客户端的设计与实现过程,该客户端实现了校园新闻公告查询、招生信息查询、就业信息查询和教务信息查询等功能。本文还探讨了校园地图、校园黄页的代码实现,应用程序是基于Java语言开发完成的,并且已经在Android 2.2仿真设备上测试通过。  相似文献   

11.
In recent years large amounts of electronic texts have become available. While the first of these corpora had only a low level of annotation, the more recent ones are annotated with refined syntactic information. To make these rich annotations accessible for linguists, the development of query systems has become an important goal. One of the main difficulties in this task consists in the choice of the right query language, a language which at the same time should be powerful enough to let users formulate the queries they want and which should be efficiently evaluable to keep query response times short. There is a widespread belief that such a query language does not exist. It is therefore the aim of this paper to show that there is indeed a powerful query language that can be efficiently evaluated. We propose the use of monadic second-order logic as a query language. We show that a query in this language can be evaluated in linear time in the size of a tree in the corpus. We also provide examples of complicated linguistic queries expressed in monadic second-order logic thereby demonstrating the high expressive power of the language.  相似文献   

12.
Digitalization has changed the way of information processing, and new techniques of legal data processing are evolving. Text mining helps to analyze and search different court cases available in the form of digital text documents to extract case reasoning and related data. This sort of case processing helps professionals and researchers to refer the previous case with more accuracy in reduced time. The rapid development of judicial ontologies seems to deliver interesting problem solving to legal knowledge formalization. Mining context information through ontologies from corpora is a challenging and interesting field. This research paper presents a three tier contextual text mining framework through ontologies for judicial corpora. This framework comprises on the judicial corpus, text mining processing resources and ontologies for mining contextual text from corpora to make text and data mining more reliable and fast. A top-down ontology construction approach has been adopted in this paper. The judicial corpus has been selected with a sufficient dataset to process and evaluate the results. The experimental results and evaluations show significant improvements in comparison with the available techniques.  相似文献   

13.
传统的犯罪查询的查询条件是文本信息,查询结果是有序的文档列表,这种方式无法展示结果之间的关系.基于异构信息网络以信息网络的形式重构假币犯罪信息数据,构建了假币犯罪信息网络,使用人名消歧的技术建立假币犯罪信息网络中嫌疑人之间的关系,并使用排序学习方法研究假币犯罪信息网络中的节点相关性问题,设计并实现了假币犯罪信息分析系统,通过以实体对象为查询项和网络图为查询结果的方式解决假币犯罪数据的查询问题.  相似文献   

14.
This paper proposes an effective query-translation approach that enables a cross-language information retrieval (CLIR) service to be more easily supported in digital library systems that only contain monolingual content. A query-translation engine called LiveTrans is used to process the translation requests of cross-lingual queries from connected digital library systems. To automatically extract translations not covered by standard dictionaries, the engine is developed based on a novel integration of dictionary resources and Web mining approaches, including anchor-text and search-result methods. The engine exploits a broad range of multilingual Web resources used as live bilingual corpora to alleviate translation difficulties. It is shown to be particularly effective for extracting multilingual translation equivalents of query terms containing proper names or new terminology. The obtained results show the feasibility of and great potential for creating English-Chinese CLIR services in existing digital libraries and new applications in cross-language Web searching, although difficulties still remain that need to be investigated further.  相似文献   

15.
嵌套命名实体含有丰富的实体和实体间语义关系,有助于提高信息抽取的效率。由于缺少统一的标准中文嵌套命名实体语料库,目前中文嵌套命名实体的研究工作难于比较。该文在已有命名实体语料的基础上采用半自动化方法构建了两个中文嵌套命名实体语料库。首先利用已有中文命名实体语料库中的标注信息自动地构造出尽可能多的嵌套命名实体,然后再进行手工调整以满足对中文嵌套实体的标注要求,从而构建高质量的中文嵌套命名实体识别语料库。语料内和跨语料嵌套实体识别的初步实验表明,中文嵌套命名实体识别仍是一个比较困难的问题,需要进一步研究。  相似文献   

16.
摘 要: 分布式信息检索是信息检索领域的重要研究内容。为了提高分布式信息检索的性能,提出了一种基于文档副本局部性的分布式检索方法。对于任一站点,如果将查询结果中的非本地文档建立本地副本,那么可以减少查询处理中站点之间的查询转发,从而相应的提高信息检索的性能。基于该思想,将分布式信息检索中的副本放置转化为查询的局部性问题,建立了相应的优化模型,并针对不同的副本放置模型提出了相应的副本选择及放置策略。最后通过模拟实验验证表明,本文提出的方法与相关方法相比较既提高了查询结果的准确性,又减小了查询的响应时间。  相似文献   

17.
基于GIS的森林防火信息管理系统设计与实现   总被引:6,自引:0,他引:6  
分析了计算机技术、地理信息系统 (GIS)技术在森林防火中的重要性 ,针对森林防火管理的特点 ,设计和开发出基于 GIS的森林防火信息管理系统 ,该系统可实现火点智能定位 ,火场信息查询、扑火信息查询、辅助决策指挥和火灾评估档案的查询、统计和分析等功能 ,促进了森林防火信息化管理。  相似文献   

18.
为帮助用户在丰富的网络资源中快速、准确地查询到所需要的信息,提出一种基于遗传算法的查询优化方法.其基本思想是首先根据词项与所有查询词的共现程度在相关文档集合中选取扩展词对初始查询进行扩展,然后利用遗传算法为扩展后的查询选择优化的权重.实验结果表明,新方法具有更高的查全率和查准率.  相似文献   

19.
电子商务网站以查询接口的方式提供商务信息,查询接口也是隐藏在后端的Deep Web数据库模式信息的载体.有效解析查询接口是访问Deep Web资源的第1步,但是由于查询接口在不同的设计模式和开发语言下实现,所以导致了属性难以抽取、语义关系复杂的现象.为提高属性抽取的准确率且实现在语义层面上对查询接口的解读,提出一种以查询接口启发式信息为基础的属性抽取方法,通过使用本体工具对属性集合进行拓展并获取语义描述.在实际的电子商务网站上进行的广泛实验证明了提出方法的可行性与有效性.  相似文献   

20.
Web信息查询优化的遗传算法   总被引:1,自引:0,他引:1  
为帮助用户在丰富的网络资源中快速、准确地查询到所需要的信息,提出一种基于增强遗传算法的查询优化算法.其基本思想是:把查询种群组织成多个称为小生境的查询子种群,一个小生境用于查询文档空闻的一个区域,规定了相应的基于项权重和相似项的交叉算子、自适应变异算子,并通过引入局部搜索机制来增强算法的局部搜索能力,最后把查询结果依据相关性次序进行合并,并返回给查询用户.实验结果表明,该算法在查询精度和计算速度上均优于常用的查询优化技术。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号