首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 218 毫秒
1.

In a Common Law system, legal practitioners need frequent access to prior case documents that discuss relevant legal issues. Case documents are generally very lengthy, containing complex sentence structures, and reading them fully is a strenuous task even for legal practitioners. Having a concise overview of these documents can relieve legal practitioners from the task of reading the complete case statements. Legal catchphrases are (multi-word) phrases that provide a concise overview of the contents of a case document, and automated generation of catchphrases is a challenging problem in legal analytics. In this paper, we propose a novel supervised neural sequence tagging model for the extraction of catchphrases from legal case documents. Specifically, we show that incorporating document-specific information along with a sequence tagging model can enhance the performance of catchphrase extraction. We perform experiments over a set of Indian Supreme Court case documents, for which the gold-standard catchphrases (annotated by legal practitioners) are obtained from a popular legal information system. The performance of our proposed method is compared with that of several existing supervised and unsupervised methods, and our proposed method is empirically shown to be superior to all baselines.

  相似文献   

2.
Web legal information retrieval systems need the capability to reason with the knowledge modeled by legal ontologies. Using this knowledge it is possible to represent and to make inferences about the semantic content of legal documents. In this paper a methodology for applying NLP techniques to automatically create a legal ontology is proposed. The ontology is defined in the OWL semantic web language and it is used in a logic programming framework, EVOLP+ISCO, to allow users to query the semantic content of the documents. ISCO allows an easy and efficient integration of declarative, object-oriented and constraint-based programming techniques with the capability to create connections with external databases. EVOLP is a dynamic logic programming framework allowing the definition of rules for actions and events. An application of the proposed methodology to the legal web information retrieval system of the Portuguese Attorney General’s Office is described.  相似文献   

3.
基于句子级别的抽取方法不足以解决中文事件元素分散问题。针对该问题,提出基于上下文融合的文档级事件抽取方法。首先将文档分割为多个段落,利用双向长短期记忆网络提取段落序列特征;其次采用自注意力机制捕获段落上下文的交互信息;然后与文档序列特征融合以更新语义表示;最后采用序列标注方式抽取事件元素并匹配事件类型。与其他事件抽取方法在相同的中文数据集上进行对比,实验结果表明,该方法能有效抽取文档中分散的事件元素,并提升模型的抽取性能。  相似文献   

4.
基于频繁词集聚类的海量短文分类方法   总被引:1,自引:0,他引:1  
王永恒  贾焰  杨树强 《计算机工程与设计》2007,28(8):1744-1746,1780
信息技术的飞速发展造成了大量的文本数据累积,其中很大一部分是短文本数据.文本分类技术对于从这些海量短文中自动获取知识具有重要意义.但是对于关键词出现次数少的短文,现有的一般文本挖掘算法很难得到可接受的准确度.一些基于语义的分类方法获得了较好的准确度但又由于其低效性而无法适用于海量数据.针对这个问题提出了一个新颖的基于频繁词集聚类的短文分类算法.该算法使用频繁词集聚类来压缩数据,并使用语义信息进行分类.实验表明该算法在对海量短文进行分类时,其准确度和性能超过其它的算法.  相似文献   

5.
信息技术的飞速发展造成了大量的文本数据累积,其中很大一部分是短文本数据。文本分类技术对于从这些海量短文中自动获取知识具有重要意义。但是由于短文中的关键词出现次数少,而且带标签的训练样本又通常数量很少,现有的一般文本挖掘算法很难得到可接受的准确度。一些基于语义的分类方法获得了较好的准确度但又由于其低效性而无法适用于海量数据。文本提出了一个新颖的短文分类算法。该算法基于文本语义特征图,并使用类似kNN的方法进行分类。实验表明该算法在对海量短文进行分类时,其准确度和性能超过其它的算法。  相似文献   

6.

扩展标记语言(XML) 带有一定的结构和语义信息, 与普通文本相比, XML具有描述精确、表现形式丰富等特点, 但同时也使得传统的自然语言处理和数据挖掘等技术不能直接应用. 根据XML内容和结构并非独立, 内容影响结构, 结构作用于内容, 提出一种基于张量的XML特征降维及综合相似度计算方法. 针对XML文档, 使用张量表示并采用基于最大互信息的方法对其进行降维, 采用将XML结构和内容相融合的综合相似度度量方法确定结构和内容的内在联系及共同作用方式, 提高XML综合相似度计算性能. 实验及结果分析验证了所提出方法的有效性.

  相似文献   

7.
Sentence similarity based on semantic nets and corpus statistics   总被引:3,自引:0,他引:3  
Sentence similarity measures play an increasingly important role in text-related research and applications in areas such as text mining, Web page retrieval, and dialogue systems. Existing methods for computing sentence similarity have been adopted from approaches used for long text documents. These methods process sentences in a very high-dimensional space and are consequently inefficient, require human input, and are not adaptable to some application domains. This paper focuses directly on computing the similarity between very short texts of sentence length. It presents an algorithm that takes account of semantic information and word order information implied in the sentences. The semantic similarity of two sentences is calculated using information from a structured lexical database and from corpus statistics. The use of a lexical database enables our method to model human common sense knowledge and the incorporation of corpus statistics allows our method to be adaptable to different domains. The proposed method can be used in a variety of applications that involve text knowledge representation and discovery. Experiments on two sets of selected sentence pairs demonstrate that the proposed method provides a similarity measure that shows a significant correlation to human intuition.  相似文献   

8.
In this paper, we proposed a new approach using ontology to improve precision of terminology extraction from documents. Firstly, a linguistic method was used to extract the terminological patterns from documents. Then, similarity measures within the framework of ontology were employed to rank the semantic dependency of the noun words in a pattern. Finally, the patterns at a predefined proportion according to their semantic dependencies were retained and regarded as terminologies. Experiments on Retuers-21578 corpus has shown that WordNet ontology, that we adopted for the task of extracting terminologies from English documents, can improve the precision of classical linguistic method on terminology extraction significantly.  相似文献   

9.
本文提出一种基于LSA和pLSA的多文档自动文摘策略。首先,将多个文档切分成自然段,以自然段作为聚类单位。采用了新的特征提取方法构建词-自然段矩阵,利用LSA对词-自然段矩阵进行奇异值分解,使得向量空间模型中的高维表示变成在潜在语义空间中的低维表示。然后,采用pLSA将数据转换成概率统计模型来计算。在文摘生成的过程中采用基于质心的文摘句挑选办法得到文摘并输出。实验表明,本文提出的方法有效地提高了生成文摘的质量。  相似文献   

10.
针对传统的向量空间模型和潜在语义分析方法应用于计算机辅助评估时存在的问题,提出一种将领域本体、一阶逻辑和潜在语义分析方法相结合的本体空间表示模型.该模型采用一阶逻辑表示从短文问题得到的二元关系并建立索引,使用潜在语义分析来计算关系集合中关系和包含段落的文档的相似度,从而得到段落在关系子集的平均相似度.实验结果表明,与向量空间模型相比,该模型的表示效果更好.  相似文献   

11.
12.
基于核方法的Web挖掘研究   总被引:2,自引:0,他引:2  
基于词空间的分类方法很难处理文本的高维特性和捕获文本语义概念.利用核主成分分析和支持向量机。提出一种通过约简文本数据维数抽取语义概念、基于语义概念进行文本分类的新方法.首先将文档映射到高维线性特征空间消除非线性特征,然后在映射空间中通过主成分分析消除变量之间的相关性,实现降维和语义概念抽取,得到文档的语义概念空间,最后在语义概念空间中采用支持向量机进行分类.通过新定义的核函数,不必显式实现到语义概念空间的映射,可在原始文档向量空间中直接实现基于语义概念的分类.利用核化的GHA方法自适应迭代求解核矩阵的特征向量和特征值,适于求解大规模的文本分类问题.试验结果表明该方法对于改进文本分类的性能具有较好的效果.  相似文献   

13.
ABSTRACT

Present day information retrieval systems largely ignore the issues of lexical and compositional semantics, and rely mainly on some statistical measures for choosing or evolving an indexing scheme. This has been the reason for the decreasing precision in their responses, given an exponentially increasing number of Web pages. The work reported in this paper addresses this issue from a linguistic point of view. We show that the detection of domain-specific phrases can capture the task-specific semantics of documents. We introduce the notion of n*-gram formalism to characterize the domain-specific phrases and their variants, taking a few sample domains. A method to construct a phrase grammar from a small set of documents is proposed. A method of conceptual indexing based on the phrase grammar has also been proposed. In order to demonstrate the effectiveness of the proposed method, we have designed a versatile system that can perform concept-based retrieval, in addition to several document-processing tasks, such as text classification, extraction-based summarization, context tracking, and semantic tagging. Collectively, the system can address the semantic content of documents. Considering the fact that an average user prefers highly relevant results in the top-ranked subset to an exhaustively retrieved set, it is shown that the proposed system performs better in that it retrieves documents that are more conceptually relevant than those retrieved by Google, and at 95% confidence level.  相似文献   

14.
研究食品安全领域跨媒体数据的主题分析技术,融合多种媒体形式数据的语义,准确表达跨媒体文档的主题。由于食品安全事件相关多媒体数据的大量涌现,单一媒体的主题分析技术不能全面反映整个数据集的主题分布,存在语义缺失、主题空间不统一,语义融合困难等问题。提出一种跨媒体主题分析方法,首先以概率生成方法分别对文本和图像数据进行语义分析,然后利用跨媒体数据间的语义相关性进行视觉主题学习,建立视觉主题模型,进而实现视觉数据和文本主题之间的映射。仿真结果表明,跨媒体主题分析方法能够有效获取与图像语义相关的文本主题,且主题跟踪的准确度优于文本主题跟踪方法,能够为食品安全事件的监测提供依据。  相似文献   

15.
基于统计的文本相似度量方法大多先采用TF-IDF方法将文本表示为词频向量,然后利用余弦计算文本之间的相似度。此类方法由于忽略文本中词项的语义信息,不能很好地反映文本之间的相似度。基于语义的方法虽然能够较好地弥补这一缺陷,但需要知识库来构建词语之间的语义关系。研究了以上两类文本相似度计算方法的优缺点,提出了一种新颖的文本相似度量方法,该方法首先对文本进行预处理,然后挑选TF-IDF值较高的词项作为特征项,再借助HowNet语义词典和TF-IDF方法对特征项进行语义分析和词频统计相结合的文本相似度计算,最后利用文本相似度在基准文本数据集合上进行聚类实验。实验结果表明,采用提出的方法得到的F-度量值明显优于只采用TF-IDF方法或词语语义的方法,从而证明了提出的文本相似度计算方法的有效性。  相似文献   

16.
提出了一种基于语义的观点倾向分析方法。按照文本结构特点,依据语义相近的原则,将文本分割为若干语义段,对语义段采用条件随机场模型进行主观内容提取和观点倾向识别,计算各个语义段的权值,确定文本的观点倾向。实验表明,与传统机器学习方法相比,该方法能有效提高文本观点倾向分析的准确率。  相似文献   

17.

Natural language processing techniques contribute more and more in analyzing legal documents recently, which supports the implementation of laws and rules using computers. Previous approaches in representing a legal sentence often based on logical patterns that illustrate the relations between concepts in the sentence, often consist of multiple words. Those representations cause the lack of semantic information at the word level. In our work, we aim to tackle such shortcomings by representing legal texts in the form of abstract meaning representation (AMR), a graph-based semantic representation that gains lots of polarity in NLP community recently. We present our study in AMR Parsing (producing AMR from natural language) and AMR-to-text Generation (producing natural language from AMR) specifically for legal domain. We also introduce JCivilCode, a human-annotated legal AMR dataset which was created and verified by a group of linguistic and legal experts. We conduct an empirical evaluation of various approaches in parsing and generating AMR on our own dataset and show the current challenges. Based on our observation, we propose our domain adaptation method applying in the training phase and decoding phase of a neural AMR-to-text generation model. Our method improves the quality of text generated from AMR graph compared to the baseline model. (This work is extended from our two previous papers: “An Empirical Evaluation of AMR Parsing for Legal Documents”, published in the Twelfth International Workshop on Juris-informatics (JURISIN) 2018; and “Legal Text Generation from Abstract Meaning Representation”, published in the 32nd International Conference on Legal Knowledge and Information Systems (JURIX) 2019.).

  相似文献   

18.
We present a new system for predicting the segmentation of online handwritten documents into multiple blocks, such as text paragraphs, tables, graphics, or mathematical expressions. A hierarchical representation of the document is adopted by aggregating strokes into blocks, and interactions between different levels are modeled in a tree Conditional Random Field. Features are extracted, and labels are predicted at each tree level with logistic classifiers, and Belief Propagation is adopted for optimal inference over the structure. Being fully trainable, the system is shown to properly handle difficult segmentation problems arising in unconstrained online note-taking documents, where no prior knowledge is available regarding the layout or the expected content. Our experiments show very promising results and allow to envision fully automatic segmentation of free-form online notes.  相似文献   

19.
提出一种使用段落自动聚类思想的自动文摘方法,首先利用词频统计和词的位置特征得到文档的关键词向量、每个段落的关键词向量,并建立以段落为基础的向量空间模型;然后计算各段落间的相似度,采用K-medoids聚类算法实现文档语义段的划分,并通过一个自定义的目标函数来自适应的确定聚类数目K;最后根据在初始文档中的位置顺序从各语义段中选出与主题最相关的句子构成文摘。  相似文献   

20.
基于语义分析的作者身份识别方法研究   总被引:5,自引:0,他引:5  
作者身份识别是一项应用广泛的研究,身份识别的关键问题是从作品中提取出代表语体风格的识别特征,并根据这些风格特征,评估作品与作品之间的风格相似度。传统的身份识别方法,主要考察作者遣词造句、段落组织等各种代表文体风格的特征,其中基于标点符号和最常见功能词频数的分析方法受到较为普遍的认同。本文依据文体学理论,利用HowNet知识库,提出一种新的基于词汇语义分析的相似度评估方法,有效利用了功能词以外的其他词汇,达到了较好的身份识别性能。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号