期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

安增文王超徐杰锋《微型机与应用》2010,29(12)

先将网页转换为规范的DOM树,然后计算每行文本的文本密度、与标题相关度等值,并将其作为输入参数利用BP神经网络进行训练,进而形成抽取规则,最后通过实验验证该方法的可行性. 相似文献

2.

An information granulation based data mining approach for classifying imbalanced data 总被引：2，自引：0，他引：2

Mu-Chen Chen Long-Sheng Chen 《Information Sciences》2008,178(16):3214-3227

Recently, the class imbalance problem has attracted much attention from researchers in the field of data mining. When learning from imbalanced data in which most examples are labeled as one class and only few belong to another class, traditional data mining approaches do not have a good ability to predict the crucial minority instances. Unfortunately, many real world data sets like health examination, inspection, credit fraud detection, spam identification and text mining all are faced with this situation. In this study, we present a novel model called the “Information Granulation Based Data Mining Approach” to tackle this problem. The proposed methodology, which imitates the human ability to process information, acquires knowledge from Information Granules rather then from numerical data. This method also introduces a Latent Semantic Indexing based feature extraction tool by using Singular Value Decomposition, to dramatically reduce the data dimensions. In addition, several data sets from the UCI Machine Learning Repository are employed to demonstrate the effectiveness of our method. Experimental results show that our method can significantly increase the ability of classifying imbalanced data. 相似文献

3.

基于CURE算法的网页分块及正文块提取研究

王超徐杰锋《微型机与应用》2012,31(12):11-14

研究基于CURE聚类的Web页面分块方法及正文块的提取规则。对页面DOM树增加节点属性,使其转换成为带有信息节点偏移量的扩展DOM树。利用CURE算法进行信息节点聚类,各个结果簇即代表页面的不同块。最后提取了正文块的三个主要特征,构造信息块权值公式,利用该公式识别正文块。相似文献

4.

The Web-OEM approach to Web information extraction

Luca Iocchi 《Journal of Network and Computer Applications》1999,22(4):259

The enormous amount of information available through the World Wide Web requires the development of effective tools for extracting and summarizing relevant data from Web sources. In this article we present a data model for representing Web documents and an associated SQL-like query language. Our framework provides an easy-to-use and well-formalized method for automatic generation of wrappers extracting data from Web documents. 相似文献

5.

An information retrieval approach for approximate queries

Calado P.P. Ribeiro-Neto B. 《Knowledge and Data Engineering, IEEE Transactions on》2003,15(1):237-240

With the growing availability of online information systems, a need for user interfaces that are flexible and easy to use has arisen. For such type of systems, an interface that allows the formulation of approximate queries can be of great utility since these allow the user to quickly explore the database contents even when he is unaware of the exact values of the database instances. Our work focuses on this problem, presenting a new model for ranking approximate answers and a new algorithm to compute the semantic similarity between attribute values, based on information retrieval techniques. To demonstrate the utility and usefulness of the approach, we perform a series of usability tests. The results suggest that our approach allows the retrieval of more relevant answers with less effort by the user. 相似文献

6.

A novel approach to feature extraction from classification models based on information gene pairs

J. Li X. Tang J. Liu J. Huang Y. Wang 《Pattern recognition》2008,41(6):1975-1984

Various microarray experiments are now done in many laboratories, resulting in the rapid accumulation of microarray data in public repositories. One of the major challenges of analyzing microarray data is how to extract and select efficient features from it for accurate cancer classification. Here we introduce a new feature extraction and selection method based on information gene pairs that have significant change in different tissue samples. Experimental results on five public microarray data sets demonstrate that the feature subset selected by the proposed method performs well and achieves higher classification accuracy on several classifiers. We perform extensive experimental comparison of the features selected by the proposed method and features selected by other methods using different evaluation methods and classifiers. The results confirm that the proposed method performs as well as other methods on acute lymphoblastic-acute myeloid leukemia, adenocarcinoma and breast cancer data sets using a fewer information genes and leads to significant improvement of classification accuracy on colon and diffuse large B cell lymphoma cancer data sets. 相似文献

7.

TEG—a hybrid approach to information extraction 总被引：1，自引：1，他引：1

Ronen Feldman Benjamin Rosenfeld Moshe Fresko 《Knowledge and Information Systems》2006,9(1):1-18

This paper describes a hybrid statistical and knowledge-based information extraction model, able to extract entities and relations at the sentence level. The model attempts to retain and improve the high accuracy levels of knowledge-based systems while drastically reducing the amount of manual labour by relying on statistics drawn from a training corpus. The implementation of the model, called TEG (trainable extraction grammar), can be adapted to any IE domain by writing a suitable set of rules in a SCFG (stochastic context-free grammar)-based extraction language and training them using an annotated corpus. The system does not contain any purely linguistic components, such as PoS tagger or shallow parser, but allows to using external linguistic components if necessary. We demonstrate the performance of the system on several named entity extraction and relation extraction tasks. The experiments show that our hybrid approach outperforms both purely statistical and purely knowledge-based systems, while requiring orders of magnitude less manual rule writing and smaller amounts of training data. We also demonstrate the robustness of our system under conditions of poor training-data quality. Ronen Feldman is a senior lecturer at the Mathematics and Computer Science Department of Bar-Ilan University in Israel, and the Director of the Data Mining Laboratory. He received his B.Sc. in Math, Physics and Computer Science from the Hebrew University, M.Sc. in Computer Science from Bar-Ilan University, and his Ph.D. in Computer Science from Cornell University in NY. He was an Adjunct Professor at NYU Stern Business School. He is the founder of ClearForest Corporation, a Boston based company specializing in development of text mining tools and applications. He has given more than 30 tutorials on next mining and information extraction and authored numerous papers on these topics. He is currently finishing his book “The Text Mining Handbook” to the published by Cambridge University Press. Benjamin Rosenfeld is a research scientist at ClearForest Corporation. He received his B.Sc. in Mathematics and Computer Science from Bar-Ilan University. He is the co-inventor of the DIAL information extraction language. Moshe Fresko is finalizing his Ph.D. in Computer Science Department at Bar-Ilan University in Israel. He received his B.Sc. in Computer Engineering from Bogazici University, Istanbul/Turkey on 1991, and M.Sc. on 1994. He is also an adjunct lecturer at the Computer Science Department of Bar-Ilan University and functions as the Information-Extraction Group Leader in the Data Mining Laboratory. 相似文献

8.

An information retrieval approach for automatically constructingsoftware libraries

Maarek Y.S. Berry D.M. Kaiser G.E. 《IEEE transactions on pattern analysis and machine intelligence》1991,17(8):800-813

A technology for automatically assembling large software libraries which promote software reuse by helping the user locate the components closest to her/his needs is described. Software libraries are automatically assembled from a set of unorganized components by using information retrieval techniques. The construction of the library is done in two steps. First, attributes are automatically extracted from natural language documentation by using an indexing scheme based on the notions of lexical affinities and quantity of information. Then a hierarchy for browsing is automatically generated using a clustering technique which draws only on the information provided by the attributes. Due to the free-text indexing scheme, tools following this approach can accept free-style natural language queries 相似文献

9.

An approach based on wavelet analysis for feature extraction in the a-wave of the electroretinogram

Barraco R Persano Adorno D Brai M 《Computer methods and programs in biomedicine》2011,104(3):316-324

Most biomedical signals are non-stationary. The knowledge of their frequency content and temporal distribution is then useful in a clinical context. The wavelet analysis is appropriate to achieve this task. The present paper uses this method to reveal hidden characteristics and anomalies of the human a-wave, an important component of the electroretinogram since it is a measure of the functional integrity of the photoreceptors. We here analyse the time–frequency features of the a-wave both in normal subjects and in patients affected by Achromatopsia, a pathology disturbing the functionality of the cones. The results indicate the presence of two or three stable frequencies that, in the pathological case, shift toward lower values and change their times of occurrence. The present findings are a first step toward a deeper understanding of the features of the a-wave and possible applications to diagnostic procedures in order to recognise incipient photoreceptoral pathologies. 相似文献

10.

An efficient edge extraction approach for flame image analysis

Vilas H. Gaidhane Yogesh V. Hote 《Pattern Analysis & Applications》2018,21(4):1139-1150

In this paper, a simple and robust approach for flame and fire image analysis is proposed. It is based on the local binary patterns, double thresholding and Levenberg–Marquardt optimization technique. The presented algorithm detects the sharp edges and removes the noise and irrelevant artifacts. The autoadaptive nature of the algorithm ensures the primary edges of the flame and fire are identified in the different conditions. Moreover, a graphical approach is presented which can be used to calculate the combustion furnace flame temperature. The various experimentations are carried out on synthetic as well as real flame and fire images which validate the efficacy and robustness of the proposed approach. 相似文献

11.

基于本体的Web信息抽取系统

王志华魏斌李占波赵伟《计算机工程与设计》2012,33(7):2634-2639

为了解决已有信息抽取系统中方法不具有重用性及不能抽取语义信息的问题,提出了一个基于领域本体的面向主题的Web信息抽取框架.对Web中文页面,借助外部资料,利用本体解析信息,对文件采集及预处理中的源文档及信息采集、文档预处理、文档存储等技术进行了分析设计,提出了文本转换中的分词及词表查询和命名实体识别算法,并给出了一种知识抽取方案.实验结果表明,该方法可以得到性能较高的抽取结果. 相似文献

12.

一种基于提取上下文信息的分词算法 总被引：8，自引：0，他引：8

曾华琳李堂秋史晓东《计算机应用》2005,25(9):2025-2027

汉语分词在汉语文本处理过程中是一个特殊而重要的组成部分。传统的基于词典的分词算法存在很大的缺陷,无法对未登录词进行很好的处理。基于概率的算法只考虑了训练集语料的概率模型,对于不同领域的文本的处理不尽如人意。文章提出一种基于上下文信息提取的概率分词算法,能够将切分文本的上下文信息加入到分词概率模型中,以指导文本的切分。这种切分算法结合经典n元模型以及EM算法,在封闭和开放测试环境中分别取得了比较好的效果。相似文献

13.

Roller: a novel approach to Web information extraction

Patricia Jiménez Rafael Corchuelo 《Knowledge and Information Systems》2016,49(1):197-241

相似文献

14.

An information gain-based approach for recommending useful product reviews 总被引：1，自引：0，他引：1

Richong Zhang Thomas Tran 《Knowledge and Information Systems》2011,26(3):419-434

Recently, many e-commerce Web sites, such as Amazon.com, provide platforms for users to review products and share their opinions, in order to help consumers make their best purchase decisions. However, the quality and the level of helpfulness of different product reviews are not disclosed to consumers unless they carefully analyze an immense number of lengthy reviews. Considering the large amount of available online product reviews, this is an impossible task for any consumer. Therefore, it is of vital importance to develop recommender systems that can evaluate online product reviews effectively to recommend the most useful ones to consumers. This paper proposes an information gain-based model to predict the helpfulness of online product reviews, with the aim of suggesting the most suitable products and vendors to consumers. Reviews are analyzed and ranked by our scoring model and reviews that help consumers better than others will be found. In addition, we also compare our model with several machine learning algorithms. Our experimental results show that our approach is effective in ranking and classifying online product reviews. 相似文献

15.

An approach to decreasing the search time for information

F. I. Andon N. A. Bashinskii 《Cybernetics and Systems Analysis》1978,14(1):128-134

相似文献

16.

一种基于软构件描述文本信息抽取的检索方法

韩忠愿谢丹《微型机与应用》2013,32(2):1-3

通过对目前应用广泛的软构件检索技术的研究,提出了一种基于软构件描述文本信息抽取的检索方法。该方法利用中文分词技术和向量空间模型中"词频与倒文档频度"算法抽取关键词,通过《知网》语义相似度,计算用户需求与可重用软构件的匹配度,实现了对软构件的语义检索,能实现模糊查询,具有一定的张弛能力。相似文献

17.

基于规则约束的深度学习网络用于文本信息抽取

赖娟洪艳伟《计算机工程与设计》2021,42(12):3548-3554

针对文本信息抽取中由于训练样本不足导致性能下降的问题,提出一种基于规矩约束的深度学习网络模型.模型分为深度学习模块、逻辑规则库和差异单元3个部分.将文本句子作为输入数据馈送到学习模块中,基于Bi-GRU网络和多头自注意力机制在多个维度上为每个单词生成一个预测向量;规则库采用带权重的逻辑规则对深度学习进行约束;差异单元利用损失函数协调学习模块与规则库之间的一致性.实验结果表明,所提模型比其它算法具有更好的性能,能够高效精确处理复杂文本. 相似文献

18.

An automatic approach for ontology-based feature extraction from heterogeneous textualresources 总被引：1，自引：0，他引：1

Carlos Vicient David Sánchez Antonio Moreno 《Engineering Applications of Artificial Intelligence》2013,26(3):1092-1106

Data mining algorithms such as data classification or clustering methods exploit features of entities to characterise, group or classify them according to their resemblance. In the past, many feature extraction methods focused on the analysis of numerical or categorical properties. In recent years, motivated by the success of the Information Society and the WWW, which has made available enormous amounts of textual electronic resources, researchers have proposed semantic data classification and clustering methods that exploit textual data at a conceptual level. To do so, these methods rely on pre-annotated inputs in which text has been mapped to their formal semantics according to one or several knowledge structures (e.g. ontologies, taxonomies). Hence, they are hampered by the bottleneck introduced by the manual semantic mapping process. To tackle this problem, this paper presents a domain-independent, automatic and unsupervised method to detect relevant features from heterogeneous textual resources, associating them to concepts modelled in a background ontology. The method has been applied to raw text resources and also to semi-structured ones (Wikipedia articles). It has been tested in the Tourism domain, showing promising results. 相似文献

19.

An improved locality sensitive discriminant analysis approach for feature extraction

Yugen Yi Baoxue Zhang Jun Kong Jianzhong Wang 《Multimedia Tools and Applications》2015,74(1):85-104

相似文献

20.

基于网页聚类的Web信息自动抽取

邱韬奋杨天奇曾洪波《微型机与应用》2011,30(4):71-74

针对现今较流行的动态Web网页数量巨大、数据价值高,并且网页结构高度模板化的特点,设计了一个基于网页聚类的Web信息自动抽取系统。在DOM抽取技术基础上利用网页聚类寻找高相似簇,并引入列相似度和全局自相似度计算方法,提高了聚类结果的准确性。抽取模板中应用了可选节点对模板的修正和调整,以提高内容节点的正确标识。实验结果表明,该方法能够自动寻找并抽取网页主要信息,达到了较高的准确率和查全率。相似文献