共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
An information granulation based data mining approach for classifying imbalanced data 总被引:2,自引:0,他引:2
Recently, the class imbalance problem has attracted much attention from researchers in the field of data mining. When learning from imbalanced data in which most examples are labeled as one class and only few belong to another class, traditional data mining approaches do not have a good ability to predict the crucial minority instances. Unfortunately, many real world data sets like health examination, inspection, credit fraud detection, spam identification and text mining all are faced with this situation. In this study, we present a novel model called the “Information Granulation Based Data Mining Approach” to tackle this problem. The proposed methodology, which imitates the human ability to process information, acquires knowledge from Information Granules rather then from numerical data. This method also introduces a Latent Semantic Indexing based feature extraction tool by using Singular Value Decomposition, to dramatically reduce the data dimensions. In addition, several data sets from the UCI Machine Learning Repository are employed to demonstrate the effectiveness of our method. Experimental results show that our method can significantly increase the ability of classifying imbalanced data. 相似文献
3.
研究基于CURE聚类的Web页面分块方法及正文块的提取规则。对页面DOM树增加节点属性,使其转换成为带有信息节点偏移量的扩展DOM树。利用CURE算法进行信息节点聚类,各个结果簇即代表页面的不同块。最后提取了正文块的三个主要特征,构造信息块权值公式,利用该公式识别正文块。 相似文献
4.
Luca Iocchi 《Journal of Network and Computer Applications》1999,22(4):259
The enormous amount of information available through the World Wide Web requires the development of effective tools for extracting and summarizing relevant data from Web sources. In this article we present a data model for representing Web documents and an associated SQL-like query language. Our framework provides an easy-to-use and well-formalized method for automatic generation of wrappers extracting data from Web documents. 相似文献
5.
Calado P.P. Ribeiro-Neto B. 《Knowledge and Data Engineering, IEEE Transactions on》2003,15(1):237-240
With the growing availability of online information systems, a need for user interfaces that are flexible and easy to use has arisen. For such type of systems, an interface that allows the formulation of approximate queries can be of great utility since these allow the user to quickly explore the database contents even when he is unaware of the exact values of the database instances. Our work focuses on this problem, presenting a new model for ranking approximate answers and a new algorithm to compute the semantic similarity between attribute values, based on information retrieval techniques. To demonstrate the utility and usefulness of the approach, we perform a series of usability tests. The results suggest that our approach allows the retrieval of more relevant answers with less effort by the user. 相似文献
6.
Various microarray experiments are now done in many laboratories, resulting in the rapid accumulation of microarray data in public repositories. One of the major challenges of analyzing microarray data is how to extract and select efficient features from it for accurate cancer classification. Here we introduce a new feature extraction and selection method based on information gene pairs that have significant change in different tissue samples. Experimental results on five public microarray data sets demonstrate that the feature subset selected by the proposed method performs well and achieves higher classification accuracy on several classifiers. We perform extensive experimental comparison of the features selected by the proposed method and features selected by other methods using different evaluation methods and classifiers. The results confirm that the proposed method performs as well as other methods on acute lymphoblastic-acute myeloid leukemia, adenocarcinoma and breast cancer data sets using a fewer information genes and leads to significant improvement of classification accuracy on colon and diffuse large B cell lymphoma cancer data sets. 相似文献
7.
TEG—a hybrid approach to information extraction 总被引:1,自引:1,他引:1
This paper describes a hybrid statistical and knowledge-based information extraction model, able to extract entities and relations
at the sentence level. The model attempts to retain and improve the high accuracy levels of knowledge-based systems while
drastically reducing the amount of manual labour by relying on statistics drawn from a training corpus. The implementation
of the model, called TEG (trainable extraction grammar), can be adapted to any IE domain by writing a suitable set of rules
in a SCFG (stochastic context-free grammar)-based extraction language and training them using an annotated corpus. The system
does not contain any purely linguistic components, such as PoS tagger or shallow parser, but allows to using external linguistic
components if necessary. We demonstrate the performance of the system on several named entity extraction and relation extraction
tasks. The experiments show that our hybrid approach outperforms both purely statistical and purely knowledge-based systems,
while requiring orders of magnitude less manual rule writing and smaller amounts of training data. We also demonstrate the
robustness of our system under conditions of poor training-data quality.
Ronen Feldman is a senior lecturer at the Mathematics and Computer Science Department of Bar-Ilan University in Israel, and the Director
of the Data Mining Laboratory. He received his B.Sc. in Math, Physics and Computer Science from the Hebrew University, M.Sc.
in Computer Science from Bar-Ilan University, and his Ph.D. in Computer Science from Cornell University in NY. He was an Adjunct
Professor at NYU Stern Business School. He is the founder of ClearForest Corporation, a Boston based company specializing
in development of text mining tools and applications. He has given more than 30 tutorials on next mining and information extraction
and authored numerous papers on these topics. He is currently finishing his book “The Text Mining Handbook” to the published
by Cambridge University Press.
Benjamin Rosenfeld is a research scientist at ClearForest Corporation. He received his B.Sc. in Mathematics and Computer Science from Bar-Ilan
University. He is the co-inventor of the DIAL information extraction language.
Moshe Fresko is finalizing his Ph.D. in Computer Science Department at Bar-Ilan University in Israel. He received his B.Sc. in Computer
Engineering from Bogazici University, Istanbul/Turkey on 1991, and M.Sc. on 1994. He is also an adjunct lecturer at the Computer
Science Department of Bar-Ilan University and functions as the Information-Extraction Group Leader in the Data Mining Laboratory. 相似文献
8.
Maarek Y.S. Berry D.M. Kaiser G.E. 《IEEE transactions on pattern analysis and machine intelligence》1991,17(8):800-813
A technology for automatically assembling large software libraries which promote software reuse by helping the user locate the components closest to her/his needs is described. Software libraries are automatically assembled from a set of unorganized components by using information retrieval techniques. The construction of the library is done in two steps. First, attributes are automatically extracted from natural language documentation by using an indexing scheme based on the notions of lexical affinities and quantity of information. Then a hierarchy for browsing is automatically generated using a clustering technique which draws only on the information provided by the attributes. Due to the free-text indexing scheme, tools following this approach can accept free-style natural language queries 相似文献
9.
Most biomedical signals are non-stationary. The knowledge of their frequency content and temporal distribution is then useful in a clinical context. The wavelet analysis is appropriate to achieve this task. The present paper uses this method to reveal hidden characteristics and anomalies of the human a-wave, an important component of the electroretinogram since it is a measure of the functional integrity of the photoreceptors. We here analyse the time–frequency features of the a-wave both in normal subjects and in patients affected by Achromatopsia, a pathology disturbing the functionality of the cones. The results indicate the presence of two or three stable frequencies that, in the pathological case, shift toward lower values and change their times of occurrence. The present findings are a first step toward a deeper understanding of the features of the a-wave and possible applications to diagnostic procedures in order to recognise incipient photoreceptoral pathologies. 相似文献
10.
In this paper, a simple and robust approach for flame and fire image analysis is proposed. It is based on the local binary patterns, double thresholding and Levenberg–Marquardt optimization technique. The presented algorithm detects the sharp edges and removes the noise and irrelevant artifacts. The autoadaptive nature of the algorithm ensures the primary edges of the flame and fire are identified in the different conditions. Moreover, a graphical approach is presented which can be used to calculate the combustion furnace flame temperature. The various experimentations are carried out on synthetic as well as real flame and fire images which validate the efficacy and robustness of the proposed approach. 相似文献
11.
为了解决已有信息抽取系统中方法不具有重用性及不能抽取语义信息的问题,提出了一个基于领域本体的面向主题的Web信息抽取框架.对Web中文页面,借助外部资料,利用本体解析信息,对文件采集及预处理中的源文档及信息采集、文档预处理、文档存储等技术进行了分析设计,提出了文本转换中的分词及词表查询和命名实体识别算法,并给出了一种知识抽取方案.实验结果表明,该方法可以得到性能较高的抽取结果. 相似文献
12.
13.
14.
Recently, many e-commerce Web sites, such as Amazon.com, provide platforms for users to review products and share their opinions,
in order to help consumers make their best purchase decisions. However, the quality and the level of helpfulness of different
product reviews are not disclosed to consumers unless they carefully analyze an immense number of lengthy reviews. Considering
the large amount of available online product reviews, this is an impossible task for any consumer. Therefore, it is of vital
importance to develop recommender systems that can evaluate online product reviews effectively to recommend the most useful
ones to consumers. This paper proposes an information gain-based model to predict the helpfulness of online product reviews,
with the aim of suggesting the most suitable products and vendors to consumers. Reviews are analyzed and ranked by our scoring
model and reviews that help consumers better than others will be found. In addition, we also compare our model with several
machine learning algorithms. Our experimental results show that our approach is effective in ranking and classifying online
product reviews. 相似文献
15.
16.
通过对目前应用广泛的软构件检索技术的研究,提出了一种基于软构件描述文本信息抽取的检索方法。该方法利用中文分词技术和向量空间模型中"词频与倒文档频度"算法抽取关键词,通过《知网》语义相似度,计算用户需求与可重用软构件的匹配度,实现了对软构件的语义检索,能实现模糊查询,具有一定的张弛能力。 相似文献
17.
针对文本信息抽取中由于训练样本不足导致性能下降的问题,提出一种基于规矩约束的深度学习网络模型.模型分为深度学习模块、逻辑规则库和差异单元3个部分.将文本句子作为输入数据馈送到学习模块中,基于Bi-GRU网络和多头自注意力机制在多个维度上为每个单词生成一个预测向量;规则库采用带权重的逻辑规则对深度学习进行约束;差异单元利用损失函数协调学习模块与规则库之间的一致性.实验结果表明,所提模型比其它算法具有更好的性能,能够高效精确处理复杂文本. 相似文献
18.
An automatic approach for ontology-based feature extraction from heterogeneous textualresources 总被引:1,自引:0,他引:1
Carlos Vicient David Sánchez Antonio Moreno 《Engineering Applications of Artificial Intelligence》2013,26(3):1092-1106
Data mining algorithms such as data classification or clustering methods exploit features of entities to characterise, group or classify them according to their resemblance. In the past, many feature extraction methods focused on the analysis of numerical or categorical properties. In recent years, motivated by the success of the Information Society and the WWW, which has made available enormous amounts of textual electronic resources, researchers have proposed semantic data classification and clustering methods that exploit textual data at a conceptual level. To do so, these methods rely on pre-annotated inputs in which text has been mapped to their formal semantics according to one or several knowledge structures (e.g. ontologies, taxonomies). Hence, they are hampered by the bottleneck introduced by the manual semantic mapping process. To tackle this problem, this paper presents a domain-independent, automatic and unsupervised method to detect relevant features from heterogeneous textual resources, associating them to concepts modelled in a background ontology. The method has been applied to raw text resources and also to semi-structured ones (Wikipedia articles). It has been tested in the Tourism domain, showing promising results. 相似文献
19.