首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Content-based spam filtering is a binary text categorization problem. To improve the performance of the spam filtering, feature selection, as an important and indispensable means of text categorization, also plays an important role in spam filtering. We proposed a new method, named Bi-Test, which utilizes binomial hypothesis testing to estimate whether the probability of a feature belonging to the spam satisfies a given threshold or not. We have evaluated Bi-Test on six benchmark spam corpora (pu1, pu2, pu3, pua, lingspam and CSDMC2010), using two classification algorithms, Naïve Bayes (NB) and Support Vector Machines (SVM), and compared it with four famous feature selection algorithms (information gain, χ2-statistic, improved Gini index and Poisson distribution). The experiments show that Bi-Test performs significantly better than χ2-statistic and Poisson distribution, and produces comparable performance with information gain and improved Gini index in terms of F1 measure when Naïve Bayes classifier is used; it achieves comparable performance with the other methods when SVM classifier is used. Moreover, Bi-Test executes faster than the other four algorithms.  相似文献   

2.
This paper analyses the impact of innovation on productivity in Taiwan. Using a panel of 48,794 firms observed over the 1997–2003 period and distributed across 23 industries, we compute total factor productivity (TFP) by estimating Translog production functions with C, L, E, M inputs. We evaluate the impact of being an innovator on TFP using propensity score matching. The rationale is that, over the period, innovating firms are likely to have benefited from one of many innovation policy measures known as statute for upgrading industry (SUI) (until 1999) or “New SUI” (after 1999). Our results show a significantly negative effect of being an innovator on TFP in most industries, both before and after 1999. This suggests that firms having innovation expenditures either perform less well than the others because of unobserved factors, or are further away from the production frontier. Therefore, innovation in Taiwan seems to be associated with catching-up strategies.  相似文献   

3.
In the recent trends of touch-less biometric authentication systems, hand knuckles from dorsal part of the hand is gaining popularity as a potential candidate for verification/recognition in variety of security applications. However, most of the available knuckle verification systems offer fixed security achieved for desired level of accuracy which cannot meet the varying levels of security requirements. This paper presents a bimodal knuckle verification system which is designed to meet a wide range of applications varying from civilian to high security regions. We use ant colony optimization (ACO) to choose the optimal fusion parameters corresponding to each level of security. The developed verification system utilizes fuzzy binary decision tree (FBDT) which is aimed at decision making in two classes: genuine (accept) and imposter (reject) using matching scores computed from the knuckle database. The FBDT is implemented using fuzzy Gini index for the selection of the tree nodes. The experiments are carried out on four publicly available HongKong PolyU knuckle databases named as: left index, right index, left middle and right middle with four bimodal systems: left–right index, left–right middle, left index–middle and right index–middle. The experimental results from these four bimodal knuckle databases validate the contributions of the proposed work.  相似文献   

4.
可靠的电力供应对于工业生产和居民日常生活至关重要,通过对电力数据平台中的停电数据进行分析和挖掘,可以更好地了解配电网停电的潜在规律。分类预测是数据挖掘和分析中的常见技术,停电分类预测可以为企事业单位的停电规划安排提供决策参考。针对停电分类预测问题,提出一种基于因子分解机(FM)的停电数据分类预测模型。利用决策树算法计算停电数据中不同特征的基尼系数以得出重要性得分,从中筛选与停电预测关联度较大的非稀疏特征。根据不同地区的地理位置关系构建不同地区间的空间位置矩阵,并通过矩阵分解的方式构造不同地区在空间上的地理位置关联特征。为防止FM模型出现过拟合问题,在模型中加入L2-范数正则化。在此基础上,利用随机梯度下降的方法训练FM模型,通过训练完成的FM模型对停电数据进行分类预测。在真实停电数据集上的实验结果表明,该模型在训练数据集和测试数据集上的F1值和准确率分别高达0.90和0.89,优于DNN、SVM、XGBoost等模型。  相似文献   

5.
We study repetitions in infinite words coding exchange of three intervals with permutation (3, 2, 1), called 3iet words. The language of such words is determined by two parameters, ε,?. We show that finiteness of the index of 3iet words is equivalent to boundedness of the coefficients of the continued fraction of ε. In this case, we also give an upper and a lower estimate on the index of the corresponding 3iet word. Our main tool is the connection between a 3iet word with parameters ε,? and sturmian words with slope ε.  相似文献   

6.
收集2009-2018年商洛市六县一区旅游综合收入数据,采用变异系数、基尼系数、赫芬达尔指数等指标对商洛市旅游经济发展在时间轴上进行空间变化分析。结果表明:(1)商洛市旅游经济呈现出绝对差异扩大,相对差异减小的特点;(2)各县域的旅游经济发展水平有更加均衡趋向;(3)洛南县、镇安县、山阳县和柞水县四个县域旅游经济排名有所提升;(4)全市旅游经济对县域国民经济贡献度在不断增大。研究结果对商洛市大力发展全域旅游,优化旅游产业空间结构有重要的参考价值。  相似文献   

7.
Existing encoding schemes and index structures proposed for XML query processing primarily target the containment relationship, specifically the parent–child and ancestor–descendant relationship. The presence of preceding-sibling and following-sibling location steps in the XPath specification, which is the de facto query language for XML, makes the horizontal navigation, besides the vertical navigation, among nodes of XML documents a necessity for efficient evaluation of XML queries. Our work enhances the existing range-based and prefix-based encoding schemes such that all structural relationships between XML nodes can be determined from their codes alone. Furthermore, an external-memory index structure based on the traditional B+-tree, XL+-tree(XML Location+-tree), is introduced to index element sets such that all defined location steps in the XPath language, vertical and horizontal, top-down and bottom-up, can be processed efficiently. The XL+-trees under the range or prefix encoding scheme actually share the same structure; but various search operations upon them may be slightly different as a result of the richer information provided by the prefix encoding scheme. Finally, experiments are conducted to validate the efficiency of the XL+-tree approach. We compare the query performance of XL+-tree with that of R-tree, which is capable of handling comprehensive XPath location steps and has been empirically shown to outperform other indexing approaches.  相似文献   

8.
《Computers & Geosciences》1987,13(5):463-494
ROBUST calculates 53 statistics, plus significance levels for 6 hypothesis tests, on each of up to 52 variables. These together allow the following properties of the data distribution for each variable to be examined in detail: (1) Location. Three means (arithmetic, geometric, harmonic) are calculated, together with the midrange and 19 high-performance robust L-, M-, and W-estimates of location (combined, adaptive, trimmed estimates, etc.) (2) Scale. The standard deviation is calculated along with the H-spread/2 (≈ semi-interquartile range), the mean and median absolute deviations from both mean and median, and a biweight scale estimator. The 23 location and 6 scale estimators programmed cover all possible degrees of robustness. (3) Normality: Distributions are tested against the null hypothesis that they are normal, using the 3rd (√h1) and 4th (b2) moments, Geary's ratio (mean deviation/standard deviation), Filliben's probability plot correlation coefficient, and a more robust test based on the biweight scale estimator. These statistics collectively are sensitive to most usual departures from normality. (4) Presence of outliers. The maximum and minimum values are assessed individually or jointly using Grubbs' maximum Studentized residuals, Harvey's and Dixon's criteria, and the Studentized range.For a single input variable, outliers can be either winsorized or eliminated and all estimates recalculated iteratively as desired. The following data-transformations also can be applied: linear, log10, generalized Box Cox power (including log, reciprocal, and square root), exponentiation, and standardization. For more than one variable, all results are tabulated in a single run of ROBUST. Further options are incorporated to assess ratios (of two variables) as well as discrete variables, and be concerned with missing data. Cumulative S-plots (for assessing normality graphically) also can be generated. The mutual consistency or inconsistency of all these measures helps to detect errors in data as well as to assess data-distributions themselves.  相似文献   

9.
We frequently use the standard correlation coefficient to quantify linear relation between two given variables of interest in crisp industrial data. On the other hand, in many real world applications involving the opinions of experts, the domain of a variable of interest, e.g. the rating of the innovativeness of a new product idea, is oftentimes composed of subjective linguistic concepts such as very poor, poor, average, good and excellent. In this article, we extend the standard correlation coefficient to the subjective, linguistic setting, so as to quantify relations in imprecise industrial and management data. Unlike the correlation measures for fuzzy variables proposed in the literature, the present approach allows one to develop a correlation coefficient for linguistic variables that can account for and reflect the conditional dependence assumptions underlying a given data set. We apply the proposed method to quantify the degree of correlation between technology and management achievements of 15 large-scale machinery firms in Taiwan. It is shown that the flexibility of the present framework in allowing for the incorporation of appropriate conditional dependence assumptions to derive a correlation measure for linguistic variables can be essential in approximate reasoning applications.  相似文献   

10.
We address the problem of building an index for a set D of n strings, where each string location is a subset of some finite integer alphabet of size σ, so that we can answer efficiently if a given simple query string (where each string location is a single symbol) p occurs in the set. That is, we need to efficiently find a string dD such that p[i]∈d[i] for every i. We show how to build such index in O(nlogσ/Δ(σ)log(n)) average time, where Δ is the average size of the subsets. Our methods have applications e.g. in computational biology (haplotype inference) and music information retrieval.  相似文献   

11.
The PROSPECT leaf optical model has, to date, combined the effects of photosynthetic pigments, but a finer discrimination among the key pigments is important for physiological and ecological applications of remote sensing. Here we present a new calibration and validation of PROSPECT that separates plant pigment contributions to the visible spectrum using several comprehensive datasets containing hundreds of leaves collected in a wide range of ecosystem types. These data include leaf biochemical (chlorophyll a, chlorophyll b, carotenoids, water, and dry matter) and optical properties (directional-hemispherical reflectance and transmittance measured from 400 nm to 2450 nm). We first provide distinct in vivo specific absorption coefficients for each biochemical constituent and determine an average refractive index of the leaf interior. Then we invert the model on independent datasets to check the prediction of the biochemical content of intact leaves. The main result of this study is that the new chlorophyll and carotenoid specific absorption coefficients agree well with available in vitro absorption spectra, and that the new refractive index displays interesting spectral features in the visible, in accordance with physical principles. Moreover, we improve the chlorophyll estimation (RMSE = 9 µg/cm2) and obtain very encouraging results with carotenoids (RMSE = 3 µg/cm2). Reconstruction of reflectance and transmittance in the 400-2450 nm wavelength domain using PROSPECT is also excellent, with small errors and low to negligible biases. Improvements are particularly noticeable for leaves with low pigment content.  相似文献   

12.
We consider the problem of indexing a string t of length n to report the occurrences of a query pattern p containing m characters and j wildcards. Let occ be the number of occurrences of p in t, and σ the size of the alphabet. We obtain the following results.
  • A linear space index with query time O(m+σ j loglogn+occ). This significantly improves the previously best known linear space index by Lam et al. (in Proc. 18th ISAAC, pp. 846–857, [2007]), which requires query time Θ(jn) in the worst case.
  • An index with query time O(m+j+occ) using space \(O(\sigma^{k^{2}} n \log^{k} \log n)\) , where k is the maximum number of wildcards allowed in the pattern. This is the first non-trivial bound with this query time.
  • A time-space trade-off, generalizing the index by Cole et al. (in Proc. 36th STOC, pp. 91–100, [2004]).
We also show that these indexes can be generalized to allow variable length gaps in the pattern. Our results are obtained using a novel combination of well-known and new techniques, which could be of independent interest.  相似文献   

13.
The tracking of moving objects consists of two critical operations: location reporting, in which moving objects (or clients) send their locations to centralized servers, and index maintenance, through which centralized servers update the locations of moving objects. In existing location reporting techniques, each moving object reports its locations to servers by utilizing long-distance links such as 3G/4G. Corresponding to this location reporting strategy, servers need to respond to all the location updating requests from individual moving objects. Such techniques suffer from very high communication cost (due to the individual reporting using long-distance links) and high index update I/Os (due to the massive amount of location updating requests). In this paper, we present a novel Group-movement based location Reporting and Indexing (GRI) framework for location reporting (at moving object side) and index maintenance (at server side). In the GRI framework, we introduce a novel location reporting strategy which allows moving objects to report their locations to servers in a group (instead of individually) by aggregating the moving objects that share similar movement patterns through wireless local links (such as WiFi). At the server side, we present a dual-index, Hash-GTPR-tree (H-GTPR), to index objects sharing similar movement patterns. Our experimental results on synthetic and real data sets demonstrate the effectiveness and efficiency of our new GRI framework, as well as the location reporting strategy and the H-GTPR tree index technique.  相似文献   

14.
In mobile ad hoc and sensor networks, greedy-face-greedy (GFG) geographic routing protocols have been a topic of active research in recent years. Most of the GFG geographic routing protocols make an ideal assumption that nodes in the network construct a unit-disk graph, UDG, and extract a planar subgraph from the UDG for face routing. However, the assumption of UDG may be violated in realistic environments, which may cause the current GFG geographic routing protocols to fail. In this paper, we propose a Pre-Processed Cross Link Detection Protocol, PPCLDP, which extracts an almost planar subgraph from a realistic network graph, instead of a UDG, for face routing and makes the GFG geographic routing work correctly in realistic environments with obstacles. The proposed PPCLDP improves the previous work of Cross Link Detection Protocol, CLDP, with far less communication cost and better convergence time. Our simulation results show that the average communication cost and the average convergence time of PPCLDP are, respectively, 65% and 45% less than those of CLDP. This makes PPCLDP more desirable for mobile ad hoc and sensor networks.  相似文献   

15.
GeD spline estimation of multivariate Archimedean copulas   总被引:1,自引:0,他引:1  
A new multivariate Archimedean copula estimation method is proposed in a non-parametric setting. The method uses the so-called Geometrically Designed splines (GeD splines) to represent the cdf of a random variable Wθ, obtained through the probability integral transform of an Archimedean copula with parameter θ. Sufficient conditions for the GeD spline estimator to possess the properties of the underlying theoretical cdf, K(θ,t), of Wθ, are given. The latter conditions allow for defining a three-step estimation procedure for solving the resulting non-linear regression problem with linear inequality constraints. In the proposed procedure, finding the number and location of the knots and the coefficients of the unconstrained GeD spline estimator and solving the constraint least-squares optimisation problem are separated. Thus, the resulting spline estimator is used to recover the generator and the related Archimedean copula by solving an ordinary differential equation. The proposed method is truly multivariate, it brings about numerical efficiency and as a result can be applied with large volumes of data and for dimensions d≥2, as illustrated by the numerical examples presented.  相似文献   

16.
We provide in this article a branch-and-bound algorithm that solves the problem of finding the k closest pairs of points (p,q), p?∈?P,q?∈?Q, considering two sets of points in the euclidean plane P,Q stored in external memory assuming that only one of the sets has a spatial index. This problem arises naturally in many scenarios, for instance when the set without an index is the answer to a spatial query. The main idea of our algorithm is to partition the space occupied by the set without an index into several cells or subspaces and to make use of the properties of a set of metrics defined on two Minimum Bounding Rectangles (MBRs). We evaluated our algorithm for different values of k by means of a series of experiments that considered both synthetical and real world datasets. We compared the performance of our algorithm with that of techniques that either assume that both datasets have a spatial index or that none has an index. The results show that our algorithm needs only between a 0.3 and a 35 % of the disk accesses required by such techniques. Our algorithm also shows a good scalability, both in terms of k and of the size of the data set.  相似文献   

17.
This paper suggests ways to facilitate creativity and innovation in software development. The paper applies four perspectives – Product, Project, Process, and People – to identify an outlook for software innovation. The paper then describes a new facility – Software Innovation Research Lab (SIRL) – and a new method concept for software innovation – Essence – based on views, modes, and team roles. Finally, the paper reports from an early experiment using SIRL and Essence and identifies further research.  相似文献   

18.
In this paper, we introduce two new subclasses of p-valent analytic functions defined by means of fractional derivative of order δ. We obtained the various results including coefficient bounds and distortion inequalities for these function classes. Furthermore, we determine some inclusion relations for the (n,p,ε)-neighborhoods of a family of p-valent analytic functions with negative coefficients which is defined by means of a certain nonhomogeneous Cauchy-Euler differential equation.  相似文献   

19.
20.
Nowadays ubiquitous sensor stations are deployed worldwide, in order to measure several geophysical variables (e.g. temperature, humidity, light) for a growing number of ecological and industrial processes. Although these variables are, in general, measured over large zones and long (potentially unbounded) periods of time, stations cannot cover any space location. On the other hand, due to their huge volume, data produced cannot be entirely recorded for future analysis. In this scenario, summarization, i.e. the computation of aggregates of data, can be used to reduce the amount of produced data stored on the disk, while interpolation, i.e. the estimation of unknown data in each location of interest, can be used to supplement station records. We illustrate a novel data mining solution, named interpolative clustering, that has the merit of addressing both these tasks in time-evolving, multivariate geophysical applications. It yields a time-evolving clustering model, in order to summarize geophysical data and computes a weighted linear combination of cluster prototypes, in order to predict data. Clustering is done by accounting for the local presence of the spatial autocorrelation property in the geophysical data. Weights of the linear combination are defined, in order to reflect the inverse distance of the unseen data to each cluster geometry. The cluster geometry is represented through shape-dependent sampling of geographic coordinates of clustered stations. Experiments performed with several data collections investigate the trade-off between the summarization capability and predictive accuracy of the presented interpolative clustering algorithm.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号