首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Text classification constitutes a popular task in Web research with various applications that range from spam filtering to sentiment analysis. In this paper, we argue that its performance depends on the quality of Web documents, which varies significantly. For example, the curated content of news articles involves different challenges than the user-generated content of blog posts and Social Media messages. We experimentally verify our claim, quantifying the main factors that affect the performance of text classification. We also argue that the established bag-of-words representation models are inadequate for handling all document types, as they merely extract frequent, yet distinguishing terms from the textual content of the training set. Thus, they suffer from low robustness in the context of noisy or unseen content, unless they are enriched with contextual, application-specific information. In their place, we propose the use of n-gram graphs, a model that goes beyond the bag-of-words representation, transforming every document into a graph: its nodes correspond to character or word n-grams and the co-occurring ones are connected by weighted edges. Individual document graphs can be combined into class graphs and graph similarities are employed to position and classify documents into the vector space. This approach offers two advantages with respect to bag models: first, classification accuracy increases due to the contextual information that is encapsulated in the edges of the n-gram graphs. Second, it reduces the search space to a limited set of robust, endogenous features that depend on the number of classes, rather than the size of the vocabulary. Our thorough experimental study over three large, real-world corpora confirms the superior performance of n-gram graphs across the main types of Web documents.  相似文献   

2.
Stemming is a process of reducing a derivational or inflectional word to its root or stem by stripping all its affixes. It is been used in applications such as information retrieval, machine translation, and text summarization, as their pre-processing step to increase efficiency. Currently, there are a few stemming algorithms which have been developed for languages such as English, Arabic, Turkish, Malay and Amharic. Unfortunately, no algorithm has been used to stem text in Hausa, a Chadic language spoken in West Africa. To address this need, we propose stemming Hausa text using affix-stripping rules and reference lookup. We stemmed Hausa text, using 78 affix stripping rules applied in 4 steps and a reference look-up consisting of 1500 Hausa root words. The over-stemming index, under-stemming index, stemmer weight, word stemmed factor, correctly stemmed words factor and average words conflation factor were calculated to determine the effect of reference look-up on the strength and accuracy of the stemmer. It was observed that reference look-up aided in reducing both over-stemming and under-stemming errors, increased accuracy and has a tendency to reduce the strength of an affix stripping stemmer. The rationality behind the approach used is discussed and directions for future research are identified.  相似文献   

3.
4.
Recent years have witnessed the rapid growth of text data, and thus the increasing importance of in-depth analysis of text data for various applications. Text data are often organized in a database with documents labeled by attributes like time and location. Different documents manifest different topics. The topics of the documents may change along the attributes of the documents, and such changes have been the subject of research in the past. However, previous analyses techniques, such as topic detection and tracking, topic lifetime, and burstiness, all focus on the topic behavior of the documents in a given attribute range without contrasting to the documents in the overall range. This paper introduces the concept of u n i q u e t o p i c s, referring to those topics that only appear frequently within a small range of documents but not in the whole range. These unique topics may reflect some unique characteristics of documents in this small range not found outside of the range. The paper aims at an efficient pruning-based algorithm that, for a user-given set of keywords and a user-given attribute, finds the maximal ranges along the given attribute and their unique topics that are highly related to the given keyword set. Thorough experiments show that the algorithm is effective in various scenarios.  相似文献   

5.
We consider the k-Server problem under the advice model of computation when the underlying metric space is sparse. On one side, we introduce Θ(1)-competitive algorithms for a wide range of sparse graphs. These algorithms require advice of (almost) linear size. We show that for graphs of size N and treewidth α, there is an online algorithm that receives O (n(log α + log log N))* bits of advice and optimally serves any sequence of length n. We also prove that if a graph admits a system of μ collective tree (q, r)-spanners, then there is a (q + r)-competitive algorithm which requires O (n(log μ + log log N)) bits of advice. Among other results, this gives a 3-competitive algorithm for planar graphs, when provided with O (n log log N) bits of advice. On the other side, we prove that advice of size Ω(n) is required to obtain a 1-competitive algorithm for sequences of length n even for the 2-server problem on a path metric of size N ≥ 3. Through another lower bound argument, we show that at least \(\frac {n}{2}(\log \alpha - 1.22)\) bits of advice is required to obtain an optimal solution for metric spaces of treewidth α, where 4 ≤ α < 2k.  相似文献   

6.
Abstract—In the projective plane PG(2, q), a subset S of a conic C is said to be almost complete if it can be extended to a larger arc in PG(2, q) only by the points of C \ S and by the nucleus of C when q is even. We obtain new upper bounds on the smallest size t(q) of an almost complete subset of a conic, in particular,
$$t(q) < \sqrt {q(3lnq + lnlnq + ln3)} + \sqrt {\frac{q}{{3\ln q}}} + 4 \sim \sqrt {3q\ln q} ,t(q) < 1.835\sqrt {q\ln q.} $$
The new bounds are used to extend the set of pairs (N, q) for which it is proved that every normal rational curve in the projective space PG(N, q) is a complete (q+1)-arc, or equivalently, that no [q+1,N+1, q?N+1]q generalized doubly-extended Reed–Solomon code can be extended to a [q + 2,N + 1, q ? N + 2]q maximum distance separable code.
  相似文献   

7.
We initiate a new line of investigation into online property-preserving data reconstruction. Consider a dataset which is assumed to satisfy various (known) structural properties; e.g., it may consist of sorted numbers, or points on a manifold, or vectors in a polyhedral cone, or codewords from an error-correcting code. Because of noise and errors, however, an (unknown) fraction of the data is deemed unsound, i.e., in violation with the expected structural properties. Can one still query into the dataset in an online fashion and be provided data that is always sound? In other words, can one design a filter which, when given a query to any item I in the dataset, returns a sound item J that, although not necessarily in the dataset, differs from I as infrequently as possible. No preprocessing should be allowed and queries should be answered online.We consider the case of a monotone function. Specifically, the dataset encodes a function f:{1,…,n}?? R that is at (unknown) distance ε from monotone, meaning that f can—and must—be modified at ε n places to become monotone.Our main result is a randomized filter that can answer any query in O(log?2 nlog? log?n) time while modifying the function f at only O(ε n) places. The amortized time over n function evaluations is O(log?n). The filter works as stated with probability arbitrarily close to 1. We provide an alternative filter with O(log?n) worst case query time and O(ε nlog?n) function modifications. For reconstructing d-dimensional monotone functions of the form f:{1,…,n} d ? ? R, we present a filter that takes (2 O(d)(log?n)4d?2log?log?n) time per query and modifies at most O(ε n d ) function values (for constant d).  相似文献   

8.
9.
A new representation is proved of the solutions of initial boundary value problems for the equation of the form u xx (x, t) + r(x)u x (x, t) ? q(x)u(x, t) = u tt (x, t) + μ(x)u t (x, t) in the section (under boundary conditions of the 1st, 2nd, or 3rd type in any combination). This representation has the form of the Riemann integral dependent on the x and t over the given section.  相似文献   

10.
In this paper, a steganographic scheme adopting the concept of the generalized K d -distance N-dimensional pixel matching is proposed. The generalized pixel matching embeds a B-ary digit (B is a function of K and N) into a cover vector of length N, where the order-d Minkowski distance-measured embedding distortion is no larger than K. In contrast to other pixel matching-based schemes, a N-dimensional reference table is used. By choosing d, K, and N adaptively, an embedding strategy which is suitable for arbitrary relative capacity can be developed. Additionally, an optimization algorithm, namely successive iteration algorithm (SIA), is proposed to optimize the codeword assignment in the reference table. Benefited from the high dimensional embedding and the optimization algorithm, nearly maximal embedding efficiency is achieved. Compared with other content-free steganographic schemes, the proposed scheme provides better image quality and statistical security. Moreover, the proposed scheme performs comparable to state-of-the-art content-based approaches after combining with image models.  相似文献   

11.
Various sorting algorithms using parallel architectures have been proposed in the search for more efficient results. This paper introduces the Multi-Sort Algorithm for Multi-Mesh of Trees (MMT) Architecture for N=n 4 elements with more efficient time complexity compared to previous architectures. The shear sort algorithm on Single Instruction Multiple Data (SIMD) mesh model requires \(4\sqrt{N}+O\sqrt{N}\) time for sorting N elements, arranged on a \(\sqrt{N}\times \sqrt{N}\) mesh, whereas Multi-Sort algorithm on the SIMD Multi-Mesh (MM) Architecture takes O(N 1/4) time for sorting the same N elements, which proves that Multi-Sort is a better sorting approach. We have improved the time complexity of intrablock Sort. The Communication time complexity for 2D Sort in MM is O(n), whereas this time in MMT is O(log?n). The time complexity of compare–exchange step in MMT is same as that in MM, i.e., O(n). It has been found that the time complexity of the Multi-Sort on MMT has been improved as on Multi-Mesh architecture.  相似文献   

12.
Edges are important cues for localizing object proposals. The recent progresses to this problem are mostly driven by defining effective objectness measures based on edge cues. In this paper, we develop a new representation named directional edges on which each edge pixel is assigned with a direction toward object center, through learning a direction prediction model with convolutional neural networks in a holistic manner. Based on directional edges, two new objectness measures are designed for ranking object proposals. Experiments show that the proposed method achieves 97.1% object recall at an overlap threshold of 0.5 and 81.9% object recall at an overlap threshold of 0.7 at 1 000 proposals on the PASCAL VOC 2007 test dataset, which is superior to the state-of-the-art methods.  相似文献   

13.
Games of the family {Λ N } N?2 are formulated and studied with the application of generalized Isaacs’s approach. The game Λ N is a simplest model of the counteraction of one persecutor P and coalition N of E N runaways for the case when the payoff is the distance up to the coalition of E N equal to the Euclidean distance between P and the farthest from the runaways; P is in command of the termination moment. Moreover, an approach within the limits of which in games with a smooth terminal payoff are generated strategies prescribing players’ motions in the directions of local gradients of the payoff is described. The approach is used for constructing pursuit strategies in games in which smooth approximations of the maximum of Euclidean distances up to the runaways are in place of payoffs. Pursuit strategies prescribing the motion in the direction of the farthest of the runaways are studied. A numerical simulation of the development of the games Λ2 and Λ3 is conducted in using different strategies by the players.  相似文献   

14.
We focus on the large field of a hyperbolic potential form, which is characterized by a parameter f, in the framework of the brane-world inflation in Randall-Sundrum-II model. From the observed form of the power spectrum P R (k), the parameter f should be of order 0.1m p to 0.001m p , the brane tension must be in the range λ ~ (1?10)×1057 GeV4, and the energy scale is around V0 1/4 ~ 1015 GeV. We find that the inflationary parameters (n s , r, and dn s /d(ln k) depend only on the number of e-folds N. The compatibility of these parameters with the last Planck measurements is realized with large values of N.  相似文献   

15.
We consider a game between a group of n pursuers and one evader moving with the same maximum velocity along the 1-skeleton graph of a regular polyhedron. The goal of the paper is finding, for each regular polyhedron M, a number N(M) with the following properties: if nN(M), the group of pursuers wins, while if n < N(M), the evader wins. Part I of the paper is devoted to the case of polyhedra in ?3; Part II will be devoted to the case of ? d , d ≥ 5; and Part III, to the case of ?4.  相似文献   

16.
The problem of ruin probability minimization in the Cramer-Lundberg risk model under excess reinsurance is studied. Together with traditional maximization of the Lundberg characteristic coefficient R is considered the problem of direct calculation of insurer’s ruin probability ? r (x) as an initial-capital function x under the prescribed level of net-retention r. To solve this problem, we propose the excess variant of the Cramer integral equation which is an equivalent to the Hamilton-Jacobi-Bellman equation. The continuation method is used for solving this equation; by means of it is found the analytical solution to the Markov risk model. We demonstrated on a series of standard examples that with any admissible value of x the ruin probability ? x (r): = ? r (x) is usually a unimodal function r. A comparison of the analytic representation of ruin probability ? r(x) with its asymptotic approximation with x → ∞ was conducted.  相似文献   

17.
We consider application of the two-armed bandit problem to processing a large number N of data where two alternative processing methods can be used. We propose a strategy which at the first stages, whose number is at most r ? 1, compares the methods, and at the final stage applies only the best one obtained from the comparison. We find asymptotically optimal parameters of the strategy and observe that the minimax risk is of the order of N α , where α = 2 r?1/(2 r ? 1). Under parallel processing, the total operation time is determined by the number r of stages but not by the number N of data.  相似文献   

18.
We introduce a construction of a set of code sequences {Cn(m) : n ≥ 1, m ≥ 1} with memory order m and code length N(n). {Cn(m)} is a generalization of polar codes presented by Ar?kan in [1], where the encoder mapping with length N(n) is obtained recursively from the encoder mappings with lengths N(n ? 1) and N(n ? m), and {Cn(m)} coincides with the original polar codes when m = 1. We show that {Cn(m)} achieves the symmetric capacity I(W) of an arbitrary binary-input, discrete-output memoryless channel W for any fixed m. We also obtain an upper bound on the probability of block-decoding error Pe of {Cn(m)} and show that \({P_e} = O({2^{ - {N^\beta }}})\) is achievable for β < 1/[1+m(? ? 1)], where ? ∈ (1, 2] is the largest real root of the polynomial F(m, ρ) = ρm ? ρm ? 1 ? 1. The encoding and decoding complexities of {Cn(m)} decrease with increasing m, which proves the existence of new polar coding schemes that have lower complexity than Ar?kan’s construction.  相似文献   

19.
A novel algorithm for simultaneous force estimation and friction compensation of constrained motion of robot manipulators is presented. This represents an extension of the improved extended active observer (IEAOB) algorithm reported earlier and proposes a higher order IEAOB or N?th order IEAOB (IEAOB ?N) for a n?DOF robot manipulator. Central to this observer is the use of extra system states modeled as a Gauss-Markov (GM) formulation to estimate the force and disturbances including robot inertial parameters and friction. The stability of IEAOB ?N is verified through stability analysis. The IEAOB-1 is validated by applying it to a Phantom Omni haptic device against a Nicosia observer, disturbance observer (DOB)/reaction torque observer (RTOB), and nonlinear disturbance observer (NDO), respectively. The results show that the proposed IEAOB-1 is superior to the compared observers in terms of force estimation. Then, the performance of the IEAOB ? N is experimentally studied and compared to the IEAOB-1. Results demonstrate that the IEAOB ? N has an improved capability in tracking nonlinear external forces.  相似文献   

20.
We present a method to construct a theoretically fast algorithm for computing the discrete Fourier transform (DFT) of order N = 2 n . We show that the DFT of a complex vector of length N is performed with complexity of 3.76875N log2 N real operations of addition, subtraction, and scalar multiplication.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号