首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Rong  Wenzhong  Han  Jin  Liu  Gen 《Multimedia Tools and Applications》2022,81(6):8617-8632

Leveraging the contextual information at instance-level can improve the accuracy in object detection. However, the-state-of-the-art object detection systems still detect each target individually without using contextual information. One reason is that contextual information is difficult to model. To solve this problem, the object relation module based on one-stage object detectors helps the object detectors learn the correlations between objects. It extracts and fuses the feature maps from various layers, including geometric features, categorical features, and appearance features, a transformation driven by visual attention mechanism are then performed to generate instance-level primary object relation features. Furthermore, a lightweight subnet is used to generate new feature prediction layer based on primary relation features and fused with the original detection layer to improve the detection ability. It does not require excessive amounts of computations and additional supervision and it can be easily ported to different one-stage object detection frameworks. The relation module is added to several one-stage object detectors (YOLO, Retinanet, and FCOS) as demonstrations and evaluate it on MS-COCO benchmark dataset after training. The results show that the relation module effectively improves the accuracy in one-stage object detection pipelines. Specifically, the relation module gives a 2.4 AP improvement for YOLOv3, 1.8 AP improvement for Retinanet and 1.6 AP improvement for FCOS.

  相似文献   

2.
3.
The pose of a rigid object is usually regarded as a rigid transformation, described by a translation and a rotation. However, equating the pose space with the space of rigid transformations is in general abusive, as it does not account for objects with proper symmetries—which are common among man-made objects. In this article, we define pose as a distinguishable static state of an object, and equate a pose to a set of rigid transformations. Based solely on geometric considerations, we propose a frame-invariant metric on the space of possible poses, valid for any physical rigid object, and requiring no arbitrary tuning. This distance can be evaluated efficiently using a representation of poses within a Euclidean space of at most 12 dimensions depending on the object’s symmetries. This makes it possible to efficiently perform neighborhood queries such as radius searches or k-nearest neighbor searches within a large set of poses using off-the-shelf methods. Pose averaging considering this metric can similarly be performed easily, using a projection function from the Euclidean space onto the pose space. The practical value of those theoretical developments is illustrated with an application of pose estimation of instances of a 3D rigid object given an input depth map, via a Mean Shift procedure.  相似文献   

4.
A technique for generating a skeleton of a ribbon-like or tree-like object using sequential data for all or part of the boundary is described. It is shown how one may use local geometric information derived from the contour to aid in the selection and generation of significant pieces of the skeleton. For contours or curves of lengthn, this may be accomplished with a computation time of ordern, while previous algorithms generally require ordern2 and require a two-dimensional matrix for their working representation.  相似文献   

5.
Techniques for video object motion analysis, behaviour recognition and event detection are becoming increasingly important with the rapid increase in demand for and deployment of video surveillance systems. Motion trajectories provide rich spatiotemporal information about an object's activity. This paper presents a novel technique for classification of motion activity and anomaly detection using object motion trajectory. In the proposed motion learning system, trajectories are treated as time series and modelled using modified DFT-based coefficient feature space representation. A modelling technique, referred to as m-mediods, is proposed that models the class containing n members with m mediods. Once the m-mediods based model for all the classes have been learnt, the classification of new trajectories and anomaly detection can be performed by checking the closeness of said trajectory to the models of known classes. A mechanism based on agglomerative approach is proposed for anomaly detection. Four anomaly detection algorithms using m-mediods based representation of classes are proposed. These includes: (i)global merged anomaly detection (GMAD), (ii) localized merged anomaly detection (LMAD), (iii) global un-merged anomaly detection (GUAD), and (iv) localized un-merged anomaly detection (LUAD). Our proposed techniques are validated using variety of simulated and complex real life trajectory datasets.  相似文献   

6.
7.
Traditional algorithms to design hand-crafted features for action recognition have been a hot research area in the last decade. Compared to RGB video, depth sequence is more insensitive to lighting changes and more discriminative due to its capability to catch geometric information of object. Unlike many existing methods for action recognition which depend on well-designed features, this paper studies deep learning-based action recognition using depth sequences and the corresponding skeleton joint information. Firstly, we construct a 3D-based Deep Convolutional Neural Network (3D2CNN) to directly learn spatio-temporal features from raw depth sequences, then compute a joint based feature vector named JointVector for each sequence by taking into account the simple position and angle information between skeleton joints. Finally, support vector machine (SVM) classification results from 3D2CNN learned features and JointVector are fused to take action recognition. Experimental results demonstrate that our method can learn feature representation which is time-invariant and viewpoint-invariant from depth sequences. The proposed method achieves comparable results to the state-of-the-art methods on the UTKinect-Action3D dataset and achieves superior performance in comparison to baseline methods on the MSR-Action3D dataset. We further investigate the generalization of the trained model by transferring the learned features from one dataset (MSR-Action3D) to another dataset (UTKinect-Action3D) without retraining and obtain very promising classification accuracy.  相似文献   

8.
为了预防人员防护缺失导致的生产事故,着力探究复杂施工场景下人员安全帽佩戴情况的智能化识别。在一阶段目标检测算法的基础上,针对安全帽识别问题中的小目标和安全帽纹理信息缺失的问题,提出提取并融合上下文信息,以增强模型的表征学习能力。首先,为解决特征鉴别力不足的问题,提出局部上下文感知模块和全局上下文融合模块。局部上下文感知模块能够融合人体头部信息和安全帽信息获取具有鉴别力的特征表示;全局上下文融合模块将高层的语义信息与浅层特征融合,提升浅层特征的抽象能力。其次,为了解决小目标识别问题,提出使用多个不同的目标检测模块分别识别不同大小的目标。在构建的复杂施工场景下的安全帽识别数据集上的实验结果表明:提出的2个模块将mAP提高了11.46个百分点,安全帽识别的平均精度提高了10.55个百分点。本文提出的方法具有速度快、精度高的特点,为智慧工地提供了有效的技术解决方案。  相似文献   

9.
10.
Conventional approaches to speech-to-speech (S2S) translation typically ignore key contextual information such as prosody, emphasis, discourse state in the translation process. Capturing and exploiting such contextual information is especially important in machine-mediated S2S translation as it can serve as a complementary knowledge source that can potentially aid the end users in improved understanding and disambiguation. In this work, we present a general framework for integrating rich contextual information in S2S translation. We present novel methodologies for integrating source side context in the form of dialog act (DA) tags, and target side context using prosodic word prominence. We demonstrate the integration of the DA tags in two different statistical translation frameworks, phrase-based translation and a bag-of-words lexical choice model. In addition to producing interpretable DA annotated target language translations, we also obtain significant improvements in terms of automatic evaluation metrics such as lexical selection accuracy and BLEU score. Our experiments also indicate that finer representation of dialog information such as yes–no questions, wh-questions and open questions are the most useful in improving translation quality. For target side enrichment, we employ factored translation models to integrate the assignment and transfer of prosodic word prominence (pitch accents) during translation. The factored translation models provide significant improvement in assignment of correct pitch accents to the target words in comparison with a post-processing approach. Our framework is suitable for integrating any word or utterance level contextual information that can be reliably detected (recognized) from speech and/or text.  相似文献   

11.
12.
In this paper we propose a new approach for dynamic selection of ensembles of classifiers. Based on the concept named multistage organizations, the main objective of which is to define a multi-layer fusion function adapted to each recognition problem, we propose dynamic multistage organization (DMO), which defines the best multistage structure for each test sample. By extending Dos Santos et al.’s approach, we propose two implementations for DMO, namely DSA m and DSA c . While the former considers a set of dynamic selection functions to generalize a DMO structure, the latter considers contextual information, represented by the output profiles computed from the validation dataset, to conduct this task. The experimental evaluation, considering both small and large datasets, demonstrated that DSA c dominated DSA m on most problems, showing that the use of contextual information can reach better performance than other existing methods. In addition, the performance of DSA c can also be enhanced in incremental learning. However, the most important observation, supported by additional experiments, is that dynamic selection is generally preferred over static approaches when the recognition problem presents a high level of uncertainty.  相似文献   

13.
14.
The representation of three-dimensional star-shaped objects by the double Fourier series (DFS) coefficients of their boundary function is considered. An analogue of the convolution theorem for a DFS on a sphere is developed. It is then used to calculate the moments of an object directly from the DFS coefficients, without an intermediate reconstruction step. The complexity of computing the moments from the DFS coefficients is O(N 2 log N), where N is the maximum order of coefficients retained in the expansion, while the complexity of computing the moments from the spherical harmonic representation is O(N 2 log 2 N). It is shown that under sufficient conditions, the moments and surface area corresponding to the truncated DFS converge to the true moments and area of an object. A new kind of DFS—the double Fourier sine series—is proposed which has better convergence properties than the previously used kinds and spherical harmonics in the case of objects with a sharp point above the pole of the spherical domain.  相似文献   

15.
In the attention-driven image interpretation process, an image is interpreted as containing several perceptually attended objects as well as the background. The process benefits greatly a content-based image retrieval task with attentively important objects identified and emphasized. An important issue to be addressed in an attention-driven image interpretation is to reconstruct several attentive objects iteratively from the segments of an image by maximizing a global attention function. The object reconstruction is a combinational optimization problem with a complexity of 2N which is computationally very expensive when the number of segments N is large. In this paper, we formulate the attention-driven image interpretation process by a matrix representation. An efficient algorithm based on the elementary transformation of matrix is proposed to reduce the computational complexity to 3ωN(N-1)2/2, where ω is the number of runs. Experimental results on both the synthetic and real data show a significantly improved processing speed with an acceptable degradation to the accuracy of object formulation.  相似文献   

16.
Segmentation of human faces from still images is a research field of rapidly increasing interest. Although the field encounters several challenges, this paper seeks to present a novel face segmentation and facial feature extraction algorithm for gray intensity images (each containing a single face object). Face location and extraction must first be performed to obtain the approximate, if not exact, representation of a given face in an image. The proposed approach is based on the Voronoi diagram (VD), a well-known technique in computational geometry, which generates clusters of intensity values using information from the vertices of the external boundary of Delaunay triangulation (DT). In this way, it is possible to produce segmented image regions. A greedy search algorithm looks for a particular face candidate by focusing its action in elliptical-like regions. VD is presently employed in many fields, but researchers primarily focus on its use in skeletonization and for generating Euclidean distances; this work exploits the triangulations (i.e., Delaunay) generated by the VD for use in this field. A distance transformation is applied to segment face features. We used the BioID face database to test our algorithm. We obtained promising results: 95.14% of faces were correctly segmented; 90.2% of eyes were detected and a 98.03% detection rate was obtained for mouth and nose.  相似文献   

17.
The storage and retrieval of multimedia data is a crucial problem in multimedia information systems due to the huge storage requirements. It is necessary to provide an efficient methodology for the indexing of multimedia data for rapid retrieval. The aim of this paper is to introduce a methodology to represent, simplify, store, retrieve and reconstruct an image from a repository. An algebraic representation of the spatio-temporal relations present in a document is constructed from an equivalent graph representation and used to index the document. We use this representation to simplify and later reconstruct the complete index. This methodology has been tested by implementation of a prototype system called Simplified Modeling to Access and ReTrieve multimedia information (SMART). Experimental results show that the complexity of an index of a 2D document is O (n*(n−1)/k) with k≥2 as opposed to the O (n*(n−1)/2) known so far. Since k depends on the number of objects in an image more complex documents have lower overall complexity.  相似文献   

18.
Text classification constitutes a popular task in Web research with various applications that range from spam filtering to sentiment analysis. In this paper, we argue that its performance depends on the quality of Web documents, which varies significantly. For example, the curated content of news articles involves different challenges than the user-generated content of blog posts and Social Media messages. We experimentally verify our claim, quantifying the main factors that affect the performance of text classification. We also argue that the established bag-of-words representation models are inadequate for handling all document types, as they merely extract frequent, yet distinguishing terms from the textual content of the training set. Thus, they suffer from low robustness in the context of noisy or unseen content, unless they are enriched with contextual, application-specific information. In their place, we propose the use of n-gram graphs, a model that goes beyond the bag-of-words representation, transforming every document into a graph: its nodes correspond to character or word n-grams and the co-occurring ones are connected by weighted edges. Individual document graphs can be combined into class graphs and graph similarities are employed to position and classify documents into the vector space. This approach offers two advantages with respect to bag models: first, classification accuracy increases due to the contextual information that is encapsulated in the edges of the n-gram graphs. Second, it reduces the search space to a limited set of robust, endogenous features that depend on the number of classes, rather than the size of the vocabulary. Our thorough experimental study over three large, real-world corpora confirms the superior performance of n-gram graphs across the main types of Web documents.  相似文献   

19.
A new approach using the Beltrami representation of a shape for topology-preserving image segmentation is proposed in this paper. Using the proposed model, the target object can be segmented from the input image by a region of user-prescribed topology. Given a target image I, a template image J is constructed and then deformed with respect to the Beltrami representation. The deformation on J is designed such that the topology of the segmented region is preserved as which the object is interior in J. The topology-preserving property of the deformation is guaranteed by imposing only one constraint on the Beltrami representation, which is easy to be handled. Introducing the Beltrami representation also allows large deformations on the topological prior J, so that it can be a very simple image, such as an image of disks, torus, disjoint disks. Hence, prior shape information of I is unnecessary for the proposed model. Additionally, the proposed model can be easily incorporated with selective segmentation, in which landmark constraints can be imposed interactively to meet any practical need (e.g., medical imaging). High accuracy and stability of the proposed model to deal with different segmentation tasks are validated by numerical experiments on both artificial and real images.  相似文献   

20.
Visual context provides cues about an object’s presence, position and size within the observed scene, which should be used to increase the performance of object detection techniques. However, in computer vision, object detectors typically ignore this information. We therefore present a framework for visual-context-aware object detection. Methods for extracting visual contextual information from still images are proposed, which are then used to calculate a prior for object detection. The concept is based on a sparse coding of contextual features, which are based on geometry and texture. In addition, bottom-up saliency and object co-occurrences are exploited, to define auxiliary visual context. To integrate the individual contextual cues with a local appearance-based object detector, a fully probabilistic framework is established. In contrast to other methods, our integration is based on modeling the underlying conditional probabilities between the different cues, which is done via kernel density estimation. This integration is a crucial part of the framework which is demonstrated within the detailed evaluation. Our method is evaluated using a novel demanding image data set and compared to a state-of-the-art method for context-aware object detection. An in-depth analysis is given discussing the contributions of the individual contextual cues and the limitations of visual context for object detection.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号