首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Query by video clip   总被引:15,自引:0,他引:15  
Typical digital video search is based on queries involving a single shot. We generalize this problem by allowing queries that involve a video clip (say, a 10-s video segment). We propose two schemes: (i) retrieval based on key frames follows the traditional approach of identifying shots, computing key frames from a video, and then extracting image features around the key frames. For each key frame in the query, a similarity value (using color, texture, and motion) is obtained with respect to the key frames in the database video. Consecutive key frames in the database video that are highly similar to the query key frames are then used to generate the set of retrieved video clips. (ii) In retrieval using sub-sampled frames, we uniformly sub-sample the query clip as well as the database video. Retrieval is based on matching color and texture features of the sub-sampled frames. Initial experiments on two video databases (basketball video with approximately 16,000 frames and a CNN news video with approximately 20,000 frames) show promising results. Additional experiments using segments from one basketball video as query and a different basketball video as the database show the effectiveness of feature representation and matching schemes.  相似文献   

2.
基于多帧图像的视频文字跟踪和分割算法   总被引:8,自引:2,他引:6  
视频中文字的提取是视频语义理解和检索的重要信息来源.针对视频中的静止文字时间和空间上的冗余特性,以文字区域的边缘位图为特征对检测结果作精化,并提出了基于二分搜索法的快速文字跟踪算法,实现了对文字对象快速有效的定位.在分割阶段,除了采用传统的灰度融合图像进行文字区域增强方法,还结合边缘位图对文字区域进行进一步的背景过滤.实验表明,文字的检测精度和分割质量都有很大提高.  相似文献   

3.
4.

Video anomaly detection automatically recognizes abnormal events in surveillance videos. Existing works have made advances in recognizing whether a video contains abnormal events; however, they cannot temporally localize the abnormal events within videos. This paper presents a novel anomaly attention-based framework for accurately temporally localize the abnormal events. Benefiting from the proposed framework, we can achieve frame-level VAD using video-level labels, which significantly reduces the burden of data annotation. Our method is an end-to-end deep neural network-based approach, which contains three modules: anomaly attention module (AAM), discriminative anomaly attention module (DAAM) and generative anomaly attention module (GAAM). Specifically, AAM is trained to generate the anomaly attention, which is used to measure the abnormal degree of each frame. Whereas, DAAM and GAAM are used to alternately augmenting AAM from two different aspects. On the one hand, DAAM enhancing AAM by optimizing the video-level video classification. On the other hand, GAAM adopts a conditional variational autoencoder to model the likelihood of each frame given the attention for refining AAM. As a result, AAM can generate higher anomaly scores for abnormal frames while lower anomaly scores for normal frames. Experimental results show that our proposed approach outperforms state-of-the-art methods, which validates the superiority of our AAVAD.

  相似文献   

5.

In video streaming services and applications, impulse noise occurs due to transmission errors or sometimes it is introduced during signal acquisition. The work presented in this paper proposes a novel impulse noise detection and mitigation (INDAM) method that can significantly recover video frames heavily impaired by impulse noise. The proposed technique uses cyclic redundancy check (CRC) method to create an error mask of the received impaired video frames. This error mask contains pixel-by-pixel error information of the video frames and is exploited further to mitigate the error from the impaired video frame. Each impaired pixel in the video frame is replaced by the average of its corresponding error-free neighboring pixels’ values, hence removing the impaired pixels and replacing them with the newly calculated average. The proposed technique uses the error mask created from the CRC method and uses only those pixels which do not contain error for calculating the averages. Results show that INDAM outperforms other contemporary methods in terms of peak signal to noise ratio (PSNR) and structural similarity index metric (SSIM).

  相似文献   

6.
Content-based indexing of multimedia databases   总被引:1,自引:0,他引:1  
Content-based retrieval of multimedia database calls for content-based indexing techniques. Different from conventional databases, where data items are represented by a set of attributes of elementary data types, multimedia objects in multimedia databases are represented by a collection of features; similarity of object contents depends on context and frame of reference; and features of objects are characterized by multimodal feature measures. These lead to great challenges for content-based indexing. On the other hand, there are special requirements on content-based indexing: to support visual browsing, similarity retrieval, and fuzzy retrieval, nodes of the index should represent certain meaningful categories. That is to say that certain semantics must be added when performing indexing. ContIndex, the context-based indexing technique presented in this paper, is proposed to meet these challenges and special requirements. The indexing tree is formally defined by adapting a classification-tree concept. Horizontal links among nodes in the same level enhance the flexibility of the index. A special neural-network model, called Learning based on Experiences and Perspectives (FEP), has been developed to create node categories by fusing multimodal feature measures. It brings into the index the capability of self-organizing nodes with respect to certain context and frames of reference. An icon image is generated for each intermediate node to facilitate visual browsing. Algorithms have been developed to support multimedia object archival and retrieval using Contlndex  相似文献   

7.
Numerous web videos associated with rich metadata are available on the Internet today. While such metadata like video tags bring us facilitation and opportunities for video search and multimedia content understanding, some challenges also arise due to the fact that those video tags are usually annotated at the video level, while many tags actually only describe parts of the video content. How to localize the relevant parts or frames of web video for given tags is the key to many applications and research tasks. In this paper we propose combining topic model and relevance filtering to localize relevant frames. Our method is designed in three steps. First, we apply relevance filtering to assign relevance scores to video frames and a raw relevant frame set is obtained by selecting the top ranked frames. Then, we separate the frames into topics by mining the underlying semantics using latent Dirichlet allocation and use the raw relevance set as validation set to select relevant topics. Finally, the topical relevances are used to refine the raw relevant frame set and the final results are obtained. Experiment results on two real web video databases validate the effectiveness of the proposed approach.  相似文献   

8.
Video segmentation acts as the fundamental step for various applications like, archiving, content based retrieval, copy detection and summarization of video data. Shot detection is first level of segmentation. In this work, a shot detection methodology is presented that evolves around a simple shot transition model based on the similarity of the frames with respect to a reference frame. Frames in an individual shot are very similar in terms of their visual content. Whenever a shot transition occurs a change in similarity values appears. For an abrupt transition, the rate of change is very high, while for gradual it is not so apparent. To overcome the effect of noise in similarity values, line is fit over a small window using a linear regression. Thus slope of this line exhibits the underlying pattern of transition. A novel algorithm for shot detection, hence, is developed based on the variation pattern of the similarity values of the frames with respect to a reference frame. First an algorithm is proposed, which is direct descendant of the underlying transition model and applies a threshold on the similarity values to detect the transitions. Then this algorithm is improved by utilizing the slope of linear approximation of variation in similarity values rather than the absolute values, following least square regression. Threshold on the slope is determined with a bias towards minimizing false rejection rate at the cost of false acceptance rate. Finally, a simple post-processing technique is adopted to reduce the false detection. Experiment is done with the video sequences taken from TRECVID 2001 database, action type movie video, recorded sports and news video. Comparison with few other systems indicates that the performance of the proposed scheme is quite satisfactory.  相似文献   

9.
根据维吾尔文字独有的基线特性,提出了一种新的视频维吾尔文字幕帧提取方法,首先进行维吾尔文字幕帧的读取,然后根据相邻帧之间的像素帧间差异和区域像素统计对视频段作初步镜头关键帧的检测,之后对检测到的镜头关键帧作区域处理,检测视频帧中是否具有基线特性,再根据基线设置阈值,最后提取出代表视频语义的主要视频帧。实验证明:该提取方法简洁有效,其字幕帧提取率平均可达到85%以上。  相似文献   

10.

Closed circuit television cameras (CCTV) are widely used in monitoring. This paper presents an intelligent CCTV crowd counting system based on two algorithms that estimate the density of each pixel in each frame and use it as a basis for counting people. One algorithm uses scale-invariant feature transform (SIFT) features and clustering to represent pixels of frames (SIFT algorithm) and the other uses features from accelerated segment test (FAST) corner points with SIFT features (SIFT-FAST algorithm). Each algorithm is designed using a novel combination of pixel-wise, motion-region, grid map, background segmentation using Gaussian mixture model (GMM) and edge detection. A fusion technique is proposed and used to validate the accuracy by combining the result of the algorithms at frame level. The proposed system is more practical than the state of the art regression methods because it is trained with a small number of frames so it is relatively easy to deploy. In addition, it reduces the training error, set-up time, cost and open the door to develop more accurate people detection methods. The University of California (UCSD) and Mall datasets have been used to test the proposed algorithms. The mean deviation error, mean squared error and the mean absolute error of the proposed system are less than 0.1, 16.5 and 3.1, respectively, for the Mall dataset and less than 0.07, 5.5 and 1.9, respectively, for UCSD dataset.

  相似文献   

11.
Temporal segmentation of successive actions in a long-term video sequence has been a long-standing problem in computer vision. In this paper, we exploit a novel learning-based framework. Given a video sequence, only a few characteristic frames are selected by the proposed selection algorithm, and then the likelihood to trained models is calculated in a pair-wise way, and finally segmentation is obtained as the optimal model sequence to realize the maximum likelihood. The average accuracy on IXMAS dataset reached to 80.5% at frame level, using only 16.5% of all frames in computation time of 1.57 s per video which has 1160 frames on the average.  相似文献   

12.
Li  Chao  Chen  Zhihua  Sheng  Bin  Li  Ping  He  Gaoqi 《Multimedia Tools and Applications》2020,79(7-8):4661-4679

In this paper, we introduce an approach to remove the flickers in the videos, and the flickers are caused by applying image-based processing methods to original videos frame by frame. First, we propose a multi-frame based video flicker removal method. We utilize multiple temporally corresponding frames to reconstruct the flickering frame. Compared with traditional methods, which reconstruct the flickering frame just from an adjacent frame, reconstruction with multiple temporally corresponding frames reduces the warp inaccuracy. Then, we optimize our video flickering method from following aspects. On the one hand, we detect the flickering frames in the video sequence with temporal consistency metrics, and just reconstructing the flickering frames can accelerate the algorithm greatly. On the other hand, we just choose the previous temporally corresponding frames to reconstruct the output frames. We also accelerate our video flicker removal with GPU. Qualitative experimental results demonstrate the efficiency of our proposed video flicker method. With algorithmic optimization and GPU acceleration, the time complexity of our method also outperforms traditional video temporal coherence methods.

  相似文献   

13.
视频数据的急剧增加,给视频的浏览、存储、检索等应用带来一系列问题和挑战,视频摘要正是解决此类问题的一个有效途径。针对现有视频摘要算法基于约束和经验设置构造目标函数,并对帧集合进行打分带来的不确定和复杂度高等问题,提出一个基于排序学习的视频摘要生成方法。该方法把视频摘要的提取等价为视频帧对视频内容表示的相关度排序问题,利用训练集学习排序函数,使得排序靠前的是与视频相关度高的帧,用学到的排序函数对帧打分,根据分数高低选择关键帧作为视频摘要。另外,与现有方法相比,该方法是对帧而非帧集合打分,计算复杂度显著降低。通过在TVSum50数据集上测试,实验结果证实了该方法的有效性。  相似文献   

14.
In this paper, we propose a novel motion-based video retrieval approach to find desired videos from video databases through trajectory matching. The main component of our approach is to extract representative motion features from the video, which could be broken down to the following three steps. First, we extract the motion vectors from each frame of videos and utilize Harris corner points to compensate the effect of the camera motion. Second, we find interesting motion flows from frames using sliding window mechanism and a clustering algorithm. Third, we merge the generated motion flows and select representative ones to capture the motion features of videos. Furthermore, we design a symbolic based trajectory matching method for effective video retrieval. The experimental results show that our algorithm is capable to effectively extract motion flows with high accuracy and outperforms existing approaches for video retrieval.  相似文献   

15.
As we all know, video frame rate determines the quality of the video. The higher the frame rate, the smoother the movements in the picture, the clearer the information expressed, and the better the viewing experience for people. Video interpolation aims to increase the video frame rate by generating a new frame image using the relevant information between two consecutive frames, which is essential in the field of computer vision. The traditional motion compensation interpolation method will cause holes and overlaps in the reconstructed frame, and is easily affected by the quality of optical flow. Therefore, this paper proposes a video frame interpolation method via optical flow estimation with image inpainting. First, the optical flow between the input frames is estimated via combined local and global-total variation (CLG-TV) optical flow estimation model. Then, the intermediate frames are synthesized under the guidance of the optical flow. Finally, the nonlocal self-similarity between the video frames is used to solve the optimization problem, to fix the pixel loss area in the interpolated frame. Quantitative and qualitative experimental results show that this method can effectively improve the quality of optical flow estimation, generate realistic and smooth video frames, and effectively increase the video frame rate.  相似文献   

16.
提出一种利用重构的静态关键帧和动态关键帧来表示视频镜头内容的检索算法.该算法充分利用视频序列在时间轴上的特性构建静态关键帧,并利用三维小波变换构建动态关键帧,将运动信息和静态信息相结合以更好地反映视频序列的内容.实验结果表明,该算法的性能优于基于时域最大值关键帧的算法及其改进算法.  相似文献   

17.
With the growing demand for visual information of rich content, effective and efficient manipulations of large video databases are increasingly desired. Many investigations have been made on content-based video retrieval. However, despite the importance, video subsequence identification, which is to find the similar content to a short query clip from a long video sequence, has not been well addressed. This paper presents a graph transformation and matching approach to this problem, with extension to identify the occurrence of potentially different ordering or length due to content editing. With a novel batch query algorithm to retrieve similar frames, the mapping relationship between the query and database video is first represented by a bipartite graph. The densely matched parts along the long sequence are then extracted, followed by a filter-and-refine search strategy to prune some irrelevant subsequences. During the filtering stage, maximum size matching is deployed for each subgraph constructed by the query and candidate subsequence to obtain a smaller set of candidates. During the refinement stage, sub-maximum similarity matching is devised to identify the subsequence with the highest aggregate score from all candidates, according to a robust video similarity model that incorporates visual content, temporal order, and frame alignment information. The performance studies conducted on a long video recording of 50 hours validate that our approach is promising in terms of both search accuracy and speed.  相似文献   

18.
This paper presents an infrared single‐pixel video imaging to surveille sea surface. Based on the temporal redundancy of the surveillance video, a two‐step scheme, including low‐scale detection and high‐scale detection, is proposed. For each frame, low‐scale detection performs low‐resolution single‐pixel imaging to obtain a “preview” image of the scene, where the moving target can be located. These targets are further refined in the high‐scale detection where the high‐resolution single‐pixel imagings focusing on these targets are used. The frame is reconstructed by merging these two‐level images. The simulated experiments show that for a video with 128 × 128 pixels and 150 frames, the sampling rate of our scheme is about 17.8%, and the reconstructed video presents a good visual quality.  相似文献   

19.
In the field of multimedia retrieval in video, text frame classification is essential for text detection, event detection, event boundary detection, etc. We propose a new text frame classification method that introduces a combination of wavelet and median moment with k-means clustering to select probable text blocks among 16 equally sized blocks of a video frame. The same feature combination is used with a new Max-Min clustering at the pixel level to choose probable dominant text pixels in the selected probable text blocks. For the probable text pixels, a so-called mutual nearest neighbor based symmetry is explored with a four-quadrant formation centered at the centroid of the probable dominant text pixels to know whether a block is a true text block or not. If a frame produces at least one true text block then it is considered as a text frame otherwise it is a non-text frame. Experimental results on different text and non-text datasets including two public datasets and our own created data show that the proposed method gives promising results in terms of recall and precision at the block and frame levels. Further, we also show how existing text detection methods tend to misclassify non-text frames as text frames in term of recall and precision at both the block and frame levels.  相似文献   

20.

Automatic detection and counting of vehicles in a video is a challenging task and has become a key application area of traffic monitoring and management. In this paper, an efficient real-time approach for the detection and counting of moving vehicles is presented based on YOLOv2 and features point motion analysis. The work is based on synchronous vehicle features detection and tracking to achieve accurate counting results. The proposed strategy works in two phases; the first one is vehicle detection and the second is the counting of moving vehicles. Different convolutional neural networks including pixel by pixel classification networks and regression networks are investigated to improve the detection and counting decisions. For initial object detection, we have utilized state-of-the-art faster deep learning object detection algorithm YOLOv2 before refining them using K-means clustering and KLT tracker. Then an efficient approach is introduced using temporal information of the detection and tracking feature points between the framesets to assign each vehicle label with their corresponding trajectories and truly counted it. Experimental results on twelve challenging videos have shown that the proposed scheme generally outperforms state-of-the-art strategies. Moreover, the proposed approach using YOLOv2 increases the average time performance for the twelve tested sequences by 93.4% and 98.9% from 1.24 frames per second achieved using Faster Region-based Convolutional Neural Network (F R-CNN ) and 0.19 frames per second achieved using the background subtraction based CNN approach (BS-CNN ), respectively to 18.7 frames per second.

  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号