Statistical Evaluation of Video Summarization Models from an Empirical Perspective
Video summarization has become an integral part of video processing systems, because all video processing systems tend to capture redundant data, thereby reducing their processing speed. In order to extract useful information from this large pool of video frames, a complex image signal analysis model is needed. This model must be able to remove redundant frames such that the final summarized video has the same information as the original input video. In order to perform this task, a series of complex video processing blocks are used. These blocks include, but are not limited to pre-filtering, segmentation, feature extraction, feature selection, classification and post processing. As these blocks are inter-dependent on each other, thus high efficiency of one block results into even higher efficiency of the cascaded blocks. A large number of algorithms and system models are available for performing this task, for instance, feature extraction models like wavelet transform, convolutional transforms, Fourier transform, etc. are used to describe the video frames into numerical sequences. While, classification architectures like convolutional neural networks, recurrent neural networks, etc. are used in order to categorize video frames into redundant and non-redundant ones. Due to the wide availability of these algorithms, it is difficult for researchers and multimedia system designers to select the best possible architectural combination for their given application. This text aims at providing assistance to these researchers by empirically comparing performance and applicability of these architectures, and recommending most optimally performing algorithmic combination for any given application. This text also recommends future directions which can be adopted in order to improve the efficiency of these architectures.