Parallelization Strategies and Performance Analysis of Media Mining Applications on Multi-Core Processors |
| |
Authors: | Wenlong Li Xiaofeng Tong Tao Wang Yimin Zhang Yen-Kuang Chen |
| |
Affiliation: | (1) Microprocessor Technology Lab, Intel Corp, Santa Clara, CA, USA;(2) Microprocessor Technology Lab, Intel Corp, Beijing, China |
| |
Abstract: | This paper studies how to parallelize the emerging media mining workloads on existing small-scale multi-core processors and
future large-scale platforms. Media mining is an emerging technology to extract meaningful knowledge from large amounts of
multimedia data, aiming at helping end users search, browse, and manage multimedia data. Many of the media mining applications
are very complicated and require a huge amount of computing power. The advent of multi-core architectures provides the acceleration
opportunity for media mining. However, to efficiently utilize the multi-core processors, we must effectively execute many
threads at the same time. In this paper, we present how to explore the multi-core processors to speed up the computation-intensive
media mining applications. We first parallelize two media mining applications by extracting the coarse-grained parallelism
and evaluate their parallel speedups on a small-scale multi-core system. Our experiment shows that the coarse-grained parallelization
achieves good scaling performance, but not perfect. When examining the memory requirements, we find that these coarse-grained
parallelized workloads expose high memory demand. Their working set sizes increase almost linearly with the degree of parallelism,
and the instantaneous memory bandwidth usage prevents them from perfect scalability on the 8-core machine. To avoid the memory
bandwidth bottleneck, we turn to exploit the fine-grained parallelism and evaluate the parallel performance on the 8-core
machine and a simulated 64-core processor. Experimental data show that the fine-grained parallelization demonstrates much
lower memory requirements than the coarse-grained one, but exhibits significant read-write data sharing behavior. Therefore,
the expensive inter-thread communication limits the parallel speedup on the 8-core machine, while excellent speedup is observed
on the large-scale processor as fast core-to-core communication is provided via a shared cache. Our study suggests that (1)
extracting the coarse-grained parallelism scales well on small-scale platforms, but poorly on large-scale system; (2) exploiting
the fine-grained parallelism is suitable to realize the power of large-scale platforms; (3) future many-core chips can provide
shared cache and sufficient on-chip interconnect bandwidth to enable efficient inter-core communication for applications with
significant amounts of shared data. In short, this work demonstrates proper parallelization techniques are critical to the
performance of multi-core processors. We also demonstrate that one of the important factors in parallelization is the performance
analysis. The parallelization principles, practice, and performance analysis methodology presented in this paper are also
useful for everyone to exploit the thread-level parallelism in their applications.
|
| |
Keywords: | Media mining Parallelization Performance analysis Multi-core processor |
本文献已被 SpringerLink 等数据库收录! |
|