GraphicsMagick is being transformed to use OpenMP for the forthcoming 1.3 release. OpenMP is a portable framework for accelerating CPU-bound and memory-bound operations using multiple threads. OpenMP originates in the super-computing world and has been available in one form or another since the late '90s.
Since GCC 4.2 has introduced excellent OpenMP support, OpenMP has become available to the masses. Microsoft Visual Studio 2008 supports OpenMP 2.0 so Windows users should benefit as well. Any multi-CPU and/or multi-core system is potentially a good candidate for use with OpenMP. Unfortunately, some older multi-CPU hardware is more suitable for multi-processing than multi-threading. Recent multi-core chipsets from Intel and AMD perform very well with OpenMP. The operating system makes a difference when it comes to OpenMP acceleration, with Linux and Solaris working exceptionally well, and FreeBSD and Apple's OS-X working poorly.
Most image processing routines are comprised of loops which iterate through the image pixels, image rows, or image regions. These loops are accelerated using OpenMP by executing portions of the total loops in different threads, and therefore on a different processor core. CPU-bound algorithms benefit most from OpenMP, but memory-bound algorithms may also benefit as well since the memory is accessed by different CPU cores, and sometimes the CPUs have their own path to memory. For example, the AMD Opteron is a NUMA (Non-Uniform Memory Architecture) design such that multi-CPU systems split the system memory across CPUs so each CPU adds more memory bandwidth as well.
For severely CPU-bound algorithms, it is not uncommon to see a linear speed-up due to the number of cores. For example, a two core system executes the algorithm twice as fast, and a four core system executes the algorithm four times as fast. Memory-bound algorithms scale based on the memory bandwith available to the cores. For example, memory-bound algorithms scale up to almost 2X on my four core Opteron system due to its NUMA architecture. Some systems/CPUs are able to immediately context switch to another thread if the core would be blocked waiting for memory, allowing multiple memory accesses to be pending at once, and thereby improving throughput.
The initial approach used in GraphicsMagick is to recognize the various access patterns in the existing code, and re-write the algorithms (sometimes from scratch) to be based on a framework that we call "pixel iterators". With this approach, the computation is restricted to a small unit (a callback function) with very well defined properties, and no knowledge as to how it is executed or where the data comes from. This approach removes the loops from the code and puts the loops in the framework, which may be adjusted based on experience. The continuing strategy will be to recognize design patterns and build frameworks which support those patterns. Sometimes algorithms are special/exotic enough that it is much easier to instrument the code for OpenMP rather than to attempt to fit the algorithm into a framework.
Since OpenMP is based on multi-threading, multiple threads attempt to access the underlying storage at once. The interface to this underlying storage is called the "pixel cache". Unfortunately, the existing pixel cache code was written with only one thread per image in mind. It is not feasable to make it optimized for multiple threads per image without a total re-write. Due to this, a sort of "global lock" has been applied to the pixel cache to make it safe for use with OpenMP. This potentially results in a performance penalty since if the pixels are in memory, it should be possible for the threads to access the memory with no locking at all. Testing shows that the pixel cache locking is not noticeable with medium and large images, but may be noticeable with thumbnail-sized images.
The following is an example of per-core speed-up due to OpenMP on a four-core system. All the pixel quantum values are divided by 2:
% for threads in 1 2 3 4 do env OMP_NUM_THREADS=$threads gm benchmark -duration 10 convert \ -size 2048x1080 -type truecolor 'xc:#f00' -operator all divide 2.0 null: done Results: 252 iter 10.02s user 10.01s total 25.17 iter/s (25.15 iter/s cpu) Results: 442 iter 18.75s user 10.00s total 44.20 iter/s (23.57 iter/s cpu) Results: 607 iter 26.92s user 10.01s total 60.64 iter/s (22.55 iter/s cpu) Results: 755 iter 33.78s user 10.01s total 75.42 iter/s (22.35 iter/s cpu)
Note that the "iter/s cpu" value is a measure of the number of iterations given the amount of reported CPU time consumed. It is an effective measure of relative efficacy.
This program is covered by multiple licenses, which are described in Copyright.txt. You should have received a copy of Copyright.txt with this package; otherwise see http://www.graphicsmagick.org/www/Copyright.html.