Acceleration is the key to improved processing performance.

There are significant speed increases for various operations on a GPU compared to a multi-core CPU system. All the operations, particularly convolution, benefit from GPU acceleration. Even addition, which could not be accelerated with multiple cores, enjoys notable speed increases. However, there is a caveat since data needs to be transferred to and from the GPU. This ultimately caps the maximum achievable performance. Source: Matrox Imaging

The continual increase in processing power has allowed the machine vision industry to push performance boundaries. Machine vision users also are demanding more from their applications including acquiring more, processing more and analyzing more. Greater computing power makes higher system throughput and greater capability possible, which leads to greater competitiveness.

What does improved performance mean? There are three objectives of image processing acceleration:
• Increase the image resolution-process higher bit depth or larger images.
• Increase the frame rate.
• Implement additional processing.

So how can computing technology be used to increase performance? For the developer with a solid understanding of image processing algorithms, three options are possible. Performance can be improved by leveraging multi-core central processing units (CPUs), graphics processor units (GPUs) or field programmable gate arrays (FPGAs); however, success will depend on choosing the right processing technologies for the required algorithms, or conversely, selecting the appropriate algorithms for the available processing technologies.

Algorithm spectrum

Algorithms behave differently according to their function. At one extreme are algorithms whose execution depends little, or not at all, on the contents of the source image. The algorithms consist of short, fixed sequences of operations with no or few alternate paths of execution commonly known as branching. They are deterministic, meaning that their execution time for a given image size is tightly bound. Such algorithms are those that employ rudimentary image processing operations such as arithmetic and neighborhood. For example, dead pixel correction is typically a straightforward neighborhood operation while gain and offset correction uses arithmetic.

At the other extreme are algorithms whose execution depends significantly on the contents of the source data and where the source data has features previously obtained from an image rather than pixels. They are nondeterministic in terms of execution time since they depend heavily on the contents of the source data. They are lengthier and more complex, and make significant use of branching and iterative trial or heuristics. These algorithms work on edge sets, blobs and lists of points to perform. Examples include geometric pattern recognition, character recognition and metrology.

The example of a sequence of operations performed on a multi-core central processing unit (CPU) vs. a field programmable gate array (FPGA), as a parallel pipeline. Source: Matrox Imaging

Application stages

An application typically incorporates two stages in order to determine a result-pre-processing followed by processing and analytics.

• The pre-processing stage readies an image for processing and analytics. Algorithms that have little or no dependence on the source image, do not use branching and heuristics, and are deterministic in nature, typically are used in the pre-processing stage. In general they convert, enhance, reformat or simplify images.

• The processing and analytics stage builds, constructs, computes, extracts, identifies or locates features in an image. It then goes on to analyze, filter, measure or sort features. Processing and analytics algorithms are dependent on contents of the source data, employ branching and heuristics, and are non-deterministic in nature.

Parallel execution

An algorithm can be accelerated if it, or a portion of it, can be parallelized. Parallelization refers to splitting up the source data or the algorithm itself so that different portions are processed simultaneously. Efficient parallelization requires that each portion be processed autonomously for as long as possible, and the longer the better. Parallelization breaks down when the processing of one portion depends on the processing of another.

Algorithms used the in pre-processing stage lend themselves well to parallelization-algorithms used in the processing and analytics stage typically do not. When it does apply, parallelization can have dramatic benefits: In some cases a convolution can be performed almost eight times faster on eight processor cores.

The 15x15 and 3x3 convolution (neighborhood operation) are compute-bound on the CPU used; an increase in the core count increases the performance. The 3x3 convolution favors IO more than the larger 15x15 convolution, as shown by the lower returns when more cores are added. The addition operation illustrates an IO-bound function; an increasing number of cores does not translate to increasing performance and is even detrimental. Source: Matrox Imaging

Compute vs. IO bound

The performance of an algorithm, or more specifically, the operations within an algorithm, also is affected by the demands on the underlying hardware. Some operations such as convolution enjoy significant acceleration when performed by two, four and eight cores, and some operations such as addition only have marginal increases. An understanding of key hardware characteristics and their influence on execution performance can provide some answers.

Compute-bound operations are limited by the frequency and number of parallel processing elements. A compute-bound operation spends more time on mathematical instructions than read/write or input/output (IO) instructions. Increasing the frequency or number of parallel processing elements can increase the performance of these functions. But there is a limit to how much processing time can be reduced. Core frequency is approaching its maximum speeds, and in some cases, adding more parallel processing elements without increasing the IO bandwidth will impede the operation’s performance, rather than improve it. The addition operation exhibits this behavior; it increases with two and four cores, but decreases with eight.

IO-bound operations are limited by a processing device’s ability to receive and send data. A good example of this type of operation is the simple adding of a constant number; this single instruction uses a single input and a single output. It is affected by the speed at which the processing device receives and sends data, not the speed of the processing elements. Increasing the IO bandwidth (including memory) can increase the performance of these functions. Note that there is an upper limit to performance gains, and increasing the bandwidth without increasing the core frequency or number of cores, the function will eventually become compute-bound.

Pipelines to the rescue

For FPGAs, overcoming the IO-bound restriction can be achieved by implementing an algorithm as a pipeline. A pipeline passes the results from one operation directly on to the next, effectively minimizing the amount of intermediary device IO. Pipelines also can be parallelized by overlapping the execution of some operations. The pipeline also should be carefully balanced so that each operation performs at roughly the same rate in order to benefit the most from this parallelization.

Applying the technology

Some image processing algorithms can be accelerated through parallelism; others enjoy increased processing speeds by relying on a higher core frequency, or by increasing the bandwidth to the processing device.

Getting maximum acceleration will require the judicious use of CPU, GPU and FPGA technologies. Performance gains can be achieved by adhering to the following guidelines:

• Use FPGA technology for pre-processing that is tied to acquisition. An FPGA is well suited to parallel processing although device IO limits are a concern. However, as stated, these are readily resolved by implementing a proper pipeline.

• Use GPU technology, with its massive number of parallel processing elements and tremendous memory bandwidth, to accelerate pre-processing algorithms. Be mindful of the overhead for getting the data to the GPU and back. A GPU is best used for longer algorithms or very high data rates.

• Use CPU technology for processing and analytics. A CPU is best for dealing with branching and heuristics. Invoke the multiple cores within a CPU when parallelization applies and a transfer of overhead to a GPU does not make sense.

The finish line

Before attempting to increase processing speed, check the computer’s hardware resources. How many cores does it have? How many processing elements are on the GPU? After answering those questions, investigate the software tools and the application. Determine what types of algorithms are used; parallelize the ones that can be parallelized.

“Opening up” an algorithm to its lowest code level and optimizing it is not a task to be taken lightly; one must have the expertise and resources to do it.

Ask the vendors the right questions: “Is your software package optimized for new types of imaging hardware devices? Is your software optimized for multiple-core systems? Can you offload functions to GPUs or FPGAs?” Working with a vendor who has this experience will save time, money and a lot of heartache.

Pierantonio Boriero is the product line manager and Dwayne Crawford is the product manager at Matrox Imaging (Dorval, Quebec, Canada). For more information, call (800) 804-6243, e-mail [email protected] or visit

Tech Tips

- An algorithm can be accelerated if it, or a portion of it, can be parallelized.

- Parallelization refers to splitting up the source data or the algorithm itself so that different portions are processed simultaneously.

- Efficient parallelization requires that each portion be processed autonomously for as long as possible, and the longer the better.

- Parallelization breaks down when the processing of one portion depends on the processing of another.