Then run each major iteration of the optimized code against the same data and compare the results to the original results to ensure your output has not been corrupted by the changed code. Run the unoptimized original code through the sample code and save the results. Generate or collect sample data to feed through each iteration of optimization. See Generating the Compute/Memory Access Peak Benchmark for examples of code you can use to test memory access speed and processing speed. You can use the techniques described in Measuring Performance On Devices to measure how long kernel code takes to run. Run some simple kernels on your GPU device to estimate its capabilities. Weigh the costs and benefits of optimization before starting any optimization effort.Įstimate optimal performance. Optimization can take significant time and effort. The process followed to optimize this code is described in Example: Tuning Performance Of a Gaussian Blur.įigure 14-1 Improvement expected Before Optimizing Codeĭecide whether the code really needs to be optimized. Figure 14-1 illustrates typical improvements in processing speed obtained when an application that executes a Gaussian blur on a 16 MP image was optimized. Tuning your OpenCL code for the GPU can result in a two- to ten-fold improvement in performance. Note: This chapter is based upon a talk at WWDC 2011 called What’s New in OpenCL?. (See Improving Performance On the CPU for suggestions for optimizing performance on the CPU.)See Table 14-1 at the end of the chapter for generally applicable suggestions for measuring and improving performance on most GPUs. It begins by describing the significant performance improvements on the GPU that can be obtained through tuning (see Why You Should Tune), lists APIs you can use to time code execution (see Measuring Performance On Devices), describes how you can estimate the optimal performance of your GPU devices (see Generating the Compute/Memory Access Peak Benchmark), describes a protocol that can be followed to tune GPU performance (see Tuning Procedure), then steps through an example in which performance improvement is obtained. This chapter focuses on how to improve performance on the GPU. However, to obtain optimal performance it is usually necessary to write different code for each type of device. It is possible to write OpenCL code that can run efficiently on both a CPU and a GPU. In addition, GPU memory access is fast when the access pattern matches the memory architecture, so the code should be designed with this in mind. Therefore, the code that runs fastest on a GPU will be designed to take up less memory and take advantage of the GPU’s superior processing power. A GPU has a relatively large number of processing elements and usually has less memory than a CPU. A CPU has a relatively small number of processing elements and a large amount of memory (both a large cache and a much larger amount of RAM available on the circuit board). GPUs and CPUs have fundamentally different architectures and so require different optimizations for OpenCL. To create high-performance code on GPUs, use the Metal framework instead.
0 Comments
Leave a Reply. |