May 1, 2013

Will I need to have different OpenCL optimizations for different devices?

"Not necessarily. Will you add a small #ifdef in your code to run 50% faster?.. Will you duplicate a 1000-line file for that? Would you do it for only 10% speedup? Or, maybe you would prefer adding the optimization unconditionally and pay 10% slowdown on other devices for 50% improvement?.. It is totally your decision. In some cases, you will need to make the tradeoff between cross device performance and maintainability of your OpenCL application". 
=> OpenCL* Design and Programming Guide for the Intel® Xeon Phi™ Coprocessor
Two interesting papers from FPGA2013 Pre-Conference Tutorials:

April 29, 2013

Ten Reasons Why Android Should Support OpenCL

4k* HEVC => Actual performance of Motion Estimation and OpenCL Motion Compensation.


After several tests and experiments I have made significant improvement in OpenCL code.

One part of our HEVC encoder is Motion Estimation module.
This module used as generator of quarter-pel motion vectors for Motion Compensation testing.

Numbers below is actual performance of both modules.

Motion Estimation (executed on CPU, one core i5-3570K, 3.4 GHz)
   // copy and pad input image, some initializations, Scene Change Detection:
   vsshevc::VssPreAnalyzerImpl::PreparePicture [ 10]: 68.146 ms, (6.815 ms each)
   // perform Motion Estimation for one reference frame (fastest mode):
   vsshevc::VssPreAnalyzerImpl::DoMotionEstimation [ 9]: 80.901 ms, (8.989 ms each)

Motion Compensation (OpenCL, AMD Radeon HD 7750)
    HevcCL::FillMotionData [ 9]: 24.594 ms, (2.733 ms each) - copy motion vectors
    HevcCL::RunFilter [ 9]: 64.640 ms, (6.960 ms each) - filter itself with prolog/epilog

Totals:
  • Motion estimation: ~ 60 FPS / 4k video.
  • Motion compensation: ~ 145 FPS / 4k video.


Notes:

  1. Motion compensation is done by 8x8 luma blocks. Each block requires about 16x8x8+8x8x8 multiplications = 24 multiplications/pixel. For real world video it's close to worst case.
  2. Input frame copying, output frame copying and motion vector filling are necessary only for test. In real encoder or decoder these buffers will always be prepared.

*4k = 3840x2160, YUV420.

March 28, 2013

Renderscript on GPU

"How to use Renderscript on Intel® based devices"
Evolution of Renderscript Performance

For now RenderScript can be executed only by CPU.
With two exceptions: Google Nexus 10 and Samsung ChromeBook.
Both are based on Samsung Exynos5 SoC, with ARM Mali-T604 as GPU.


March 27, 2013

HEVC motion compensation - 4k performance test.

Just tested performance with 4k video clip (3840x2160):
  • AMD HD 7750 ~ 52 FPS;
  • Intel i5-3570K single core at 3.4 GHz ~ 12-24 FPS, depending on actual motion vectors;
  • Intel HD Graphics 4000 ~ 10-20 FPS, depending on actual motion vectors.
Note: For now I'm using different OpenCL code for AMD and other Intel's devices.
The reason is huge difference in performance if branches are used.
AMD wants only calculations, without branches.
Intel can handle branches and thus reduce amount of calculations.

March 21, 2013

..it is feasible to implement whole HEVC decoder using OpenCL

See updated results here:
http://work.martin.spb.ru/2013/04/detailed-performance-of-hevc-motion.html
------------------------------------------------------------
1) I have changed OpenCL HEVC inter-prediction:
- shared memory implementation - slow;
- switch/case on filter-type - slow;
- #define usage for constants - good choice.
Finally I have:
~ 60 FPS on Intel's GPU,
~100 FPS on Intel i5 CPU, (Intel's compiler).
~250 FPS on AMD HD7750.
Not bad. That's mean, it is feasible to implement whole FullHD HEVC decoder using OpenCL.

2) I have compiled OpenCL test and run it on Sabre Lite board.
Result - "Segmentation fault" in clGetDeviceIDs. Will check it more tomorrow..

UPD. It's worked. I just forgot to execute "modprobe galcore" after system restart.
Now it's time to check real performance of Vivante's GPU.