See updated results here:
http://work.martin.spb.ru/2013/04/detailed-performance-of-hevc-motion.html
------------------------------------------------------------
1) I have changed OpenCL HEVC inter-prediction:
- shared memory implementation - slow;
- switch/case on filter-type - slow;
- #define usage for constants - good choice.
Finally I have:
~ 60 FPS on Intel's GPU,
~100 FPS on Intel i5 CPU, (Intel's compiler).
~250 FPS on AMD HD7750.
Not bad. That's mean, it is feasible to implement whole FullHD HEVC decoder using OpenCL.
2) I have compiled OpenCL test and run it on Sabre Lite board.
Result - "Segmentation fault" in clGetDeviceIDs. Will check it more tomorrow..
UPD. It's worked. I just forgot
to execute
"modprobe galcore" after system restart.
Now it's time to check real performance of Vivante's GPU.
hi, I am interested in your openCL implementation of HEVC decoder. But I am not clear about what your mean by "it is feasible to implement whole HEVC decoder using OpenCL", is it including intra?
ReplyDeleteCurrently we already have real-time HEVC solutions for PC and for ARM.
ReplyDeleteCheck our web-site for details:
http://www.vanguardvideo.com/h265.php
We use SSE and AVX optimizations (NEON for ARM). It's enough for PC and ARM real-time processing. But for better code portability and scalability and also to use all available hardware, we started development of OpenCL-based optimization.
After implementation of inter-prediction using OpenCL, heaviest process in HEVC (it takes about 60% of decoding time), I have decided that all other parts can be implemented also.
Only CABAC cannot be implemented effectively for OpenCL, because it's sequential process and thus not suitable for GPU.
We already have optimized CABAC for CPU and next our steps is:
- intra-prediction;
- transform/quantization;
- deblocking;
- SAO;
We already have all these modules implemented in SIMD style, so it's easy to prepare OpenCL version.
The main question was only real performance of various types of GPU and memory bandwidth.
hey, thanks for sharing these information.
ReplyDeleteHowever, according to your updated result link
http://work.martin.spb.ru/2013/04/detailed-performance-of-hevc-motion.html
The implementation apply 8x8 luma blocks while the smallest block size in HEVC is 4x8 or 8x4,
so how do you solve this problem? or this is the restriction of your Decoder?
BTW, since you have use the i5-3570K CPU from intel, have you employ the zero copy feature of this chip to remove the memory transfer overhead between CPU and GPU?