MulticoreWare is a software solutions and services company that specialises in optimising computer vision, sensor data processing, and AI applications on a variety of low-power, embedded, and heterogeneous systems.
Imagination collaborated with MulticoreWare to showcase accelerated stereo block matching algorithm performance on the UNISOC-T710 development platform, leveraging MulticoreWare’s expertise in OpenCL by re-implementing the stereo BM algorithm to improve compute resource utilisation and memory optimisation to deliver greater than ~50x performance gain on the Imagination GPUs.
Commenting on the announcement Vish Rajalingam, Vice President & Co-GM, Autonomous Vehicle & Automotive BU, MulticoreWare, said, “Power-efficient GPUs are now essential for all computer vision, artificial intelligence, and sensing applications. We are excited to be partnering with Imagination to enable their customers in implementing algorithm optimisations and software accelerations on Imagination’s PowerVR GPUs, with a planned roadmap to include RISC-V software acceleration in future.”
Gilberto Rodriguez, Director of Product Management, Imagination, added, “Imagination’s GPUs can be used to deploy computer vision tasks as well as machine learning acceleration easily and efficiently on edge devices. MulticoreWare is utilising our IP to its actual potential for general-purpose GPU applications. By working together, we can offer our customers a truly optimised PowerVR deployment experience.”
The StereoBM algorithm was chosen for optimisation based on customer interest. MulticoreWare analysed the CPU performance to identify bottlenecks. The goal was to achieve maximum GPU parallelism, which was enabled by efficient implementation of internal register usage and the configuration of an appropriate global workgroup size that is adaptive to the image resolution.
Imagination’s GM9446 GPU memory layout was used to calculate the adaptive global workgroup size. In addition to the computational optimisation, the algorithm parameters were modified to obtain greater accuracy combined with a very-high performant implementation on the GPUs compared to the CPUs on the same platform.
The CPU time in one configuration was 54.25ms, whereas the MCW implementation was 0.78ms, a ~70x gain in performance.