Low-latency FPGA solutions

DRUM: a low-latency FPGA compiler

DRUM quickly translates high level image processing code into efficient hardware designs that can be used to run on FPGA devices or create energy efficient ASICs. It allows you to compile your algorithms down to run efficiently with high-throughput and low-latency. The live demo setup in the video above has a Go Pro Hero 4 camera running at 60Hz at 720p. HDMI input from the camera is processed on the FPGA by a netlist compiled using DRUM on the laptop on the left. The FPGA used in this demo is a 45nm Xilinx Spartan 6 which outputs directly to the monitor.


Low Latency

DRUM has superior out-of-the-box low latency performance for a large set of standard computer vision and image processing operations. DRUM shows a 86 – 95% reduction in latency on a whole host of standard algorithms when compared to existing FPGA middleware and GPU based solutions. The video shows a selection of algorithms implemented using DRUM’s high-level language – HLL. This language provides the unique ability to expose and then vary key algorithm parameters via a web-based interface. This ability to tune key algorithm parameters, live on the board, avoids time-consuming repeated compilation. Furthermore, the flexibility of our language, and short compile times, greatly improves iteration times for algorithm development on FPGA boards.


Efficient netlists

The efficiency of the netlists produced by DRUM mean that our output can run on smaller, cheaper boards. All the statistics, shown below, except for optical flow, were achieved on the 45nm Xilinx Spartan 6 in the video. All exhibit power consumption around a 1000 pico joules per pixel and a frame latency in the hundreds of microseconds.



DRUM has many unique features. Firstly, it is entirely developed in a strongly typed purely functional language. This strong-typing minimizes errors in the output hardware designs and means that critical issues are identified earlier, and are resolvable in software, which ensures smoother standards testing. We are very proud of our fast compilation times which reduces development time: all the examples in the video compile in less than 2.5 seconds. Lastly, DRUM contains unique mathematical optimization passes which are the key to it producing out-of-the box performance comparable to hand-coded HDL.

Myrtle is currently using DRUM to realize Deep Learning algorithms as efficient hardware designs within Phase I of the UK government’s Autonomous Vehicles program.

You can download the DRUM slide deck here.


The term Frame Latency in this article explicitly means the difference in time between the input image being put into the pipeline and the output image coming out, specifically the difference between the pixel at coordinate (x, y) being on the input signal and the pixel at coordinate (x, y) being on the output signal. The image we’re processing is a 1650×750 pixel image and we’re processing at 60 frames per second (the image is streamed from a consumer camera). Taking the Integral Image statistic as an example. The pixel clock for this image has a period of approximately 13ns; as the pipeline for integral image is only one stage deep the output value for pixel (x, y) is only one cycle offset from the input for the pixel at the same coordinate giving a frame latency of 13ns. Full power and other calculations are worked through in the slide deck linked above.