So I did this project a looooong time ago (Actually around this time last year…..) where my FPGA was acting as a video controller for the STM32F0. I’ve recently improved (or I’d like to think I have given my final year project) my VHDL skills so I thought I’d give this another shot. I completely rewrote the SDRAM controller along with all the associated gubbins to make sure it works properly. For a quick spec sheet:
- 96MHz SDRAM clock speed with 16bit writes and 64bit reads (4x 16bit burst)
- 1024 word “pixel in” FIFO (STM32 -> UART -> SDRAM)
- 1024 word “pixel out” FIFO (SDRAM -> VGA Generator)
- Stable at 640×480, can run at 800×600 though the input pixel FIFO fills pretty quickly
- 16bit colour
- 32Mbit/s maximum UART rate (UART handler runs at 96MHz with 3x bit samples/bit period giving a maximum UART rate of 96MHz/3 = 32Mbit/s)
High speed UART
Obviously achieving UART at 32Mbit/s from a 48MHz STM32F0 is a bit magic right? Correct! For reference, the UART modules themselves can only run at a maximum of 1/8 clock frequency meaning for a 48MHz clock, maximum bit rate is 6Mbit/s, pretty shocking huh? For reference, completely filling the screen at 640×480 is 614.4kB. Every pixel write requires 5x 8bit UART transactions (3x address, 2x pixels) giving a pixel fill time at 6Mbit/s of 2.048s (640*480*8*5/6M = 2.048s), pretty shocking getting like 0.5fps! Taking this clk/8 limitation into account, I had a go at overclocking the STM32. Using the HSI/2 clock and a PLL multiplier of 16, the STM32F0 can achieve 64MHz. Surprisingly, it doesn’t even get warm at this rate and seems to be stable. At this clock rate, a UART bitrate of 8Mbit/s can be achieved, not that much faster than the 6Mbit/s but still better. This translates to 1.536s/frame or 0.65fps. Still pretty shocking.
It then hit me that the SPI modules in the STM32 can output data at up to system clock/2 as opposed to the system clock/8 limitation of the UART. I then set upon trying to rig the SPI module as a UART. This isn’t particularly advisable and nowhere on the internet could I see whether it was even possible! It is however possible with the STM32s as they feature variable bit depth SPI transfers – win! A UART transaction is essentially an SPI transaction but instead of taking both the clock and MOSI, just using MOSI as the UART data stream. As UART consists of a start bit (0), the data (8 bits) and a stop bit (1), configuring the STM32 SPI as a 10bit device could accomodate for the extra data. There was an issue with the output polarity though. Standard UART is idle high and the MISO output is a little weird and sticks at the last sent piece of data. It is also worth noting that UART is LSB (least significant bit) first though the SPI modules allow for this. Going back to the sticking point, this isn’t actually an issue as the stop bit will always be the same value. To solve the output polarity issue, I pre-inverted the data before sending and invert the entire UART stream inside of the FPGA. This took a bit of trial and error to get everything working properly but once working was fine!
This was the trick I used to (sounds like the beginning of a clickbait ad…) achieve at 32MBit pixel rate. Running the SPI at system clock/2 and overclocking to 64Mhz allowed me to achieve this data rate. For reference, a full pixel fill takes 0.384s, equivalent to up to 2.6fps (WOW!). Not quite 60fps elite game level but for an overclocked Cortex M0 device, I wouldn’t call it that bad!
To write a pixel to the SDRAM, 5x UART transactions are required. For a 640×480 pixel display, the maximum memory address that will need to be accessed is the 640x480th pixel. This address will be address 307200-1 (640*480-1) which after a bit of log2’ing gives a bit width of 19 i.e. to represent 307200 in binary form, 19 bits will be required. As the UART module can only take 8bit chunks (I could probably wrangle this to variable but that’s a chore…), this will need to be rounded up to 24bits (8*3 = 24, 8*2 = 16<19) hence the 3x address byte transfers. The FPGA stores pixels in RGB565 format which translates to 16bit pixels. 16bit = 2x8bit transfers and thus forth we have the 5x byte transfers required to write a pixel! I’d like to have a go at implementing an intelligent UART handler, like the ones used for LCDs where you can set a start and end address and stream pixels into it. This would increase screen writes massively and could actually increase the frame rate to 6.5fps for full screen burst writes.
FPGA UART Handler
For the UART handler, I used the same style of controller that I used for my final year project – Its quite funny that I can’t upload the code for this until after the deadline at risk of plagiarism…. to myself…….
Anyway, the UART module itself merely receives a UART stream and if the start and stop bits are present, sets the data on its output and pulses a ready signal high for a clock cycle. The UART handler therefore has to wait for these pulses and upon receiving a pulse, latching the received data. After 5 pieces of received data, the 5 bytes are combined into a 40bit word (5*8 bits) and are written to the output FIFO. Obviously, you can see here that losing sync will cause catastrophic failure and probably memory writes to incorrect locations – bad design but as long as I hold the STM32 in reset while the FPGA is powering up, sync is fine and I’ve not had an issue otherwise yet.
The FIFO handler is the module that does all of the communication with the SDRAM controller. There are two FIFOs which are connected to the FIFO handler, the FIFO from the UART handler and the FIFO to the VGA controller. When the VGA FIFO is less than half full, the FIFO handler requests pixels from the SDRAM controller. These pixels are then used to refill the VGA FIFO. This ensures the VGA generator will always have a ready stream of pixels to display. This operation is also prioritised over write operations as these are vital to a solid display.
If the VGA FIFO is more than half full, the FIFO handler can then look to see if the input FIFO is empty. If this FIFO isn’t empty, the SDRAM grabs the 40bit word, decodes it and writes the pixels to memory. For reference, the lower 16 bits are the pixel and the rest of the bits are the memory address.
For a good throughput, the SDRAM is configured for single writes and burst reads. This is because reads are so much more important than writes and I still want the ability to write single 16bit pixels. By allowing for burst reads, I can get way more memory throughput on the read side than reading single 16bit values at a time. It is worth noting at this point too that the SDRAM controller is super basic and closes its row after every transfer – there is no nice row management for fast data streaming. Regardless of its basic implementation, it can synthesize pretty well and works fine. Upon a read, the 4x16bit transfers are pushed into the VGA FIFO and the FIFO handler returns to idle.
Upon the start of a new frame (indicated by the last visible pixel of the VGA frame), a new frame signal is sent to the FIFO handler. This signal is used to reset the internal pixel counter as well as flushing the FIFO. While flushing the FIFO may seem excessive, not flushing the FIFO seems to cause weird visual issues at higher frequencies.
Weird phasey issues where the last line is displayed first… Still need to figure this out!
The SDRAM controller is really basic as stated previously. This was mainly for coding simplicity so maybe one day I’ll write a proper one. For now however, this one seems to work fine. I can get pixels out at up to around 150MHz though after then it gets a bit hairy – even though the SDRAM is rated for 200MHz at CL3. I’m running it at 96MHz at CL2.
Now for the best part! Using the industry standard 640×480 timings, I’m able to achieve a stable image and unlimited writes at 32Mbit/s without overflowing the input FIFO (the FPGA outputs when this FIFO is nearly full). Upping that to 800×600 is achievable but the SDRAM controller is under a relatively large amount of stress as the input FIFO fills up pretty quick – even at 6Mbit/s rates. Running at 1024×768 is completely unachievable and ends up in horribly skewed images.
Running at 640×480, 800×600 and the failed 1024×768
For the rest of testing, I just kept it at 640×480.
On the STM32F0 side, I obviously wanted to be able to blast a picture to the screen! The easiest way of course was to take a picture, bit reduce it to the max and display it – for reference, a 640×480 picture with a measly 4bits/pixel still consumes 153.6kB! Wayyyy more than the 64kB available on the STM32F051. For really early tests, I was storing a 320x240x4bpp image (38.4kB) but it looked horrendous! 4bits per pixel really isn’t a format that should have ever existed though we allll remember 16 color (ew).
It was at this point that I needed to find a method of compressing the images and at least being able to decompress them on the other side. I had a brief moment of considering having a go at writing my own algorithm and instantly thought “nah”. So why not scour the internet for the worlds most common compressed format – JPEG? After a bit of searching (i.e. one google page), I found a few JPEG decoders but they were all pretty intensive on resources with regards to running it on the STM32. I then stumbled across this godsend of a library written by the genius themselves, ChaN – the creator of the FatFS library used by nearly all microcontrollers. I don’t know who this person is but damn, if I met them and they drank, I’d buy them a brewery!
Their library was the tiny JPEG decompressor, written for embedded systems with minimum consumption of resources – for reference, 3kB RAM and 3-8kB of ROM, absolutely nothing! After a very minimal port to the STM32F0, I was able to store the image as an array in flash, converted from a file using this website and rewriting the ‘infunc’ to work for streaming flash data, was able to decompress and display the image! Its really good that the library offers an RGB565 decompression format as I can pretty much send this straight to the FPGA without needing to bit shift or anything. I do however need to reverse the B and R channels meaning I’m probably wrong in my implementation… It was with this library that I was able to display the fox JPEG image, found here. I find it pretty funny that finding that picture was as simple as googling “picture” and selecting the first result! It’s a pretty nice picture to be honest.
The Mandlebrot set is every graphics guys favourite thing to plot (I lie, I don’t have a clue what the all time favourite would be, maybe that Lena pic?). Its generally pretty simple to implement and has been around for years. I found a floating point implementation really easily and just had to switch a few variables around and it was plotting fine. It was however slow. The STM32F0 doesn’t feature an FPU therefore all floating point operations are done in software with bitshifts and other such bit based magic. From this, I decided to implement a fixed point version. There seem to be loads of these floating (waheyy) around but converting it from scratch wasn’t too hard and seems to work fine. Mine is based on a 36.28 fixed point method using the int64_t type of which I learnt today that GCC supports! These weird bit patterns were chosen as they allowed stable plots at zoomed out levels along with maintaining quality when zooming in. Something that caused by 18.14 implementation to die on. Floating points are still used to calculate the initial start and end frame to plot but all intensive sections are fixed point.
The zoomed in Mandlebrot set! The part at the top is because I took a photo while it was zooming and refreshing the frame…
And so concludes my FPGA + SDRAM + VGA + STM32F0 + UART + Electrons…… post. The code for this isn’t amazing but I’ll be posting it hopefully after my thesis submission.