Enter the uHMCU…

It’s been a while since I’ve sat down to do some proper VHDL but today (after MASSIVE advances with the Phobass these past few days), I’ve decided to write a miniature version of the HMCU. I say miniature in the lightest sense.


  • MISC architecture featuring 9 Instructions
  • 16bit processing capability (22bit program word length)
  • Memory mapped IO
  • ~5 cycles/instruction (inefficient, I know! No pipelining…)
  • 16 level hardware stack
  • 4x 16bit hardware registers
  • 16bit program counter (not manipulatable as a register)
  • Currently features two peripheral timers with PWM capability
  • A hacky assembler written in C++ with defines and multiple subroutine capability (no variable declarations as of yet…).
  • Compiled text file reader IO for modelsim
  • CPU + Testbench + RAM + ROM + 2x Timers (with PWM) = 712 lines of vhdl!
  • All peripherals interfaced through a really simple bus interface: addr, dataOut, dataIn, nwr, ackO and ackI. Master sets data out, read/write bit, address and ackO. Once ackI goes high from the slave, master releases ackO and transaction is complete. Memory mapped peripherals respond when the address presented is within their range.
  • Both RAM and ROM interfaces can introduce CPU stalling meaning if the CPU is interfaced to slow memory OR memory that requires processing before the value is present (e.g. using an SPI flash/sram IC), the CPU will happily wait for the memory to be present. Peripherals not requiring constant attention will automatically continue their process e.g. PWM. This is achieved by the slave holding their ackO low until ready to send data.

Instruction set:

  • NOP – No operation
  • MTR – Memory to register, transfer a 16bit variable from a memory location to a register
  • RTM – Register to memory, transfer a 16bit variable from a register to a memory location
  • JMP – Immediate jump to location (sets PC directly)
  • JSZ – Immediate Jump placing the current PC+1 on the PC stack
  • JNZ – Jump if register is zero, not placing the current PC on the stack
  • PSP – Pop a value from the PC stack and store in the PC
  • STR – Store literal in register
  • ALG – Arithmetic and logic instruction, can perform increment and decrement of single registers, addition, subtraction, and’ing, or’ing, nand’ing and xor’ing of two registers

I’m yet to test the microcontroller on my FPGA though I’m hoping to do this over the next couple of days. I’m currently writing a few more peripherals at the moment, I’ve implemented an SPI master module quite a few times in the past so this will most likely be next on my list, after GPIO of course. Currently, the peripheral arbitration is done within the peripherals which I feel isn’t a particularly efficient method of doing so as every peripheral will have to contain some form of address comparison units meaning lots of LUTs! I’ve experimented using my bus in a ring network fashion where each peripheral contains a mux between the input data vector and its own output data vector (where the muxing is decided by whether that slave has its ackI high from the master). The problem I see here however is high propagation delays as the round path of data from the master through a slave to the master, passes through all of the slaves, therefore limiting the maximum speed.

My second approach was to have one dedicated bus arbitrator which the master communicates to directly, which then muxes between all of the slaves and handles the acknowledges. This seems a relatively efficient method though writing the mux is always a bit of a chore and ends up with loads of really long data vectors.

The final approach of course is my current one where every peripheral has its own mini-mux which decodes the address, however, all of the ackO’s from the slaves to the master are all or’d together, along with all the data. When the slave doesn’t receive and address match, it ensures its data output is set to zero, as to not interfere with any data being transmitted from another slave. While each slave requires its own mini address mux, it stops the requirement of muxing the data streams into the CPU input.

The CPU itself has dedicated input and output vectors for the ROM, along side dedicated ackI and ackO. This will allow for future pipelining though considering PIC10,12 and 16 series chips run at ~4instructions/cycle, I don’t think that ~5instructions/cycle is particularly fatal. As of yet, there is no way of copying data from an LUT stored in ROM into the registers without using literal stores, then placing the LUT into RAM. My opcode length is 4 bits meaning I still have room for 15-9 = 6 more instructions!

As the ALG instruction executes both arithmetic and logical operations, I’ve decided to implement the ALU inside of the CPU, instead of a seperate entity. I’ve used variables (which as of yet, I don’t know if they’re bad or not…) to do, what I would expect would compile to 3x muxes. Two of the muxes select the two input registers (left hand side and right hand side register) and one mux which routes the result of the operation to a destination register.

Stack management is currently not managed particularly efficiently and if stack overflow occurs, the first n (n = stack size) stack pushes are fine, anything after then will not write to the stack and will disappear into thin air.

As stated by nearly every embedded engineer ever, timers are one of the most useful peripherals on a microcontroller, therefore I thought it was a good idea to put quite a lot of thought into the implementation of this peripheral. My timer modules feature:

  • 16 bit counter register
  • 16 bit 2^n prescaler (1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768)
  • 16 bit overflow register, defines the period of the timer
  • 16 bit PWM comparator register
  • Timer enable, reset, PWM enable and PWM polarity configurability

The PWM polarity allows for inverted PWM control. The reset bit hold the prescaler and counter low.

So far, thats pretty much it!

Obviously, testing is a pretty vital part of VHDL design and can sometimes comprise of more code than the actual project itself! One test I’ve decided to do to see the feasibility of the microcontroller and whether it can actually do anything of use is to produce a complementary PWM waveform using two timer peripherals and changing the pulse width in a sawtooth like fashion.


Simulating in modelsim produces a pretty reasonable result! The two PWM streams are not directly in phase due to the instruction cycle delay of enabling both streams. With the timer register definitions, this whole program only came to ~58 lines of uHMCU assembly code. Even more so, this compiles down to 38 program words or 836bits (104.5bytes), pretty small if you ask me!

Program listing:

-def TIM1_CNT 65000
-def TIM1_CTRL 65001
-def TIM1_PSC 65002
-def TIM1_OVF 65003
-def TIM1_PWM 65004

-def TIM2_CNT 65010
-def TIM2_CTRL 65011
-def TIM2_PSC 65012
-def TIM2_OVF 65013
-def TIM2_PWM 65014

str ra 2 #Reset timer1
rtm ra TIM1_CTRL
str ra 0 #Set prescaler
rtm ra TIM1_PSC
str ra 10 #Set Overflow
rtm ra TIM1_OVF
str ra 1 #Set PWM
rtm ra TIM1_PWM

str ra 2 #Reset timer2
rtm ra TIM2_CTRL
str ra 0 #Set prescaler
rtm ra TIM2_PSC
str ra 10 #Set Overflow
rtm ra TIM2_OVF
str ra 1 #Set PWM
rtm ra TIM2_PWM
str ra 13 #Start timer1 and PWM polarity 0
rtm ra TIM1_CTRL
str ra 5 #Start timer2 and PWM polarity 1
rtm ra TIM2_CTRL
str rb 1
str ra 0
jmp ‘CounterLoop

alg dec rb #RB–
jnz rb ‘IncLoop #If RB == 0, jump to inc loop
jmp ‘CounterLoop #Else, jump to RB–

str rb 1 #Reset sub counter
alg inc ra #Increment A twice
alg inc ra #(easier than loading in a literal and adding)
rtm ra TIM1_PWM #Write PWM value to T1
rtm ra TIM2_PWM #Write PWM value to T2
str rd 10 #Set D to 10
alg sub rc ra rd #Check to see if ra==rd (rc = ra-rd)
jnz rc ‘RaReset #Jump if rc = 0
jmp ‘CounterLoop

str ra 0 #Set RA to 0
str rb 1 #Reset the sub counter
jmp ‘CounterLoop #Jump back to the main loop!

It might seem like a lot lot of code but there are two nested loops and a couple of comparisons too. The timers are also able to run at the master clock frequency meaning for potentially high PWM rates! To ensure the PWM compare value doesn’t mess up the current cycle, if the PWM register is written to during a cycle, the value is only loaded into the PWM comparator at the beginning of a cycle. This allows the seamless PWM that is shown above.

comppwm2Variable prescalers for each timer

The assembler is written in C++ and is unbelievably simple. All of the whitespace and blank lines are removed and the text is capitalized to make parsing easier. Initially, a small preprocessor is ran looking for defines (syntax described below!). Once a list of defines has been constructed (using the vector class), the preprocessor runs through the entire program (loaded into the assembler as one big string) and replaces all of the defines with their literal values.

After the preprocessor has replaced all the defines, the label parser is ran, finding all of the definitions of labels in the program. These labels are stored in a list with their absolute position of where they will be stored in the final code implementation (related to the line number of the label and how many labels have previously been found).

Once the labels and their positions have been found and added to the label list, the program is searched for same line comments. These comments are eradicated as these aren’t required for the compilation process. Finally, the program is split into lines and sent to the main instruction parser. The parser has minor error checking capability (amount of operands, instruction and register names etc.) and returns the instruction as a 32bit word. This 32bit word is then printed to a text file with one word per line to be parsed by my VHDL TXT to ROM parser.

assembler1.pngOutput from the assembler

assembler1Finding an error in the code!

prgcomp.pngComparison between the program and the output program words


  • -def : Literal define, these are used by the preprocessor to ease programming. The defines are replaced by their literals before compilation. e.g.
    -def myNum 10
    str ra myNum
    ra now contains myNum
  • # character : Comment designator. This character is used at the start of comments, equivalent to // in C or — in VHDL
  • ‘ character : Label designator. This character denotes a label and is used to tell the assembler about labelled sections of code. This is really useful for abstracting code jumps as the absolute addresses are managed by the assembler

I’m probably going to be working on this a fair bit over the next few days as I’ve reached a point with the Phobass where I’m waiting for PCBs to arrive – which are going to be arriving back home anyway. Keep tuned for more updates! I’ll probably be implementing this as a VM at some point too and hopefully I’ll be able to port a simple C compiler to work with my architecture.

2 thoughts on “Enter the uHMCU…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s