Not all the ones on the web will work right. There are 4 registers pointed to by the current PC. As is normal, the the next sequential address is place in that register pointed to by the current PC, the called address is place in the next register and the current PC point uses that new register.
When the stack is over written and followed by code that does 4 BBL instructions, on the 4th BBL return, it will go to the address that was over written by the JMS call.
Tom uses this to change the flow of the state of the program. Since the 4th JMS doesn't overwrite the return address on a 4040, the code, as written will not work on the 4040. Almost no one uses this difference between the two processors that I've seen. You have to remember, Tom wrote the assembler to fit into 1K of code space. He used just about every possible trick.
There is one other funny behavior that I've seen used for both the 4004 and the 4040. The JCN has 8 possible actions based on the condition code. One of those is an "Always" jump that is redundant with the JUN but only on the same page. This is one of the funny page based difference if the code is at a page boundary. There is no advantage to using this instead of the JUN but a simulator must still do the right thing. The other rarely used variant is the "Never" jump. This is a tricky one. When I disassemble this one, I always use the additional instruction of SKIP to make the code more readable in the disassembly. The skipped address is often used for a single byte instruction, like LDM. This is often used at the beginning of a subroutine. This allows the subroutine to start with multiple possible values in the accumulator. The code might look like:
ENTER1 LDM 5
SKIP
ENTER2 LDM 4
... do something with ACC
BBL 0
This allows the same subroutine to be used from multiple places that require 4 in some cases and 5 in others. It only makes sense if you have multiple times you need these different values, otherwise it one would use the LDM instruction before the call. I've seen this used in code that was used to run a printer.
I hope this helps you get you simulator working correctly.
You should probably make a input option for you simulator that would take either binary or BNPF code, with the option to use bit inversion. That way it could use Tom's assembler to create code to simulate.
Making a simple way to instrument the assembler with I/O operations would enhance its use. Things like buttons for inputs and lights for outputs would be useful. Also adding the ability to add things that can be que'd up like serial data I/O, that is cycle count determined, would be useful, especially for something like Tom's code.
You'd then have something that would be off more use than just executing a string of instructions.
Dwight