WedNESday wrote:2. What is the 'Address Bus' that some documents keep referring to.
It just refers to the signalling lines that are used by the CPU to tell other bus participants (e.g. a memory controller?) what address it wants to read or write to.
Think of a "bus transaction" as follows: one party (the CPU) supplies an address on the address lines, and indicates whether it wants to read or write. If writing, it also supplies a "value" on the data lines. if reading, the other party (the memory controller?) is responsible for setting the data lines. One party writes to them and the other party reads from them.
WedNESday wrote:2. I am quite confused by the 'pipelining' concept. Some documents say that the previous opcode is executed when the new one is fetched. Is this true?
Yes. But for practical purposes, you can pretend it isn't true.
"Pipelining" means dividing your logic that implements an instruction into multiple "stages". If you had three instructions A, B, C and a three-stage pipeline, then in cycle 1 instruction A enters the first stage. In cycle 2 instruction A moves on to the second stage, and instruction B enters the first stage. In cycle 3 instruction A enters the third stage, B enters the second stage and C enters the first stage.
The "latency" of instruction A was 3, because it had to pass through 3 pipeline stages (taking one cycle each) before it was completely finished. However, the "throughput" of this pipeline (assuming all instructions use all 3 stages and have latency of 3 cycles) would be 1 instruction per cycle! In other words, by splitting up the task of executing an instruction into 3 stages, we manage to be doing 3 things at once instead of just one thing. So our throughput is 3 times as high.
Now, its not always so rosy. For one thing, instructions take different numbers of cycles and sometimes need to participate in only some of the pipeline stages. Also, sometimes there are bottlenecks: for example, the 80386 had a pipeline but the first stage had to share some resources with the second stage. So an instruction could only 'issue' into the pipeline every two cycles, and the minimum effective cost of each instruction was 2 cycles. The 486 fixed this and many of its instructions could then be executed effectively in one cycle, making it perform noticably better.
Imagine a branch instruction enters the first stage of your pipeline, but you won't figure out if the branch is 'taken' or 'not taken' until stage 3 of the pipeline. What do you do on the next cycle? The branch moves up to stage 2 of the pipeline, but what should go into stage 1? You don't know yet if its the instruction right after the branch, or the instruction at the branch target address. (even if you knew, you'd have to fetch that byte before you could decode it). Most processors have 'branch prediction' logic to try and figure out well in advance whether a branch is
likely to be taken or not. They then
assume that is the case and keep filling the pipeline with the predicted instructions. If it turns out it guessed wrong, it has to flush the parts of the pipeline that contain wrongly-predicted instructions, causing a "pipeline stall". On modern X86 processors, most branch instructions are predicted correctly 80% to 90% of the time. The rest of the time, you take a performance hit that could be anywhere from 10 to 25 cycles. [Edit: I think I forgot to mention, that modern x86 processors have the equivalent of 15-25 stages in their pipelines. Modern GPUs in graphics cards have even more stages---dozens and dozens!]
Okay, so getting back to the 6502... think of it as having a "pipeline of depth two". The first stage performs a memory access (read or write), and the second stage performs the operation of the instruction. The first and second cycles of a 6502 instruction are consecutive reads. During the second cycle, it is also decoding the opcode and deciding what to do with it. Maybe the second read was wasted (e.g. an implied operand), or maybe it was a Direct Offset or an Absolute Address Low byte, or something else we would have needed to read anyway. Take the example of an implied operand, though. The *third* cycle of the instruction is when its actually executed! But the instruction is only two cycles long, you say! And you're sort of right---that *third* cycle is the same cycle as when the opcode fetch of the next instruction is occurring:
Code: Select all
A-1. fetch A's opcode (and, finish last instruction)
A-2. fetch next byte (and decode A)
B-1. fetch B's opcode (***and execute A)
B-2. fetch next byte (and decode B, which turns out to be a Zero Page insn)
B-3. read from Zero Page (and...idle)
C-1. fetch C's opcode (and execute B)
But if you write these effects sequentially, does it really matter how you divide them up?
Even if its really doing this:
Code: Select all
(finish last instruction) and
A-1. fetch A's opcode
----------------------------
(decode A) and
A-2. fetch next byte
----------------------------
(execute A) and
B-1. fetch B's opcode
----------------------------
(decode B, which turns out to be a Zero Page insn), and
B-2. fetch next byte
you can for the most part think of it instead as doing this:
Code: Select all
....
(finish last instruction)
----------------------------
A-1. fetch A's opcode, and
(decode A)
----------------------------
A-2. fetch next byte, and
(execute A)
----------------------------
B-1. fetch B's opcode, and
(decode B, which turns out to be a Zero Page insn)
----------------------------
B-2. fetch next byte, and
....
I'm no expert so there might be errors in this. But you get the general idea.