Now that's what I call micromanagement.
(sorry couldn't resist)
Not even the developers are very technical in the future!
Woah, really? And they still manage to write good software?
Of course not, if good software would be standing next to their bed at 4 am they would scream who are you what are you doing here? help! help! Someone, make it go away!
Correct. Most ciphers of that era should be Feistel cipher in the likes of DES/3DES, or even RC4 uses XOR too. Later AES/Rijndael, CRC and ECC (Elliptic Curve Cryptography) also make heavy use of XOR but in finite field terms which is based on modular arithmetic over GF(2), that effectively reduces to XOR (while in theory should be mod 2).
Western Design Center is still (apparently) making a profit at least in part licensing 6502 core IP for embedded stuff. There's probably a 6502 buried and unrecognized in all sorts of low-cost control applications laying around you.
RC5 on an 8085
Oof. Well played.
[1] faster, more registers than the IBM 360, << 64k RAM
[2] much faster, 32bit, >> 64k RAM
But forget AVR. Yeah, for a buck or so the ATTiny85 was my go-to small MCU five years ago, and the $5 328 for bigger tasks.
But for the last three years both can be replaced by a 48 MHz 32 bit RISC-V CH32V003 for $0.10 for the 8 pin package (like ATTiny85, and also no external components needed) and $0.20 for the 20 pin package with basically the same number of GPIOs as the 328. At 2k RAM and 16K flash it's the same RAM and a little less flash than the ATMega328 -- but not as much as you'd think as RISC-V handles 16 and 32 bit values and pointers sooo much better.
And now you have the CH32V002/4/5/6 with enhanced CPU and more RAM and/or flash -- up to 8K rAM and 62K flash on the 006 -- and still for around the $0.10-$0.20 price
There's probably no reason not to get some of the CH32VXXX's to play with. Every now and again I have an application that needs very low power and I'm happy to spring for an MSP430. But every time I buy an MSP430, TI EoLs the specific model I bought.
If it had been properly encrypted my young cracker self would have had no opportunity.
https://jnz.dk/z80/ld_r_n.html
Yep, if I'm reading this right that's 3E 00, since the second byte is the immediate value.
One difference between XOR and LD is that LD A, 0 does not affect flags, which sometimes mattered.
One of the random things burned into my memory for 6502 assembly is that LDA is $A9. I never separated the instruction from the register; it's not like they were general purpose. But that might be because I learned programming from the 2 books that came with my C64, a BASIC manual and a machine code reference manual, and that's how they did it.
I learned assembly programming by reading through the list of supported instructions. That, and typing in games from Compute's Gazette and manually disassembling the DATA instructions to understand how they worked. Oh, and the zero-page reference.
Good times.
On the 6502 you had three instructions LDA, LDX, LDY where the register name is essentially part of the instruction name. On the Z80 you had a lot of "load" instruction so you had LD and then many different operands: loading 8-bit registers, loading 16-bit, writing to memory, reading from memory, reading/writing from memory using a register as an index. So, made more sense on Z80 to have "LD" whereas LDA/LDX/LDY worked fine on 6502.
You had LDA and LDX and LDY as separate instructions while the Z80 assembler had a single LD instruction with different operands. It's the same thing really.
Well, I never wrote any 6502 so I can't compare, but yes, you could load immediate values into any register except the flag register on the Z80. Was that not a thing on the 6502?
Back in 1985 I did some hand-coding like this because I didn't have access to an assembler: https://blog.jgc.org/2013/04/how-i-coded-in-1985.html and I typed the whole program in through the keypad.
Later still I'd be patching binaries to ensure their serial-checks passed, on Intel.
I'm quite sure none of my friends knew any CPU opcode; however, people usually remembered a few phone numbers.
I didn't know you could do it differently for years after I started.
https://github.com/pret/pokecrystal/wiki/Optimizing-assembly...
echo tya|asm|mondump -r|6502
A=AA X=00 Y=00 S=00 P=22 PC=0300 0
0300- 98 TYA A=00 X=00 Y=00 S=00 P=22 PC=0301 2*Actually a custom chip also containing some peripherals.
I’m familiar with 32-bit x86 assembly from writing it 10-20 years ago. So I was aware of the benefit of xor in general, but the above quote was new to me.
I don’t have any experience with 64-bit assembly - is there a guide anywhere that teaches 64-bit specifics like the above? Something like “x64 for those who know x86”?
The reason `xor eax,eax` is preferred to `xor rax,rax` is due to how the instructions are encoded - it saves one byte which in turn reduces instruction cache usage.
When using 64-bit operations, a REX prefix is required on the instruction (byte 0x40..0x4F), which serves two purposes - the MSB of the low nybble (W) being set (ie, REX prefixes 0x48..0x4f) indicates a 64-bit operation, and the low 3 bits of low nybble allow using registers r8-r15 by providing an extra bit for the ModRM register field and the base and index fields in the SIB byte, as only 3-bits (8-registers) are provided by x86.
A recent addition, APX, adds an additional 16 registers (r16-r31), which need 2 additional bits. There's a REX2 prefix for this (0xD5 ...), which is a two byte prefix to the instruction. REX2 replaces the REX prefix when accessing r16-r31, still contains the W bit, but it also includes an `M0` bit, which says which of the two main opcode maps to use, which replaces the 0x0F prefix, so it has no additional cost over the REX prefix when accessing the second opcode map.
It's not just that, zero-extending or sign-extending the result is also better for out-of-order implementations. If parts of the output register are preserved, the instruction needs an extra dependency on the original value.
Instead you need to use the multi-byte, general purpose encoding of `xchg` for `xchg eax, eax` to get the expected behavior.
https://www.intel.com/content/www/us/en/developer/articles/t...
If you decode the instruction, it makes sense to use XOR:
- mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs 3 bytes (66 31 c0)
This extra byte in a machine with less than 1 Megabyte of memory did id matter.
In 386 processors it was also - mov eax,0 - needs 5 bytes (b8 00 00 00 00) - xor eax,eax - needs 2 bytes (31 c0)
Here Intel made the decision to use only 2 bytes. I bet this helps both the instruction decoder and (of course) saves more memory than the old 8086 instruction.
Never mind the fact that, as the author also mentions, the xor idiom takes essentially zero cycles to execute because nothing actually happens besides assigning a new pre-zeroed physical register to the logical register name early on in the pipeline, after which the instruction is retired.
This is slightly inaccurate -- instructions retire in order, so it doesn't necessarily retire immediately after it's decoded and the new zeroed register is assigned. It has to sit in the reorder buffer waiting until all the instructions ahead of it are retired as well.
Thus in workloads where reorder buffer size is a bottleneck, it could contribute to that. However I doubt this describes most workloads.
For the AMD 9950, we are talking about 1280kb of L1 (per core). 16MB of L2 (per core) and 64MB of L3 (shared, 128 if you have the X3D version).
I won't say it doesn't matter, but it doesn't matter as much as it once did. CPU caches have gotten huge while the instructions remain the same size.
The more important part, at this point, is it's idiomatic. That means hardware designers are much more likely to put in specialty logic to make sure it's fast. It's a common enough operation to deserve it's own special cases. You can fit a lot of 8 byte instructions into 1280kb of memory. And as it turns out, it's pretty common for applications to spend a lot of their time in small chunks of instructions. The slow part of a lot of code will be that `for loop` with the 30 AVX instructions doing magic. That's why you'll often see compilers burn `NOP` instructions to align a loop. That's to avoid splitting a cache line.
Ryzen 9 CPUs have 1280kB of L1 in total. 80kB (48+32) per core, and the 9 series is the first in the entire history of Ryzens to have some other number than 64 (32+32) kilobytes of L1 per core. The 16MB L2 figure is also total. 1MB per core, same as the 7 series. AMD obviously touts the total, not per-core, amounts in their marketing materials because it looks more impressive.
As an aside, zen 1 did actually have a 64kB (and only 4 way!) L1I cache, but changed to the page size times way count restriction with zen 2, reducing the L1 size by half.
You can also see this on the apple side, where their giant 192kB caches L1I are 12 ways with a 16kB page size.
You don't need operand size prefix 0x66 when running 16 bit code in Real Mode. So "mov ax, 0" is 3 bytes and "xor ax, ax" is just 2 bytes.
It makes much more sense: resetting ax, and bc (xor ax,ax ; xor bx,bx) will be 4 octets, DWORD aligned, and a bit faster to fetch by the x86 than the 3-octet version I wrote before.
> - mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs 3 bytes (66 31 c0)
Except, apparently, on the pentium Pro, according to this comment: https://randomascii.wordpress.com/2012/12/29/the-surprising-..., which says:
“But there was at least one out-of-order design that did not recognize xor reg, reg as a special case: the Pentium Pro. The Intel Optimization manuals for the Pentium Pro recommended “mov” to zero a register.”
https://fanael.github.io/archives/topic-microarchitecture-ar...
“I assume that the ability to recognize that the exclusive-or zeroing idiom doesn't really depend on the previous value of a register, so that it can be dispatched immediately without waiting for the old value — thus breaking the dependency chain — met the same fate; the Pentium Pro shipped without it.
Some of the cut features were introduced in later models: segment register renaming, for example, was added back in the Pentium II. Maybe dependency-breaking zeroing XOR was added in later P6 models too? After all, it seems such a simple yet important thing, and indeed, I remember seeing people claim that's the case in some old forum posts and mailing list messages. On the other hand, some sources, such as Agner Fog's optimization manuals say that not only it was never present in any of the P6 processors, it was also missing in Pentium M.”
Fun fact - the IBM PC XT also came in a 286 model (the XT 286).
iirc doesn't word alignment matter? I have no idea if this is how the IBM PC XT was aligned but if you had 4 byte words then it doesn't matter if you save a byte with xor because you wouldn't be able to use it for anything else anyway. again, iirc.
It has it's merits, but the underlying hardware has changed.
Intel tried to push this responsibility to the compiler with Itanium but that failed catastrophicically, so we're back to the CPU pretending it's 1985.
> And, having done that it removes the operation from the execution queue - that is the xor takes zero execution cycles!1 It’s essentially optimised out by the CPU
Right, it has the same consequence, but it doesn't actually perform the stated operation. ASM is just a now just a high level language that tells the computer to "please give me the same state that a PDP-11-like computer would give me upon executing these instructions."
But it will not execute xor, nor will it actually zero out eax in most cases.
It'll do something similar to constant propagation with the information that whenever xor eax, eax occurs; all uses of eax go through a simpler execution path until eax is overwritten.
It's emulating the zero result when it recognizes this pattern, usually by playing clever tricks with virtual registers.
@grok: This is a classic and brilliant puzzle from the early days of DOS. The challenge is to write a self-displaying hex dumper (.COM file) using only the copy con command, which means every byte of the program must correspond to a standard, typeable ASCII character. This is a form of code-golfing that relies on a deep understanding of 8086 machine code.
The solution is a clever piece of self-modifying code. Since many essential 8086 opcodes (like MOV, INT, RET) don't have printable ASCII equivalents, the program must construct these instructions in memory and then execute them.
Here is a well-known 128-byte solution. I will first show you the ASCII characters you would type, and then explain how it works.
The Program to Type
You would have started by typing copy con hex.com and then carefully typed the following 128 characters (including spaces and case-sensitive characters) and finally pressed Ctrl+Z to save the file.
1 j!>1!QhH-@0X,jHj.Ph1PhD0Ph 0@h, j, h.0@h, h 0@h.H-@,jH1X,h.H.@,H1j j X,h.H.@,H1j j X.H-@H-@,jHj.Ph1PhE0Ph@0@h, j, h.0@h, h
0@h.
How It Works: A High-Level ExplanationThis program is a marvel of 8086 trickery. Here is a simplified breakdown of what's happening:
etc.etc
@grok: While your exact code is lost to time, it would have looked something like one of the ultra-small ASCII "dropper" programs that were once passed around. Here is a plausible 32-byte example of what the program you typed might have looked like.
You would have run copy con nibbler.com, typed the following line, and hit Ctrl+Z:
`j%1!PZYfX0f1Xf1f1AYf1E_j%1!PZ`
This looks like nonsense, but to the 8088/8086 processor, it's a dense set of instructions that does the following:
etc etc.BTW. It is not beyond possibility that this nibbler or dropper was made by myself and published in Usenet by me myself in 1989. Who else would have such a problem.
It was a bankcrupt sale and the machine was sold as "inactivated".
Let's take a look at just the first few instructions here:
ASCII HEX ASM
j% 6A 25 PUSH 25h ; 80186+ only, won't work on Atari Portfolio!
1! 31 21 XOR [BX+DI],SP ; complete nonsense
P 50 PUSH AX ; AX sort-of-undefined at this point, depends on DOS version and command line arguments
Z 5A POP DX ; copy AX into DX
Y 59 POP CX ; CX = 25h?
fX 66 58 POP EAX ; 80386+ only! also nothing on stack anymore!
0f1 30 66 31 XOR [BP+31h],AH ; complete nonsense
...
This looks like randomly jumbled together from fragments of ASCII-only machine code intended for much newer x86 processors -- just look at how often `f` appears in those strings, which is the operand-size prefix, and meaningless on 16-bit chips. The memory addressing might make more sense in 32-bit mode too, where a different encoding is used (modr/m and SIB byte).And this is how LLMs work. They can associate the tokens for e.g. `PZ` with the instructions `PUSH AX / POP DX`, and a certain kind of program in which those would be likely to appear, drawn somewhere from their massive training set.
Humans can easily learn to recognize these 'words' of ASCII text in machine code too, just by spending time looking at it. Another good one is the pair of `<ar` and `<zw` (usually next to each other, with an unprintable character between), present in most upcase()-like code from that era. So if you had asked Grok to accept both upper and lower case of hex input, I would bet that it would have inserted those sequences somewhere too.
But what LLMs CAN NOT do is plan ahead on how to use these program fragments to accomplish a specific task, or keep simple constraints in mind like "this must run on the 80C88 processor in the Atari Portfolio". They even have trouble with keeping track of registers, stack, or which mode the CPU is in.
(yes, late, but I can't let this stand here uncommented. https://xkcd.com/386 )
There was "See the video that accompanies this post." but NGL was just posting encase anyone didn’t have time to read or missed it.
It's also available as an inscrutable printed book on Amazon.
I do wonder who was the first cracker that thought of including a keygen music that started the tradition.
I also miss how different groups competed with each other and boasted about theirs while dissing others in readmes.
Readme's would have .NFO suffix and that would try to load in some Windows tool but you had to open them in notepad. Good times.
XR 15,15 XOR REGISTER 15 WITH REGISTER 15
vs L 15,=F'0' LOAD REGISTER 15 WITH 0
This was alleged to be faster on the 370 because because XR operated entirely within the CPU registers, and L (Load) fetched data from memory (i.e.., the constant came from program memory).Meanwhile, most "apps" we get nowadays contain half of npmjs neatly bundled in electron. I miss the days when default was native and devs had constraints to how big their output could be.
Which isn’t an excuse anymore. UI coding isn’t that hard; if someone can’t do it, well, Claude certainly can.
``` const window.$ = (q)=>document.querySelector(q); ``` Emulates the behavior much better. This is already set on modern version of browsers[1]
[1] https://firefox-source-docs.mozilla.org/devtools-user/web_co...
I had no idea this happened. Talk about a fascinating bit of X86 trivia! Do other architectures do this too? I'd imagine so, but you never know.
So the real question is why does x86 zero extend rather than sign extend in these cases, and the answer is probably that by zero extending, with an implementation that treats a 64bit architectural register as a pair 32bit renamed physical registers, you can statically set the architectural upper register back on the free pool by marking it as zero rather than the sign extended result of an op.
I know x86-64 zeroes the upper part of the register for backwards compability and improve instruction cache (no need for REX prefix), but AArch64 is unclear for me.
mov w5, w6 // move low 32 bits of register 6 into low 32 bits of register 5
This instruction only depends on the value of register 6. If instead it of zeroing the upper half it left it unchanged, then it would depend on w6 and also the previous value of register 5. That would constrain the renamer and consequently out-of-order execution.Meanwhile, people like me who got started with a Z80 instead immediately knew why, since XOR A is the smallest and fastest way to clear the accumulator and flag register. Funny how that also shows how specific this is to a particular CPU lineage or its offshoots.
Not sure exactly how I could dig up pronunciations, except finding the oldest recordings
Was it someone from an electronics background? Because MOV is also the acronym for Metal Oxide Varistor [1] from electronics and in the electronics world the acronym it is often pronounced "MAUV".
While this is probably true ("probably" because I haven't checked it myself, but it makes sense), the CPU could do the exact same thing for "mov eax, 0", couldn't it? (Does it?)
I do not think that anyone bothers to do this for a "mov eax, 0", because neither assembly programmers nor compilers use such an instruction. Either "xor reg,reg" or "sub reg,reg" have been the recommended instructions for clearing registers since 1978, i.e. since the launch of Intel 8086, because Intel 8086 lacked a "clear" instruction, like that of the competing CPUs from DEC or Motorola.
One should remember that what is improperly named "exclusive or" in computer jargon is actually simultaneously addition modulo 2 and subtraction modulo 2 (because these 2 operations are identical; the different methods of carry and borrow generation distinguish addition from subtraction only for moduli greater than 2).
The subtraction of a thing from itself is null, which is why clearing a register is done by subtracting it from itself, either with word subtraction or with bitwise modulo-2 subtraction, a.k.a. XOR.
(The true "exclusive or" operation is a logical operation distinct from the addition/subtraction modulo 2. These 2 distinct operations are equivalent only for 2 operands. For 3 or more operands they are different, but programmers still use incorrectly the term XOR when they mean the addition modulo 2 of 3 or more operands. The true "exclusive" or is the function that is true only when exactly one of its operands is true, unlike "inclusive" or, which is true when at least one of its operands is true. To these 2 logical "or" functions correspond the 2 logical quantifiers "There exists a unique ..." and "There exists a ...".)
It could of course. It can do pretty much any pattern matching it likes. But I doubt very much it would because that pattern is way less common.
As the article points out, the XOR saves 3 bytes of instructions for a really, really common pattern (to zero a register, particularly the return register).
So there's very good reason to perform the XOR preferentially and hence good reason to optimise that very common idiom.
Other approaches eg add a new "zero <reg>" instruction are basically worse as they're not backward compatible and don't really improve anything other than making the assembly a tiny bit more human readable.
Yes, it could, but mov eax, 0 is still going to also be six bytes of instruction in cache, and fetched, and decoded, so optimizing on the shorter version is marginally better.
The 8080 and Z80's NOP was at opcode 0. Which was neat because you could make a "NOP slide" simply by zeroing out memory.
xor:
> The OF and CF flags are cleared; the SF, ZF, and PF flags are set according to the result. The state of the AF flag is undefined.
sub:
> The OF, SF, ZF, AF, PF, and CF flags are set according to the result.
(I don't have an x64 system handy, but hopefully the reference manual can be trusted. I dimly remembered this, or something like it, tripping me up after coming from programming for the 6502.)
(Not suggesting it should be. Maybe that's a terrible idea, but I don't know why.)
Though technically you said "assembler macro", not opcode. For that, I suspect the argument is more psychological: we had such limited resources of all sorts back then that being parsimonious with everything was a required mindset. The mindset didn't just mean you made everything as short as possible, it also meant you reused everything you possibly could. So reusing XOR just felt more fitting and natural than carving out a separate assembler instruction name. (Also, there would be the question of what effect ZEROAX should have on the flags, which can be somewhat inferred when reusing an existing instruction.)
.macro ZEROAX
xor eax, eax
.endm
where it was defined with a semantically meaningful name, but emitting the exact same opcodes as when writing it out. I mean, I guess taking that to the logical extreme, you'd end up with... C. I dunno, it just seemed like the sort of thing that would have caught on my convention.I use to write lots of 6502 and 68k assembler, and 68k especially tended to look quite human-readable by the time devs ended up writing macros for everything. Perhaps that wasn't the same culture around x86 asm, which I admit I've done far, far less of.
> I use to write lots of 6502 and 68k assembler, and 68k especially tended to look quite human-readable by the time devs ended up writing macros for everything.
Yes. I only did nontrivial amounts of 6502 and x86, but from what I saw of 68k, it seemed like it started out cleaner-looking and more readable even before adding in macros. (Or for all I know, I was reading code using built-in macros.)
Of course. I might have some data stored in the higher dword of that register.
Partial register updates are kryptonite to OoO engines. For people used to low-level programming weak machines, it seems natural to just update part of a register, but the way every modern OoO CPU works that is literally not a possible operation. Registers are written to exactly once, and this operation also frees every subsequent instruction waiting for that register to be executed. Dirty registers don't get written to again, they are garbage collected and reset for next renaming.
The only way to implement partial register updates is to add 3-operand instructions, and have the old register state to be the third input. This is also more expensive than it sounds like, and on many modern CPUs you can execute only one 3-operand integer instruction per clock, vs 4+ 2-operand ones.
For loops, it is generally expected that you count down, with CX. The "LOOP" instruction is designed for this, so no special need to zero CX. SI and DI, the index registers may benefit from an optimized zeroing, for use with the "string" instructions.
Here I think Intel engineers didn't see the need and not having a special instruction to zero AX must simplify the decoder.
Huh, news to me. Although the amount of x86-64 assembly programming I've personally done is extremely minimal. Frankly, this is exactly the sort of architecture-specific detail I'm happy to let an ASM-generating library know for me rather than know myself.
xor wax, wax ; clear wax
xor sax, sax ; clear sax
xor fax, fax ; tru truIt's quite interesting what neat tricks roll out once you've got a guaranteed zero register - it greatly reduces the number of distinct instructions you need for what is basically the same operation.
If trying to set the stack pointer, or copy the stack pointer, instead the underlying instruction is ADD SP, Xn, #0 i.e. SP = Xn + 0. This is because the stack pointer and zero register are both encoded as register 31 (11111). Some instructions allow you to use the zero register, others the stack pointer. Presumably ORR uses the zero register and ADD the stack pointer.
NOP maps to HINT #0. There are 128 HINT values available; anything not implemented on this processor executes as a NOP.
There are other operations that are aliased like CMP Xm, Xn is really an alias for SUBS XZR, Xm, Xn: subtract Xn from Xm, store the result in the zero register [i.e. discard it], and set the flags. RISC-V doesn't have flags, of course. ARM Ltd clearly considered them still useful.
There are other oddities, things like 'rotate right' is encoded as 'extract register from pair of registers', but it specifies the same source register twice.
Disassemblers do their best to hide this from you. ARM list a 'preferred decoding' for any instruction that has aliases, to map back to a more meaningful alias wherever possible.
That, and `add rd, rs, x0` could (like the zeroing idiom on x86), run entirely in the decoding and register-renaming stages of a processor.
RISC-V does actually have quite a few idioms. Some idioms are multi-instruction sequences ("macro ops") that could get folded into single micro-ops ("macro-op fusion"/"instruction fusion"): for example `lui` followed by `addi` for loading a 32-bit constant, and left shift followed by right shift for extracting a bitfield.
And when the instruction decoder in such a CPU with register renaming sees `xor eax, eax`, it just makes `eax` point to the zero register for instructions after it. It does not have to put any instruction into the pipeline, and it takes effectively 0 cycles. That is what makes the "zeroing idiom" so powerful.
Do we have any data showing that having a dedicated zero register is better than a short and canonical instruction for zeroing an arbitrary register?
You don't need a mov instruction, you just OR with $zero. You don't need a load immediate instruction you just ADDI/ORI with $zero. You don't need a Neg instruction, you just SUB with $zero. All your Compare-And-Branch instructions get a compare with $zero variant for free.
I refuse to say this "zero register" approach is better, it is part of a wide design with many interacting features. But once you have 31 registers, it's quite cheap to allocate one register to be zero, and may actually save encoding space elsewhere. (And encoding space is always an issue with fixed width instructions).
AArch64 takes the concept further, they have a register that is sometimes acts as the zero register (when used in ALU instructions) and other times is the stack pointer (when used in memory instructions and a few special stack instructions).
Which if funny because IMHO RISC-V instruction encoding is garbage. It was all optimized around the idea of fixed length 32-bit instructions. This leads to weird sized immediates (12 bits?) and 2 instructions to load a 32 bit constant. No support for 64 bit immediates. Then they decided to have "compressed" instructions that are 16 bits, so it's somewhat variable length anyway.
IMHO once all the vector, AI and graphics instructions are nailed down they should make RISC-VI where it's almost the same but re-encoding the instructions. Have sensible 16-bit ones, 32-bit, and use immediate constants after the opcodes. It seems like there is a lot they could do to clean it up - obviously not as much as x86 ;-)
The largest MOV available is 16 bits, but those 16 bits can be shifted by 0, 16, 32 or 48 bits, so the worst case for a 64-bit immediate is 4 instructions. Or the compiler can decide to put the data in a PC-relative pool and use ADR or ADRP to calculate the address.
ADD immediate is 12 bits but can optionally apply a 12-bit left-shift to that immediate, so for immediates up to 24 bits it can be done in two instructions.
ARM64 decoding is also pretty complex, far less orthogonal than ARM32. Then again, ARM32 was designed to be decodable on a chip with 25,000 transistors, not where you can spend thousands of transistors to decode a single instruction.
(Any instruction that can be similarly rephrased as a composition of more restricted elementary instructions is also a candidate for this macro-insn approach.)
I really like the idea of composition or standard prefixes. My favorite is the idea of replacing cmp/branch with "if". Where the condition is a predicate for the following instruction. For RISC-V it would eat a large part of the 16bit opcodes. Some form of load/store might be a good use for the remaining 16bit ops. Other things that might be a good prefix could be encoding data types (8,16,32,64 bit, sign extended, float, double) or a source/destination register. It might be interesting to see how a full ISA might be decomposed into smaller instruction fragments.
This is just a forward skip, which is optimized to a predicated insn already in some implementations.
>> This is just a forward skip, which is optimized to a predicated insn already in some implementations.
True, but make it a 16bit prefix and apply to all (or selected) instructions.
ISA design is always a tradeoff, https://ics.uci.edu/~swjun/courses/2023F-CS250P/materials/le... has some good details, but the TLDR is that RISC-V makes reasonable choices for a fairly "boring" ISA.
Strongly disagree. Throughput is cheap, latency is expensive. Any time you can fit a constant in the instruction fetch stream is a win. This is especially true for jump targets, because getting them resolved faster both saves power and improves performance.
> Most 64 bit constants in use can be sign extended from much smaller values
You should obviously also have smaller load instructions.
> will necessarily bloat instruction cache, stall your instruction decoder (or limit parallelism)
No, just have more fetch throughput.
> and will only be 2 cycles faster than a L1 cache load
Only on tiny machines will L1 cache load be 2 cycles. On a reasonable high-end machine it will be 4-5 cycles, and more critically (because the latency would usually be masked well by OoO), the energy required to engage the load path is orders of magnitude more than just getting it from the fetch.
And that's when it's not a jump target, when it's a jump target suddenly loading it using a load instruction adds 12+ cycles of latency.
> TLDR is that RISC-V makes reasonable choices for a fairly "boring" ISA.
No. Not even talking about constants, RISC-V makes insane choices for essentially religious reasons. Can you explain to me why, exactly, would you ever make jal take a register operand, instead of using a fixed link register and putting the spare bits into the address immediate?
Fetch throughput isn't unlimited. Modern x86 CPUs only have ~16-32B/cycle (from L2 once you're out of the uop cache). If you decode a single 10 byte instruction you're already using up a huge amount of the available decode bandwidth.
There absolutely are cases where a 64 bit load instruction would be an advantage, but ISA design is always a case of tradeoffs. Allowing 10 byte instructions has real cost in decode complexity, instruction bandwidth requirements, ensuring cacheline/pageline alignment etc. You have to weigh against that how frequent the instruction would be as well as what your alternative options are. Most imediates are small, and many others can be efficiently synthesized via 2 other instructions (e.g. shifts/xors/nots) and any synthesis that is 2 instructions or fewer will be cheaper than doing a load anyway. As a result you would end up massively complicating your architecture/decoders to benefit a fairly rare instruction which probably isn't worthwhile. It's notable that aarch64 makes the same tradeoff here and Apple's M series processors have an IPC advantage over the best x86.
> Can you explain to me why, exactly, would you ever make jal take a register operand, instead of using a fixed link register and putting the spare bits into the address immediate?
This mostly seems like a mistake to me. The rational probably is that you need the other instructions anyway (not all jumps are returns), so adding a jal that doesn't take a register would take a decent percentage of the opspace, but the extra 5 bits would be very nice.
AFAIK, the reason RISC-V supports alternative link registers is that it allows for efficient -msave-restore, keeps the encoding orthogonal to LUI/AUPIC and using the smaller immediate didn't impact codegen much.
There's a reason why AMD added r8-r15 to the architecture, and why intel is adding r16-r31..
But it does nothing to help you, the programmer, when your algorithm really needs to have 9 registers worth of data in registers and your CPU only has 8 architectural registers available to you. At that point, you either spill manually, or you take the performance hit from keeping the ninth value in memory instead of a register.
Don't get too carried away in the above, x86 is still a lot more complex than ARM or RISC-V. However the complexity is only a tiny part of a CPU and so it doesn't matter.
Modern ISAs try really hard to be independent from microarchitecture.
RISC-V has the most compact code on 64bit, with margin to boot.
On 32bit, it used to be behind Thumb2, but it's the best as of the bit manipulation and extra compressed extensions circa 2021.
There’s dozens of us! By the way, totally unaffiliated, but I have used fetchrss for those websites that have no feed.
Yeah, sadly the 6502 didn't allow you to do EOR A; while the Z80 did allow XOR A. If I remember correctly XOR A was AF and LD A, 0 was 3E 01[1]. So saved a whole byte! And I think the XOR was 3 clock cycles fast than the LD. So less space taken up by the instruction and faster.
I have a very distinct memory in my first job (writing x86 assembly) of the CEO walking up behind my desk and pointing out that I'd done MOV AX, 0 when I could have done XOR AX, AX.
[1] 3E 00