I feel like the importance of padding is a bit understated on this page - BLAS and LAPACK require LDA and LDB parameters, and you can definitely tune these to the page size of a particular system/machine to improve performance.
When working with BLAS/LAPACK or other matrix libs, you can often apply a little linear algebra to reshape the problem rather than the input data to avoid a transpose altogether
Actually I think I recall some GPUs storing textures that way, but I'm not entirely sure.
https://github.com/dzaima/CBQN/blob/v0.11.0/src/singeli/src/...
The language is oriented towards compile-time array programming instead of managing a bunch of individual vectors. So you have runtime vec_select{} (docs at [1]), mirrored by compile-time select{}, and the indices generated by pairs{} can be used in either.
[0] https://github.com/mlochbaum/Singeli/
[1] https://github.com/mlochbaum/Singeli/tree/master/include#sim...
The most efficient way to swap 2 values stored in memory is to use 2 load instructions and 2 store instructions, like "load X in R1; load Y in R2; store R1 in Y; store R2 in X".
Therefore, for swapping memory values the XOR trick has never been useful, in the entire history of automatic computers.
For swapping data that is stored in internal CPU vector registers or matrix registers there are special shuffle instructions, which implement various kinds of transpositions.
Switch in place was efficient mostly in the distant past, for swapping general-purpose registers in those CPUs that did not have dedicated exchange/swap instructions. Intel/AMD CPUs always had exchange instructions, so switch in place has never been useful on them in any circumstances, since the very launch of the IBM PC, 45 years ago.
Today, the XOR trick might have remained useful for swapping general-purpose registers in some microcontrollers, but in the most popular ISAs, like ARM-based or RISC-V, most GPRs are equivalent, so the need to swap them arises very rarely, only in certain kinds of loops, and even there swapping can frequently be avoided by unrolling the loops.
It might benefit the world if there was philanthropic funding for people like this to do more public research and writing. Imagine there's so much information and wisdom in some people's brains, that deserve the chance to be written down and released in the public domain.