"M" Performance... Pessimizations! or: SSE is a Scam.
This article is a continuation of the M series.
The vpatch given below entirely re-implements the TLB (MMU) of M to use SIMD instructions from the AMD64 SSE2 set.
Whereas previously TLB entries were kept in memory and searched iteratively, now we keep the Tags (3 byte each) sliced into three XMM registers, and search them in parallel, e.g.:
%define TLB_TAG_BYTE_0 xmm5 ; Byte 0 of Tag %define TLB_TAG_BYTE_1 xmm6 ; Byte 1 of Tag %define TLB_TAG_BYTE_2 xmm7 ; Byte 2 of Tag %define XMM_T0 xmm8 ; Temp ;; ..... ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; Tag being sought is in ecx; ;; Stored Tag slices are in TLB_TAG_BYTE_0 .. TLB_TAG_BYTE_2. ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; Search for B0, B1, B2 of Tag, accumulate result in ebx ;; ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ; Search for Byte 0 of Tag: mov edx, ecx ; edx := ecx (wanted Tag) and edx, 0xFF ; Byte 0 (lowest) of wanted Tag ; Fill T0 with 16 copies of Tag Byte 0: movd XMM_T0, edx punpcklbw XMM_T0, XMM_T0 punpcklwd XMM_T0, XMM_T0 pshufd XMM_T0, XMM_T0, 0 ; Now SIMD-compare: pcmpeqb XMM_T0, TLB_TAG_BYTE_0 ; Get the result mask of the compare: pmovmskb ebx, XMM_T0 ; i-th bit in ebx = 1 where match B0 test ebx, ebx ; if Byte 0 of Tag not found: jz .Not_Found ; ... then go straight to 'not found' ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ; Search for Byte 1 of Tag: mov edx, ecx ; edx := ecx (wanted Tag) shr edx, 8 ; Byte 1 (middle) of wanted Tag and edx, 0xFF ; Fill T0 with 16 copies of Tag Byte 1: movd XMM_T0, edx punpcklbw XMM_T0, XMM_T0 punpcklwd XMM_T0, XMM_T0 pshufd XMM_T0, XMM_T0, 0 ; Now SIMD-compare: pcmpeqb XMM_T0, TLB_TAG_BYTE_1 ; Get the result mask of the compare: pmovmskb edx, XMM_T0 ; i-th bit in edx = 1 where match B1 and ebx, edx ; Keep only where B0 also matched test ebx, ebx ; if Bytes 0+1 of Tag not found: jz .Not_Found ; ... then go straight to 'not found' ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ; Search for Byte 2 of Tag: mov edx, ecx ; eax := edx (wanted Tag) shr edx, 16 ; Byte 2 (top) of wanted Tag and edx, 0xFF ; Fill T0 with 16 copies of Tag Byte 2: movd XMM_T0, edx punpcklbw XMM_T0, XMM_T0 punpcklwd XMM_T0, XMM_T0 pshufd XMM_T0, XMM_T0, 0 ; Now SIMD-compare: pcmpeqb XMM_T0, TLB_TAG_BYTE_2 ; Get the result mask of the compare: pmovmskb edx, XMM_T0 ; i-th bit in edx = 1 where match B2 and ebx, edx ; Keep only where B0,B1 also matched test ebx, ebx ; if Bytes 0+1+2 of Tag not found: jz .Not_Found ; ... then go straight to 'not found' ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
Surprisingly, this was found to slow down the execution of M by approximately 30% (as measured by the Dhrystone benchmark.).
Before anyone comments, I am aware that the punpcklbw/punpcklwd/pshufd sequence can be replaced with one instruction on processors that support SSE3. However, none of the machines on which I was considering deploying this program, support it. Or ever will.
I have decided to post the patch regardless, so that others may attempt to determine whether this holds true on later (i.e. newer than the AMD 2393SE I tested on) irons.
You will need:
- A Keccak-based VTron (for this and all subsequent patches.)
- m_genesis.kv.vpatch
- m_genesis.kv.vpatch.asciilifeform.sig
- errata_slaveirq.kv.vpatch
- errata_slaveirq.kv.vpatch.asciilifeform.sig
- tlb_and_exc_speedup.kv.vpatch
- tlb_and_exc_speedup.kv.vpatch.asciilifeform.sig
- simd_tlb_lookup.kv.vpatch
- simd_tlb_lookup.kv.vpatch.asciilifeform.sig
- simd_tlb_errata.kv.vpatch
- simd_tlb_errata.kv.vpatch.asciilifeform.sig
- (OPTIONAL DEMO) linux-3-16-70-bigendian-100hz.bin.gz
Add the above vpatch and seal to your V-set, and press to simd_tlb_lookup.kv.vpatch simd_tlb_errata.kv.vpatch.
Build and test as described in the previous article. Dhrystone is included in the demo booter.
Edit: removed the nonfunctional TLB cache. Performance is now approximately on-par with the non-SIMD version. SSE is still a scam.
Edit: I neglected to document this earlier: the poweroff command in the Busybox shell will cleanly exit the emulator.
~Probably to be not continued!~
Currently I suspect that this line of research is a dead end!
Though at some point I will post the kernel patches so that someone else could continue smashing his head against this wall, if so wishes.