felix86 26.03

This month we added AVX, AVX2, BMI1 and F16C support!

AVX/AVX2 support is here!

This version of felix86 emulates all AVX and AVX2 instructions using RVV 1.0! Thanks to the FEX-Emu test suite, each instruction is extensively tested and all tests pass.

AVX is an x86 SIMD extension that adds 256-bit vectors and new instructions. Additionally, it introduces a new version for most of the SSE instructions that can use the 256-bit vectors and perform unaligned access, amongst other benefits.

Overall, 288 new instruction handlers were added, and 333 single-instruction tests pass. Binary tests can now detect AVX support and pass.

Fun parts

Here’s some fun parts of AVX emulation.

VZEROALL, VZEROUPPER

The RVV 1.0 statically allocated registers for YMM0-YMM15 were changed to v16-v31. This means we can use LMUL=8 to perform VZEROALL as two grouped VXOR instructions, and VZEROUPPER as two grouped and masked VXOR instructions.

Group writeback/restore

Since the XmmReg struct is now 256 bits, we can writeback and restore the XMM state in two LMUL=8 stores/loads, in VLEN=256 systems. This may improve performance in some applications that end up exiting to the dispatcher too often.

VGATHER

The VGATHER set of instructions in x86 (not to be confused with RISC-V’s VRGATHER) are basically indexed loads that work slightly differently from RISC-V. The indices may be optionally scaled and may need to be sign-extended before passed to the equivalent RISC-V instruction.

With the Zicclsm extension, which is mandatory in RVA23, indices that end up performing a misaligned load are supported, without needing to resort to a potentially slower instruction sequence.

Currently, there’s no support for VGATHER instructions that may fault during one of the element loads. We’ll implement this feature when there’s a program that relies on it.

Annoying parts

Below are a couple annoyances with emulating AVX on RVV.

No zeroing of agnostic bits

RVV 1.0 has two modes of dealing with tail and mask bits: Undisturbed and agnostic.

Undisturbed mode doesn’t affect the tail and mask bits. This is helpful for SSE emulation, since the legacy SSE instructions don’t modify the top 128 bits of the YMM registers.

Agnostic mode does something weird. Based on implementation, it will work the same as undisturbed, or replace the bits with ones. In version RVV 0.7.1, the bits were replaced with zeroes. This was changed in RVV 1.0 to replacing with ones, with the justification:

The value of all 1s instead of all 0s was chosen for the overwrite value to discourage software developers from depending on the value written.

This is unfortunate, because it means that there’s no instruction-free way of zeroing the tail or mask bits. Thus, performing the zeroing requires 1-3 extra instructions.

This zeroing of tail and mask bits is important for emulating AVX instructions that deal with XMMs, as they need to zero the upper 128 bits. It is also important for instructions like VMASKMOVPD, which zero the masked elements. It will also be important when/if we emulate AVX-512, as every instruction can perform similar zeroing masking.

Unfortunately, since RVV 1.0 is ratified, changing the agnostic bits to be zeroes instead of ones is not possible for software compatibility reasons. We hope there’s a mode that allows for zeroing agnostic bits in a future version, perhaps via a bit in vtype.

Cumbersome shuffles

VRGATHER is a powerful instruction in RISC-V, however it has some limitations. The iota cannot be an immediate, except in the case of VRGATHER.VI, which can only be used to broadcast a single element. The iota also cannot be picked from bits of a GPR. This means the iota needs to be loaded from a literal pool in memory in most cases.

An optimal instruction would be able to perform shuffles between two sources, using an immediate iota, similar to x86 shuffles. This would almost certainly need to be a >32-bit instruction, which RISC-V is capable of.

Until then, we need to use literal pools and two VRGATHER instructions to emulate most VSHUFPS instructions.

BMI1 support

BMI1 is a bit manipulation extension. Most of its instructions don’t match one-to-one with RISC-V equivalents, but are relatively simple to emulate.

F16C support

F16C is an extension that adds conversions from 16-bit floats to 32-bit floats. It is equivalent to the Zvfhmin extension, and is now supported when the extension is present, which is mandatory in RVA23.

Flatpaks

This month we also added initial support for Flatpaks!

The Hytale launcher, which is a Flatpak, running with felix86

Unfortunately there may be unrelated bugs with Hytale, causing it to not work. Nevertheless, Flatpaks should be able to be installed now.

In felix86 --shell, use flatpak install --user /path/to/my/game.flatpak to install and flatpak run com.MyOrg.MyGame to run your Flatpak app. System-wide installations are not tested.

Preliminary seccomp support

Flatpaks needed seccomp to run, particularly the filter mode. The BPF is now supported and recompiled to RISC-V and validates each syscall. The support is limited to the scope of supporting Flatpaks, and not all filters are supported.

AES support

The AES extension in RISC-V adds hardware acceleration for AES encryption and decryption. It is now supported in felix86!

The encryption instructions AESENC and AESENCLAST match up perfectly with the RISC-V equivalents. The decryption instruction AESDECLAST also matches up 1-to-1.

The AESDEC instruction performs InvMixColumns and AddRoundKey in the opposite order that VAESDM.VV does. My initial solution was to apply MixColumns to the key (by applying InvMixColumns via AES64IM 3 times in a row for each 64-bit word in the key) and then use VAESDM.VV.

As camel-cdr pointed out, there’s a much better solution, which is to use VAESDF.VV with a key of zeroes to apply the InvShiftRows and InvSubBytes transformations without AddRoundKey. Then, using a specific constant in VAESDM.VV can isolate the InvMixColumns transformation, and finally AddRoundKey can be done manually with a XOR. Overall, a great improvement over my initial solution.

VAESDM.VV can also be used in this way to emulate AESIMC, thus not needing the scalar AES64IM instruction.

AESKEYGENASSIST doesn’t match 1-to-1 with any instruction. It is possible to emulate some of its functionality with VAESKF1.VI, but until we encounter it in a game it is emulated in software.

Optimizations

Felix86 is >1.5 years old now, and some of the older instruction handlers were quite bad. One such example was the PHMINPOSUW instruction. While it would use VREDMINU to find the minimum element, a slow loop was used to find its index. When thinking about implementing VPHMINPOSUW and looking at my PHMINPOSUW implementation, I realized I can simply use VFIRST to find the position of the minimum element, vastly improving the instruction performance.

Another example is UNPCKHPS, which would use an ugly sequence of VRGATHER instructions, while widening adds and slides suffice.


Thanks for reading this post.

If you like this project, please give us a star on Github: https://github.com/OFFTKP/felix86

Written on March 1, 2026