April 2025 update

felix86 is a new x86-64 userspace emulator for RISC-V. It is aimed at achieving good performance in games, and as of now is in relatively early development. A few games are already fully working. As this is the first post, we are going to go through a brief introduction.

Inner workings

felix86 emulates an x86-64 CPU running in userspace, which is to say it is not a virtual machine like VMware, rather it directly translates the instructions of an application and mostly uses the host Linux kernel to handle syscalls.

Currently, translation happens during execution time, also known as just-in-time (JIT) recompilation. The JIT recompiler in felix86 is focused on fast compilation speed and performs minimal optimizations. It utilizes extensions found on the host system such as the vector extension for SIMD operations, or the B extension for emulating bit manipulation extensions like BMI. The only mandatory extensions for felix86 are G, which every RISC-V general purpose computer should already have, and v1.0 of the standard vector extension.

Flags

The RISC-V architecture, unlike x86-64, has no flags – thus flag calculations must be performed for some x86 instructions like add. In most cases flags go unused, so felix86 will not emit unnecessary flag calculations. This is performed by doing a forward scan of the instruction sequence it is about to compile (the basic block) and finding at which points flags are needed and at which points they are defined, effectively finding liveness ranges for each flag. Instructions that would emit flags check if the flags are actually used before doing so.

Registers

RISC-V has 31 general purpose registers, 32 floating-point registers and 32 vector registers in the V extension. On the other hand, x86-64 has 16 GPRs and 16 XMMs, at least until AVX-512 and APX. Because of this, we can allocate 16 out of the 31 RISC-V GPRs for the 16 x86-64 GPRs, and 4 more for flag calculations, for the most important flags: OF CF SF and ZF. That leaves us with 11 registers for other uses. One of them is used for storing a pointer to the x86-64 state which we need to frequently access, and we also don’t touch the sp, gp and tp registers as they serve a special purpose in the calling convention. That leaves us with 7 GPRs for scratch usage, which we allocate as we emit code, there’s no linear scan or graph coloring for these for the sake of simplicity. One day there may be!

Single instruction multiple data

The vector extension in RISC-V has been thus far very adequate in emulating x86-64 SSE instructions, and I think that will continue to be the case when we get to AVX. Only few instructions have caused difficulties in their emulation, namely the pcmpxstrx group of instructions which are currently emulated with a C function rather than recompiled code. However, most instructions have caused no problems and have a relatively smooth translation.

You can use the dropdown below to get an idea of some SIMD translations. Some are rougher than others, while some translate to a single instruction if you exclude vsetivli.

felix86's RISC-V translation:

Select an instruction to view its equivalent code.

Some of these translations may be suboptimal! Contributions are welcome! Alternatively, you can open an issue.

x87

Currently x87 is probably the least optimized part of the translation. There’s no register allocation and registers are loaded from memory and stored back at the end of each instruction. Luckily, most games rarely use x87 instructions. In the future, optimizations need to happen as older games use much more x87, especially ones compiled for x86.

Other optimizations

felix86 also performs block linking, which is quite important for performance. Basic blocks that jump to a known target are patched to jump to that target directly (when it gets eventually compiled) instead of returning to the dispatcher.

There’s also an option for return address prediction, to minimize jumps to the dispatcher even more. Basically in most cases each call will eventually correspond to a return which means we can predict the return address of each ret with pretty good accuracy. This involves pushing a predicted guest address and a host return address to the stack. If the prediction is right, we can use the host address directly, and if the prediction fails we can just jump to the dispatcher.

Currently I am working on thunking, using host libraries in place of guest ones for speed and compatibility, for example if a chip has a GPU with RISC-V drivers but no x86-64 drivers, thunking can enable usage of those RISC-V drivers and utilization of the GPU.


Thanks for reading the April post! That’s all for now! See you in a month or two.

Make sure to check out (and star) the repo: https://github.com/OFFTKP/felix86

Written on April 1, 2025