Updated on November 19, 2020
A question that frequently pops up in VFIO or GPU passthrough forums is which graphics card to buy: AMD or Nvidia? And the answer often depends on whom you ask.
Some people will tell you to stay clear of Nvidia graphics cards since their driver detects the virtual machine and quits.
Others mention the “reset bug” that’s been haunting AMD graphics cards for the last couple of years (see Wendells video interview of Linux kernel maintainer Greg Kroah-Hartman). So what’s the story?
Life was so much easier when there was only one graphics card vendor to seriously consider (guess who). But AMD has made a comeback in recent years and is now able to compete performance-wise with Nvidia, at least in the mid-range category. AMD GPUs such as the RX 5700 and RX 5700XT are often recommended on websites such as Techradar, Tom’s Hardware and PC Magazine, as well as by Linus Tech Tips. All the while Linus criticizes Nvidia for overpricing their new mid-range cards.
But can we take these recommendations at face value when planning for GPU passthrough? My short answer: absolutely not! Unfortunately most reviews on the Internet are pretty useless if not misleading. Because in addition to performance (under Windows) and perhaps price, we must consider some vital features that make graphics cards work with GPU passthrough.
Without further ado, here is the list of AMD bugs:
AMD GPU FLR Reset Not Working
This bug has been around for years. Practically all modern AMD graphics cards are not able to perform a FLR or Function Level Reset.
FLR is a way to “soft” reset the graphics card so it can be reused, for example after you shutdown the VM and start a new VM. As of this writing, many modern AMD graphics cards are safely reset only if you shut down the PC.
If you run your VM and never shut it down, FLR is almost a non-issue. AMD graphics cards will work fine when booting up the VM (for the first time). However, when you shut down the VM and wish to start the VM again, your entire host may hang.
There is a 2018 presentation called VFIO Device Assignment Quirks, How to use Them and How to Avoid Them by Alex Williamson where towards the end presumably an AMD developer asks questions regarding FLR and mentions that AMD is working on it. It’s now 2020 and I have NOT heard the good news yet that AMD fixed this issue.
AMD has known about this bug for years, yet has provided no solution to it. In fact, AMD introduced the RX 5700 and RX 5700 XT just last year with the same bug that’s plagued older cards (Vega, etc.).
Is there a fix for it? Not really, but there are some workarounds. Geoff aka gniff has developed a patch to work around this problem. This patch works only for Navi cards (such as the recent 5000 series). Instead of using the non-functional PCIe FLR function to reset the card, this patch uses the powerplay tables to turn it off and on again.
Here a comment from the developer:
AMD Navi 10 series GPUs require a vendor specific reset procedure.
According to AMD a PSP mode 2 reset should be enough however at this
time the details of how to perform this are not available to us.
Instead we can signal the SMU to enter and exit BACO which has the same
In plain English, AMD does not provide the details on how to perform a (PSP mode 2) reset. The patch by Geoff may work most of the time, but what if the VM hangs or crashes?
In the meantime I posted on the AMD Reddit to see if AMD would like to comment. Wendell from Level1Techs.com was quick to respond and subscribe. But no response from AMD, not even in private.
Edit November 19, 2020: Good news – AMD has just released the Radeon 6800 and Radeon 6800XT GPU models based on the RDNA2 architecture that are said to be free from the FLR reset bug, according to yesterdays (p)review by Wendell.
But there are more good news for those who already own an AMD graphics card that has the reset bug. The developers of the kernel patch have recently announced the availability of a kernel module to address the FLR bug. This makes it easier to apply the workaround, as you only need to compile the module (or get a pre-compiled version for your kernel) and load it.
Latest 20.5.1 Driver Brakes GPU Passthrough on 5700XT
This is the latest story as of today. When I started to write this post I was going to mention older driver issues like BSOD when installing AMD Crimson drivers.
The problem may not be limited to one specific GPU model. For more on that see this VFIO Reddit forum thread.
My advise: don’t update the AMD graphics driver in Windows until this issue is fixed.
Nvidia has essentially two lines of graphics cards: the popular GeForce line of consumer and enthusiast GPUs and the professional Quadro, Tesla, Grid etc. lines. The professional line is geared towards workstation users or cloud/data center applications. Virtualization support is a key element in these markets.
Unlike AMD, Nvidia is playing in the big league – AI (artificial intelligence), cloud computing, GPU virtualization, you name it.
In the consumer market, Nvidia has a strong presence with everything from absolute basic “no-frills” GPUs to the currently fastest gaming GPU you can buy, with at least a dozen of GPUs in between. Price-performance-wise Nvidia tends to be a bit more expensive, but has a lot more to offer.
As to the features, Nvidia got about everything covered. Even my not so new Nvidia GTX 970 runs Folding@home without a problem. (“Folding on AMD GPUs is problematic in Linux due to poor OpenCL driver support from AMD”, as noted on the F@H website.)
Back to the subject – what about VGA passthrough? Here my list of complains:
Nvidia Driver Locks Virtualization Support for “Non-Professional” Cards
A couple of years ago Nvidia might have had a point in locking virtualization support in the driver software for consumer GPUs. Why? Because back then the GeForce consumer cards sold for a fraction of the price of the professional Quadro etc. equivalents, at the same exact GPU performance. Companies could save a lot of money if they went with the consumer line.
So Nvidia created an obstacle in that their drivers check to see if the operating system runs in a virtual environment. This integrated test inside the Nvidia graphics driver has been around for quite some years.
But somebody outsmarted Nvidia and invented a patch and/or procedure to “fool” the driver. This workaround has been available for I believe at least 5-6 years. Nvidia surely knows about this simple trick, and they have done nothing to make it hard on us so far.
I would love to see Nvidia remove this “virtualization test” and officially support virtualization for the consumer products. I really can’t see why they haven’t done it, except for that it takes R&D time to remove and test it. Nvidia doesn’t need this nonsense.
If, however, Nvidia decides to go tough on the VFIO / passthrough community, they could easily do so within their driver. This makes it very tricky to recommend Nvidia.
As for now, use the workaround (see almost any tutorial, including mine).
Nvidia Driver Reverts to Legacy INTx
Every kid on the block knows that MSI (Message Signaled Interrupt) is the way to go. Yet with each driver update Nvidia reverts to legacy INTx (Interrupt x). This has become quite annoying.
You can manually edit your Windows registry every time this happens, or use a little program to make it easier (same link above).
We actually shouldn’t blame Nvidia for using legacy methods, as it was probably Microsoft who hesitated implementing MSI. For some giggles on how nicely Intel puts it, here a quote:
MSI was introduced in revision 2.2 of the PCI spec in 1999 as an optional component. However, with the introduction of the PCIe specification in 2004, implementation of MSI became mandatory from a hardware standpoint. Unfortunately, software support in mainstream operating systems was slow in coming, forcing many MSI-capable PCIe* devices to operate in legacy mode.
Nvidia’s “virtualization” detection is nothing more than a nuisance. Using a simple workaround, Nvidia consumer graphics cards work perfectly fine with GPU passthrough. The same can be said for MSI support – it’s a performance tuning measure that can be enabled after driver updates.
Edit November 19, 2020: The release of the AMD Radeon 6800 and 6800XT graphics cards and the absence of the FLR reset bug is nothing short of a game changer. For those on the lookout for a new graphics card, AMD may finally be an option. For your own good, it might be worth waiting to see more user reports. That can be said for any new product. This gives hope that new AMD products based on the RDNA2 architecture may finally be cured of the FLR bug.
My new verdict: With the new Radeon 6800 and 6800XT GPUs, AMD has finally made it to the VFIO GPU passthrough game. It’s too early to recommend these cards, but initial testing by Wendell are very promising.
There is always Nvidia which has been working well with VFIO for years (provided you hide the hypervisor from the driver).
The great news is that finally you have a choice!
It remains to be seen if Intel provides us with additional options: Intel plans on entering the GPU market.