A question that frequently pops up in VFIO or GPU passthrough forums is which graphics card to buy: AMD or Nvidia? And the answer often depends on whom you ask.
Some people will tell you to stay clear of Nvidia graphics cards since their driver detects the virtual machine and quits.
Others mention the “reset bug” that’s been haunting AMD graphics cards for the last couple of years (see Wendells video interview of Linux kernel maintainer Greg Kroah-Hartman). So what’s the story?
Life was so much easier when there was only one graphics card vendor to seriously consider (guess who). But AMD has made a comeback in recent years and is now able to compete performance-wise with Nvidia, at least in the mid-range category. AMD GPUs such as the RX 5700 and RX 5700XT are often recommended on websites such as Techradar, Tom’s Hardware and PC Magazine, as well as by Linus Tech Tips. All the while Linus criticizes Nvidia for overpricing their new mid-range cards.
But can we take these recommendations at face value when planning for GPU passthrough? My short answer: absolutely not! Unfortunately most reviews on the Internet are pretty useless if not misleading. Because in addition to performance (under Windows) and perhaps price, we must consider some vital features that make graphics cards work with GPU passthrough.
Without further ado, here is the list of AMD bugs:
AMD GPU FLR Reset Not Working
This bug has been around for years. Practically all modern AMD graphics card are not able to perform a FLR or Function Level Reset.
FLR is a way to “soft” reset the graphics card so it can be reused, for example after you shutdown the VM and start a new VM. As of this writing, many modern AMD graphics cards are safely reset only if you shut down the PC.
If you run your VM and never shut it down, FLR is almost a non-issue. AMD graphics cards will work fine when booting up the VM (for the first time). However, when you shut down the VM and wish to start the VM again, your entire host may hang.
There is a 2018 presentation called VFIO Device Assignment Quirks, How to use Them and How to Avoid Them by Alex Williamson where towards the end presumably an AMD developer asks questions regarding FLR and mentions that AMD is working on it. It’s now 2020 and I have NOT heard the good news yet that AMD fixed this issue.
AMD has known about this bug for years, yet has provided no solution to it. In fact, AMD introduced the RX 5700 and RX 5700 XT just last year with the same bug that’s plagued older cards (Vega, etc.).
Is there a fix for it? Not really, but there are some workarounds. Geoff aka gniff has developed a patch to work around this problem. This patch works only for Navi cards (such as the recent 5000 series). Instead of using the non-functional PCIe FLR function to reset the card, this patch uses the powerplay tables to turn it off and on again.
Here a comment from the developer:
AMD Navi 10 series GPUs require a vendor specific reset procedure.
According to AMD a PSP mode 2 reset should be enough however at this
time the details of how to perform this are not available to us.
Instead we can signal the SMU to enter and exit BACO which has the same
In plain English, AMD does not provide the details on how to perform a (PSP mode 2) reset. The patch by Geoff may work most of the time, but what if the VM hangs or crashes?
In the meantime I posted on the AMD Reddit to see if AMD would like to comment. Wendell from Level1Techs.com was quick to respond and subscribe. But no response from AMD, not even in private.
Latest 20.5.1 Driver Brakes GPU Passthrough on 5700XT
This is the latest story as of today. When I started to write this post I was going to mention older driver issues like BSOD when installing AMD Crimson drivers.
The problem may not be limited to one specific GPU model. For more on that see this VFIO Reddit forum thread.
My advise: don’t update the AMD graphics driver in Windows until this issue is fixed.
Nvidia has essentially two lines of graphics cards: the popular GeForce line of consumer and enthusiast GPUs and the professional Quadro, Tesla, Grid etc. lines. The professional line is geared towards workstation users or cloud/data center applications. Virtualization support is a key element in these markets.
Unlike AMD, Nvidia is playing in the big league – AI (artificial intelligence), cloud computing, GPU virtualization, you name it.
In the consumer market, Nvidia has a strong presence with everything from absolute basic “no-frills” GPUs to the currently fastest gaming GPU you can buy, with at least a dozen of GPUs in between. Price-performance-wise Nvidia tends to be a bit more expensive, but has a lot more to offer.
As to the features, Nvidia got about everything covered. Even my not so new Nvidia GTX 970 runs Folding@home without a problem. (“Folding on AMD GPUs is problematic in Linux due to poor OpenCL driver support from AMD”, as noted on the F@H website.)
Back to the subject – what about VGA passthrough? Here my list of complains:
Nvidia Driver Locks Virtualization Support for “Non-Professional” Cards
A couple of years ago Nvidia might have had a point in locking virtualization support in the driver software for consumer GPUs. Why? Because back then the GeForce consumer cards sold for a fraction of the price of the professional Quadro etc. equivalents, at the same exact GPU performance. Companies could save a lot of money if they went with the consumer line.
So Nvidia created an obstacle in that their drivers check to see if the operating system runs in a virtual environment. This integrated test inside the Nvidia graphics driver has been around for quite some years.
But somebody outsmarted Nvidia and invented a patch and/or procedure to “fool” the driver. This workaround has been available for I believe at least 5-6 years. Nvidia surely knows about this simple trick, and they have done nothing to make it hard on us so far.
I would love to see Nvidia remove this “virtualization test” and officially support virtualization for the consumer products. I really can’t see why they haven’t done it, except for that it takes R&D time to remove and test it. Nvidia doesn’t need this nonsense.
If, however, Nvidia decides to go tough on the VFIO / passthrough community, they could easily do so within their driver. This makes it very tricky to recommend Nvidia.
As for now, use the workaround (see almost any tutorial, including mine).
Nvidia Driver Reverts to Legacy INTx
Every kid on the block knows that MSI (Message Signaled Interrupt) is the way to go. Yet with each driver update Nvidia reverts to legacy INTx (Interrupt x). This has become quite annoying.
You can manually edit your Windows registry every time this happens, or use a little program to make it easier (same link above).
We actually shouldn’t blame Nvidia for using legacy methods, as it was probably Microsoft who hesitated implementing MSI. For some giggles on how nicely Intel puts it, here a quote:
MSI was introduced in revision 2.2 of the PCI spec in 1999 as an optional component. However, with the introduction of the PCIe specification in 2004, implementation of MSI became mandatory from a hardware standpoint. Unfortunately, software support in mainstream operating systems was slow in coming, forcing many MSI-capable PCIe* devices to operate in legacy mode.
Nvidia’s “virtualization” detection is nothing more than a nuisance. Using a simple workaround, Nvidia consumer graphics cards work perfectly fine with GPU passthrough. The same can be said for MSI support – it’s a performance tuning measure that can be enabled after driver updates.
Unfortunately AMD’s FLR bug is a real killer. What’s most annoying is that AMD knows about the bug for such a long time and has done nothing!
As long as AMD (or anyone else) doesn’t fix the reset bug, I wouldn’t install an AMD GPU even if I got it as present.
The verdict is simple: If you plan on doing VGA passthrough using VFIO, stay clear of AMD graphics cards.
Would I endorse Nvidia? No, but with only one vendor that supplies working hardware, do I have a choice?
Perhaps there is some hope: Intel plans on entering the GPU market.