Aibophobia has hit the nail on the head: DX12 and Vulkan move all the optimisation work that was previously done by the GPU vendor in their driver, and says to the developer "here, you deal with this now". If you're a massive engine developer like Epic or Unity, then you can afford to hire, train and pay experts on each architecture (not just each vendor, because everyone has multiple different architectures) to tweak your engine at a low level in the same way the vendors previously did. If you're an indie developer, you pretty much have three choices:
- Use DX12/Vulkan, and leave performance on the table or accept an extremely long and arduous development cycle optimising for each architecture
- Use vendor-provided pre-optimised libraries and have the internet accuse you of killing puppies
- Use DX11/OpenGL as before
--------------------------
On multi-GPU; there are broadly three multi-adapter addressing methods:
- Implicit Multiadapter: this is similar to DX11/OpenGL, where the application pretends there is just one GPU and the driver has to handle splitting the workload across multiple GPUs. In practice this will end up performing worse, as DX12/Vulkan encourages low-level noodling with the dispatch process, and inevitably overoptimising for single-adapter at the expense of multi-adapter.
- Explicit multiadapter with discrete GPU: Each GPU is exposed independently, and the application determines what jobs get dispatched where and when. Maximum flexibility, but means changing the optimisation yourself for each architecture and re-doing that work for single-GPU, 20GPU, 3-GPU, etc. In theory you can mix-and-match GPUs from different architectures or even vendors, but again, MORE WORK FOR YOU.
- Explicit multiadapter with linked GPUs: The GPUs are exposed as a 'composite' GPU with shared resources but multiple 'nodes'. This leaves the driver still dealing with some parts of the setup, but the applciation also needs to be aware of dispatching jobs to multiple nodes. In theory this can be a heterogenous setup, but for now it;s assumed that you will only see linked adapters with the same performance.
As far as I am aware, there aren't any games actually using linked GPUs at the moment. Most are using Implicit Multiadapter (if even supporting multi-GPU at all on low-level API), but there are a handful that have implemented Explicit Multiadapter (Ashes of the Singularity, Rise of the Tomb Raider, and Deus Ex: Mankind Divided, and Hitman 2016 are the only ones I can recall off the top of my head).
---------------------------
On NVLink and Infinity Fabric: we might see these as 'supplementary connectors' between GPUs, but probably not as the primary way to connect a GPU to a CPU. Even if CPU vendors could be persuaded to add NVLink or Infinity Fabric (and take up more die area) you'd then end up with having to make GPUs with a new PHY interface, have motherboard vendors produce yet more variants (that need to confirm to different signal routing standards), etc.
And when it all boils down to it, these new interfaces offer greater bandwidth but do not otherwise affect the software side of multi-GPU. Neither are close to the bandwidth or latency required to do 'transparent' multi-die-single-chip linking, so you still have to deal with the same issues of multi-device dispatch that you do at the moment.