Quite true. Thank goodness @ASRock System is doing a mATX TR4 motherboard. Only issue at that point would be a good top-down cooler such as an Arctic Accelero.
That is why we have quad-channel memory on TR4. While it won't help as much as we'd hope, it would help a lot.About this, I'd also be all in for a TR4 powered mATX HTPC (22 CU + 8 CPU Core would be nice!). But then of course then we are smashing our heads back into the Memory bandwidth limitations, and also having to run against Intel's Hades Canyon which would be equally powerful graphically, and possible also CPU wise if they are able to get a 6 core coffee lake CPU on board (no idea if the chip can actually fit that).
How about developing 'RAM' for APU using PCI-E X16 slot with HBM module. I don't know if that is possible.
And maybe we can use that 'RAM' for upgrading our dGPU for extra capacity.
PCI Express 4.0 x16 maximum throughput is 31.5 GB/s, that's less than even the slowest DDR4 on dual channel, so it probably wouldn't work very well even if someone made such a thing.
My knowledge is limited too, but i can confirm that the lenght effects speed VERY much. Think of it like electricity flowing tru very long wire. Due to material resistance you will lose some voltage. Thats normal. It is the same with data speed.IIRC part of the speed of HBM comes from it being right next to the die w/ incredibly short traces through the silicon interposer. Whilst my knowledge on this front is far from comprehensive replicating this over a longer distance, with current technology, especially through card edge connectors is currently outside of what can be done or at least what can be done cost effectively. IIRC even DDR4 has specification concerning not only matching the lengths of the PCB traces but also specification pertaining to the maximum length of those traces in order to maintain stability at speed. I also seem to remember that dealing w/ reflections from connectors was a major engineering challenge for maintaining the speed of 10GBASE-T ethernet.
TLDR: driving those kind of speeds over those kinds of distances across different mediums and through connectors is currently outside of current technology. My understanding is limited though so if anyone here who has a deeper understanding who wants to chime in w/ detail / corrections / etc would be appreciated.
From WikiPediaHBM achieves higher bandwidth while using less power in a substantially smaller form factor than DDR4 or GDDR5.[6] This is achieved by stacking up to eight DRAM dies, including an optional base die with a memory controller, which are interconnected by through-silicon vias (TSV) and microbumps. The HBM technology is similar in principle but incompatible with the Hybrid Memory Cube interface developed by Micron Technology.[7]
HBM memory bus is very wide in comparison to other DRAM memories such as DDR4 or GDDR5. An HBM stack of four DRAM dies (4-Hi) has two 128-bit channels per die for a total of 8 channels and a width of 1024 bits in total. A graphics card/GPU with four 4-Hi HBM stacks would therefore have a memory bus with a width of 4096 bits. In comparison, the bus width of GDDR memories is 32 bits, with 16 channels for a graphics card with a 512-bit memory interface.[8] HBM supports up to 4 GB per package.
The larger number of connections to the memory, relative to DDR4 or GDDR5, required a new method of connecting the HBM memory to the GPU (or other processor).[9] AMD and Nvidia have both used purpose built silicon chips, called interposers, to connect the memory and GPU. This interposer has the added advantage of requiring the memory and processor to be physically close, decreasing memory paths. However, as semiconductor device fabrication is significantly more expensive than printed circuit board manufacture, this adds cost to the final product.
- PCPERAMD even went as far to show the overclocking headroom that the Ryzen APU can offer. During an on-site demo we saw the Ryzen 5 2400G improve its 3DMark score by 39% with memory frequency and GPU clock speed increases. Moving the GPU clock from ~1100 MHz to 1675 MHz will mean a significant increase in power consumption, and I do question the size of the audience that wants to overclock an APU. Still – cool to see!
Remember interposer size limitations: GP100 and GV100 both had to use double patterning to produce interposers large enough to fit both the GPU die and HBM dies (as the GPU dies themselves were already at the maximum reticule limit), and this is an extremely expensive process. Threadripper is already even larger than the 1200mm^2 GV100 package, around 9600mm^2! That could be shrunk somewhat by shoving the dies closer together on the interposer, but once you start adding on HBM stacks and a big GPU die, you're going to be creating a package that not only costs Big Data money to fab the interposer for, it also has a huge number of dies to attach to. Each time you add a die to an interposer, you add an opportunity for a defect that can kill the entire assembly (some scavenging is possible for HBM failures). 4x CPU dies, a GPU die, and several HBM dies means lots of opportunities to kill your package during assembly. Even if AMD are able to license EMIB from Intel to avoid the need for a monolithic interposer, you still have the problem of assembly, and of ending up with several hundred watts of heat to dissipate from one package.Just to throw it out there but if a new socket came out (Im going to call it UTR4 for ultimate threadripper) that removed the DRAM slots and instead opted to use HBM as system memory we could see a new kind of beast. CPU+GPU+HBM all on die. Removes trace length issues, HBM solves the bandwidth issue, and now the interposer has everything it needs without outside help. Its probably unreasonable but even if they shoved a few HBM stacks between the GPU and CPU portions and then the HBM was connected to DDR4 (effectively making the HBM into a large cache) it would probably increase the speed and usefulness of the Infinity Fabric that AMD uses.