tl;dr: I made a memory subsystem a few years ago that did a lot of cool stuff but wont ever come out. Wanted to show some research being done with memory right now so people could have an idea of future tech and maybe discuss memory a bit (Its a complicated subject and one of the 3 big issues with performance in PCs).
So a few years back I wrote a white paper and simulated Dynamic Bandwidth Memory (DBM) for CPUs right around the time HBM started being produced and made for AMD's then top of the line GPUs. The paper itself isn't too important as I ended up scrapping it after a few peer reviews and it never got published.
The basic idea (high level) was to emulate the way highways and high traffic areas do with electronically changing signs to relocate how many lanes are being used for in-going and out-going traffic. For this I had to design a new memory standard and a new memory controller, managed to simulate it, and made a VHDL prototype for an FPGA to test with. It managed a bandwidth that could be changed in steps from 512mb/s to 256gb/s and would change around every 5 million instructions between each step depending on the percentage of data being flushed from the cache. You might be wondering why this was even proposed and the main issue it was tackling was Cache Thrashing. There is no point in drawing more information from the memory if it just meant the cached data would be dropped and need to be pulled back later wasting precious cycle time waiting.
By limiting the bandwidth we limited the amount of data that could come over the bus, so less information had the probability of being thrown out by prediction algorithms that were guessing what we would need next. This in turn meant we were waiting less time between information and throwing out less to make room for new data. This ended up being very efficient for low data tasks. On the otherside if we unlocked the limit and let it run at full bandwidth we could pull in more data and if the prediction algorithm was correct (say encoding/decoding a file where it is running data through a bit stream) then it didnt need to throw out the extra data. This made high bandwidth tasks run faster.
The downside was that weird middle ground. Because of the stepping there were some instances where the controller would swap back and forth between two areas because the application would decide that X wasnt enough but Y was too much (X < Y). This ended up creating performance issues. Where the good applications that eventually figured out what step it wanted would get a 1-18% boost depending on the application (all percentages are compared to DDR4), the ones that were on the line would lose out on 3-13% in performance. We never could figure out how to fix this other than adding more steps but it made the complexity much larger and more costly.
Some positives of the system was increased performance in most cases (1-18%), power reqs about 60% less than DDR4, heat was displace-able with just airflow from initial calculations, and a more compatible memory system for high and low bandwidth tasks. It also had the cool feature of being able to change the bandwidth-per-processor because of the distribution system that was made meaning a multi-cpu server/computer could run high bandwidth apps on 1 cpu and low bandwidth apps on another and distribute the bandwidth between them as to not starve the high bandwidth cpu.
It was a really cool idea that never did take off but I wanted to share some stuff with the community just so people had some ideas of what we could see in the future for the memory market. While my implementation will not be used (pretty much ever), there are others working on similar technology and you might be seeing it in the next 8-10 years come to the server market. Its a niche use case, but it would work great in something like AWS or VM Farms because of how the distribution system worked with multi-cpu builds.
So a few years back I wrote a white paper and simulated Dynamic Bandwidth Memory (DBM) for CPUs right around the time HBM started being produced and made for AMD's then top of the line GPUs. The paper itself isn't too important as I ended up scrapping it after a few peer reviews and it never got published.
The basic idea (high level) was to emulate the way highways and high traffic areas do with electronically changing signs to relocate how many lanes are being used for in-going and out-going traffic. For this I had to design a new memory standard and a new memory controller, managed to simulate it, and made a VHDL prototype for an FPGA to test with. It managed a bandwidth that could be changed in steps from 512mb/s to 256gb/s and would change around every 5 million instructions between each step depending on the percentage of data being flushed from the cache. You might be wondering why this was even proposed and the main issue it was tackling was Cache Thrashing. There is no point in drawing more information from the memory if it just meant the cached data would be dropped and need to be pulled back later wasting precious cycle time waiting.
By limiting the bandwidth we limited the amount of data that could come over the bus, so less information had the probability of being thrown out by prediction algorithms that were guessing what we would need next. This in turn meant we were waiting less time between information and throwing out less to make room for new data. This ended up being very efficient for low data tasks. On the otherside if we unlocked the limit and let it run at full bandwidth we could pull in more data and if the prediction algorithm was correct (say encoding/decoding a file where it is running data through a bit stream) then it didnt need to throw out the extra data. This made high bandwidth tasks run faster.
The downside was that weird middle ground. Because of the stepping there were some instances where the controller would swap back and forth between two areas because the application would decide that X wasnt enough but Y was too much (X < Y). This ended up creating performance issues. Where the good applications that eventually figured out what step it wanted would get a 1-18% boost depending on the application (all percentages are compared to DDR4), the ones that were on the line would lose out on 3-13% in performance. We never could figure out how to fix this other than adding more steps but it made the complexity much larger and more costly.
Some positives of the system was increased performance in most cases (1-18%), power reqs about 60% less than DDR4, heat was displace-able with just airflow from initial calculations, and a more compatible memory system for high and low bandwidth tasks. It also had the cool feature of being able to change the bandwidth-per-processor because of the distribution system that was made meaning a multi-cpu server/computer could run high bandwidth apps on 1 cpu and low bandwidth apps on another and distribute the bandwidth between them as to not starve the high bandwidth cpu.
It was a really cool idea that never did take off but I wanted to share some stuff with the community just so people had some ideas of what we could see in the future for the memory market. While my implementation will not be used (pretty much ever), there are others working on similar technology and you might be seeing it in the next 8-10 years come to the server market. Its a niche use case, but it would work great in something like AWS or VM Farms because of how the distribution system worked with multi-cpu builds.