CPU Threadripper 2- Where's the bottlenecking?

el01

King of Cable Management
Original poster
Jun 4, 2018
770
588
After looking at a lot of benchmarks, I've found that TR2 (especially the 2990WX) is really kinda underwhelming (to me) in performance, especially considering it should hypothetically perform better than an i9-7980XE given Cinebench scores. So, I must ask... Where's the bottlenecking? Is it:
  • In the RAM (NUMA vs UMA, latency issues on 16 cores of the CPU)
  • In optimization (Windows and other programs are not optimized to use so many damn cores)
  • In the X399 chipset (I don't know how)
  • In the core itself (kinda unlikely considering it's still the Zen core)
  • In clock speed (lower turbo/base on TR2)
Hopefully I'm not a complete idiot and didn't think of something obvious...
Thanks!
-el01
 

Soul_Est

SFF Guru
SFFn Staff
Feb 12, 2016
1,534
1,928
After looking at a lot of benchmarks, I've found that TR2 (especially the 2990WX) is really kinda underwhelming (to me) in performance, especially considering it should hypothetically perform better than an i9-7980XE given Cinebench scores. So, I must ask... Where's the bottlenecking? Is it:
  • In the RAM (NUMA vs UMA, latency issues on 16 cores of the CPU)
  • In optimization (Windows and other programs are not optimized to use so many damn cores)
  • In the X399 chipset (I don't know how)
  • In the core itself (kinda unlikely considering it's still the Zen core)
  • In clock speed (lower turbo/base on TR2)
Hopefully I'm not a complete idiot and didn't think of something obvious...
Thanks!
-el01
One of the bigger issues is the older code base of Cinebench. Using the Blender Open Data benchmark may give a better representation considering that it is continuously being updated to support newer CPU instructions.
 
  • Like
Reactions: el01

Phuncz

Lord of the Boards
SFFn Staff
May 9, 2015
5,836
4,906
Above a few cores, many applications struggle to scale well. With 32 cores this is very much an issue. Not only Windows becomes a factor with disputable scaling issues but also the software and drivers running on the system need to be able to scale properly as well. The 2990WX is not a CPU anyone should consider, unless 8 or 16 cores just don't feed the beast of a very thread-heavy application that you have to use.

Next there is the issue of memory bandwidth and latency that some applications work better with or are not influenced by at all or meaningfully. From what I understand from reading some of the earlier reviews, the 2990WX is limited on memory bandwidth in relation to other more common CPUs and thus isn't very good at memory-heavy tasks. You'll see this with video editing (which already doesn't scale so linearly) and compression.

The problem I see with this is that many reviewers just have no clue how to review high-end workstation hardware: they just throw Blender, PCMark and Adobe Premiere at the problem and be done for the day. Some of those already have issues scaling from 6 to 8 or 8 to 10 cores, so this is only exacerbated. It's not like there hasn't been >16 core offerings from Intel for years that were used in workstations, but those machines are used for a completely different usage that many reviewers don't comprehend.

AMD has brought this behemoth of a CPU but there are very few people that need to look at this. This is meant for specific use cases and in my opinion more for marketing purposes to show what AMD can bring to the table. The really interesting Threadripper 2000 series CPUs are the 2920X and 2950X: 12 cores with 4.3GHz boost and 16 cores with 4.4GHz boost for $649 and $899 respectively are nice products, if you can make use of the platform's benefits. Otherwise stay in the consumer space.

The 2990WX isn't just "MOAR processor" because the product number is bigger than the lower version. AMD added that "W" to differentiate it's for Workstation usage, not general usage. Just like the Intel 18-core Core i9 7980XE isn't better in everything to their 6-core consumer and 8-core HEDT offerings. It's a lot more complex, which is also true for RAM, GPU and storage where the top price products aren't always the fastest products.
 
Last edited:

Arie

Trash Compacter
Jul 4, 2018
37
70
Limited memory bandwidth and increased latency on half of the cores (dies) because they have no direct connection to RAM. Also, Windows 10 seems to cripple the WX parts somewhat compared to Linux. Still very cool CPUs, but for the WX series you have be sure your workload is OK with their architecture.
 
Last edited:

Gautam

Cable-Tie Ninja
Sep 5, 2016
148
123
@Arie is correct about the memory architecture. I have heavy bias here as an ex-Intel engineer, as I worked on the mesh network used in Skylake. Great pains and years of engineering work were taken to allow for consistent latency, high core-to-memory and core-to-core bandwidth. (Intel absolutely kills AMD in the latter when it comes to two arbitrary cores) Again, extreme bias showing here, but AMD's designs are much more akin to just slapping a bunch of cores on a die, literally, their Core Complexes. Even the first iteration of Threadripper required cores in different CCX's to communicate over Infinity Fabric.

In short- AMD can do great in "embarrassingly parallel" tasks. But it's going to be weak when multiple threads need to communicate with each other, especially if they have to concurrently modify a lot of memory. On the flip side, there probably aren't that many applications (or people running them) to warrant this.
 

el01

King of Cable Management
Original poster
Jun 4, 2018
770
588
OK, now a question for all of you... Do you think that if AMD changed the chipset and changed wirings for memory channels to 1 channel for each CCX, performance would be better?

Also, it appears that I was watching the wrong type of analysis:
https://www.phoronix.com/scan.php?page=article&item=amd-linux-2990wx&num=11
This shows significantly better performance compared to Intel than on Windows...

@Phuncz
I'm reading the Anandtech review and it makes more sense now... But I do feel like the 2990WX was "hyped" as an enthusiast's CPU for completely destroying benchmarks (such as Fire Strike and such), but instead it's essentially an EPYC with higher clocks and TDP.
 
Last edited:
  • Like
Reactions: Soul_Est

O_and_N

Average Stuffer
Aug 18, 2016
77
18
Above a few cores, many applications struggle to scale well. With 32 cores this is very much an issue. Not only Windows becomes a factor with disputable scaling issues but also the software and drivers running on the system need to be able to scale properly as well. The 2990WX is not a CPU anyone should consider, unless 8 or 16 cores just don't feed the beast of a very thread-heavy application that you have to use.

Next there is the issue of memory bandwidth and latency that some applications work better with or are not influenced by at all or meaningfully. From what I understand from reading some of the earlier reviews, the 2990WX is limited on memory bandwidth in relation to other more common CPUs and thus isn't very good at memory-heavy tasks. You'll see this with video editing (which already doesn't scale so linearly) and compression.

The problem I see with this is that many reviewers just have no clue how to review high-end workstation hardware: they just throw Blender, PCMark and Adobe Premiere at the problem and be done for the day. Some of those already have issues scaling from 6 to 8 or 8 to 10 cores, so this is only exacerbated. It's not like there hasn't been >16 core offerings from Intel for years that were used in workstations, but those machines are used for a completely different usage that many reviewers don't comprehend.

AMD has brought this behemoth of a CPU but there are very few people that need to look at this. This is meant for specific use cases and in my opinion more for marketing purposes to show what AMD can bring to the table. The really interesting Threadripper 2000 series CPUs are the 2920X and 2950X: 12 cores with 4.3GHz boost and 16 cores with 4.4GHz boost for $649 and $899 respectively are nice products, if you can make use of the platform's benefits. Otherwise stay in the consumer space.

The 2990WX isn't just "MOAR processor" because the product number is bigger than the lower version. AMD added that "W" to differentiate it's for Workstation usage, not general usage. Just like the Intel 18-core Core i9 7980XE isn't better in everything to their 6-core consumer and 8-core HEDT offerings. It's a lot more complex, which is also true for RAM, GPU and storage where the top price products aren't always the fastest products.


In order not to start a new thread I loud like to ask for some opinion.I just finished reading all the benchmarks I coud find and I agree that most of the time the tests were not at pro level ussage.And I'm disappointed at the single thread performance in games.I know its not about gaming,but in my case (apart from modeling packages) I'm using the unreal engine which is like ¨¨playing a game¨¨ .
If I get the 2990wx I suspect that rendering,compiling and light baking will be a time saver,but if start game testing in the editor and world building the cpu will be bottlenecked and giving me low fps in theory.

My question is,if for example I switch my monitor to 4k and I start developing and testing my game in 2k and even 4k,will that eliminate(mostly ) that low framerate.Only 2-3-4 cores max are used in the situation.(disabling cores in ryzen master is a waste of time)

Or shoud I just look away and ...take the 7980xe.
 

el01

King of Cable Management
Original poster
Jun 4, 2018
770
588
In order not to start a new thread I loud like to ask for some opinion.I just finished reading all the benchmarks I coud find and I agree that most of the time the tests were not at pro level ussage.And I'm disappointed at the single thread performance in games.I know its not about gaming,but in my case (apart from modeling packages) I'm using the unreal engine which is like ¨¨playing a game¨¨ .
If I get the 2990wx I suspect that rendering,compiling and light baking will be a time saver,but if start game testing in the editor and world building the cpu will be bottlenecked and giving me low fps in theory.

My question is,if for example I switch my monitor to 4k and I start developing and testing my game in 2k and even 4k,will that eliminate(mostly ) that low framerate.Only 2-3-4 cores max are used in the situation.(disabling cores in ryzen master is a waste of time)

Or shoud I just look away and ...take the 7980xe.
I don't quite understand your question. Please give more information and fix a bit of your English (I understand if you're a non-native speaker, but please, try to follow English conventions.:))

The CPU will not be bottlenecked in development. The single-core scores are on par with older i5s and i7s, which is pretty good when it comes to gaming. Also, you must ask yourself, "do I really need 32 cores and 64 threads?" For your use case, I would get a 1920x or a 2920x and use that. You really don't need 32 cores and 64 threads for Unreal Engine.

P.S. It's a good practice (unless you're in an off-topic thread) to create new threads everywhere. Please, keep things on topic and don't go off on tangents too wildly.
 

VegetableStu

Shrink Ray Wielder
Aug 18, 2016
1,949
2,619
gonna copypaste some ramblings from the LTT forums that I made regarding this (I had the same curiosity):
Do you think that if AMD changed the chipset and changed wirings for memory channels to 1 channel for each CCX, performance would be better?

I did some thinking...

I'm still on the memory topology traffic here (I can't claim to know any of this to save my own life), and I'm starting to think the 1-channel per CCX might be a bad idea because of how much RAM is on one stick and how each CCX has memory control for one particular channel (or 2 in regular real-world cases)

say a system has 4x4GB, and each CCX only has 4GB to contend with. Windows typically use 3GB at minimum, so if it's writing those sequentially then only one CCX has the closest access to OS files (and "optimally" would be "shut off" to the rest of the system to prevent it from doing work that's farther away). if it's spread out, the CCXes has to start exchanging notes quite a lot.

And that's before considering workloads that use a lot of RAM. 3/4 of the actual RAM data has to be "paged" from the other, and the CCXes would already be busy with managing their own. Ideally having more RAM per channel and managing tasks in a freer CCX might help but obviously that's not how it works because I'm just a guy on the internet without any means of experimentation

I'm already drawing parallels of that thought to the 2950X (and even to the point of considering to buy larger RAM), but I'm kinda glad it's pretty robust for what it is currently despite being less ideal than a unified memory controller situation. I kinda hope threadripper gen 3 somehow does away with CCXes having their own memory controllers and relying on something more unified, but the way it works is being able to use multiple single chips and melding it into a many-core single unit, so...

unless all the memory controllers answer to a head librarian but that's another thing to add and I'm overthinking this way too much. can't wait for in-depth reviews and experiments and can't wait for next year's generation
 

el01

King of Cable Management
Original poster
Jun 4, 2018
770
588
gonna copypaste some ramblings from the LTT forums that I made regarding this (I had the same curiosity):


I did some thinking...

I'm still on the memory topology traffic here (I can't claim to know any of this to save my own life), and I'm starting to think the 1-channel per CCX might be a bad idea because of how much RAM is on one stick and how each CCX has memory control for one particular channel (or 2 in regular real-world cases)

say a system has 4x4GB, and each CCX only has 4GB to contend with. Windows typically use 3GB at minimum, so if it's writing those sequentially then only one CCX has the closest access to OS files (and "optimally" would be "shut off" to the rest of the system to prevent it from doing work that's farther away). if it's spread out, the CCXes has to start exchanging notes quite a lot.

And that's before considering workloads that use a lot of RAM. 3/4 of the actual RAM data has to be "paged" from the other, and the CCXes would already be busy with managing their own. Ideally having more RAM per channel and managing tasks in a freer CCX might help but obviously that's not how it works because I'm just a guy on the internet without any means of experimentation

I'm already drawing parallels of that thought to the 2950X (and even to the point of considering to buy larger RAM), but I'm kinda glad it's pretty robust for what it is currently despite being less ideal than a unified memory controller situation. I kinda hope threadripper gen 3 somehow does away with CCXes having their own memory controllers and relying on something more unified, but the way it works is being able to use multiple single chips and melding it into a many-core single unit, so...

unless all the memory controllers answer to a head librarian but that's another thing to add and I'm overthinking this way too much. can't wait for in-depth reviews and experiments and can't wait for next year's generation
This is probably gonna sound very stupid and very uninformed, but why don't they consolidate two 8-core silicon units with some sort of bridge in the middle, bypassing Infinity Fabric, and then have two dies? Also, on such an already expensive processor, why can't they have custom memory controllers on each processor that is essentially a small microcontroller handling transfer of information across RAM channels, bypassing the cores?
 
  • Like
Reactions: Phuncz

VegetableStu

Shrink Ray Wielder
Aug 18, 2016
1,949
2,619
(keep in mind that I'm not a computer scientist ,_, so anyone who knows better please spare me of your thunder)

but why don't they consolidate two 8-core silicon units with some sort of bridge in the middle, bypassing Infinity Fabric, and then have two dies?

like having two CCXes share the same channels of memory? with its current implementation I'd imagine it's quite hard to do without introducing write conflicts (imagine two guys trying to write their own essays but on the same sheet of paper (and sometimes they have to read and edit what the other has written))

Also, on such an already expensive processor, why can't they have custom memory controllers on each processor that is essentially a small microcontroller handling transfer of information across RAM channels, bypassing the cores?
one extra thing to make, but whether that would actually reduce "RAM paging" would be something I'd love to see as well
 

Gautam

Cable-Tie Ninja
Sep 5, 2016
148
123
I'm going to (try to) make this brief. A process has a memory address space. Multiple processes cannot see the memory of other processes. E.g. each chrome tab is its own process. Each process also has the illusion from the operating system that it has access to the entirety of physical memory. E.g. if you have 32GB of RAM, each Chrome tab "thinks" it has 32GB available to it. In reality, the OS is constantly paging data from main memory back to disk to maintain this illusion.

A process can have multiple threads. Each thread is a stream of instructions, just like a process, but 2 threads in one process share the memory of that process. Heavily threaded processes are what make things tricky, since you typically have multiple threads running on multiple cores. If they belong to a single process, they have to be able to communicate and share data.

Threads (and processes) can have wildly different requirements of memory bandwidth and latency. Some threads never need to use more memory than what the L1 cache provides them with. Others might need to use multiple gigabytes. The hardest cases are when multiple threads share a single address space while demanding heavy memory bandwidth each, and require sharing large amounts of data with one another.

You can reason through which sorts of cases start falling apart when you're running on a core that has relatively memory high latency, and high core-to-core latency.
OK, now a question for all of you... Do you think that if AMD changed the chipset and changed wirings for memory channels to 1 channel for each CCX, performance would be better?
It would depend on how much memory and memory bandwidth a given task needs. In some cases it'll be better, in others it'll be worse.
say a system has 4x4GB, and each CCX only has 4GB to contend with. Windows typically use 3GB at minimum, so if it's writing those sequentially then only one CCX has the closest access to OS files (and "optimally" would be "shut off" to the rest of the system to prevent it from doing work that's farther away). if it's spread out, the CCXes has to start exchanging notes quite a lot.
This only matters in the case of heavily threaded processes, like video encoding, etc.

Most windows services are a single thread with very low bandwidth, so they're not going to suffer much. Worst case would be having one process run per CCX. As far as the OS is concerned, its just 4 services running on 4 threads available on the CPU.
like having two CCXes share the same channels of memory? with its current implementation I'd imagine it's quite hard to do without introducing write conflicts (imagine two guys trying to write their own essays but on the same sheet of paper (and sometimes they have to read and edit what the other has written))
You have hit upon the crux of why sharing memory resources across a large amount of cores is so difficult to engineer. In reality, you avoid such conflicts by putting the memory requests of multiple cores in a queue for the memory controller to process. Sort of like 4 lanes of a highway scissor merging into one... That comes with challenges in deciding how to prioritize and schedule requests, and the overhead of having these queues.
Also, on such an already expensive processor, why can't they have custom memory controllers on each processor that is essentially a small microcontroller handling transfer of information across RAM channels, bypassing the cores?
This is essentially how Intel does it. Each core gets a mesh stop which handles routing to the memory controllers, other cores, and the LLC (which is a truly shared last level cache). It takes a lot of engineering manpower to get right. A lot. And each mesh stop is nearly as large and complex as a core.
 
Last edited:

O_and_N

Average Stuffer
Aug 18, 2016
77
18
I don't quite understand your question. Please give more information and fix a bit of your English (I understand if you're a non-native speaker, but please, try to follow English conventions.:))

The CPU will not be bottlenecked in development. The single-core scores are on par with older i5s and i7s, which is pretty good when it comes to gaming. Also, you must ask yourself, "do I really need 32 cores and 64 threads?" For your use case, I would get a 1920x or a 2920x and use that. You really don't need 32 cores and 64 threads for Unreal Engine.

P.S. It's a good practice (unless you're in an off-topic thread) to create new threads everywhere. Please, keep things on topic and don't go off on tangents too wildly.

Apologies for the typos.I was aiming at the 32 core simply because I wanted lower baking and rendering times in my 8 hour working day.I don't only use unreal.I do Arnold rendering in maya and keyshot renders too.The reason I mentioned unreal was because I got spooked when I saw those benches in the recent reviews showing a 2990wx underperforming with a gtx 1080 in games.And I do need to test my game work from time to time.Thats all.

I noticed that all tests were performed at 1080p and I remembered that when the resolution goes up, than amd cpu start to act more closely to a intel one until they reach 4k and get equal performance.And was just wondering if the same will apply to the 2990wx if I just start testing in 2k and 4k my final product?No loss in fps?
 
  • Like
Reactions: Soul_Est and el01

el01

King of Cable Management
Original poster
Jun 4, 2018
770
588
Apologies for the typos.I was aiming at the 32 core simply because I wanted lower baking and rendering times in my 8 hour working day.I don't only use unreal.I do Arnold rendering in maya and keyshot renders too.The reason I mentioned unreal was because I got spooked when I saw those benches in the recent reviews showing a 2990wx underperforming with a gtx 1080 in games.And I do need to test my game work from time to time.Thats all.

I noticed that all tests were performed at 1080p and I remembered that when the resolution goes up, than amd cpu start to act more closely to a intel one until they reach 4k and get equal performance.And was just wondering if the same will apply to the 2990wx if I just start testing in 2k and 4k my final product?No loss in fps?
No worries about Englishing... ;)
Yep... But in a lot of tasks, the 2950X performs similar or better than the 2990WX (from what I remember). If you're using Linux, though... That's a different story! The 2990WX performs very good in Linux!
 

Gautam

Cable-Tie Ninja
Sep 5, 2016
148
123
Apologies for the typos.I was aiming at the 32 core simply because I wanted lower baking and rendering times in my 8 hour working day.I don't only use unreal.I do Arnold rendering in maya and keyshot renders too.The reason I mentioned unreal was because I got spooked when I saw those benches in the recent reviews showing a 2990wx underperforming with a gtx 1080 in games.And I do need to test my game work from time to time.Thats all.

I noticed that all tests were performed at 1080p and I remembered that when the resolution goes up, than amd cpu start to act more closely to a intel one until they reach 4k and get equal performance.And was just wondering if the same will apply to the 2990wx if I just start testing in 2k and 4k my final product?No loss in fps?
I use Unity myself...but AFAIK Unreal is the same by default, lightmap baking is done on the CPU. Unlike gaming. And Maya doesn't benefit much from higher thread counts- as far as I can tell the 8700k is basically the top dog there.

Heavily biased again but I would go Intel. The biggest benefit is that the 79xx's clock almost as well as the client CPU's, so they have almost no weaknesses. Lots of content creating tools don't scale well with thread count.
 

Duality92

Airflow Optimizer
Apr 12, 2018
307
330
Shared resources for the CPU cores is always their downfall, look at the FX series from AMD. They physically had 4/6/8 cores, but since each set of 2 cores had shared cache in modules, it acted more like 2/3/4 cores respectively with SMT. It's also the same reason why Pentium D's (basically dual pentium 4 die on a subrate) performed much worst than first gen c2d's.

If Intel really wanted, they could do the same as slap two HCC (18) core dies on a PCB and create a 36 core monster in a LGA 3647 type substrate, but, I feel like they rather have a single, much better structured die than multiple dies bridged with what ever method.
 

Gautam

Cable-Tie Ninja
Sep 5, 2016
148
123
Shared resources for the CPU cores is always their downfall, look at the FX series from AMD. They physically had 4/6/8 cores, but since each set of 2 cores had shared cache in modules, it acted more like 2/3/4 cores respectively with SMT. It's also the same reason why Pentium D's (basically dual pentium 4 die on a subrate) performed much worst than first gen c2d's.

If Intel really wanted, they could do the same as slap two HCC (18) core dies on a PCB and create a 36 core monster in a LGA 3647 type substrate, but, I feel like they rather have a single, much better structured die than multiple dies bridged with what ever method.
Intel already has had 28 core Xeon SKUs out for about a year. Given the nature of their architecture, they don't need to slap two dies, because the design is mostly modularized at the core level. The only limitation is that cores need to generally be added in rows or columns.

If they decide that they need to release a 32 core or 36 core part to go head-to-head with AMD, they can do so pretty easily.

It won't cheap though; that goes without saying.
 

Duality92

Airflow Optimizer
Apr 12, 2018
307
330
Intel already has had 28 core Xeon SKUs out for about a year. Given the nature of their architecture, they don't need to slap two dies, because the design is mostly modularized at the core level. The only limitation is that cores need to generally be added in rows or columns.

If they decide that they need to release a 32 core or 36 core part to go head-to-head with AMD, they can do so pretty easily.

It won't cheap though; that goes without saying.

I know about the 28-core, the Xeon and the marketing stunt we saw at Computex this year.

I was saying the 18 core model, because it's smaller than the 28 core die, the 28 core is also already on 3647, so I'm sure they can't fit two of those dies for 56 cores/112 threads.
 

Gautam

Cable-Tie Ninja
Sep 5, 2016
148
123
I know about the 28-core, the Xeon and the marketing stunt we saw at Computex this year.

I was saying the 18 core model, because it's smaller than the 28 core die, the 28 core is also already on 3647, so I'm sure they can't fit two of those dies for 56 cores/112 threads.
Not the marketing stunt- there have been 165-205W 28 core SKU's out for a while now for the enterprise. There were even 28 core Broadwell parts.

But that's not the point. The point is that Intel doesn't need to slap dies together. They can just add columns and rows of cores.