CPU Threadripper 2- Where's the bottlenecking?

el01 · Aug 13, 2018

After looking at a lot of benchmarks, I've found that TR2 (especially the 2990WX) is really kinda underwhelming (to me) in performance, especially considering it should hypothetically perform better than an i9-7980XE given Cinebench scores. So, I must ask... Where's the bottlenecking? Is it:

In the RAM (NUMA vs UMA, latency issues on 16 cores of the CPU)
In optimization (Windows and other programs are not optimized to use so many damn cores)
In the X399 chipset (I don't know how)
In the core itself (kinda unlikely considering it's still the Zen core)
In clock speed (lower turbo/base on TR2)

Hopefully I'm not a complete idiot and didn't think of something obvious...
Thanks!
-el01

Soul_Est · Aug 13, 2018

el01 said:
After looking at a lot of benchmarks, I've found that TR2 (especially the 2990WX) is really kinda underwhelming (to me) in performance, especially considering it should hypothetically perform better than an i9-7980XE given Cinebench scores. So, I must ask... Where's the bottlenecking? Is it:

In the RAM (NUMA vs UMA, latency issues on 16 cores of the CPU)

In optimization (Windows and other programs are not optimized to use so many damn cores)

In the X399 chipset (I don't know how)

In the core itself (kinda unlikely considering it's still the Zen core)

In clock speed (lower turbo/base on TR2)

Hopefully I'm not a complete idiot and didn't think of something obvious...
Thanks!
-el01

One of the bigger issues is the older code base of Cinebench. Using the Blender Open Data benchmark may give a better representation considering that it is continuously being updated to support newer CPU instructions.

Phuncz · Aug 14, 2018

Above a few cores, many applications struggle to scale well. With 32 cores this is very much an issue. Not only Windows becomes a factor with disputable scaling issues but also the software and drivers running on the system need to be able to scale properly as well. The 2990WX is not a CPU anyone should consider, unless 8 or 16 cores just don't feed the beast of a very thread-heavy application that you have to use.

Next there is the issue of memory bandwidth and latency that some applications work better with or are not influenced by at all or meaningfully. From what I understand from reading some of the earlier reviews, the 2990WX is limited on memory bandwidth in relation to other more common CPUs and thus isn't very good at memory-heavy tasks. You'll see this with video editing (which already doesn't scale so linearly) and compression.

The problem I see with this is that many reviewers just have no clue how to review high-end workstation hardware: they just throw Blender, PCMark and Adobe Premiere at the problem and be done for the day. Some of those already have issues scaling from 6 to 8 or 8 to 10 cores, so this is only exacerbated. It's not like there hasn't been >16 core offerings from Intel for years that were used in workstations, but those machines are used for a completely different usage that many reviewers don't comprehend.

AMD has brought this behemoth of a CPU but there are very few people that need to look at this. This is meant for specific use cases and in my opinion more for marketing purposes to show what AMD can bring to the table. The really interesting Threadripper 2000 series CPUs are the 2920X and 2950X: 12 cores with 4.3GHz boost and 16 cores with 4.4GHz boost for $649 and $899 respectively are nice products, if you can make use of the platform's benefits. Otherwise stay in the consumer space.

The 2990WX isn't just "MOAR processor" because the product number is bigger than the lower version. AMD added that "W" to differentiate it's for Workstation usage, not general usage. Just like the Intel 18-core Core i9 7980XE isn't better in everything to their 6-core consumer and 8-core HEDT offerings. It's a lot more complex, which is also true for RAM, GPU and storage where the top price products aren't always the fastest products.

Arie · Aug 14, 2018

Limited memory bandwidth and increased latency on half of the cores (dies) because they have no direct connection to RAM. Also, Windows 10 seems to cripple the WX parts somewhat compared to Linux. Still very cool CPUs, but for the WX series you have be sure your workload is OK with their architecture.

Gautam · Aug 14, 2018

@Arie is correct about the memory architecture. I have heavy bias here as an ex-Intel engineer, as I worked on the mesh network used in Skylake. Great pains and years of engineering work were taken to allow for consistent latency, high core-to-memory and core-to-core bandwidth. (Intel absolutely kills AMD in the latter when it comes to two arbitrary cores) Again, extreme bias showing here, but AMD's designs are much more akin to just slapping a bunch of cores on a die, literally, their Core Complexes. Even the first iteration of Threadripper required cores in different CCX's to communicate over Infinity Fabric.

In short- AMD can do great in "embarrassingly parallel" tasks. But it's going to be weak when multiple threads need to communicate with each other, especially if they have to concurrently modify a lot of memory. On the flip side, there probably aren't that many applications (or people running them) to warrant this.

el01 · Aug 14, 2018

OK, now a question for all of you... Do you think that if AMD changed the chipset and changed wirings for memory channels to 1 channel for each CCX, performance would be better?

Also, it appears that I was watching the wrong type of analysis:
https://www.phoronix.com/scan.php?page=article&item=amd-linux-2990wx&num=11
This shows significantly better performance compared to Intel than on Windows...

@Phuncz
I'm reading the Anandtech review and it makes more sense now... But I do feel like the 2990WX was "hyped" as an enthusiast's CPU for completely destroying benchmarks (such as Fire Strike and such), but instead it's essentially an EPYC with higher clocks and TDP.

O_and_N · Aug 14, 2018

Phuncz said:
Above a few cores, many applications struggle to scale well. With 32 cores this is very much an issue. Not only Windows becomes a factor with disputable scaling issues but also the software and drivers running on the system need to be able to scale properly as well. The 2990WX is not a CPU anyone should consider, unless 8 or 16 cores just don't feed the beast of a very thread-heavy application that you have to use.

Next there is the issue of memory bandwidth and latency that some applications work better with or are not influenced by at all or meaningfully. From what I understand from reading some of the earlier reviews, the 2990WX is limited on memory bandwidth in relation to other more common CPUs and thus isn't very good at memory-heavy tasks. You'll see this with video editing (which already doesn't scale so linearly) and compression.

The problem I see with this is that many reviewers just have no clue how to review high-end workstation hardware: they just throw Blender, PCMark and Adobe Premiere at the problem and be done for the day. Some of those already have issues scaling from 6 to 8 or 8 to 10 cores, so this is only exacerbated. It's not like there hasn't been >16 core offerings from Intel for years that were used in workstations, but those machines are used for a completely different usage that many reviewers don't comprehend.

AMD has brought this behemoth of a CPU but there are very few people that need to look at this. This is meant for specific use cases and in my opinion more for marketing purposes to show what AMD can bring to the table. The really interesting Threadripper 2000 series CPUs are the 2920X and 2950X: 12 cores with 4.3GHz boost and 16 cores with 4.4GHz boost for $649 and $899 respectively are nice products, if you can make use of the platform's benefits. Otherwise stay in the consumer space.

The 2990WX isn't just "MOAR processor" because the product number is bigger than the lower version. AMD added that "W" to differentiate it's for Workstation usage, not general usage. Just like the Intel 18-core Core i9 7980XE isn't better in everything to their 6-core consumer and 8-core HEDT offerings. It's a lot more complex, which is also true for RAM, GPU and storage where the top price products aren't always the fastest products.

In order not to start a new thread I loud like to ask for some opinion.I just finished reading all the benchmarks I coud find and I agree that most of the time the tests were not at pro level ussage.And I'm disappointed at the single thread performance in games.I know its not about gaming,but in my case (apart from modeling packages) I'm using the unreal engine which is like ¨¨playing a game¨¨ .
If I get the 2990wx I suspect that rendering,compiling and light baking will be a time saver,but if start game testing in the editor and world building the cpu will be bottlenecked and giving me low fps in theory.

My question is,if for example I switch my monitor to 4k and I start developing and testing my game in 2k and even 4k,will that eliminate(mostly ) that low framerate.Only 2-3-4 cores max are used in the situation.(disabling cores in ryzen master is a waste of time)

Or shoud I just look away and ...take the 7980xe.

el01 · Aug 14, 2018

O_and_N said:
In order not to start a new thread I loud like to ask for some opinion.I just finished reading all the benchmarks I coud find and I agree that most of the time the tests were not at pro level ussage.And I'm disappointed at the single thread performance in games.I know its not about gaming,but in my case (apart from modeling packages) I'm using the unreal engine which is like ¨¨playing a game¨¨ .
If I get the 2990wx I suspect that rendering,compiling and light baking will be a time saver,but if start game testing in the editor and world building the cpu will be bottlenecked and giving me low fps in theory.

My question is,if for example I switch my monitor to 4k and I start developing and testing my game in 2k and even 4k,will that eliminate(mostly ) that low framerate.Only 2-3-4 cores max are used in the situation.(disabling cores in ryzen master is a waste of time)

Or shoud I just look away and ...take the 7980xe.

I don't quite understand your question. Please give more information and fix a bit of your English (I understand if you're a non-native speaker, but please, try to follow English conventions.

)

The CPU will not be bottlenecked in development. The single-core scores are on par with older i5s and i7s, which is pretty good when it comes to gaming. Also, you must ask yourself, "do I really need 32 cores and 64 threads?" For your use case, I would get a 1920x or a 2920x and use that. You really don't need 32 cores and 64 threads for Unreal Engine.

P.S. It's a good practice (unless you're in an off-topic thread) to create new threads everywhere. Please, keep things on topic and don't go off on tangents too wildly.

VegetableStu · Aug 14, 2018

gonna copypaste some ramblings from the LTT forums that I made regarding this (I had the same curiosity):

el01 said:
Do you think that if AMD changed the chipset and changed wirings for memory channels to 1 channel for each CCX, performance would be better?

I did some thinking...

I'm still on the memory topology traffic here (I can't claim to know any of this to save my own life), and I'm starting to think the 1-channel per CCX might be a bad idea because of how much RAM is on one stick and how each CCX has memory control for one particular channel (or 2 in regular real-world cases)

say a system has 4x4GB, and each CCX only has 4GB to contend with. Windows typically use 3GB at minimum, so if it's writing those sequentially then only one CCX has the closest access to OS files (and "optimally" would be "shut off" to the rest of the system to prevent it from doing work that's farther away). if it's spread out, the CCXes has to start exchanging notes quite a lot.

And that's before considering workloads that use a lot of RAM. 3/4 of the actual RAM data has to be "paged" from the other, and the CCXes would already be busy with managing their own. Ideally having more RAM per channel and managing tasks in a freer CCX might help but obviously that's not how it works because I'm just a guy on the internet without any means of experimentation

I'm already drawing parallels of that thought to the 2950X (and even to the point of considering to buy larger RAM), but I'm kinda glad it's pretty robust for what it is currently despite being less ideal than a unified memory controller situation. I kinda hope threadripper gen 3 somehow does away with CCXes having their own memory controllers and relying on something more unified, but the way it works is being able to use multiple single chips and melding it into a many-core single unit, so...

unless all the memory controllers answer to a head librarian but that's another thing to add and I'm overthinking this way too much. can't wait for in-depth reviews and experiments and can't wait for next year's generation

el01 · Aug 14, 2018

VegetableStu said:
gonna copypaste some ramblings from the LTT forums that I made regarding this (I had the same curiosity):

I did some thinking...

I'm still on the memory topology traffic here (I can't claim to know any of this to save my own life), and I'm starting to think the 1-channel per CCX might be a bad idea because of how much RAM is on one stick and how each CCX has memory control for one particular channel (or 2 in regular real-world cases)

say a system has 4x4GB, and each CCX only has 4GB to contend with. Windows typically use 3GB at minimum, so if it's writing those sequentially then only one CCX has the closest access to OS files (and "optimally" would be "shut off" to the rest of the system to prevent it from doing work that's farther away). if it's spread out, the CCXes has to start exchanging notes quite a lot.

And that's before considering workloads that use a lot of RAM. 3/4 of the actual RAM data has to be "paged" from the other, and the CCXes would already be busy with managing their own. Ideally having more RAM per channel and managing tasks in a freer CCX might help but obviously that's not how it works because I'm just a guy on the internet without any means of experimentation

I'm already drawing parallels of that thought to the 2950X (and even to the point of considering to buy larger RAM), but I'm kinda glad it's pretty robust for what it is currently despite being less ideal than a unified memory controller situation. I kinda hope threadripper gen 3 somehow does away with CCXes having their own memory controllers and relying on something more unified, but the way it works is being able to use multiple single chips and melding it into a many-core single unit, so...

unless all the memory controllers answer to a head librarian but that's another thing to add and I'm overthinking this way too much. can't wait for in-depth reviews and experiments and can't wait for next year's generation

This is probably gonna sound very stupid and very uninformed, but why don't they consolidate two 8-core silicon units with some sort of bridge in the middle, bypassing Infinity Fabric, and then have two dies? Also, on such an already expensive processor, why can't they have custom memory controllers on each processor that is essentially a small microcontroller handling transfer of information across RAM channels, bypassing the cores?

VegetableStu · Aug 14, 2018

(keep in mind that I'm not a computer scientist ,_, so anyone who knows better please spare me of your thunder)

el01 said:
but why don't they consolidate two 8-core silicon units with some sort of bridge in the middle, bypassing Infinity Fabric, and then have two dies?

like having two CCXes share the same channels of memory? with its current implementation I'd imagine it's quite hard to do without introducing write conflicts (imagine two guys trying to write their own essays but on the same sheet of paper (and sometimes they have to read and edit what the other has written))

el01 said:
Also, on such an already expensive processor, why can't they have custom memory controllers on each processor that is essentially a small microcontroller handling transfer of information across RAM channels, bypassing the cores?

one extra thing to make, but whether that would actually reduce "RAM paging" would be something I'd love to see as well

Gautam · Aug 14, 2018

I'm going to (try to) make this brief. A process has a memory address space. Multiple processes cannot see the memory of other processes. E.g. each chrome tab is its own process. Each process also has the illusion from the operating system that it has access to the entirety of physical memory. E.g. if you have 32GB of RAM, each Chrome tab "thinks" it has 32GB available to it. In reality, the OS is constantly paging data from main memory back to disk to maintain this illusion.

A process can have multiple threads. Each thread is a stream of instructions, just like a process, but 2 threads in one process share the memory of that process. Heavily threaded processes are what make things tricky, since you typically have multiple threads running on multiple cores. If they belong to a single process, they have to be able to communicate and share data.

Threads (and processes) can have wildly different requirements of memory bandwidth and latency. Some threads never need to use more memory than what the L1 cache provides them with. Others might need to use multiple gigabytes. The hardest cases are when multiple threads share a single address space while demanding heavy memory bandwidth each, and require sharing large amounts of data with one another.

You can reason through which sorts of cases start falling apart when you're running on a core that has relatively memory high latency, and high core-to-core latency.

el01 said:
OK, now a question for all of you... Do you think that if AMD changed the chipset and changed wirings for memory channels to 1 channel for each CCX, performance would be better?

It would depend on how much memory and memory bandwidth a given task needs. In some cases it'll be better, in others it'll be worse.

VegetableStu said:
say a system has 4x4GB, and each CCX only has 4GB to contend with. Windows typically use 3GB at minimum, so if it's writing those sequentially then only one CCX has the closest access to OS files (and "optimally" would be "shut off" to the rest of the system to prevent it from doing work that's farther away). if it's spread out, the CCXes has to start exchanging notes quite a lot.

This only matters in the case of heavily threaded processes, like video encoding, etc.

Most windows services are a single thread with very low bandwidth, so they're not going to suffer much. Worst case would be having one process run per CCX. As far as the OS is concerned, its just 4 services running on 4 threads available on the CPU.

VegetableStu said:
like having two CCXes share the same channels of memory? with its current implementation I'd imagine it's quite hard to do without introducing write conflicts (imagine two guys trying to write their own essays but on the same sheet of paper (and sometimes they have to read and edit what the other has written))

You have hit upon the crux of why sharing memory resources across a large amount of cores is so difficult to engineer. In reality, you avoid such conflicts by putting the memory requests of multiple cores in a queue for the memory controller to process. Sort of like 4 lanes of a highway scissor merging into one... That comes with challenges in deciding how to prioritize and schedule requests, and the overhead of having these queues.

el01 said:
Also, on such an already expensive processor, why can't they have custom memory controllers on each processor that is essentially a small microcontroller handling transfer of information across RAM channels, bypassing the cores?

This is essentially how Intel does it. Each core gets a mesh stop which handles routing to the memory controllers, other cores, and the LLC (which is a truly shared last level cache). It takes a lot of engineering manpower to get right. A lot. And each mesh stop is nearly as large and complex as a core.

O_and_N · Aug 14, 2018

el01 said:
I don't quite understand your question. Please give more information and fix a bit of your English (I understand if you're a non-native speaker, but please, try to follow English conventions.)

The CPU will not be bottlenecked in development. The single-core scores are on par with older i5s and i7s, which is pretty good when it comes to gaming. Also, you must ask yourself, "do I really need 32 cores and 64 threads?" For your use case, I would get a 1920x or a 2920x and use that. You really don't need 32 cores and 64 threads for Unreal Engine.

P.S. It's a good practice (unless you're in an off-topic thread) to create new threads everywhere. Please, keep things on topic and don't go off on tangents too wildly.

Apologies for the typos.I was aiming at the 32 core simply because I wanted lower baking and rendering times in my 8 hour working day.I don't only use unreal.I do Arnold rendering in maya and keyshot renders too.The reason I mentioned unreal was because I got spooked when I saw those benches in the recent reviews showing a 2990wx underperforming with a gtx 1080 in games.And I do need to test my game work from time to time.Thats all.

I noticed that all tests were performed at 1080p and I remembered that when the resolution goes up, than amd cpu start to act more closely to a intel one until they reach 4k and get equal performance.And was just wondering if the same will apply to the 2990wx if I just start testing in 2k and 4k my final product?No loss in fps?

el01 · Aug 14, 2018

O_and_N said:
Apologies for the typos.I was aiming at the 32 core simply because I wanted lower baking and rendering times in my 8 hour working day.I don't only use unreal.I do Arnold rendering in maya and keyshot renders too.The reason I mentioned unreal was because I got spooked when I saw those benches in the recent reviews showing a 2990wx underperforming with a gtx 1080 in games.And I do need to test my game work from time to time.Thats all.

I noticed that all tests were performed at 1080p and I remembered that when the resolution goes up, than amd cpu start to act more closely to a intel one until they reach 4k and get equal performance.And was just wondering if the same will apply to the 2990wx if I just start testing in 2k and 4k my final product?No loss in fps?

No worries about Englishing...

Yep... But in a lot of tasks, the 2950X performs similar or better than the 2990WX (from what I remember). If you're using Linux, though... That's a different story! The 2990WX performs very good in Linux!

Gautam · Aug 14, 2018

O_and_N said:
Apologies for the typos.I was aiming at the 32 core simply because I wanted lower baking and rendering times in my 8 hour working day.I don't only use unreal.I do Arnold rendering in maya and keyshot renders too.The reason I mentioned unreal was because I got spooked when I saw those benches in the recent reviews showing a 2990wx underperforming with a gtx 1080 in games.And I do need to test my game work from time to time.Thats all.

I noticed that all tests were performed at 1080p and I remembered that when the resolution goes up, than amd cpu start to act more closely to a intel one until they reach 4k and get equal performance.And was just wondering if the same will apply to the 2990wx if I just start testing in 2k and 4k my final product?No loss in fps?

I use Unity myself...but AFAIK Unreal is the same by default, lightmap baking is done on the CPU. Unlike gaming. And Maya doesn't benefit much from higher thread counts- as far as I can tell the 8700k is basically the top dog there.

Heavily biased again but I would go Intel. The biggest benefit is that the 79xx's clock almost as well as the client CPU's, so they have almost no weaknesses. Lots of content creating tools don't scale well with thread count.

Duality92 · Aug 14, 2018

Shared resources for the CPU cores is always their downfall, look at the FX series from AMD. They physically had 4/6/8 cores, but since each set of 2 cores had shared cache in modules, it acted more like 2/3/4 cores respectively with SMT. It's also the same reason why Pentium D's (basically dual pentium 4 die on a subrate) performed much worst than first gen c2d's.

If Intel really wanted, they could do the same as slap two HCC (18) core dies on a PCB and create a 36 core monster in a LGA 3647 type substrate, but, I feel like they rather have a single, much better structured die than multiple dies bridged with what ever method.

Gautam · Aug 14, 2018

Duality92 said:
Shared resources for the CPU cores is always their downfall, look at the FX series from AMD. They physically had 4/6/8 cores, but since each set of 2 cores had shared cache in modules, it acted more like 2/3/4 cores respectively with SMT. It's also the same reason why Pentium D's (basically dual pentium 4 die on a subrate) performed much worst than first gen c2d's.

If Intel really wanted, they could do the same as slap two HCC (18) core dies on a PCB and create a 36 core monster in a LGA 3647 type substrate, but, I feel like they rather have a single, much better structured die than multiple dies bridged with what ever method.

Intel already has had 28 core Xeon SKUs out for about a year. Given the nature of their architecture, they don't need to slap two dies, because the design is mostly modularized at the core level. The only limitation is that cores need to generally be added in rows or columns.

If they decide that they need to release a 32 core or 36 core part to go head-to-head with AMD, they can do so pretty easily.

It won't cheap though; that goes without saying.

Duality92 · Aug 14, 2018

Gautam said:
Intel already has had 28 core Xeon SKUs out for about a year. Given the nature of their architecture, they don't need to slap two dies, because the design is mostly modularized at the core level. The only limitation is that cores need to generally be added in rows or columns.

If they decide that they need to release a 32 core or 36 core part to go head-to-head with AMD, they can do so pretty easily.

It won't cheap though; that goes without saying.

I know about the 28-core, the Xeon and the marketing stunt we saw at Computex this year.

I was saying the 18 core model, because it's smaller than the 28 core die, the 28 core is also already on 3647, so I'm sure they can't fit two of those dies for 56 cores/112 threads.

Gautam · Aug 14, 2018

Duality92 said:
I know about the 28-core, the Xeon and the marketing stunt we saw at Computex this year.

I was saying the 18 core model, because it's smaller than the 28 core die, the 28 core is also already on 3647, so I'm sure they can't fit two of those dies for 56 cores/112 threads.

Not the marketing stunt- there have been 165-205W 28 core SKU's out for a while now for the enterprise. There were even 28 core Broadwell parts.

But that's not the point. The point is that Intel doesn't need to slap dies together. They can just add columns and rows of cores.

Duality92 · Aug 14, 2018

I'm talking about the 28 core overclockable to 5 GHz stunt from Computex

CPU Threadripper 2- Where's the bottlenecking?

King of Cable Management

SFF Guru

Lord of the Boards

Trash Compacter

Cable-Tie Ninja

King of Cable Management

Average Stuffer

King of Cable Management

Shrink Ray Wielder

King of Cable Management

Shrink Ray Wielder

Cable-Tie Ninja

Average Stuffer

King of Cable Management

Cable-Tie Ninja

Airflow Optimizer

Cable-Tie Ninja

Airflow Optimizer

Cable-Tie Ninja

Airflow Optimizer

Similar threads