I'm going to (try to) make this brief. A process has a memory address space. Multiple processes cannot see the memory of other processes. E.g. each chrome tab is its own process. Each process also has the illusion from the operating system that it has access to the entirety of physical memory. E.g. if you have 32GB of RAM, each Chrome tab "thinks" it has 32GB available to it. In reality, the OS is constantly paging data from main memory back to disk to maintain this illusion.
A process can have multiple threads. Each thread is a stream of instructions, just like a process, but 2 threads in one process share the memory of that process. Heavily threaded processes are what make things tricky, since you typically have multiple threads running on multiple cores. If they belong to a single process, they have to be able to communicate and share data.
Threads (and processes) can have wildly different requirements of memory bandwidth and latency. Some threads never need to use more memory than what the L1 cache provides them with. Others might need to use multiple gigabytes. The hardest cases are when multiple threads share a single address space while demanding heavy memory bandwidth each, and require sharing large amounts of data with one another.
You can reason through which sorts of cases start falling apart when you're running on a core that has relatively memory high latency, and high core-to-core latency.
OK, now a question for all of you... Do you think that if AMD changed the chipset and changed wirings for memory channels to 1 channel for each CCX, performance would be better?
It would depend on how much memory and memory bandwidth a given task needs. In some cases it'll be better, in others it'll be worse.
say a system has 4x4GB, and each CCX only has 4GB to contend with. Windows typically use 3GB at minimum, so if it's writing those sequentially then only one CCX has the closest access to OS files (and "optimally" would be "shut off" to the rest of the system to prevent it from doing work that's farther away). if it's spread out, the CCXes has to start exchanging notes quite a lot.
This only matters in the case of heavily threaded processes, like video encoding, etc.
Most windows services are a single thread with very low bandwidth, so they're not going to suffer much. Worst case would be having one process run per CCX. As far as the OS is concerned, its just 4 services running on 4 threads available on the CPU.
like having two CCXes share the same channels of memory? with its current implementation I'd imagine it's quite hard to do without introducing write conflicts (imagine two guys trying to write their own essays but on the same sheet of paper (and sometimes they have to read and edit what the other has written))
You have hit upon the crux of why sharing memory resources across a large amount of cores is so difficult to engineer. In reality, you avoid such conflicts by putting the memory requests of multiple cores in a queue for the memory controller to process. Sort of like 4 lanes of a highway scissor merging into one... That comes with challenges in deciding how to prioritize and schedule requests, and the overhead of having these queues.
Also, on such an already expensive processor, why can't they have custom memory controllers on each processor that is essentially a small microcontroller handling transfer of information across RAM channels, bypassing the cores?
This is essentially how Intel does it. Each core gets a mesh stop which handles routing to the memory controllers, other cores, and the LLC (which is a truly shared last level cache). It takes a lot of engineering manpower to get right. A lot. And each mesh stop is nearly as large and complex as a core.