While AMD’s new Ryzen processors offer impressive performance to workloads such as software compilation, media encoding, 3D rendering, and indeed, anything that can take advantage of the 8 cores and 16 simultaneous threads, certain aspects of its gaming performance were uneven.
It’s still a very strong performer in games, especially for those who like to stream their gameplay to Twitch, but not consistently so. Some games that were expected to perform well on Ryzen didn’t. Testers also observed that there were some troublesome interactions with both power management and Ryzen’s simultaneous multithreading (SMT), with certain titles showing unexpectedly high performance drop-offs from having these features enabled. There was widespread hope that some combination of game patches and perhaps even operating system changes would go some way toward boosting Ryzen’s gaming performance, or at least, making Ryzen perform in a more consistent way.
The last few weeks have seen the release of a couple of game patches designed to address certain Ryzen issues. AMD has also released guidance to game developers on how best to use its processor, as well as a new power management profile for Windows 10. Together, we can gain some insight into some of the complexities of developing game software for modern processors and get some understanding of what kind of performance gains gamers might hope to see.
Patches make everything better
The first big Ryzen patch was for Ashes of the Singularity. Ryzen’s performance in Ashes was arguably one of the more surprising findings in the initial benchmarking. The game has been widely used as a kind of showcase for the advantages of DirectX 12 and the multithreaded scaling that it shows. We spoke to the game’s developers, and they told us that its engine splits up the work it has to do between multiple cores automatically.
In general, the Ryzen 1800X performed at about the same level as Intel’s Broadwell-E 6900K. Both parts are 8-core, 16-thread chips, and while Broadwell-E has a modest instructions-per-cycle advantage in most workloads, Ryzen’s higher clock speed is enough to make up for that deficit. But in Ashes of the Singularity under DirectX 12, the 6900K had average frame rates about 25 percent better than the AMD chip.
In late March, Oxide/Stardock released a Ryzen performance update for Ashes, and it has gone a long way toward closing that gap. PC Perspective tested the update, and depending on graphics settings and memory clock speeds, Ryzen’s average frame rate went up by between 17 and 31 percent. The 1800X still trails the 6900K, but now the gap is about 9 percent, or even less with overclocked memory (but we’ll talk more about memory later on).
It’s not entirely known what Oxide and Stardock changed in the patch (we’ve asked but are still waiting on an answer), but there is some credible speculation that there were two issues (possibly intertwined) at play, both related to how data is loaded and stored in memory.
Out-of-order execution is a complicated thing
While much has been made of Ryzen’s cache layout, and in particular its large, split level 3 cache, the Ashes changes aren’t believed to be around cache, but instead relate to a processor’s load and store queues. The processor does not simply read and write directly from and to cache or memory. Instead, reads (loads) and writes (stores) are buffered. The reason for this is that the processor executes instructions speculatively and out of order, but the results of that execution—the actual reads and writes to memory—need to occur in the order that the program specifies, and speculative writes that don’t actually occur need to be cancelled. The buffers are where this all happens.
For example, branch prediction means that the processor might start executing a set of instructions without knowing for certain if it should skip over them instead. If those instructions perform writes to memory, the write is put into the store buffer. If the processor subsequently determines that the branch predictor was correct, the store can be retired and written to memory. But if it discovers that the branch predictor was wrong, and the instructions should never have been executed at all, it can invalidate the store in the store buffer, canceling the write to memory before any other core can see it.
The processor can use the store buffer to fulfill load requests, too; if a store in the buffer should come before a load, the buffered store can be used to provide the value that would otherwise be read from memory, a process called store forwarding.
Managing these buffers and their interactions with out-of-order execution is complex. The processor has to make sure that, for example, writes to the same location are handled properly and that the writes show up in the correct order.
Certain sequences of instructions can cause performance problems. Intel’s optimization guides contain tables showing which combinations can forward and which cannot; the exact results depend not just on the architecture of the chip but on the size of the store and the memory addresses being used. The patterns are not always simple. For example, with a 32-byte store, a 4-byte load can be forwarded if the memory address divided by 32 has a remainder of 0 to 4, 8 to 12, 16 to 20, or 24 to 28. But if the remainder is 5 to 7, 13 to 15, 21 to 23, or 29 to 31, the load won’t be forwarded.
Optimizing compilers should know the rules about things like store forwarding and should strive to produce code that follows the rules as best it can. If it gets things wrong, the result can be bad performance. Sometimes this can be unavoidable, but often the compiler has several options for how it could generate equivalent code, and it needs to figure out which one is best.
Reportedly, the Visual C++ compiler from Visual C++ 2015 could produce sequences of two stores followed by a load in such a way as to stall the store queue, blocking writes to memory until the processor can flush its outstanding instructions. Visual C++ 2017 has a new optimizer, and that apparently avoids the bad sequence of instructions, and hence the performance problem.
Bypassing the cache; sometimes it’s good, sometimes it’s awful
The other suggestion about the Ashes patch is about the use of a set of instructions called non-temporal instructions. These are a series of load and store instructions that are designed to bypass the cache.
For most data, the cache is a wonderful thing, because the cache is so much faster than main memory. But sometimes, the programmer knows that after reading to or writing from a particular memory address, the data won’t be used any time soon, and so there’s no point in caching it. In fact, caching the data would be a waste of cache space; caching that data will simply mean displacing something else from the cache, something else that might actually be needed.
The non-temporal instructions allow the processor to store data to main memory bypassing the cache on the way out. They also have some other properties. Non-temporal stores are write combining: multiple writes to the same 64 bytes of memory are combined into a single 64 byte write operation. They’re write collapsing: multiple writes to the same byte are collapsed into a single write (so only the last value written is ever visible to other applications). They’re also weakly ordered: non-temporal writes don’t interact with the normal store and load buffers and so may appear in memory in an order that doesn’t correspond with the program’s write order.
Used correctly, these can provide some of the fastest writes to memory, 64 bytes at a time, without disturbing any valuable cached data. But if you use them incorrectly, performance can drop off a cliff. The non-temporal writes are buffered in buffers the size of a cache line, and there’s a limited number of these buffers. If a program tries to perform non-temporal writes to too many different cache lines simultaneously, the processor ends up having to perform a bunch of partial cache line writes instead of nice big 64-byte writes, and performance drops. If a program mixes regular and non-temporal stores to the same cache line, the performance drops. If the program mixes loads with non-temporal stores to the same cache line, the performance drops. Although the writes are meant to be collapsed, on at least some processors, if a byte on a cache line is written more than once, performance drops.
The nature of the performance hit will also tend to vary on a processor-by-processor basis. A penalty that may be negligible on one chip could be substantial on another. The belief is that Ashes of the Singularity did something with non-temporal instructions that was harmless, or perhaps even desirable, on other chips but particularly detrimental on Ryzen. The performance update changes how the instructions are used to avoid the problem.
Most non-temporal instructions control stores, but there’s also a non-temporal load instruction and a set of prefetching instructions that include a non-temporal variant. These are supposed to load data into some levels of cache, without requiring it to be loaded into other levels of cache. The precise meaning of the prefetch instructions varies from processor to processor—at best it is a hint rather than well-defined instruction. Placing the prefetch instructions in the right place is tricky; do it too early, and the prefetched data will be discarded by the time it’s needed anyway. Do it too late, and the prefetched data still won’t have been loaded by the time you need it. Worse, prefetched data can displace data that would be used from cache, causing performance degradation.