Design faster. Render faster. Iterate faster.
Our AMD Ryzen™ Performance Guide will help guide you through the optimization process with a collection of tidbits, tips, and tricks which aim to support you in your performance quest.
Tools
PresentMon
PresentMon is a Command Line Interface (CLI) tool for logging frame times such as MsBetweenPresents .
Example:
Open Capture and Analysis Tool (OCAT)
OCAT is a Graphics User Interface (GUI) tool with hot key support for logging frame times based on PresentMon.
Windows® Performance Toolkit
Windows Performance Analyzer (WPA)
WPA is a highly configurable tool for finding system performance bottlenecks and ideal for filtering and visualizing call stacks.
- WPA is included in the Windows SDK Windows Performance Toolkit and also available in the Microsoft Store.
- WPA opens logs created by wpr.exe or xperf.exe .
- wpr.exe is included in all Windows 10 installations.
- xperf.exe is included in the Windows SDK.
- See https://docs.microsoft.com/en-us/windows-hardware/test/wpt/windows-performance-analyzer
- See https://developer.microsoft.com/en-us/windows/downloads/windows-10-sdk/
GPUView
GPUView is a tool for analyzing GPU performance with regard to direct memory access (DMA) buffer processing.
- GPUView allows you to find times where the GPU Hardware Queue is empty or times where the Process Context CPU Queue is empty.
- Ideally, the GPU Hardware Queue should be near 100% busy.
- GPUView is included in the Windows SDK Windows Performance Toolkit.
- See https://docs.microsoft.com/en-us/windows-hardware/drivers/display/using-gpuview
- See https://developer.microsoft.com/en-us/windows/downloads/windows-10-sdk/
Visual Studio Concurrency Visualizer
You can use the Concurrency Visualizer for Visual Studio to locate performance bottlenecks, CPU underutilization, thread contention, cross-core thread migration, synchronization delays, DirectX activity, areas of overlapped I/O, and other information.
- See https://visualstudio.microsoft.com/downloads/
- See https://docs.microsoft.com/en-us/visualstudio/profiling/concurrency-visualizer
AMD µProf
- Find performance bottlenecks using CPU hardware Performance Monitoring Counters (PMCs)
- Instruction Based Sampling (IBS) has disassembly instruction accurate attribution but with limited counter coverage.
- Event Based Sampling (EBS) has more counters available but less accurate attribution. It is typically accurate within a few instructions. AMD Dev Techs often use EBS counters in the Assess Performance (Extended) profile.
- See https://developer.amd.com/amd-uprof/
Radeon GPU Profiler (RGP)
RGP is an offline compiler and performance analysis tool for DirectX, Vulkan®, SPIR-V™, OpenGL® and OpenCL™.
- The Overview > Frame summary may quickly assess if the application is CPU bound (GPU idle > 5%) based on the few frames captured.
- See https://github.com/GPUOpen-Tools/radeon_gpu_profiler
Compiling
Use the latest compiler and Windows SDK
- Get the latest build and link time improvements.
- Ensure you are using the latest C runtime optimizations.
- See https://devblogs.microsoft.com/cppblog/the-coalition-sees-27-9x-iteration-build-improvement-with-visual-studio-2019/
Add virus and threat protection exclusions
- Add project folders to virus and threat protection settings exclusions for faster build times.
- We have seen some projects compiling 20% faster!
Prefer Shipping configuration builds for CPU profiling
- Debug and development configuration builds may greatly reduce performance.
- Stats collection may cause cache pollution.
- Logging may create serialization points.
- Sometimes debug builds may disable multi-threading optimizations.
- While investigating open issues, developers may submit change requests which enable debug features on Test and Shipping configurations. Be sure to disable debug features before you ship!
- Some Unreal Engine settings to verify include:
- In Build.h , #define FORCE_USE_STATS and #define STATS should never be enabled during Shipping builds.
- It may be convenient to enable ALLOW_CONSOLE_IN_SHIPPING during game development.
- See master/Engine/Source/Runtime/Core/Public/Misc/Build.h
Disable Anti-Tamper for CPU profiling
- Build a binary similar to Shipping configuration but without Anti-Tamper or Anti-Cheat tools which may prevent CPU profiling tools from properly loading symbols.
Testing
Audit Content
-
Run Unreal Engine UE4Editor MapCheck to find errors.
-
Use Unity® AssetPostprocessor to enforce minimum standards.
Ask artists and QA for scene recommendations
- It is important to profile potential optimizations using representative content. Not all scenes are created equal, and there is not always one best scene.
- Indoor scenes may have heavy occlusion.
- Outdoor forests may have many masked materials.
- Large crowds may represent a good stress test for AI, navmesh, physics, animation, and rendering workloads.
- Consistent in game time of day is an important consideration when minimizing run to run variation.
- Time of day may trigger specific world events such as rush hour where there are larger crowds or different lighting composition between day and night.
Use the default Platform Clock setting
DAMAGE CAUSED BY USE OF YOUR AMD PROCESSOR OUTSIDE OF SPECIFICATION OR IN EXCESS OF FACTORY SETTINGS ARE NOT COVERED UNDER YOUR AMD PRODUCT WARRANTY AND MAY NOT BE COVERED BY YOUR SYSTEM MANUFACTURER’S WARRANTY. Operating your AMD processor outside of specification or in excess of factory settings, including but not limited to overclocking, may damage or shorten the life of your processor or other system components, create system instabilities (e.g. data loss and corrupted images), and in extreme cases may result in total system failure. AMD does not provide support or service for issues or damages related to use of an AMD processor outside of processor specifications or in excess of factory settings.
- Use the default platform clock setting for best performance with high precision and low latency.
- Default:
- bcdedit.exe /deletevalue useplatformclock
- This option should only be used for debugging. However, some overclocking tools may set it to yes:
- bcdedit.exe /set useplatformclock yes
- See https://docs.microsoft.com/en-us/windows-hardware/drivers/devtest/bcdedit—set
Test the cold shader cache first time user experience
- Be sure to clear the application shader cache if it has one.
- The end user will often not be running the same scene back to back as a developer might.
- The example below clears the Microsoft®, AMD, and NVIDIA® shader caches:
Analyze frame times
- When doing performance analysis, prefer averages and percentiles over min and max metrics.
- It only takes one bad frame for min and max to no longer be representative of the average experience.
- Be sure to collect sufficient samples when comparing 3 sigma and higher.
- Determine the coefficient of variation over many test iterations.
- Under 3% is good in our experience.
- High variation is endemic of an inconsistent test scene.
- We recommend setting static seed values for dynamically generated content and fixing variables like time of day.
- If higher variation is unavoidable, the user should increase their number of benchmark runs proportionally.
Profiling
Disable Memory Integrity if needed
Hypervisor-Protected Code Integrity (HVCI) is labelled Memory Integrity in the Windows Security app.
- HVCI can be accessed via Settings > Update and Security > Windows Security > Device security > Core isolation details > Memory Integrity.
- You may need to disable Memory Integrity for some tools to function such as AMD µProf.
- See https://support.microsoft.com/en-us/windows/device-protection-in-windows-security-afa11526-de57-b1c5-599f-3a4c6a61c5e2
Add symbols
The symstore and symbol path can be powerful tools for loading vendor symbols and providing hints to tools which do not check the local directory.
- Edit the system environment variables for _NT_SYMBOL_PATH .
- Example:
- Install the Windows 10 SDK Debuggers including symchk.exe and symstore.exe. Adding “C:\Program Files (x86)\Windows Kits\10\Debuggers\x64” to the PATH is recommended.
- Store symbols for your project.
- Example:
- See https://developer.microsoft.com/en-us/windows/downloads/windows-10-sdk/
- See https://docs.microsoft.com/en-us/windows/win32/debug/using-symstore
Determine if CPU-bound
Typically, the application is CPU-bound if GPU Idle > 5%
-
Look for bubbles of idle work on the GPU in tools such as RGP, GPUView, and the Visual Studio Concurrency Visualizer.
-
There are multiple tools and methods available for developers to detect boundedness:
-
Radeon GPU Profiler (RGP)
- GPUView
- Warning: Adapter Hardware Queue 3D is a good measure of GPU %Busy but be sure to zoom to a selection which trims out the head and tail of the log which may be missing events.
- Warning: This capture is typically limited to a few seconds which may be too broad to see smaller idle periods. Consider using the zoom function to limit scope to a few frames at a time.
- Example:
- Windows Performance Recorder & Window Performance Analyzer
- Warning: The Windows Performance Analyzer’s GPU Utilization (FM) GPU by Process excludes GPU Idle time in Percentage calculation. Fortunately, you can open the etl file in GPUView.
- Note this capture is typically limited to a few seconds.
- Example:
- Visual Studio Concurrency Visualizer
- The Threads View shows DirectX GPU Engine utilization which may be used to zoom into regions where to GPU is idle for further analysis of blocked threads.
Verify UE4 Parallel Rendering
- While investigating open issues, developers may submit change requests which enable debug features on
Test and Shipping configurations. Some debug features may greatly reduce performance due to disabling
parallel rendering. - Check UE4 Parallel Rendering CVARs before shipping.
| r.rhicmdbypass | 0 |
| r.rhicmdusedeferredcontexts | 1 |
| r.rhicmduseparallelalgorithms | 1 |
| r.rhithread.enable | 1 |
Verify Parallel DX12 PipelineState Creation
Use a cold shader cache while verifying parallel DX12 pipeline state creation.
- Install the Windows SDK Windows Performance Toolkit.
- Add the GPUView folder to the PATH .
- Open the merged etl log file with the Windows Performance Analyzer.
- Add CPU Usage (Precise) and CPU Usage (Sampled) Flame by Process, Stack graphs.
- Find all D3D12.dll!CDevice::CreatePipelineState within the Flame by Process, Stack.
This find command highlights the samples of interest in the CPU Usage (Precise) graph:
Verify Parallel DX12 Command List Generation
- Install the Windows SDK Windows Performance Toolkit.
- Add the GPUView folder to the PATH .
- Add GPU Utilization, CPU Usage (Precise), and Generic Events graphs.
- Zoom into a single frame between two Present markers.
- In the Generic Events graph, move the CPU Column next to the Task name then filter and expand Command List.
Debugging
WinDbg
WinDbg may be used for setting breakpoints, logging, skipping functions, editing memory, or editing registers.
- For any function, the first four args are in RCX , RDX , R8 , and R9 . Arguments five and higher are passed on the stack.
- Note Steam games often require a steam_appid.txt file or SteamAppId system environment variable to launch an executable from WinDbg.
- Verify DXGI_GPU_PREFERENCE_HIGH_PERFORMANCE was used:
- DXGI_GPU_PREFERENCE_HIGH_PERFORMANCE (2) is recommended for optimal performance on hybrid graphics systems.
- These WinDbg commands may help:
- Verify GetLogicalProcessorInformation(Ex) calls with non-zero input buffer lengths return success:
- Some applications incorrectly assume the buffer size and may crash, especially on systems with many logical processors.
- Test if the first call has input buffer length 0 to get the buffer length to malloc.
- Test that all calls with non-zero input buffer lengths return success ( return 1 ).
- These WinDbg commands may help:
Integrated Graphics
Test for Integrated Graphics
The DirectX APIs refer to Accelerated Processing Units (APUs) or Integrated Graphics parts via the term Unified Memory Architecture (UMA).
DirectX 12
- See https://docs.microsoft.com/en-us/windows/win32/api/d3d12/ns-d3d12-d3d12_feature_data_architecture
DirectX 11.3
- See https://docs.microsoft.com/en-us/windows/win32/api/d3d11/ns-d3d11-d3d11_feature_data_d3d11_options2
Calculate VRAM Budget appropriately for Integrated Graphics
Integrated graphics parts which share their video memory with the CPU require special considerations when detecting VRAM budgets.
DirectX
Preferred method:
Alternative method:
- DedicatedVideoMemory : This represents the actual local memory on discrete GPUs and the dedicated carve-out system memory on integrated GPUs.
- DedicatedSystemMemory : This value is always zero on AMD GPUs.
- SharedSystemMemory : This is determined by the GPU KMD and may return up to half of system memory.
- UMA: Unified Memory Architecture used in integrated GPUs.
- DedicatedVideoMemorySize alone may be insufficient to run some gaming applications on systems with integrated graphics (UMA).
- For systems with integrated graphics (UMA), developers should query SharedSystemMemorySize then rely on the GPU KMD and the vidMm to assign system memory optimally.
- Use DX12 (or DX11.3) CheckFeatureSupport to query UMA.
- See https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/d3dkmthk/ns-d3dkmthk-_d3dkmt_segmentsizeinfo
- See https://docs.microsoft.com/en-us/windows-hardware/drivers/display/calculating-graphics-memory
Optimize Scalability for Integrated Graphics
Sometimes feature scaling may required in order to achieve acceptable framerates on thermal limited platforms.
- Straightforward changes to try for scaling include:
- Use DXGI_FORMAT_R11G11B10_FLOAT rather than DXGI_FORMAT_R16G16B16A16_FLOAT .
- Reduce shadow map quality.
- Reduce volumetric fog quality.
- Disable Ambient Occlusion.
- The following related Unreal Engine CVars may be helpful:
- r.SceneColorFormat
- r.AmbientOcclusionLevels
Hybrid Graphics
Select the optimal GPU for Hybrid Graphics
Additional considerations may be necessary to ensure the expected GPU is utilized in hybrid graphics platforms.
- Windows 10 v1803 added IDXGIFactory6::EnumAdapterByGpuPreference .
- Use DXGI_GPU_PREFERENCE_HIGH_PERFORMANCE for game applications.
- WinDbg may be used to test if DXGI_GPU_PREFERENCE=2 ( DXGI_GPU_PREFERENCE_HIGH_PERFORMANCE ).
- The user may change preferences per application in Graphics settings.
- Example from Dell G5 15 Special Edition (5505)
Memory
Optimize memcpy/memset
- Update the compiler for the latest memcpy , memset , and other c runtime optimizations.
- Aligning memcpy source and destination to a 4096 byte page boundary may reduce Zen 2 store to load forwarding events (See STLIOther in AMD µProf).
- Aligning data to a 4096 page boundary may benefit probe filtering on AMD Threadripper™ and EPYC™ processors.
Avoid false sharing
- Alignas of the native cache line size ( 64 bytes) may reduce false sharing.
- Use aligned memory allocators such as _aligned_malloc or C++17 aligned new .
- Prefer thread local storage and local variables over process shared data.
- Try using per thread range indices such that thread ranges avoid sharing the same 64 byte cache line or 4096 byte page.
- Try copying data rather than using process shared data.
- Padding or reordering a struct may reduce false sharing in some cases where variables which share the same cache line are used by more than one thread.
Prefer data access patterns matching hardware prefetcher behaviors
- Streaming
- Uses history of memory access patterns to fetch additional sequential lines in ascending or descending order.
- Stride
- Uses memory access history of individual instructions to fetch additional lines when each access is a constant.
Use Software Prefetch instructions for linked data structures experiencing cache misses
- Use Software Prefetch instructions on linked data structures, such as std::vector , experiencing cache misses.
- Tune prefetch distance to account for memory latency. In our experience, four iterations into the future is a good place to start tuning.
- Use NTA on use once data.
- While in dual-thread mode, beware that too many software prefetches from one thread may evict the working set of the other thread from their shared caches.
- Remove Ineffective Software Prefetches found by PMCx052.
- The AMD µProf Assess Performance (Extended) profile may help find Data Cache refills from DRAM.
Synchronization
Use Modern Sync APIs
Modern sync APIs include std::mutex , std::shared_mutex , SRWLock , and EnterCriticalSection .
- These may be faster than and consume less power than WaitForSingleObject or user spin locks.
- Some modern sync APIs leverage AMD’s mwaitx instruction efficiently to wait on an address or timeout.
- Legacy sync APIs may have unneeded Syscall overhead.
- User spins locks may consume OS thread scheduling resources unnecessarily since the OS scheduler may be unable to determine if it should yield to another program thread rather than spin.
- It is generally recommended to issue sleep/wait instructions rather than spin locks.
- Even when waiting on the GPU, calls like SetEventOnCompletion() may be as efficient as the old fence polling model while avoiding starving other threads or unnecessarily consuming power.
Test application scalability from 1 to %NUMBER_OF_PROCESSORS%
This advice is specific to AMD processors and is not general guidance for all processor vendors.
Generally, applications show SMT benefits and use of all logical processors is recommended. However, games often suffer from SMT contention on the main or render threads during gameplay.
- One strategy to reduce this contention is to create threads based on physical core count rather than logical processor count.
- Avoid setting thread pool size as a constant.
- Profile your application/game to determine the ideal thread count.
- Game initialization, including decompressing assets and compiling/warming shaders, may benefit from logical processors using SMT dual-thread mode.
- Game play may prefer physical core count using SMT single-thread mode.
- We recommend creating developer options to:
- Set Max Thread Pool Size.
- Force Thread Pool Size.
- Force SMT.
- Force Single NUMA Node (implicitly Group).
- Profile against multiple CPUs. There is no hard and fast rule here.
- The best thread count heuristic may vary between low and high core count CPUs.
- While a 12 core CPU may benefit from an idle thread in your game to handle interrupts from the Operating System and 3rd party apps, a 6 core may require the availability of every compute resource.
- Developers may tune the low cores threshold for optimal performance on different core count CPUs.
- AMD µProf may be used to show the actual thread concurrency histogram for a process.
- See: