First steps when implementing FP16

April 20, 2018, 4:19 am

Introduction

Half-precision (FP16) computation is a performance-enhancing GPU technology long exploited in console and mobile devices not previously used or widely available in mainstream PC development. With the advent of AMD’s Vega GPU architecture, this technology is now more easily accessible and available for boosting graphics performance in mainstream PC development.

The latest iteration of the GCN architecture allows you to pack 2x FP16 values into each 32-bit VGPR register. This enables you to:

Halve ALU operations by using new data-parallel instructions.
Reduce the VGPR footprint of a shader, leading to a potential increase occupancy and performance.

There are also some minor risks:

Poor use of FP16 can result in excessive conversion between FP16 and FP32. This can reduce the performance advantage.
FP16 gently increases code complexity and maintenance.

Getting started

It is tempting to assume that implementing FP16 is as simple as merely substituting the ‘half’ type for ‘float’. Alas not: this simply doesn’t work on PC. The DirectX® FXC compiler only offers half for compatibility; it maps it onto float . If you compare the bytecode generated, it is identical.

The correct types to use are the standard HLSL types prefixed with min16 : min16float , min16int , min16uint . These can be used as a scalar or vector type in the usual fashion.

Your development environment will need specific software to successfully generate FP16 code. First of all, you need Windows 8 or later. Older versions of Windows will simply fail to create shaders if you use min16float . Whilst there is a Platform Update available for Windows 7 which enables FP16 shaders to compile, it simply compiles the code as FP32. In practice you are emulating absent hardware and therefore the resulting code may not be as efficient. It may therefore be worthwhile providing an alternative code path or set of shaders for hardware and operating systems lacking FP16 support.

Secondly, you will need up-to-date versions of the FXC compiler and the driver compiler. The FXC compiler in the Windows 10 SDK will suffice, and Radeon Crimson driver version 17.9.1 or later is required.

Thirdly, it is worth clarifying that FP16 will work on DirectX 11.1 and Shader Model 5 code. DirectX 12 is not required. Simply add a run-time test to query for D3D11_FEATURE_SHADER_MIN_PRECISION_SUPPORT from ID3D11Device::CheckFeatureSupport() .

And most importantly, compatible hardware is required: an AMD RX Vega or Vega Frontier Edition GPU!

Recommendation: Pre-processor Support

Whilst min16float is perfectly legal HLSL syntax and therefore fine to use as-is, I would caution against using it directly. I find it better to implement pre-processor support to globally include or remove the use of min16float and therefore FP16. There are two reasons for this:

You need to be able to strip it all out for compatibility with older operating systems or incompatible hardware.
It provides a convenient way to perform A/B testing for performance or correctness.

Example Output

With all these tools in place, what does compiled FP16 code look like? Let’s write a trivial test function:

cbuffer params

{

    min16float4 colour;

};
Texture2D<min16float4> tex;

SamplerState samp;
min16float4 test( in min16float2 uv : TEXCOORD0 ) : SV_Target

{

    return colour * tex.Sample( samp, uv );

}

The first step of verifying FP16 functionality is to look at the FXC .asm output. The driver cannot compile FP16 code unless it is given the correct bytecode from DirectX. Here we see the compiler has introduced a series of {min16f} suffixes:

ps_5_0

dcl_globalFlags refactoringAllowed | enableMinimumPrecision

dcl_constantbuffer CB0[1], immediateIndexed

dcl_sampler s0, mode_default

dcl_resource_texture2d (float,float,float,float) t0

dcl_input_ps linear v0.xy {min16f}

dcl_output o0.xyzw {min16f}

dcl_temps 1

sample_indexable(texture2d)(float,float,float,float) r0.xyzw {min16f}, v0.xyxx {min16f}, t0.xyzw, s0

mul o0.xyzw {min16f}, r0.xyzw {min16f}, cb0[0].xyzw {min16f}

ret

Now we turn to the ISA output. There are typically two major classes of instruction to look for:

Packed FP16 instructions such as v_pk_add/mul/sub/mad_f16
Mix instructions, such as v_mad_mix_f32/v_mad_mixlo_f16/v_mad_mixhi_f16

Instructions such as v_pk_add/mul/sub_f16 perform an ALU operation on two FP16 values at once, halving the instructions needed and your ALU time. This gives one of the primary performance advantages of FP16.

The mix modifiers allow you to freely mix FP16 and FP32 operands in one VOP3 instruction, without requiring an additional conversion instruction. The cost of using these instructions is the lost opportunity to issue a packed instruction. It is therefore neither faster nor slower than the equivalent FP32 instruction you would have done.

Note that the specific form of the mix instruction is a multiply-add instruction. The compiler can use this to implement most arithmetic operations with creative use of 0, 1 or -1 constants. However, commonly encountered shader ALU operations, such as min() or max() , cannot be performed using a mix instruction.

Here is the GCN ISA output for the above shader:

shader main

  asic(GFX9)

  type(PS)
  s_mov_b32     m0, s20

  s_mov_b64     s[22:23], exec

  s_wqm_b64     exec, exec

  s_setreg_imm32_b32  hwreg(HW_REG_MODE, 0, 8), 0x000001cc

  v_interp_p1ll_f16  v2, v0, attr0.x

  v_interp_p1ll_f16  v0, v0, attr0.y

  v_interp_p2_f16  v2, v1, attr0.x, v2

  v_interp_p2_f16  v2, v1, attr0.y, v0 op_sel:[0,0,0,1]

  image_sample  v[0:3], v[2:4], s[4:11], s[12:15] dmask:0xf a16 d16

  s_buffer_load_dwordx4  s[0:3], s[16:19], 0x00

  s_waitcnt     lgkmcnt(0)

  v_mov_b32     v2, s1

  v_cvt_pkrtz_f16_f32  v2, s0, v2

  v_mov_b32     v3, s3

  v_cvt_pkrtz_f16_f32  v3, s2, v3

  s_setreg_imm32_b32  hwreg(HW_REG_MODE, 0, 8), 0x000001c0

  s_waitcnt     vmcnt(0)

  v_pk_mul_f16  v0, v0, v2 op_sel_hi:[1,1]

  v_pk_mul_f16  v1, v1, v3 op_sel_hi:[1,1]

  v_mov_b32     v2, v0 src0_sel: WORD_0

  v_mov_b32     v0, v0 src0_sel: WORD_1

  v_mov_b32     v3, v1 src0_sel: WORD_0

  v_mov_b32     v1, v1 src0_sel: WORD_1

  s_mov_b64     exec, s[22:23]

  v_lshl_or_b32  v0, v0, 16, v2

  v_lshl_or_b32  v1, v1, 16, v3

  exp           mrt0, v0, v0, v1, v1 done compr vm

  s_endpgm

end

This output illustrates a couple of interesting points. Firstly, the compiler has successfully introduced some v_pk_mul_f16 instructions. Instead of the usual four v_mul_f32 ops required to multiply a float4 by a a scalar, we’ve halved that to two v_mul_pk_f16 ops.

Secondly, consider the two v_cvt_pkrtz instructions. These operations take 2 FP32 source values and packs them to 2 FP16 values in a single 32-bit destination register. It does this to form the min16float4 in the cbuffer. It is surprising that despite using the correct type, the compiler has not generated the simple load we may have expected. We will return to this issue later.

Recommendation: Radeon GPU Analyzer

AMD offers an extremely powerful software tool known as Radeon GPU Analyzer (RGA). This tool is an interface to the driver compiler which allows you to directly see the resulting code. RGA accepts shader source or intermediates from all main graphics APIs. The user specifies which generation of GCN GPU to target, and the tool can output a number of analyses, including but not limited to ISA output and register usage analysis.

I consider RGA invaluable for FP16 work. We have integrated this tool into our tool chain so that we can obtain ISA output or register analysis immediately after compilation. I iterate on the ISA output until I have satisfactory code, and then test it for performance and correctness. Whilst some GPU capture tools now offer ISA disassembly, this is a far more productive method of working.

FP16 Target Selection

It is critical to choose your targets very carefully. Not all code is a suitable candidate for FP16 optimisation. The ideal target:

Is compatible with the precision limitations of FP16.
Offers good scope for data parallelism.
Is fully or partially bound by:
- The quantity of ALU operations, or
- Overall VGPR register footprint.

Data parallelism typically comes in two forms. Packed instructions can easily be used on code employing 2-, 3- or 4-component vectors. Alternatively, strictly scalar code can be made suitable for packed instructions by unrolling the loop manually and working on pairs of data.

Common Targets

A reliable target for FP16 optimisation is the blending of colour and normal maps. These operations are typically heavy on data-parallel ALU operations. What’s more, such data frequently originates from a low-precision texture and therefore fits comfortably within FP16’s limitations. A typical game frame has a plentiful supply of these operations in gbuffer export and post-process shaders, all ripe for optimisation.

BRDFs are an attractive but difficult candidate. The portion of a BRDF that computes specular response is typically very register- and ALU-intensive. This would seem a promising target. However, caution must be exercised. BRDFs typically contain exponent and division operations. There are currently no FP16 instructions for these operations. This means that at best there will be no parallelisation of those operations; at worst it will introduce conversion overhead between FP16 and FP32.

All is not lost. There is a suitable optimization candidate in the typical BRDF equation: the large number of vectors and dot products typically present. Whilst individual dot products are more a data reduction operation than a data parallel operation, many dot products can be performed in parallel using SIMD code. These dot products often feed back into FP32 BRDF code, so care must be taken not to introduce FP16 to FP32 conversion overhead that exceeds the gains made.

Finally, TAA or checker-boarding systems offer strong potential for optimisation alongside surprising risks. These systems perform a great deal of colour processing, and ALU can indeed be the primary bottleneck. UV calculations often consume much of this ALU work. It is tempting to assume these screen-space UVs are well within the limits of FP16. Surprisingly, the combination of small pixel velocities and high resolutions such as 4K can cause artefacts when using FP16. Exercise care when optimising similar code.

Constants

The most efficient way to write FP16 code is to supply it with FP16 constant data. Any use of FP32 constant data will invoke a conversion operation. Constant data typically occurs in two forms: cbuffer values and literals.

In an ideal world, there would be an FP16 version of every cbuffer value available for use. In practice, it is often possible to obtain a performance advantage just using FP32 cbuffer data. It depends on how frequently a constant is used. If a constant is used only once or twice it is no slower to simply use a mix instruction. If a constant is used more widely, or on vectors, it is usually more efficient to provide an FP16 cbuffer value. Clearly, larger types such as vectors or matrices should be supplied as native FP16 data as the conversion overhead would be prohibitive.

The second source of constant data is the use of literal values in the shader. It is tempting to assume that using the h suffix would be sufficient to introduce an FP16 constant. It isn’t. Again, the half type is for backwards compatibility and FXC converts it to an FP32 literal. Using either the h or f suffix will result in a conversion. It is better to use the unadorned literal, such as 0.0, 1.5 and so on. Generally, the compiler is able to automatically encode that literal as FP32 or FP16 as appropriate according to context.

One exception is expanding literals for use in an operation with a vector. Sometimes the compiler is unable to expand the literal to a min16float3 automatically. In this case, you must either manually construct a min16float3 , or use syntax such as 1.5.xxx .

Loading FP16 data

Recall the earlier example code snippet. Whilst the compiler emitted the expected v_pk_mul_f16 operations, it didn’t emit the code sequence you might expect to load a min16float4 from memory. It loaded FP32 values and packed them down to an FP16 vector manually. If you were to access a larger type, such as a min16float4x4 matrix, the code sequence would be very sub-optimal. There is an easy solution. If we change the source code to:

cbuffer params

{

    uint2 packedColour;

};
Texture2D<min16float4> tex;

SamplerState samp;
min16float2 UnpackFloat16( uint a )

{

    float2 tmp = f16tof32( uint2( a & 0xFFFF, a >> 16 ) );
    return min16float2( tmp );

}
min16float4 UnpackFloat16( uint2 v )

{

    return min16float4( UnpackFloat16( v.x ), UnpackFloat16( v.y ) );

}
min16float4 test( in min16float2 uv : TEXCOORD0 ) : SV_Target

{

    min16float4 colour = UnpackFloat16( packedColour );
    return colour * tex.Sample( samp, uv );

}

The driver recognises this code sequence, and issues a much more optimal sequence of instructions:

shader main

  asic(GFX9)

  type(PS)
  s_mov_b32     m0, s20

  s_mov_b64     s[2:3], exec

  s_wqm_b64     exec, exec

  s_setreg_imm32_b32  hwreg(HW_REG_MODE, 0, 8), 0x000001cc

  v_interp_p1ll_f16  v2, v0, attr0.x

  v_interp_p1ll_f16  v0, v0, attr0.y

  v_interp_p2_f16  v2, v1, attr0.x, v2

  v_interp_p2_f16  v2, v1, attr0.y, v0 op_sel:[0,0,0,1]

  image_sample  v[0:3], v[2:4], s[4:11], s[12:15] dmask:0xf a16 d16

  s_buffer_load_dwordx2  s[0:1], s[16:19], 0x00

  s_setreg_imm32_b32  hwreg(HW_REG_MODE, 0, 8), 0x000001c0

  s_waitcnt     vmcnt(0) & lgkmcnt(0)

  v_pk_mul_f16  v0, v0, s0 op_sel_hi:[1,1]

  v_pk_mul_f16  v1, v1, s1 op_sel_hi:[1,1]

  v_mov_b32     v2, v0 src0_sel: WORD_0

  v_mov_b32     v0, v0 src0_sel: WORD_1

  v_mov_b32     v3, v1 src0_sel: WORD_0

  v_mov_b32     v1, v1 src0_sel: WORD_1

  s_mov_b64     exec, s[2:3]

  v_lshl_or_b32  v0, v0, 16, v2

  v_lshl_or_b32  v1, v1, 16, v3

  exp           mrt0, v0, v0, v1, v1 done compr vm

  s_endpgm

end

Finally, it is useful to embed FP16 constants at the end of the cbuffer rather than mix them alongside FP32 constants. This makes it much easier to strip away FP16 constants for the non-FP16 compatibility path, causing minimal effect on cbuffer size, layout and member alignment for both C++ and shader code.

It’s worth noting that Shader Model 6.2 supports 16-bit scalar types for all memory operations, meaning that the above issue will eventually go away in the future!

Challenges

FP16 optimisation typically encounters two main problems:

Conversion overhead between FP16 and FP32.
Code complexity.

At present, FP16 is typically introduced to a shader retrospectively to improve its performance. The new FP16 code requires conversion instructions to integrate and coexist with FP32 code. The programmer must take care to ensure these instruction do not equal or exceed the time saved. Is is important to keep large blocks of computation as purely FP16 or FP32 in order to limit this overhead. Indeed, shaders such as post-process or gbuffer exports as FP16 can run entirely in FP16 mode.

This leads us to the final point. FP16 code adds a little extra complexity to shader code. This article has outlined issues such as minimising conversion overhead, the special code to unpack FP16 data, and maintaining a non-FP16 code path. Whilst these issues are easily overcome, they may make the code take a little more effort to write and maintain. It is important to remember the reward is very worthwhile.

Conclusion

FP16 is a valuable additional tool in the programmer’s toolbox for obtaining peak shader performance. We have observed gains of around 10% on AMD RX Vega hardware. This is an attractive and lasting return for a moderate investment of engineering effort.

Links

Radeon GPU Analyzer
AMD Radeon RX Vega Instruction Set

Tom is a Principal Programmer at Codemasters Software, working on the F1 franchise. He has over 20 years of experience in the games industry across all PC and console generations. Tom specialises in rendering and optimisation across all platforms, devices and APIs.

The post First steps when implementing FP16 appeared first on GPUOpen.

↧

AMD GPU Services 5.2.0

May 31, 2018, 6:55 am

≫ Next: Radeon GPU Profiler 1.3

≪ Previous: First steps when implementing FP16

The AMD GPU Services (AGS) library provides game and application developers with the ability to query information about installed AMD GPUs and their driver, in order to access useful information that isn’t normally available through standard operating system or graphics APIs. AGS is also the gateway to some of the very useful extra functionality provided by AMD GPUs, especially for games and apps running on Windows®.

We recently released AGS 5.2.0 and its associated documentation, adding some commonly requested functionality and fixing some bugs (of course!).

The major functionality is the addition of app registration for DirectX® 12. App registration lets you give more information about your game or application to our driver, which can then use that (ideally unique) information to better support the game or app if we need to make driver-side changes to help things run as efficiently and correctly as possible.

We also changed how you get access to extensions under DX12, requiring you to create your GPU device using agsDriverExtensionsDX12_CreateDevice() , instead of the normal D3D12CreateDevice() call you’d make to D3D.

Lastly, we’ve also added support for breadcrumb markers in D3D11. Using the agsDriverExtensionsDX11_WriteBreadcrumb() API, you can put in place a strategy for debugging driver issues more easily. Sometimes your game or app can interact with the driver in a way that causes it to crash or TDR. The new API gives you the ability to leave markers around your D3D11 API calls, helping you narrow down exactly what interaction with the driver caused the problem.

That ability is very powerful, helping you to help us either fix the driver issue, or help you find a way to work around it in your game or app.

Download and Documentation

You can download the latest 5.2.0 release on GitHub and check out the product page on GPUOpen. We also updated the documentation to reflect the 5.2.0 changes.

Rys Sommefeldt looks after the Game Engineering group in Europe, which is part of the Radeon Technologies Group at AMD. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.

The post AMD GPU Services 5.2.0 appeared first on GPUOpen.

↧

Radeon GPU Profiler 1.3

August 31, 2018, 5:55 am

≫ Next: Radeon GPU Profiler 1.3.1

≪ Previous: AMD GPU Services 5.2.0

Radeon GPU Profiler 1.3

First, happy birthday to RGP! We released 1.0 publicly almost exactly a year ago at the time of writing, something I’ve just realised as I put together this post. Thank you to everyone that’s used it, sent us feedback and asked for fixes, changes or features, and helped repro bugs — RGP wouldn’t be a success without you.

Changes in 1.3

This release has a lot of work under the hood to prepare it for some big features and changes coming in future releases, but there are still a few notable things worth talking about. The first is that we now surface some extra data about shader resources in a new table, showing you register and LDS usage for the particular shader you want to inspect, along with theoretical occupancy, to give you an at-a-glance view of what’s happening in a shader without having to switch to the Pipeline view.

We now also have a new render target view that lets you get an overview of RTs used in the profiled frame. Based on the active grouping mode you can either see a top-level listing of the render targets per frame, or per pass, and the UI will adjust your view of the statistics to suit. If you drill down into an RT to take a closer look, you can see the details you’d expect — name, format, dimensions, the number of draw calls that rendered to it, whether DCC was on or off, MSAA sample count — plus a couple of other deeper draw-related pieces of information that depend on draw state and the underlying hardware.

Along with that, RGP and RDP together are more robust in the face of applications that create multiple device contexts before using one for rendering. It turns out that’s pretty common, usually to use the initial device context to query device capabilities, so we handle that case a lot better than before, improving compatibility with quite a few more games.

So while 1.3 might be a relatively small release as far as visible features goes, but a lot has happened internally to lay the groundwork for an exciting crop of features to be released in the future. We’ll talk about those, and the other features we’re working on for 1.4 and beyond, here on GPUOpen as soon as we can!

More information

As always, you can find out more about RGP, including links to the release binaries on GitHub and the full release notes list, on our product page. As always: please send us your feedback so that we can keep making RGP the very best developer-focused performance analysis tool for modern graphics work. It’s incredibly valuable to us and helps shape the roadmap for RGP

The post Radeon GPU Profiler 1.3 appeared first on GPUOpen.

↧

Radeon GPU Profiler 1.3.1

October 11, 2018, 12:24 am

≫ Next: Optimize your engine using compute @ 4C Prague 2018

≪ Previous: Radeon GPU Profiler 1.3

Radeon GPU Profiler 1.3.1

RGP 1.3.1 is a hotfix release to keep compatibility with an upcoming Radeon Adrenalin Edition graphics driver. That driver descends from a newer branch of our driver source than existing Adrenalin Edition drivers, with changes that break compatibility with RGP 1.3 and all earlier releases.

Please grab 1.3.1 ahead of that driver coming out, so that RGP will continue to work seamlessly when the driver is released and you upgrade.

Radeon GPU Profiler 1.3.1 on GitHub.

More information

You can find out more about RGP, including links to the release binaries on GitHub, videos on using RGP for performance optimisation, and more, on our product page.

The post Radeon GPU Profiler 1.3.1 appeared first on GPUOpen.

↧

Optimize your engine using compute @ 4C Prague 2018

November 5, 2018, 7:06 am

≫ Next: AMD GPU Services 5.3.0

≪ Previous: Radeon GPU Profiler 1.3.1

Organised by the fine folks at Wargaming, the 4C conference was held in Prague over 2 days in early October this year, bringing attendees and speakers from all across the games industry together to discuss the varied facets of making games.

From the human element of those making games, and all the way across the full gamut of production, technology, user experience, enjoyment, business, failure, story telling, QA, project management and more, 4C brought the European game development community together in a picture-perfect environment.

The conference proceedings are now available on 4C’s YouTube channel, including a great presentation on how compute shaders work, both conceptually and in practice on GCN and other GPUs, and how they can help optimise your rendering engine.

Presented by Lou Kramer, part of the Game Engineering team inside of AMD’s Radeon Technologies Group, the 40 minute talk will walk you through what compute shaders are, why they are different to the other shader types that use graphics-specific logic in a GPU, how you program them, and finish with a great practical example of something very useful to any rendering engine.

Slides

Download the slides (2.9MB Powerpoint).

The post Optimize your engine using compute @ 4C Prague 2018 appeared first on GPUOpen.

↧

AMD GPU Services 5.3.0

December 11, 2018, 9:08 am

≫ Next: Radeon GPU Profiler 1.4

≪ Previous: Optimize your engine using compute @ 4C Prague 2018

We recently released AGS 5.3.0 and its associated documentation.

AGS 5.3.0 changes

The major new features in AGS 5.3, versus the 5.2 (and 5.2.1 bugfix) release are the addition of deferred context support for our multidraw indirect and UAV overlap extensions, along with a helper function to let your app determine if the currently installed driver meets the minimum driver version requirements of your game.

For the latter, if you’re a Vulkan user, you can also pair that driver version helper functionality with our machine readable AMD Vulkan versions database. Check out our GPUOpen blog post describing how that database works, and how you can get access to the XML that drives it.

Finally, we’ve also added support for a new FreeSync 2 gamma 2.2 mode. The new mode uses a 10bpc swapchain (with 2 bits of alpha), and it’s designed to be an alternative to the 16bpc (with 16 bits of alpha) FreeSync 2 scRGB mode, helping you save some bandwidth.

If you’re an AGS user and would like any functionality added, please get in touch.

Download and Documentation

You can download the latest 5.3.0 release on GitHub and check out the product page on GPUOpen. We also updated the documentation to reflect the 5.3.0 updates.

The post AMD GPU Services 5.3.0 appeared first on GPUOpen.

↧

Radeon GPU Profiler 1.4

December 13, 2018, 10:05 am

≫ Next: Radeon GPU Analyzer 2.1

≪ Previous: AMD GPU Services 5.3.0

Radeon GPU Profiler 1.4

While the G in GPU stands for graphics, there are also popular SIMD programming models and associated APIs that map well to the GPU but mostly bypass their fixed graphics-specific logic. Khronos OpenCL™ is one such model and API that we support on Radeon GPUs, and the ability to profile OpenCL workloads in RGP is the big new feature in RGP 1.4.

OpenCL profiling

Most of the major RGP features that you’re used to using for profiling graphics workloads generated by Vulkan® and DirectX® 12 are there when profiling OpenCL applications, including the workload and barrier overviews.

Visible barriers between two dispatches and colour coded SE differentiation

You can see a pair of compute dispatches in the image above, separated by a barrier event that’s been highlighted and shown in the information pane on the right. You can see the first dispatch drain out of the machine before the subsequent dispatch starts up.

You can also see that the wavefronts are coloured by the SE they were dispatched to inside the GPU, highlighting how the GPU was filled up by the particular dispatch, and how it dynamically executes the workload by balancing waves across the available execution resources.

Time progresses on the x-axis, occupancy is represented on the y-axis

In the profile summary view you can configure it to see how your kernel enqueues progressed on the machine and were able to fill it up.

It’s worth nothing that RGP’s view of barrier sychronisation is different to how you drive them from the OpenCL API. You’ll see that they show up in RGP as CmdBarrier() s, rather than cl_event s or regular barrier() calls in your kernels. Check the documentation for more information.

Lastly, all of RGP’s familiar UI features and idioms from profiling graphics workloads are all there. Things like: you can zoom into specific workload areas, see CPU submission markers, interact with the timeline events as you see them to get more information on the rightmost summary pane, and selecting a region inside the timeline view will show you its duration.

Platform and hardware support

OpenCL profiling support is available on Windows 10, Windows 7 and Ubuntu 18.04.1 LTS on systems with Radeon RX Vega 56 and Radeon RX Vega 64 GPUs, and on AMD Ryzen processors with integrated Radeon Vega graphics.

Driver support

RGP 1.4 requires specific driver support to work correctly, so please sure you that you have at least 18.12.2 installed. In 18.12.2’s specific case that means using this special hotfix driver that we released alongside the regular 18.12.2 release. It fixes a couple of last minute issues we couldn’t roll in due to WHQL QA pressure.

Drivers from 18.12.3 onwards don’t have that special hotfix requirement so you’re good to go after that!

More information

As always, you can find out more about RGP, including links to the release binaries on GitHub and the full release notes list, on our product page. As always: please send us your feedback so that we can keep making RGP the very best developer-focused performance analysis tool for modern graphics work.

Your feedback is incredibly valuable to us and helps drive the RGP roadmap.

The post Radeon GPU Profiler 1.4 appeared first on GPUOpen.

↧

Radeon GPU Analyzer 2.1

March 20, 2019, 5:48 am

≫ Next: Radeon GPU Profiler 1.5.1

≪ Previous: Radeon GPU Profiler 1.4

Radeon GPU Analyzer (RGA) is our offline compiler and integrated code analysis tool, supporting the high-level shading and kernel languages that are consumed by DirectX® 11, Vulkan®, OpenGL® and OpenCL™, including HLSL, GLSL, the OpenCL kernel language, and SPIR-V™.

RGA lets you write and edit shader or kernel programs, and then analyse the generated machine ISA for a wide range of supported AMD GPUs, showing you the isolated cost of a particular program as you develop it, to help you understand and fine-tune it for the target GPU you care about.

Along with support for Vulkan in the RGA GUI, the biggest new feature in RGA 2.1 is a new analysis system that lets you obtain the GCN machine ISA and hardware resource information, using the compiler in the running driver that you have in your system.

There hardware resource information gives you a rolling summary of resource usage (VGPRs, SGPRs, LDS and more) across the entire program that updates as you type, underneath the ISA window on the right.

As you can see, there’s also a resource tracker that will show you performance hazards such as register spills and how much instruction cache pressure you’re putting on the target machine that you’re analysing for.

The screenshot above shows editing of a SPIR-V input program used in a Vulkan application, with associated disassembly for the Vega graphics architecture.

You can see all of the major information at a glance, too. It’s a GLSL input vertex shader, with highlight on a line of the resulting SPIR-V after being run through glslang , the associated highlight of the resulting line of Vega machine ISA in the right-hand pane, and the detailed build output below which shows how the offline compiler was invoked, and which products the gfx900 graphics IP is used in, across all product families (Radeon RX, Radeon Instinct and Radeon Pro).

You’ll have access to not just code generation for Vega, but for all of our GCN- and Vega-powered GPUs, including the recently released AMD Radeon VII, and you can compile for those GPUs without needing to have them in your system. That list is searchable too, so just start typing to filter the list and select what you’re interested in, including product names, so just start typing those and the filter will find what you need.

Lastly, there’s a new feature that lets you edit the pipeline state of a running Vulkan application using a VK layer that you activate in your application. RGA then intercepts Vulkan pipeline (more info here!) objects and lets you edit them in a searchable, filterable tree. The tree represents the relationships between parts of the pipeline state, something which is traditionally hard to do just by reading the spec.

You can see that feature above, highlighting all parts of the VkGraphicsPipelineCreateInfo that are related to blending. Check out the RGA manual for more information about how to activate that layer in your game or application.

We’re very excited to release it, and because RGA is open source, binaries are available on GitHub that work on Windows 7 and 10 (64-bit only), Ubuntu 18.04, and RHEL 7.

Note that the new UI only supports Vulkan and OpenCL project types today; the command line version of RGA supports a wider range of platforms, including DirectX 11 and OpenGL. Future updates will widen the API support in the UI so that it more closely matches the command line version of RGA, and we’re happy to confirm that DirectX 12 is next; we’re working hard on it and we’ll let everyone know as soon as that’s ready.

Find out more on our RGA product page here on GPUOpen, and if you want help using RGA or integrating the command line version into your own tooling, drop us a line!

The post Radeon GPU Analyzer 2.1 appeared first on GPUOpen.

↧

Radeon GPU Profiler 1.5.1

May 14, 2019, 11:34 pm

≫ Next: Radeon GPU Profiler 1.6

≪ Previous: Radeon GPU Analyzer 2.1

Radeon GPU Profiler 1.5

We previewed the main RGP 1.5 features at GDC 2019 late last month, but didn’t set the release free because it relied on driver support that wasn’t quite ready for primetime. It’s one of the biggest upgrades in feature set since the 1.0 release back in July 2017, adding two of the most requested features we’ve been asked for since 1.0, along with some really great improvements to frame navigation that we hope you’ll love. That driver work is now baked and ready, so let’s dive into what’s new.

Note that the first release of the RGP 1.5 series is 1.5.1, not 1.5.0. While waiting for the driver support we needed to mature into a release driver, we had time to iterate on things such that 1.5.1 was ready in time.

Instruction Timing

If you’ve ever used RGP to profile your frame and wondered what was going on deep in the hardware as it issues instructions, now you can find out! Using the built in hardware-level tracing support to dump instruction level trace data during a capture, RGP now shows you that data with a built-in timing view to help you understand how it performs.

Dumping the instruction-level data doesn’t require any extra work or instrumentation on your part, or any recompilation of shaders; all you need to do is profile a frame. As you can see in the screenshot above, you get GCN ISA on the left, along with timing data to follow as the program flows from top-to-bottom. You get average timings in clocks and as a percentage of the shader program, along with an intuitive time-based graph on the right.

The screenshot shows something worth noting: instruction timings shown are an average, and variable. That might be counter intuitive, so here’s why: because of the way GPUs work to hide memory latency, and how they share execution resources, sometimes the shader core needs to wait for memory or other data dependencies to be satisfied before an instruction can make forward progress.

In general, if you see a long bar on the timing view it’s because off-chip memory accesses were involved in running that particular instruction. You can also use the new data to decide if max occupancy is what you need to achieve in the shader you’re analysing. Because wavefronts share access to resources, particularly the cache hierarchy, fully occupying the machine can be detrimental to performance by decreasing the effectiveness of the caches. Instruction timing data along with RGP’s data on VALU utilisation can help you work on that balance if needed!

The high-level workflow to get instruction timing is pretty straightforward. Highlight an expensive event in the normal timeline view and copy the identifying hash for that pipeline state object (PSO), then hop over the Radeon Developer Panel and plug that hash in before triggering a profile to be collected. That’ll instruct the GPU to dump instruction timing data for that particular PSO.

Back in RGP, select an event that contains the traced PSO and pick a shader stage to get access to the new view. Done!

Shader ISA

The natural companion to instruction timing is being able to view the GPU ISA that’s part of a pipeline state object, so RGP 1.5 adds that into the pipeline state pane like so:

Each active shader stage that has embedded GPU ISA that we can show you will now have two tabs. The first is the same information tab that you’re familiar with that contains information about the hardware occupancy of the particular event you’re interested in. That’s where you see number of wavefronts launched, the average number of active threads inside those wavefronts, the average duration of a wavefront, and information about the theoretical occupancy.

Next to that tab is a new one labelled ISA , showing you the hardware instructions sent to the GPU, after being compiled from the higher level language you submitted with the dispatch or draw using one of RGP’s supported high-level APIs. It’s very similar to what you’ll see in the instruction-level timing view above, colour coded to help you separate instructions and operands, jump labels, different instruction types and other relevant parts in the disassembly.

Improved user marker display, grouping and filtering

Lastly we’ve got a couple of nice quality of life improvements that improve on functionality already there. The first is related to user markers: if the thing you’re profiling has user marker support, we now show you those user markers in the event view as labels on top of the event!

In that screenshot you can see what I mean, with the coloured events overlayed with contrast-appropriate text labels to help you annotate what you can see above in the wavefront occupancy view. That makes it easier to see how your frame is put together, get a better idea of what is overlapping when that’s happening, and follow the general flow of what you’re submitting to the GPU a lot easier than before.

Secondly, we’ve improved how the UI lets you group and filter events to make it easier to navigate your profiled frame, find events of interest, search for PSO hashes, and group together events by related state in a nicer way. RGP tries to make it easy to find actionable data in your frame and the filtering and grouping changes are designed to help you narrow down to an area of interest and do something about what you can see.

Driver support

RGP 1.5 requires specific driver support in order to work correctly, so please make sure you have at least 19.5.1 installed, which was released yesterday.

More information

Your feedback is incredibly valuable to us and helps drive the RGP roadmap forward. The vast majority of change in RGP 1.5 were user-driven, so if you want something and it makes sense then just let us know!

The post Radeon GPU Profiler 1.5.1 appeared first on GPUOpen.

↧

Radeon GPU Profiler 1.6

July 29, 2019, 1:59 am

≫ Next: AGS SDK 5.4 improves handling of video memory reporting for APUs

≪ Previous: Radeon GPU Profiler 1.5.1

With this latest incarnation of RGP, we have added support for AMD’s new Radeon RX 5700 and RX 5700 XT ‘Navi’ graphics cards. Since this is a new architecture, some of the UI elements have changed to display the hardware features but most of the UI is unchanged and will be familiar. There’s a new set of hardware shaders, and these can be seen in the side panel of the Wavefront Occupancy or Event Timing views.

In the image below, selecting a group of events will show which events use the default pipeline and which events use the new Next generation pipeline.

We’ve also added a couple of smaller tweaks to the instruction timing view so that it’s now possible to search for keywords. This allows you to search for instruction types and operands. The ‘goto line’ has also been moved to the top of the UI beside the search box so the UI is more consistent with the rest of the tool.

The ‘Recent’ section of the Welcome page now only shows the name of the profile rather than the whole path. The full data can still be seen by clicking on the “Recent profiles” button on the left.

The RadeonDeveloperPanel shipped as part of RGP has also had a slight modification in that the ‘Application blacklist’ has been exposed to the user interface. This allows you to specify which applications you don’t want to be enabled for profiling. More and more applications running as part of Windows are using DX12 and these applications can get enabled for profiling if they start up while the panel is connected to the driver.

More information

Your feedback is incredibly valuable to us and helps drive the RGP roadmap forward, so if you want something and it makes sense then just let us know!

Anthony is Software Development Engineer in GPU Developer Tools team. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

The post Radeon GPU Profiler 1.6 appeared first on GPUOpen.

↧

AGS SDK 5.4 improves handling of video memory reporting for APUs

November 25, 2019, 3:59 am

≫ Next: Radeon™ GPU Profiler 1.7

≪ Previous: Radeon GPU Profiler 1.6

The AGS (AMD GPU Services) library provides game and application developers with the ability to query information about installed AMD GPUs and their driver, in order to access useful information that isn’t normally available through standard operating system or graphics APIs. AGS is also the gateway to some of the very useful extra functionality provided by AMD GPUs, especially for games and apps running on Windows®.

We have recently released our latest update – version 5.4 – and with it comes improved handling of video memory reporting for APUs (AMD Accelerated Processing Units). When determining total available graphics memory, things can often be complicated by whether you’re on an APU or a discrete GPU, and whether the information you require is available in the graphics API you’re using.

For APUs, this distinction is important as all memory is shared memory, with an OS typically budgeting half of the remaining total memory for graphics after the operating system fulfils its functional needs. As a result, the traditional queries to Dedicated Video Memory in these platforms will only return the dedicated carveout – and often represent a fraction of what is actually available for graphics. Most of the available graphics budget will actually come in the form of shared memory which is carefully OS-managed for performance.

With AGS version 5.4 we’ve introduced the isAPU flag. When determining which memory value is relevant, then query this flag.

If isAPU is true, set your video memory budget to sharedMemoryInBytes .
If isAPU is false, you can treat this as discrete GPU and should return localMemoryInBytes .

We often hear requests for the ability to query the details surrounding shared memory in unified memory architecture GPUs. To support this, AGS can provide video memory bandwidth speeds, allowing developers who require it some additional granularity in platform bucketing.

Key variables in AGSDeviceInfo include:

isAPU : Boolean. Returns whether the device is an APU.
localMemoryInBytes : Returns dedicated video memory in bytes. Use this for video memory budget if isAPU is false.
sharedMemoryInBytes : Returns shared video memory in bytes. Use this for video memory budget if isAPU is true.
memoryBandwidth : Returns memory bandwidth in MB/s.

More information

John is a Game Developer Technology Engineer in AMD's Radeon Technology Group. His specialties are in CPU performance and APU graphics.

The post AGS SDK 5.4 improves handling of video memory reporting for APUs appeared first on GPUOpen.

↧

Radeon™ GPU Profiler 1.7

January 7, 2020, 11:09 am

≪ Previous: AGS SDK 5.4 improves handling of video memory reporting for APUs

We are happy to announce the release of Radeon™ GPU Profiler (RGP) v1.7. This release adds support for the latest Radeon™ graphics cards: the RX 5500 series and the RX 5300 series. In addition, this release adds new UI features to help you better understand your GPU workloads.

RGP generates easy to understand visualizations of how your DirectX®12, Vulkan®, and OpenCL™ applications interact with the GPU at the hardware level. Profiling a game is both a quick and simple process using the Radeon Developer Panel and our public GPU driver.

Pipelines Pane

First, a new Overview pane has been added: the Pipelines pane. This pane summarizes the pipeline usage for the profile, including detailed information about the shaders contained in the pipeline and the events which use the pipeline. In the image below, you can see this new pane.

In the above image, you can see another new addition. When viewing traces taken on RDNA hardware, RGP will now tell you whether a shader was compiled in wave32 vs. wave64 mode. In addition to being displayed in the Pipelines overview, this is also displayed in the Pipeline State view for a particular event.

New Overlays for Wavefront Occupancy View

Next, the timeline in the Wavefront Occupancy view now includes several overlays which allow you to visualize additional data alongside the events. In previous RGP releases, user events were displayed at the top of the timeline to provide extra context to the actual GPU events. In RGP 1.7, this has been enhanced to allow optional display of additional overlays. The new overlays supported are Hardware Contexts, Command Buffers, and Render Targets.

In the image below, you can see the UI element used to select which overlays are displayed. In the timeline itself, you can see each of the four types of overlays displayed in rows at the top of the timeline.

New Overlays for Wavefront Occupancy View

One other new feature is the display of the amount of overhead incurred and bandwidth used when the profile was captured. This can be found in the Frame Summary pane, as seen below.

Other Improvements

In addition to the above UI items, RGP 1.7 also includes an improved algorithm for assigning latencies to instructions in the Instruction Timing view. This should lead to more accurate instruction timing data.

This release also includes many bug fixes and smaller enhancements, all designed to improve the profiling experience.

Resources

As always: please send us your feedback so that we can keep working to make RGP the very best developer-focused performance analysis tool for modern graphics and compute profiling work.

Your feedback is incredibly valuable to us and helps drive the RGP roadmap forward, so if you want something and it makes sense then just let us know!

Chris Hesik is the GPU Profiler technical lead for the Developer Tools Group at AMD. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.

The post Radeon™ GPU Profiler 1.7 appeared first on GPUOpen.

↧