【转】Redundancy and Latency in Structured Buffer Use-白红宇

【转】Redundancy and Latency in Structured Buffer Use

阅读量：5371 次

发布时间：2019-06-15

本文共 2764 字，大约阅读时间需要 9 分钟。

From：https://developer.nvidia.com/content/redundancy-and-latency-structured-buffer-use

In a , we discussed a simple pothole that developers can often hit in the use of structured buffers. This post dives into a much more subtle issue where shader structure impacts the efficiency of processing structured buffers.

Developers can benefit substantially in performance by giving some attention to the subject of redundancy in structured buffer processing. It is quite common to see code like the following. The code certainly isn’t wrong, and it may even be the best solution. However, every thread is redundantly fetching the same data. For a large light list, this is potentially a large amount of redundant work.

StructuredBuffer
    
      LightBuf; for (int i = 0; i < num_lights; i++) { Light cur_light = LightBuf[i]; // Do Lighting }

In the case of Structured Buffers, the mechanisms implementing them are architected around good performance for divergent accesses. This means each fetch can have a fair amount of latency. When all threads are completely coherent, the cache hit ratio is fantastic, but it still doesn’t resolve the latency. In a case like the code above, fetching multiple light indices in parallel would likely have approximately the same latency cost, but more useful work would be accomplished. Batching the data into shared memory could be a win in this case. Below is a snippet of what you could do in a compute shader:

StructuredBuffer
    
      LightBuf; groupshared Light[MAX_LIGHTS] LocalLights; LocalLights[ThreadIndex] = LightBuf[ThreadIndex]; GroupSharedMemoryBarrierWithGroupSync(); for (int i = 0; i < num_lights; i++) { Light cur_light = LocalLights[i]; // Do Lighting }

Obviously, an optimization like this adds complexity, and it may not always be a win due to issues like shared memory pressure or extra barrier instructions. Also, the size of the structure will have an impact on how efficiently this works. (For example, a structure that is 1024 bytes in size will lead to some inefficiency, as the stride between threads is quite large.) In some cases, using a simpler structure where you flatten things to an array of float or float4 and compute the index offsets manually can be a win. The code is obviously a bit ugly, but this is often an inner loop, and the redundancy elimination may well be worth a couple ugly macros. As with many things, your mileage will vary, but it is at least something to consider when working with Structured Buffers.

Experienced readers may be asking themselves how constant buffers compare to these issues with buffers. The answer is that they actually can be dramatically faster. I’ll demonstrate this in the final in the series.

转载于:https://www.cnblogs.com/hustztz/p/7574756.html

你可能感兴趣的文章