博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
【转】Redundancy and Latency in Structured Buffer Use
阅读量:5371 次
发布时间:2019-06-15

本文共 2764 字,大约阅读时间需要 9 分钟。

From:https://developer.nvidia.com/content/redundancy-and-latency-structured-buffer-use

In a , we discussed a simple pothole that developers can often hit in the use of structured buffers. This post dives into a much more subtle issue where shader structure impacts the efficiency of processing structured buffers.

Developers can benefit substantially in performance by giving some attention to the subject of redundancy in structured buffer processing. It is quite common to see code like the following. The code certainly isn’t wrong, and it may even be the best solution. However, every thread is redundantly fetching the same data. For a large light list, this is potentially a large amount of redundant work.

 

StructuredBuffer
LightBuf; for (int i = 0; i < num_lights; i++) { Light cur_light = LightBuf[i]; // Do Lighting }

In the case of Structured Buffers, the mechanisms implementing them are architected around good performance for divergent accesses. This means each fetch can have a fair amount of latency. When all threads are completely coherent, the cache hit ratio is fantastic, but it still doesn’t resolve the latency. In a case like the code above, fetching multiple light indices in parallel would likely have approximately the same latency cost, but more useful work would be accomplished. Batching the data into shared memory could be a win in this case. Below is a snippet of what you could do in a compute shader:

StructuredBuffer
LightBuf; groupshared Light[MAX_LIGHTS] LocalLights; LocalLights[ThreadIndex] = LightBuf[ThreadIndex]; GroupSharedMemoryBarrierWithGroupSync(); for (int i = 0; i < num_lights; i++) { Light cur_light = LocalLights[i]; // Do Lighting }

Obviously, an optimization like this adds complexity, and it may not always be a win due to issues like shared memory pressure or extra barrier instructions. Also, the size of the structure will have an impact on how efficiently this works. (For example, a structure that is 1024 bytes in size will lead to some inefficiency, as the stride between threads is quite large.) In some cases, using a simpler structure where you flatten things to an array of float or float4 and compute the index offsets manually can be a win. The code is obviously a bit ugly, but this is often an inner loop, and the redundancy elimination may well be worth a couple ugly macros. As with many things, your mileage will vary, but it is at least something to consider when working with Structured Buffers.

Experienced readers may be asking themselves how constant buffers compare to these issues with buffers. The answer is that they actually can be dramatically faster. I’ll demonstrate this in the final  in the series.

转载于:https://www.cnblogs.com/hustztz/p/7574756.html

你可能感兴趣的文章
【.NET】使用HtmlAgilityPack抓取网页数据
查看>>
typedef的使用
查看>>
基于位置的本地商铺个性化推荐
查看>>
职场上一个人情商高的十种表现
查看>>
【底层原理】深入理解Cache (下)
查看>>
Elasticsearch安装中文分词插件IK
查看>>
进阶4:常见函数-单行函数
查看>>
简述企业信息化与企业架构关系
查看>>
npoi List 泛型导出
查看>>
流程图怎么画?分享绘制流程图简单方法
查看>>
squid的处理request和reply的流程
查看>>
硬件_陀螺仪
查看>>
三、winForm-DataGridView操作——DataGridView 操作复选框checkbox
查看>>
SSIS的部署和配置
查看>>
计算机内存管理介绍
查看>>
POJ 2761 Feed the dogs 求区间第k大 划分树
查看>>
mysql中间件研究(Atlas,cobar,TDDL)[转载]
查看>>
ASP.NET应用程序与页面生命周期
查看>>
Linux--多网卡的7种Bond模式
查看>>
Oracle命令(一):Oracle登录命令
查看>>