Render PBO¶

next¶

is a near/far restricted raycast possible ?
add switch target feature to daeviewgl : to enable easily doing raycasts wherever
avoid duplicated vertices/faces between chroma/cuda and opengl
as threads get held back by the slowest within the block (or warp?) maybe better to arrange blocks to cover rectangles of pixels rather than one pixel wide strips ?
- slowest is the one that had to read the most nodes/triangles from memory ?

called twice¶

Due to glut/glumpy window resize ?, the render gets called twice.

For a successful small run:

(chroma_env)delta:chroma_camera blyth$ python render_pbo.py
WARNING:env.geant4.geometry.collada.collada_to_chroma:setting parent_material to __dd__Materials__Vacuum0xaf1d298 as parent is None for node top.0
pixels (200, 200) launch LaunchSequence worksize 40000 max_blocks 128 threads_per_block 64 launches 5
[0] offset 0 grid (128, 1) block (64, 1, 1)
[1] offset 8192 grid (128, 1) block (64, 1, 1)
[2] offset 16384 grid (128, 1) block (64, 1, 1)
[3] offset 24576 grid (128, 1) block (64, 1, 1)
[4] offset 32768 grid (113, 1) block (64, 1, 1)
pixels (200, 200) launch LaunchSequence worksize 40000 max_blocks 128 threads_per_block 64 launches 5
[0] offset 0 grid (128, 1) block (64, 1, 1)
[1] offset 8192 grid (128, 1) block (64, 1, 1)
[2] offset 16384 grid (128, 1) block (64, 1, 1)
[3] offset 24576 grid (128, 1) block (64, 1, 1)
[4] offset 32768 grid (113, 1) block (64, 1, 1)
(chroma_env)delta:chroma_camera blyth$

glumpy frame size¶

Keeping glumpy frame inset from figure at default of (0.9,0.9) leads to pixel offsets/complications. When looking at timing pattern see half a block. Avoid the issue and keep the frame at (1,1), so the pixels fill the figure window.

daeviewgl integration¶

Avoid wide view, and keep screen size small. As absolutely everything in the view gets visited it seems.:

daeviewgl.py -t 8005 --with-chroma --cuda-profile --near 0.5 --size 640,480

CUDA Device Reset¶

http://stackoverflow.com/questions/7144195/cudadevicereset-for-multiple-gpus
google:cudaDeviceReset driver API equivalent
- https://devtalk.nvidia.com/default/topic/686313/how-to-reset-cuda-error-in-driver-api/
- need to destroy the context to recover, perhaps an atexit context.pop ?
- http://lists.tiker.net/pipermail/pycuda/2011-June/003247.html
- see ~/e/cuda cuda_info.sh cuda_info.py for monitoring memory usage

render_pbo kernel launch observations¶

changing alpha depth 3-10 seems not to have significant performance effect
cuda profile log gputimes in milliseconds and pycuda returned launch times in seconds agree quite well
pattern of launch times repeats, some pixels(directions through the geometry) are almost 1000x more expensive
sometimes chroma has to split nodes, when that happens always get launch timeouts
such deaths related to (unsure of causality direction) persistent mis-behavior that can last for minutes
- reducing activity on the machine clears it (closing tabs, windows)
- maybe if anything goes wrong the GPU memory is not being properly cleaned up ?

2014-04-07 17:18:30,924 Optimization: Sufficient memory to move triangles onto GPU
2014-04-07 17:18:30,936 Optimization: Sufficient memory to move vertices onto GPU
2014-04-07 17:18:30,937 device usage:
----------
nodes             2.8M  44.7M
total                   44.7M
----------
device total             2.1G
device used              1.7G
device free            408.3M

2014-04-07 17:18:31,023 created PBORenderer
2014-04-07 17:18:31,024 scene init
2014-04-07 17:18:31,037 scene draw
2014-04-07 17:18:31,038 render Launch worksize (1024, 768) total 786432 max_blocks 1024 threads_per_block 64 launches 12 block (64, 1, 1)
2014-04-07 17:18:36,792 nprofile 12 nlaunch 12
Launch worksize (1024, 768) total 786432 max_blocks 1024 threads_per_block 64 launches 12 block (64, 1, 1)
offset          0 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.00681          6301.6        1540.587s 0.5
offset      65536 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.01097         10930.8           5.675s 0.5
offset     131072 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.03746         37415.4           8.493s 0.5
offset     196608 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.36371        363682.8           8.011s 0.5
offset     262144 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.59253        592511.9           8.837s 0.5
offset     327680 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.79366        793652.1           9.258s 0.5
offset     393216 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.97799        977982.1           5.761s 0.5
offset     458752 count 65536 grid (1024, 1) block (64, 1, 1)  :         1.10103       1101015.2           4.811s 0.5
offset     524288 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.59855        598542.8           9.474s 0.5
offset     589824 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.49433        494320.5          12.027s 0.5
offset     655360 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.53871        538697.0           5.174s 0.5
offset     720896 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.21149        211476.8           4.884s 0.5
2014-04-07 17:18:36,810 scene draw
2014-04-07 17:18:36,810 render Launch worksize (1024, 768) total 786432 max_blocks 1024 threads_per_block 64 launches 12 block (64, 1, 1)
2014-04-07 17:18:42,116 nprofile 24 nlaunch 12
Launch worksize (1024, 768) total 786432 max_blocks 1024 threads_per_block 64 launches 12 block (64, 1, 1)
offset          0 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.00294          2747.1          11.575s 0.5
offset      65536 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.00479          4778.2           4.144s 0.5
offset     131072 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.01632         16301.1           8.594s 0.5
offset     196608 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.19747        197449.9           8.445s 0.5
offset     262144 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.45458        454571.2           5.197s 0.5
offset     327680 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.74471        744698.5           3.977s 0.5
offset     393216 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.96353        963522.1           7.007s 0.5
offset     458752 count 65536 grid (1024, 1) block (64, 1, 1)  :         1.08026       1080253.2           8.065s 0.5
offset     524288 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.59160        591585.8           5.246s 0.5
offset     589824 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.48775        487742.6           3.921s 0.5
offset     655360 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.53199        531985.9           3.566s 0.5
offset     720896 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.21126        211253.5           3.725s 0.5
(chroma_env)delta:render_pbo blyth$
(chroma_env)delta:render_pbo blyth$

kernel pixel times figure¶

Change output pixels to reflect the cycles for each pixel:

289     // PBO format BGRA as that is preferred by OpenGL
290
291     /*
292     pixels[idx] = blue ;
293     pixels[idx+1] = green ;
294     pixels[idx+2] = red ;
295     pixels[idx+3] = a ;
296     */
297
298     int64_t cycles = clock64() - start ;
299     unsigned int stime = (int) cycles >> 10 ;
300
301    // if (threadIdx.x == 0 && blockIdx.x == 0) printf("cycles %d  stime %d \n", cycles, stime );
302
303     pixels[idx] = stime ;
304     pixels[idx+1] = stime ;
305     pixels[idx+2] = stime ;
306     pixels[idx+3] = stime ;

With image size of 1024,768 see that lines of pixels report the same time, across frame see 32 blocks where all pixels have the same tone. 1024/32 = 32 (is that due to warp size 32 ?).

Perhaps a figure of the maximum tricount along the line of pixels would match this.

tri count metric¶

PMT tri count hotspots very evident, despite the render only showing the plain outside of the radslabs.

(chroma_env)delta:render_pbo blyth$ ./render_pbo.py --cuda-profile --alpha-depth 10 --kernel render_pbo --size 1024,768 --view B --kernel-flags 2,0

256     if( g_flags.x > 0){
257
258         //int64_t metric = clock64() - start ;
259         int metric = tri_count ;
260
261         unsigned int shifted_metric = (int) metric >> g_flags.x ;
262
263         pixels[idx]   = shifted_metric ;
264         pixels[idx+1] = shifted_metric ;
265         pixels[idx+2] = shifted_metric ;
266         pixels[idx+3] = shifted_metric ;

profile shows first launch much more expensive¶

not seeing this anymore ?

1314 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.064 ]
1315 method=[ fill ] gputime=[ 12.544 ] cputime=[ 13.317 ] occupancy=[ 1.000 ]
1316 method=[ memcpyHtoD ] gputime=[ 1.344 ] cputime=[ 6.096 ]
1317 method=[ memcpyHtoD ] gputime=[ 1.184 ] cputime=[ 2.946 ]
1318 method=[ memcpyHtoD ] gputime=[ 1.344 ] cputime=[ 2.574 ]
1319 method=[ fill ] gputime=[ 12.608 ] cputime=[ 22.530 ] occupancy=[ 1.000 ]
1320 method=[ render_pbo ] gputime=[ 4591701.500 ] cputime=[ 470.986 ] occupancy=[ 0.500 ]
1321 method=[ render_pbo ] gputime=[ 155.456 ] cputime=[ 14.779 ] occupancy=[ 0.500 ]
1322 method=[ render_pbo ] gputime=[ 155.232 ] cputime=[ 5.127 ] occupancy=[ 0.500 ]
1323 method=[ render_pbo ] gputime=[ 156.288 ] cputime=[ 4.489 ] occupancy=[ 0.500 ]
1324 method=[ render_pbo ] gputime=[ 139.744 ] cputime=[ 8.254 ] occupancy=[ 0.500 ]
1325 method=[ fill ] gputime=[ 5.856 ] cputime=[ 25.364 ] occupancy=[ 1.000 ]
1326 method=[ render_pbo ] gputime=[ 4221800.000 ] cputime=[ 6.858 ] occupancy=[ 0.500 ]
1327 method=[ render_pbo ] gputime=[ 158.400 ] cputime=[ 14.441 ] occupancy=[ 0.500 ]
1328 method=[ render_pbo ] gputime=[ 158.528 ] cputime=[ 4.998 ] occupancy=[ 0.500 ]
1329 method=[ render_pbo ] gputime=[ 159.040 ] cputime=[ 5.541 ] occupancy=[ 0.500 ]
1330 method=[ render_pbo ] gputime=[ 140.256 ] cputime=[ 8.515 ] occupancy=[ 0.500 ]

standard chroma cam¶

chroma-cam -F $DAE_NAME

Links

Content Skeleton

This Page

Previous topic

Next topic

Render PBO¶

next¶

called twice¶

glumpy frame size¶

daeviewgl integration¶

CUDA Device Reset¶

render_pbo kernel launch observations¶

kernel pixel times figure¶

tri count metric¶

profile shows first launch much more expensive¶

standard chroma cam¶

Navigation

Quick search

Links

Content Skeleton

This Page

Previous topic

Next topic

Render PBO¶

next¶

called twice¶

glumpy frame size¶

daeviewgl integration¶

CUDA Device Reset¶

render_pbo kernel launch observations¶

kernel pixel times figure¶

tri count metric¶

profile shows first launch much more expensive¶

standard chroma cam¶

Navigation