Render PBO
===========

next
-----

#. is a near/far restricted raycast possible ? 

#. add switch target feature to daeviewgl : to enable easily doing raycasts wherever

#. avoid duplicated vertices/faces between chroma/cuda and opengl 

#. as threads get held back by the slowest within the block (or warp?) maybe better 
   to arrange blocks to cover rectangles of pixels rather than one pixel wide strips ?

   * slowest is the one that had to read the most nodes/triangles from memory ?


called twice 
-------------

Due to glut/glumpy window resize ?, the render gets called twice. 

For a successful small run::

    (chroma_env)delta:chroma_camera blyth$ python render_pbo.py 
    WARNING:env.geant4.geometry.collada.collada_to_chroma:setting parent_material to __dd__Materials__Vacuum0xaf1d298 as parent is None for node top.0 
    pixels (200, 200) launch LaunchSequence worksize 40000 max_blocks 128 threads_per_block 64 launches 5  
    [0] offset 0 grid (128, 1) block (64, 1, 1) 
    [1] offset 8192 grid (128, 1) block (64, 1, 1) 
    [2] offset 16384 grid (128, 1) block (64, 1, 1) 
    [3] offset 24576 grid (128, 1) block (64, 1, 1) 
    [4] offset 32768 grid (113, 1) block (64, 1, 1) 
    pixels (200, 200) launch LaunchSequence worksize 40000 max_blocks 128 threads_per_block 64 launches 5  
    [0] offset 0 grid (128, 1) block (64, 1, 1) 
    [1] offset 8192 grid (128, 1) block (64, 1, 1) 
    [2] offset 16384 grid (128, 1) block (64, 1, 1) 
    [3] offset 24576 grid (128, 1) block (64, 1, 1) 
    [4] offset 32768 grid (113, 1) block (64, 1, 1) 
    (chroma_env)delta:chroma_camera blyth$ 


glumpy frame size
--------------------

Keeping glumpy frame inset from figure at default of (0.9,0.9) leads to 
pixel offsets/complications.  When looking at timing pattern see half a block.
Avoid the issue and keep the frame at (1,1), so the pixels fill the figure window. 


daeviewgl integration
----------------------

Avoid wide view, and keep screen size small. As absolutely everything in the view gets visited it seems.::

    daeviewgl.py -t 8005 --with-chroma --cuda-profile --near 0.5 --size 640,480


CUDA Device Reset
-------------------

* http://stackoverflow.com/questions/7144195/cudadevicereset-for-multiple-gpus
* :google:`cudaDeviceReset driver API equivalent`

  * https://devtalk.nvidia.com/default/topic/686313/how-to-reset-cuda-error-in-driver-api/
  * need to destroy the context to recover, perhaps an atexit context.pop ?

  * http://lists.tiker.net/pipermail/pycuda/2011-June/003247.html

  * see `~/e/cuda` `cuda_info.sh` `cuda_info.py` for monitoring memory usage 


render_pbo kernel launch observations
---------------------------------------

#. changing alpha depth 3-10 seems not to have significant performance effect
#. cuda profile log gputimes in milliseconds and pycuda returned launch times in seconds agree quite well
#. pattern of launch times repeats, some pixels(directions through the geometry) are almost 1000x more expensive 
#. sometimes chroma has to split nodes, when that happens always get launch timeouts

#. such deaths related to (unsure of causality direction) persistent mis-behavior that can last for minutes

   * reducing activity on the machine clears it (closing tabs, windows)
   * maybe if anything goes wrong the GPU memory is not being properly cleaned up ? 


::

    2014-04-07 17:18:30,924 Optimization: Sufficient memory to move triangles onto GPU
    2014-04-07 17:18:30,936 Optimization: Sufficient memory to move vertices onto GPU
    2014-04-07 17:18:30,937 device usage:
    ----------
    nodes             2.8M  44.7M
    total                   44.7M
    ----------
    device total             2.1G
    device used              1.7G
    device free            408.3M

    2014-04-07 17:18:31,023 created PBORenderer 
    2014-04-07 17:18:31,024 scene init
    2014-04-07 17:18:31,037 scene draw
    2014-04-07 17:18:31,038 render Launch worksize (1024, 768) total 786432 max_blocks 1024 threads_per_block 64 launches 12 block (64, 1, 1)  
    2014-04-07 17:18:36,792 nprofile 12 nlaunch 12
    Launch worksize (1024, 768) total 786432 max_blocks 1024 threads_per_block 64 launches 12 block (64, 1, 1) 
    offset          0 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.00681          6301.6        1540.587s 0.5 
    offset      65536 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.01097         10930.8           5.675s 0.5 
    offset     131072 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.03746         37415.4           8.493s 0.5 
    offset     196608 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.36371        363682.8           8.011s 0.5 
    offset     262144 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.59253        592511.9           8.837s 0.5 
    offset     327680 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.79366        793652.1           9.258s 0.5 
    offset     393216 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.97799        977982.1           5.761s 0.5 
    offset     458752 count 65536 grid (1024, 1) block (64, 1, 1)  :         1.10103       1101015.2           4.811s 0.5 
    offset     524288 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.59855        598542.8           9.474s 0.5 
    offset     589824 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.49433        494320.5          12.027s 0.5 
    offset     655360 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.53871        538697.0           5.174s 0.5 
    offset     720896 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.21149        211476.8           4.884s 0.5 
    2014-04-07 17:18:36,810 scene draw
    2014-04-07 17:18:36,810 render Launch worksize (1024, 768) total 786432 max_blocks 1024 threads_per_block 64 launches 12 block (64, 1, 1)  
    2014-04-07 17:18:42,116 nprofile 24 nlaunch 12
    Launch worksize (1024, 768) total 786432 max_blocks 1024 threads_per_block 64 launches 12 block (64, 1, 1) 
    offset          0 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.00294          2747.1          11.575s 0.5 
    offset      65536 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.00479          4778.2           4.144s 0.5 
    offset     131072 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.01632         16301.1           8.594s 0.5 
    offset     196608 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.19747        197449.9           8.445s 0.5 
    offset     262144 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.45458        454571.2           5.197s 0.5 
    offset     327680 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.74471        744698.5           3.977s 0.5 
    offset     393216 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.96353        963522.1           7.007s 0.5 
    offset     458752 count 65536 grid (1024, 1) block (64, 1, 1)  :         1.08026       1080253.2           8.065s 0.5 
    offset     524288 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.59160        591585.8           5.246s 0.5 
    offset     589824 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.48775        487742.6           3.921s 0.5 
    offset     655360 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.53199        531985.9           3.566s 0.5 
    offset     720896 count 65536 grid (1024, 1) block (64, 1, 1)  :         0.21126        211253.5           3.725s 0.5 
    (chroma_env)delta:render_pbo blyth$ 
    (chroma_env)delta:render_pbo blyth$ 


kernel pixel times figure
----------------------------

Change output pixels to reflect the cycles for each pixel::

    289     // PBO format BGRA as that is preferred by OpenGL
    290 
    291     /*
    292     pixels[idx] = blue ;
    293     pixels[idx+1] = green ;
    294     pixels[idx+2] = red ;
    295     pixels[idx+3] = a ;
    296     */
    297 
    298     int64_t cycles = clock64() - start ;
    299     unsigned int stime = (int) cycles >> 10 ;
    300 
    301    // if (threadIdx.x == 0 && blockIdx.x == 0) printf("cycles %d  stime %d \n", cycles, stime );
    302    
    303     pixels[idx] = stime ;
    304     pixels[idx+1] = stime ;
    305     pixels[idx+2] = stime ;
    306     pixels[idx+3] = stime ;


With image size of 1024,768 see that lines of pixels report the same time, across frame see 32 blocks
where all pixels have the same tone. 1024/32 = 32 (is that due to warp size 32 ?). 

Perhaps a figure of the maximum tricount along the line of pixels would match this.

    
tri count metric
------------------

PMT tri count hotspots very evident, despite the render only showing the plain outside of the radslabs.


::

   (chroma_env)delta:render_pbo blyth$ ./render_pbo.py --cuda-profile --alpha-depth 10 --kernel render_pbo --size 1024,768 --view B --kernel-flags 2,0 


::

    256     if( g_flags.x > 0){
    257 
    258         //int64_t metric = clock64() - start ;
    259         int metric = tri_count ;
    260         
    261         unsigned int shifted_metric = (int) metric >> g_flags.x ;
    262         
    263         pixels[idx]   = shifted_metric ;
    264         pixels[idx+1] = shifted_metric ;
    265         pixels[idx+2] = shifted_metric ;
    266         pixels[idx+3] = shifted_metric ;


profile shows first launch much more expensive
-------------------------------------------------

#. not seeing this anymore ?

::

    1314 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.064 ]
    1315 method=[ fill ] gputime=[ 12.544 ] cputime=[ 13.317 ] occupancy=[ 1.000 ]
    1316 method=[ memcpyHtoD ] gputime=[ 1.344 ] cputime=[ 6.096 ]
    1317 method=[ memcpyHtoD ] gputime=[ 1.184 ] cputime=[ 2.946 ]
    1318 method=[ memcpyHtoD ] gputime=[ 1.344 ] cputime=[ 2.574 ]
    1319 method=[ fill ] gputime=[ 12.608 ] cputime=[ 22.530 ] occupancy=[ 1.000 ]
    1320 method=[ render_pbo ] gputime=[ 4591701.500 ] cputime=[ 470.986 ] occupancy=[ 0.500 ]
    1321 method=[ render_pbo ] gputime=[ 155.456 ] cputime=[ 14.779 ] occupancy=[ 0.500 ]
    1322 method=[ render_pbo ] gputime=[ 155.232 ] cputime=[ 5.127 ] occupancy=[ 0.500 ]
    1323 method=[ render_pbo ] gputime=[ 156.288 ] cputime=[ 4.489 ] occupancy=[ 0.500 ]
    1324 method=[ render_pbo ] gputime=[ 139.744 ] cputime=[ 8.254 ] occupancy=[ 0.500 ]
    1325 method=[ fill ] gputime=[ 5.856 ] cputime=[ 25.364 ] occupancy=[ 1.000 ]
    1326 method=[ render_pbo ] gputime=[ 4221800.000 ] cputime=[ 6.858 ] occupancy=[ 0.500 ]
    1327 method=[ render_pbo ] gputime=[ 158.400 ] cputime=[ 14.441 ] occupancy=[ 0.500 ]
    1328 method=[ render_pbo ] gputime=[ 158.528 ] cputime=[ 4.998 ] occupancy=[ 0.500 ]
    1329 method=[ render_pbo ] gputime=[ 159.040 ] cputime=[ 5.541 ] occupancy=[ 0.500 ]
    1330 method=[ render_pbo ] gputime=[ 140.256 ] cputime=[ 8.515 ] occupancy=[ 0.500 ]


standard chroma cam
---------------------

::

    chroma-cam -F $DAE_NAME