CUDA Profiling
===============

* http://docs.nvidia.com/cuda/profiler-users-guide/

nvprof
---------

* http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-nvprof-your-handy-universal-gpu-profiler/

    nvprof knows how to profile CUDA kernels running on NVIDIA GPUs, no matter what
    language they are written in (as long as they are launched using the CUDA
    runtime API or driver API).

* http://devblogs.nvidia.com/parallelforall/pro-tip-clean-up-after-yourself-ensure-correct-profiling/

    Therefore, you should clean up your application’s CUDA objects properly to make
    sure that the profiler is able to store all gathered data. This means not only
    freeing memory allocated on the GPU, but also resetting the device Context.

    If your application uses the CUDA Driver API, call cuProfilerStop() on each
    context to flush the profiling buffers before destroying the context with
    cuCtxDestroy().


cuda clock
----------

* http://stackoverflow.com/questions/11217117/equivalent-of-usleep-in-cuda-kernel

gpu burn
---------

* http://wili.cc/blog/gpu-burn.html

how many registers
-------------------

::

    Also, how many registers is your kernel using?? (pass --ptxas-options=-v
    argument to nvcc) If you can only launch 16 threads per block, the GPU will be
    idle most of the time.


cuda profile
-------------

From a headless simplecamera.py render run::

    1285 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.020 ]
    1286 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 11.104 ]
    1287 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 1.972 ]
    1288 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 2.006 ]
    1289 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 2.006 ]
    1290 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 1.996 ]
    1291 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.012 ]
    1292 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 2.022 ]
    1293 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 10.942 ]
    1294 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 4.039 ]
    1295 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.034 ]
    1296 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 1.891 ]
    1297 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 1.912 ]
    1298 method=[ memcpyHtoD ] gputime=[ 1794.976 ] cputime=[ 1993.471 ]
    1299 method=[ memcpyHtoD ] gputime=[ 1617.952 ] cputime=[ 1481.204 ]
    1300 method=[ memcpyHtoD ] gputime=[ 1601.280 ] cputime=[ 1472.250 ]
    1301 method=[ memcpyHtoD ] gputime=[ 7432.672 ] cputime=[ 7370.140 ]
    1302 method=[ memcpyHtoD ] gputime=[ 4602.432 ] cputime=[ 4620.065 ]
    1303 method=[ memcpyHtoD ] gputime=[ 2335.680 ] cputime=[ 2351.582 ]
    1304 method=[ memcpyHtoD ] gputime=[ 1.664 ] cputime=[ 5.372 ]
    1305 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.315 ]
    1306 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 2.037 ]
    1307 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 1.973 ]
    1308 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 2.185 ]
    1309 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 2.113 ]
    1310 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.008 ]
    1311 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.010 ]
    1312 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.372 ]
    1313 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.009 ]
    1314 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 1.959 ]
    1315 method=[ memcpyHtoD ] gputime=[ 612.832 ] cputime=[ 501.086 ]
    1316 method=[ memcpyHtoD ] gputime=[ 590.560 ] cputime=[ 449.675 ]
    1317 method=[ fill ] gputime=[ 24.544 ] cputime=[ 13.470 ] occupancy=[ 1.000 ]
    1318 method=[ fill ] gputime=[ 25.504 ] cputime=[ 7.263 ] occupancy=[ 1.000 ]
    1319 method=[ render ] gputime=[ 5259416.500 ] cputime=[ 234.175 ] occupancy=[ 0.500 ]
    1320 method=[ memcpyDtoH ] gputime=[ 194.016 ] cputime=[ 5260492.000 ]


cuda_profile_parse.py
----------------------

::

    (chroma_env)delta:chroma_camera blyth$ ./cuda_profile_parse.py cuda_profile_0.log
    WARNING:__main__:failed to parse : # CUDA_PROFILE_LOG_VERSION 2.0 
    WARNING:__main__:failed to parse : # CUDA_DEVICE 0 GeForce GT 750M 
    WARNING:__main__:failed to parse : # CUDA_CONTEXT 1 
    WARNING:__main__:failed to parse : method,gputime,cputime,occupancy 

    memcpyDtoH           : {'gputime': 201.504, 'cputime': 5260556.83} 
    write_size           : {'gputime': 6.208, 'cputime': 37.704, 'occupancy': 0.048} 
    fill                 : {'gputime': 50.048, 'cputime': 20.733, 'occupancy': 2.0} 
    render               : {'gputime': 5259416.5, 'cputime': 234.175, 'occupancy': 0.5} 
    memcpyHtoD           : {'gputime':   22289.11999999997, 'cputime': 23602.95499999999} 
    (chroma_env)delta:chroma_camera blyth$ 


#. memcpyDtoH consumes the same 'cputime' as render takes 'gputime' with the 
   vast majority of that at the last sample

::

    (chroma_env)delta:chroma_camera blyth$ tail -5 cuda_profile_0.log
    method=[ memcpyHtoD ] gputime=[ 590.560 ] cputime=[ 449.675 ] 
    method=[ fill ] gputime=[ 24.544 ] cputime=[ 13.470 ] occupancy=[ 1.000 ] 
    method=[ fill ] gputime=[ 25.504 ] cputime=[ 7.263 ] occupancy=[ 1.000 ] 
    method=[ render ] gputime=[ 5259416.500 ] cputime=[ 234.175 ] occupancy=[ 0.500 ] 
    method=[ memcpyDtoH ] gputime=[ 194.016 ] cputime=[ 5260492.000 ] 

method
-------

This is character string which gives the name of the GPU kernel or memory copy
method. In case of kernels the method name is the mangled name generated by the
compiler.


occupancy
---------

This column gives the multiprocessor occupancy which is the ratio of number of
active warps to the maximum number of warps supported on a multiprocessor of
the GPU. This is helpful in determining how effectively the GPU is kept busy.
This column is output only for GPU kernels and the column value is a single
precision floating point value in the range 0.0 to 1.0.

cputime
---------

For non-blocking methods the cputime is only the CPU or host side overhead to
launch the method. In this case:

walltime = cputime + gputime

For blocking methods cputime is the sum of gputime and CPU overhead. In this
case:

walltime = cputime

Note all kernel launches by default are non-blocking. But if any of the
profiler counters are enabled kernel launches are blocking. Also asynchronous
memory copy requests in different streams are non-blocking.

The column value is a single precision floating point value in microseconds.


gputime
--------


This column gives the execution time for the GPU kernel or memory copy method.
This value is calculated as (gpuendtimestamp - gpustarttimestamp)/1000.0. The
column value is a single precision floating point value in microseconds.


config
-------


The command line profiler is controlled using the following environment
variables:

COMPUTE_PROFILE: is set to either 1 or 0 (or unset) to enable or disable
profiling.

COMPUTE_PROFILE_LOG: is set to the desired file path for profiling output. In
case of multiple contexts you must add '%d' in the COMPUTE_PROFILE_LOG name.
This will generate separate profiler output files for each context - with '%d'
substituted by the context number. Contexts are numbered starting with zero. In
case of multiple processes you must add '%p' in the COMPUTE_PROFILE_LOG name.
This will generate separate profiler output files for each process - with '%p'
substituted by the process id. If there is no log path specified, the profiler
will log data to "cuda_profile_%d.log" in case of a CUDA context ('%d' is
substituted by the context number).

COMPUTE_PROFILE_CSV: is set to either 1 (set) or 0 (unset) to enable or disable
a comma separated version of the log output.

COMPUTE_PROFILE_CONFIG: is used to specify a config file for selecting
profiling options and performance counters.

Configuration details are covered in a subsequent section.

The following old environment variables used for the above functionalities are
still supported:

CUDA_PROFILE

CUDA_PROFILE_LOG

CUDA_PROFILE_CSV

CUDA_PROFILE_CONFIG


metrics
---------

::

    (chroma_env)delta:e blyth$ nvprof --query-metrics
    Available Metrics:
                                Name   Description
    Device 0 (GeForce GT 750M):
            l1_cache_global_hit_rate:  Hit rate in L1 cache for global loads
                   branch_efficiency:  Ratio of non-divergent branches to total branches
             l1_cache_local_hit_rate:  Hit rate in L1 cache for local loads and stores
                       sm_efficiency:  The percentage of time at least one warp is active on a multiprocessor
                                 ipc:  Instructions executed per cycle
                  achieved_occupancy:  Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor
            gld_requested_throughput:  Requested global memory load throughput
            gst_requested_throughput:  Requested global memory store throughput
              sm_efficiency_instance:  The percentage of time at least one warp is active on a multiprocessor
                        ipc_instance:  Instructions executed per cycle
                inst_replay_overhead:  Average number of replays for each instruction executed
              shared_replay_overhead:  Average number of replays due to shared memory conflicts for each instruction executed
              global_replay_overhead:  Average number of replays due to local memory cache misses for each instruction executed
        global_cache_replay_overhead:  Average number of replays due to global memory cache misses for each instruction executed
                  tex_cache_hit_rate:  Texture cache hit rate
                tex_cache_throughput:  Texture cache throughput
                dram_read_throughput:  Device memory read throughput
               dram_write_throughput:  Device memory write throughput
                      gst_throughput:  Global memory store throughput
                      gld_throughput:  Global memory load throughput
               local_replay_overhead:  Average number of replays due to local memory accesses for each instruction executed
                   shared_efficiency:  Ratio of requested shared memory throughput to required shared memory throughput
                      gld_efficiency:  Ratio of requested global memory load throughput to required global memory load throughput
                      gst_efficiency:  Ratio of requested global memory store throughput to required global memory store throughput
                 l2_l1_read_hit_rate:  Hit rate at L2 cache for all read requests from L1 cache
            l2_texture_read_hit_rate:  Hit rate at L2 cache for all read requests from texture cache
               l2_l1_read_throughput:  Memory read throughput seen at L2 cache for read requests from L1 cache
          l2_texture_read_throughput:  Memory read throughput seen at L2 cache for read requests from the texture cache
               local_memory_overhead:  Ratio of local memory traffic to total memory traffic between the L1 and L2 caches
                          issued_ipc:  Instructions issued per cycle
                       inst_per_warp:  Average number of instructions executed by each warp
              issue_slot_utilization:  Percentage of issue slots that issued at least one instruction, averaged across all cycles
    local_load_transactions_per_request:  Average number of local memory load transactions performed for each local memory load
    local_store_transactions_per_request:  Average number of local memory store transactions performed for each local memory store
    shared_load_transactions_per_request:  Average number of shared memory load transactions performed for each shared memory load
    shared_store_transactions_per_request:  Average number of shared memory store transactions performed for each shared memory store
        gld_transactions_per_request:  Average number of global memory load transactions performed for each global memory load
        gst_transactions_per_request:  Average number of global memory store transactions performed for each global memory store
             local_load_transactions:  Number of local memory load transactions
            local_store_transactions:  Number of local memory store transactions
            shared_load_transactions:  Number of shared memory load transactions
           shared_store_transactions:  Number of shared memory store transactions
                    gld_transactions:  Number of global memory load transactions
                    gst_transactions:  Number of global memory store transactions
            sysmem_read_transactions:  Number of system memory read transactions
           sysmem_write_transactions:  Number of system memory write transactions
              tex_cache_transactions:  Texture cache read transactions
              dram_read_transactions:  Device memory read transactions
             dram_write_transactions:  Device memory write transactions
                l2_read_transactions:  Memory read transactions seen at L2 cache for all read requests
               l2_write_transactions:  Memory write transactions seen at L2 cache for all write requests
               local_load_throughput:  Local memory load throughput
              local_store_throughput:  Local memory store throughput
              shared_load_throughput:  Shared memory load throughput
             shared_store_throughput:  Shared memory store throughput
                  l2_read_throughput:  Memory read throughput seen at L2 cache for all read requests
                 l2_write_throughput:  Memory write throughput seen at L2 cache for all write requests
              sysmem_read_throughput:  System memory read throughput
             sysmem_write_throughput:  System memory write throughput
                           cf_issued:  Number of issued control-flow instructions
                         cf_executed:  Number of executed control-flow instructions
                         ldst_issued:  Number of issued load and store instructions
                       ldst_executed:  Number of executed load and store instructions
                            flops_sp:  Single-precision floating point operations executed
                        flops_sp_add:  Single-precision floating point add operations executed
                        flops_sp_mul:  Single-precision floating point multiply operations executed
                        flops_sp_fma:  Single-precision floating point multiply accumulate operations executed
                            flops_dp:  Double-precision floating point operations executed
                        flops_dp_add:  Double-precision floating point add operations executed
                        flops_dp_mul:  Double-precision floating point multiply operations executed
                        flops_dp_fma:  Double-precision floating point multiply accumulate operations executed
                    flops_sp_special:  Single-precision floating point special operations executed
               l1_shared_utilization:  The utilization level of the L1/shared memory relative to peak utilization
                      l2_utilization:  The utilization level of the L2 cache relative to the peak utilization
                     tex_utilization:  The utilization level of the texture cache relative to the peak utilization
                    dram_utilization:  The utilization level of the device memory relative to the peak utilization
                  sysmem_utilization:  The utilization level of the system memory relative to the peak utilization
                 ldst_fu_utilization:  The utilization level of the multiprocessor function units that execute load and store instructions
                  alu_fu_utilization:  The utilization level of the multiprocessor function units that execute integer and floating-point arithmetic instructions
                   cf_fu_utilization:  The utilization level of the multiprocessor function units that execute control-flow instructions
                  tex_fu_utilization:  The utilization level of the multiprocessor function units that execute texture instructions
                       inst_executed:  The number of instructions executed
                         inst_issued:  The number of instructions issued
                         issue_slots:  The number of issue slots used


events
------

::

    (chroma_env)delta:e blyth$ which nvprof
    /Developer/NVIDIA/CUDA-5.5/bin/nvprof
    (chroma_env)delta:e blyth$ 
    (chroma_env)delta:e blyth$ nvprof --query-events
    Available Events:
                                Name   Description
    Device 0 (GeForce GT 750M):
            Domain domain_a:
           tex0_cache_sector_queries:  Number of texture cache 0 requests. This increments by 1 for each 32-byte access.
           tex1_cache_sector_queries:  Number of texture cache 1 requests. This increments by 1 for each 32-byte access.
           tex2_cache_sector_queries:  Number of texture cache 2 requests. This increments by 1 for each 32-byte access. Value will be 0 for devices that contain only 2 texture units.
           tex3_cache_sector_queries:  Number of texture cache 3 requests. This increments by 1 for each 32-byte access. Value will be 0 for devices that contain only 2 texture units.
            tex0_cache_sector_misses:  Number of texture cache 0 misses. This increments by 1 for each 32-byte access.
            tex1_cache_sector_misses:  Number of texture cache 1 misses. This increments by 1 for each 32-byte access.
            tex2_cache_sector_misses:  Number of texture cache 2 misses. This increments by 1 for each 32-byte access. Value will be 0 for devices that contain only 2 texture units.
            tex3_cache_sector_misses:  Number of texture cache 3 misses. This increments by 1 for each 32-byte access. Value will be 0 for devices that contain only 2 texture units.
                   elapsed_cycles_sm:  Elapsed clocks

            Domain domain_b:
               fb_subp0_read_sectors:  Number of DRAM read requests to sub partition 0, increments by 1 for 32 byte access.
               fb_subp1_read_sectors:  Number of DRAM read requests to sub partition 1, increments by 1 for 32 byte access.
              fb_subp0_write_sectors:  Number of DRAM write requests to sub partition 0, increments by 1 for 32 byte access.
              fb_subp1_write_sectors:  Number of DRAM write requests to sub partition 1, increments by 1 for 32 byte access.
        l2_subp0_write_sector_misses:  Number of write misses in slice 0 of L2 cache. This increments by 1 for each 32-byte access.
        l2_subp1_write_sector_misses:  Number of write misses in slice 1 of L2 cache. This increments by 1 for each 32-byte access.
        l2_subp2_write_sector_misses:  Number of write misses in slice 2 of L2 cache. This increments by 1 for each 32-byte access.
        l2_subp3_write_sector_misses:  Number of write misses in slice 3 of L2 cache. This increments by 1 for each 32-byte access.
         l2_subp0_read_sector_misses:  Number of read misses in slice 0 of L2 cache. This increments by 1 for each 32-byte access.
         l2_subp1_read_sector_misses:  Number of read misses in slice 1 of L2 cache. This increments by 1 for each 32-byte access.
         l2_subp2_read_sector_misses:  Number of read misses in slice 2 of L2 cache. This increments by 1 for each 32-byte access.
         l2_subp3_read_sector_misses:  Number of read misses in slice 3 of L2 cache. This increments by 1 for each 32-byte access.
    l2_subp0_write_l1_sector_queries:  Number of write requests from L1 to slice 0 of L2 cache. This increments by 1 for each 32-byte access.
    l2_subp1_write_l1_sector_queries:  Number of write requests from L1 to slice 1 of L2 cache. This increments by 1 for each 32-byte access.
    l2_subp2_write_l1_sector_queries:  Number of write requests from L1 to slice 2 of L2 cache. This increments by 1 for each 32-byte access.
    l2_subp3_write_l1_sector_queries:  Number of write requests from L1 to slice 3 of L2 cache. This increments by 1 for each 32-byte access.
     l2_subp0_read_l1_sector_queries:  Number of read requests from L1 to slice 0 of L2 cache. This increments by 1 for each 32-byte access.
     l2_subp1_read_l1_sector_queries:  Number of read requests from L1 to slice 1 of L2 cache. This increments by 1 for each 32-byte access.
     l2_subp2_read_l1_sector_queries:  Number of read requests from L1 to slice 2 of L2 cache. This increments by 1 for each 32-byte access.
     l2_subp3_read_l1_sector_queries:  Number of read requests from L1 to slice 3 of L2 cache. This increments by 1 for each 32-byte access.
        l2_subp0_read_l1_hit_sectors:  Number of read requests from L1 that hit in slice 0 of L2 cache. This increments by 1 for each 32-byte access.
        l2_subp1_read_l1_hit_sectors:  Number of read requests from L1 that hit in slice 1 of L2 cache. This increments by 1 for each 32-byte access.
        l2_subp2_read_l1_hit_sectors:  Number of read requests from L1 that hit in slice 2 of L2 cache. This increments by 1 for each 32-byte access.
        l2_subp3_read_l1_hit_sectors:  Number of read requests from L1 that hit in slice 3 of L2 cache. This increments by 1 for each 32-byte access.
    l2_subp0_read_tex_sector_queries:  Number of read requests from Texture cache to slice 0 of L2 cache. This increments by 1 for each 32-byte access.
    l2_subp1_read_tex_sector_queries:  Number of read requests from Texture cache to slice 1 of L2 cache. This increments by 1 for each 32-byte access.
    l2_subp2_read_tex_sector_queries:  Number of read requests from Texture cache to slice 2 of L2 cache. This increments by 1 for each 32-byte access.
    l2_subp3_read_tex_sector_queries:  Number of read requests from Texture cache to slice 3 of L2 cache. This increments by 1 for each 32-byte access.
       l2_subp0_read_tex_hit_sectors:  Number of read requests from Texture cache that hit in slice 0 of L2 cache. This increments by 1 for each 32-byte access.
       l2_subp1_read_tex_hit_sectors:  Number of read requests from Texture cache that hit in slice 1 of L2 cache. This increments by 1 for each 32-byte access.
       l2_subp2_read_tex_hit_sectors:  Number of read requests from Texture cache that hit in slice 2 of L2 cache. This increments by 1 for each 32-byte access.
       l2_subp3_read_tex_hit_sectors:  Number of read requests from Texture cache that hit in slice 3 of L2 cache. This increments by 1 for each 32-byte access.
    l2_subp0_read_sysmem_sector_queries:  Number of system memory read requests to slice 0 of L2 cache. This increments by 1 for each 32-byte access.
    l2_subp1_read_sysmem_sector_queries:  Number of system memory read requests to slice 1 of L2 cache. This increments by 1 for each 32-byte access.
    l2_subp2_read_sysmem_sector_queries:  Number of system memory read requests to slice 2 of L2 cache. This increments by 1 for each 32-byte access.
    l2_subp3_read_sysmem_sector_queries:  Number of system memory read requests to slice 3 of L2 cache. This increments by 1 for each 32-byte access.
    l2_subp0_write_sysmem_sector_queries:  Number of system memory write requests to slice 0 of L2 cache. This increments by 1 for each 32-byte access.
    l2_subp1_write_sysmem_sector_queries:  Number of system memory write requests to slice 1 of L2 cache. This increments by 1 for each 32-byte access.
    l2_subp2_write_sysmem_sector_queries:  Number of system memory write requests to slice 2 of L2 cache. This increments by 1 for each 32-byte access.
    l2_subp3_write_sysmem_sector_queries:  Number of system memory write requests to slice 3 of L2 cache. This increments by 1 for each 32-byte access.
    l2_subp0_total_read_sector_queries:  Total read requests to slice 0 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
    l2_subp1_total_read_sector_queries:  Total read requests to slice 1 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
    l2_subp2_total_read_sector_queries:  Total read requests to slice 2 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
    l2_subp3_total_read_sector_queries:  Total read requests to slice 3 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
    l2_subp0_total_write_sector_queries:  Total write requests to slice 0 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
    l2_subp1_total_write_sector_queries:  Total write requests to slice 1 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
    l2_subp2_total_write_sector_queries:  Total write requests to slice 2 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
    l2_subp3_total_write_sector_queries:  Total write requests to slice 3 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.

            Domain domain_c:
                       gld_inst_8bit:  Total number of 8-bit global load instructions that are executed by all the threads across all thread blocks.
                      gld_inst_16bit:  Total number of 16-bit global load instructions that are executed by all the threads across all thread blocks.
                      gld_inst_32bit:  Total number of 32-bit global load instructions that are executed by all the threads across all thread blocks.
                      gld_inst_64bit:  Total number of 64-bit global load instructions that are executed by all the threads across all thread blocks.
                     gld_inst_128bit:  Total number of 128-bit global load instructions that are executed by all the threads across all thread blocks.
                       gst_inst_8bit:  Total number of 8-bit global store instructions that are executed by all the threads across all thread blocks.
                      gst_inst_16bit:  Total number of 16-bit global store instructions that are executed by all the threads across all thread blocks.
                      gst_inst_32bit:  Total number of 32-bit global store instructions that are executed by all the threads across all thread blocks.
                      gst_inst_64bit:  Total number of 64-bit global store instructions that are executed by all the threads across all thread blocks.
                     gst_inst_128bit:  Total number of 128-bit global store instructions that are executed by all the threads across all thread blocks.

            Domain domain_d:
                     prof_trigger_00:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
                     prof_trigger_01:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
                     prof_trigger_02:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
                     prof_trigger_03:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
                     prof_trigger_04:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
                     prof_trigger_05:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
                     prof_trigger_06:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
                     prof_trigger_07:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
                      warps_launched:  Number of warps launched on a multiprocessor.
                    threads_launched:  Number of threads launched on a multiprocessor.
                        inst_issued1:  Number of single instruction issued per cycle
                        inst_issued2:  Number of dual instructions issued per cycle
                       inst_executed:  Number of instructions executed, do not include replays.
                         shared_load:  Number of executed load instructions where state space is specified as shared, increments per warp on a multiprocessor.
                        shared_store:  Number of executed store instructions where state space is specified as shared, increments per warp on a multiprocessor.
                          local_load:  Number of executed load instructions where state space is specified as local, increments per warp on a multiprocessor.
                         local_store:  Number of executed store instructions where state space is specified as local, increments per warp on a multiprocessor.
                         gld_request:  Number of executed load instructions where the state space is not specified and hence generic addressing is used, increments per warp on a multiprocessor. It can include the load operations from global,local and shared state space.
                         gst_request:  Number of executed store instructions where the state space is not specified and hence generic addressing is used, increments per warp on a multiprocessor. It can include the store operations to global,local and shared state space.
                          atom_count:  Number of warps executing atomic reduction operations. Increments by one if at least one thread in a warp executes the instruction.
                          gred_count:  Number of warps executing reduction operations on global and shared memory. Increments by one if at least one thread in a warp executes the instruction
                              branch:  Number of branch instructions executed per warp on a multiprocessor.
                    divergent_branch:  Number of divergent branches within a warp. This counter will be incremented by one if at least one thread in a warp diverges (that is, follows a different execution path) via a conditional branch.
                       active_cycles:  Number of cycles a multiprocessor has at least one active warp. This event can increment by 0 - 1 on each cycle.
                        active_warps:  Accumulated number of active warps per cycle. For every cycle it increments by the number of active warps in the cycle which can be in the range 0 to 64.
                     sm_cta_launched:  Number of thread blocks launched on a multiprocessor.
             local_load_transactions:  Number of local load transactions from L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
            local_store_transactions:  Number of local store transactions to L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
         l1_shared_load_transactions:  Number of shared load transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
        l1_shared_store_transactions:  Number of shared store transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
       __l1_global_load_transactions:  Number of global load transactions from L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
      __l1_global_store_transactions:  Number of global store transactions from L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
                   l1_local_load_hit:  Number of cache lines that hit in L1 cache for local memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.
                  l1_local_load_miss:  Number of cache lines that miss in L1 cache for local memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.
                  l1_local_store_hit:  Number of cache lines that hit in L1 cache for local memory store accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.
                 l1_local_store_miss:  Number of cache lines that miss in L1 cache for local memory store accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32,64 and 128 bit accesses by a warp respectively.
                  l1_global_load_hit:  Number of cache lines that hit in L1 cache for global memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.
                 l1_global_load_miss:  Number of cache lines that miss in L1 cache for global memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.
    uncached_global_load_transaction:  Number of uncached global load transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
            global_store_transaction:  Number of global store transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
                  shared_load_replay:  Replays caused due to shared load bank conflict (when the addresses for two or more shared memory load requests fall in the same memory bank) or when there is no conflict but the total number of words accessed by all threads in the warp executing that instruction exceed the number of words that can be loaded in one cycle (256 bytes).
                 shared_store_replay:  Replays caused due to shared store bank conflict (when the addresses for two or more shared memory store requests fall in the same memory bank) or when there is no conflict but the total number of words accessed by all threads in the warp executing that instruction exceed the number of words that can be stored in one cycle.
    global_ld_mem_divergence_replays:  global ld is replayed due to divergence
    global_st_mem_divergence_replays:  global st is replayed due to divergence