Links

Content Skeleton

This Page

Previous topic

CUDA GDB

Next topic

CUDA Tests

CUDA Profiling

nvprof

how many registers

Also, how many registers is your kernel using?? (pass --ptxas-options=-v
argument to nvcc) If you can only launch 16 threads per block, the GPU will be
idle most of the time.

cuda profile

From a headless simplecamera.py render run:

1285 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.020 ]
1286 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 11.104 ]
1287 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 1.972 ]
1288 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 2.006 ]
1289 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 2.006 ]
1290 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 1.996 ]
1291 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.012 ]
1292 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 2.022 ]
1293 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 10.942 ]
1294 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 4.039 ]
1295 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.034 ]
1296 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 1.891 ]
1297 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 1.912 ]
1298 method=[ memcpyHtoD ] gputime=[ 1794.976 ] cputime=[ 1993.471 ]
1299 method=[ memcpyHtoD ] gputime=[ 1617.952 ] cputime=[ 1481.204 ]
1300 method=[ memcpyHtoD ] gputime=[ 1601.280 ] cputime=[ 1472.250 ]
1301 method=[ memcpyHtoD ] gputime=[ 7432.672 ] cputime=[ 7370.140 ]
1302 method=[ memcpyHtoD ] gputime=[ 4602.432 ] cputime=[ 4620.065 ]
1303 method=[ memcpyHtoD ] gputime=[ 2335.680 ] cputime=[ 2351.582 ]
1304 method=[ memcpyHtoD ] gputime=[ 1.664 ] cputime=[ 5.372 ]
1305 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.315 ]
1306 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 2.037 ]
1307 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 1.973 ]
1308 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 2.185 ]
1309 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 2.113 ]
1310 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.008 ]
1311 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.010 ]
1312 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.372 ]
1313 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.009 ]
1314 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 1.959 ]
1315 method=[ memcpyHtoD ] gputime=[ 612.832 ] cputime=[ 501.086 ]
1316 method=[ memcpyHtoD ] gputime=[ 590.560 ] cputime=[ 449.675 ]
1317 method=[ fill ] gputime=[ 24.544 ] cputime=[ 13.470 ] occupancy=[ 1.000 ]
1318 method=[ fill ] gputime=[ 25.504 ] cputime=[ 7.263 ] occupancy=[ 1.000 ]
1319 method=[ render ] gputime=[ 5259416.500 ] cputime=[ 234.175 ] occupancy=[ 0.500 ]
1320 method=[ memcpyDtoH ] gputime=[ 194.016 ] cputime=[ 5260492.000 ]

cuda_profile_parse.py

(chroma_env)delta:chroma_camera blyth$ ./cuda_profile_parse.py cuda_profile_0.log
WARNING:__main__:failed to parse : # CUDA_PROFILE_LOG_VERSION 2.0
WARNING:__main__:failed to parse : # CUDA_DEVICE 0 GeForce GT 750M
WARNING:__main__:failed to parse : # CUDA_CONTEXT 1
WARNING:__main__:failed to parse : method,gputime,cputime,occupancy

memcpyDtoH           : {'gputime': 201.504, 'cputime': 5260556.83}
write_size           : {'gputime': 6.208, 'cputime': 37.704, 'occupancy': 0.048}
fill                 : {'gputime': 50.048, 'cputime': 20.733, 'occupancy': 2.0}
render               : {'gputime': 5259416.5, 'cputime': 234.175, 'occupancy': 0.5}
memcpyHtoD           : {'gputime':   22289.11999999997, 'cputime': 23602.95499999999}
(chroma_env)delta:chroma_camera blyth$
  1. memcpyDtoH consumes the same ‘cputime’ as render takes ‘gputime’ with the vast majority of that at the last sample
(chroma_env)delta:chroma_camera blyth$ tail -5 cuda_profile_0.log
method=[ memcpyHtoD ] gputime=[ 590.560 ] cputime=[ 449.675 ]
method=[ fill ] gputime=[ 24.544 ] cputime=[ 13.470 ] occupancy=[ 1.000 ]
method=[ fill ] gputime=[ 25.504 ] cputime=[ 7.263 ] occupancy=[ 1.000 ]
method=[ render ] gputime=[ 5259416.500 ] cputime=[ 234.175 ] occupancy=[ 0.500 ]
method=[ memcpyDtoH ] gputime=[ 194.016 ] cputime=[ 5260492.000 ]

method

This is character string which gives the name of the GPU kernel or memory copy method. In case of kernels the method name is the mangled name generated by the compiler.

occupancy

This column gives the multiprocessor occupancy which is the ratio of number of active warps to the maximum number of warps supported on a multiprocessor of the GPU. This is helpful in determining how effectively the GPU is kept busy. This column is output only for GPU kernels and the column value is a single precision floating point value in the range 0.0 to 1.0.

cputime

For non-blocking methods the cputime is only the CPU or host side overhead to launch the method. In this case:

walltime = cputime + gputime

For blocking methods cputime is the sum of gputime and CPU overhead. In this case:

walltime = cputime

Note all kernel launches by default are non-blocking. But if any of the profiler counters are enabled kernel launches are blocking. Also asynchronous memory copy requests in different streams are non-blocking.

The column value is a single precision floating point value in microseconds.

gputime

This column gives the execution time for the GPU kernel or memory copy method. This value is calculated as (gpuendtimestamp - gpustarttimestamp)/1000.0. The column value is a single precision floating point value in microseconds.

config

The command line profiler is controlled using the following environment variables:

COMPUTE_PROFILE: is set to either 1 or 0 (or unset) to enable or disable profiling.

COMPUTE_PROFILE_LOG: is set to the desired file path for profiling output. In case of multiple contexts you must add ‘%d’ in the COMPUTE_PROFILE_LOG name. This will generate separate profiler output files for each context - with ‘%d’ substituted by the context number. Contexts are numbered starting with zero. In case of multiple processes you must add ‘%p’ in the COMPUTE_PROFILE_LOG name. This will generate separate profiler output files for each process - with ‘%p’ substituted by the process id. If there is no log path specified, the profiler will log data to “cuda_profile_%d.log” in case of a CUDA context (‘%d’ is substituted by the context number).

COMPUTE_PROFILE_CSV: is set to either 1 (set) or 0 (unset) to enable or disable a comma separated version of the log output.

COMPUTE_PROFILE_CONFIG: is used to specify a config file for selecting profiling options and performance counters.

Configuration details are covered in a subsequent section.

The following old environment variables used for the above functionalities are still supported:

CUDA_PROFILE

CUDA_PROFILE_LOG

CUDA_PROFILE_CSV

CUDA_PROFILE_CONFIG

metrics

(chroma_env)delta:e blyth$ nvprof --query-metrics
Available Metrics:
                            Name   Description
Device 0 (GeForce GT 750M):
        l1_cache_global_hit_rate:  Hit rate in L1 cache for global loads
               branch_efficiency:  Ratio of non-divergent branches to total branches
         l1_cache_local_hit_rate:  Hit rate in L1 cache for local loads and stores
                   sm_efficiency:  The percentage of time at least one warp is active on a multiprocessor
                             ipc:  Instructions executed per cycle
              achieved_occupancy:  Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor
        gld_requested_throughput:  Requested global memory load throughput
        gst_requested_throughput:  Requested global memory store throughput
          sm_efficiency_instance:  The percentage of time at least one warp is active on a multiprocessor
                    ipc_instance:  Instructions executed per cycle
            inst_replay_overhead:  Average number of replays for each instruction executed
          shared_replay_overhead:  Average number of replays due to shared memory conflicts for each instruction executed
          global_replay_overhead:  Average number of replays due to local memory cache misses for each instruction executed
    global_cache_replay_overhead:  Average number of replays due to global memory cache misses for each instruction executed
              tex_cache_hit_rate:  Texture cache hit rate
            tex_cache_throughput:  Texture cache throughput
            dram_read_throughput:  Device memory read throughput
           dram_write_throughput:  Device memory write throughput
                  gst_throughput:  Global memory store throughput
                  gld_throughput:  Global memory load throughput
           local_replay_overhead:  Average number of replays due to local memory accesses for each instruction executed
               shared_efficiency:  Ratio of requested shared memory throughput to required shared memory throughput
                  gld_efficiency:  Ratio of requested global memory load throughput to required global memory load throughput
                  gst_efficiency:  Ratio of requested global memory store throughput to required global memory store throughput
             l2_l1_read_hit_rate:  Hit rate at L2 cache for all read requests from L1 cache
        l2_texture_read_hit_rate:  Hit rate at L2 cache for all read requests from texture cache
           l2_l1_read_throughput:  Memory read throughput seen at L2 cache for read requests from L1 cache
      l2_texture_read_throughput:  Memory read throughput seen at L2 cache for read requests from the texture cache
           local_memory_overhead:  Ratio of local memory traffic to total memory traffic between the L1 and L2 caches
                      issued_ipc:  Instructions issued per cycle
                   inst_per_warp:  Average number of instructions executed by each warp
          issue_slot_utilization:  Percentage of issue slots that issued at least one instruction, averaged across all cycles
local_load_transactions_per_request:  Average number of local memory load transactions performed for each local memory load
local_store_transactions_per_request:  Average number of local memory store transactions performed for each local memory store
shared_load_transactions_per_request:  Average number of shared memory load transactions performed for each shared memory load
shared_store_transactions_per_request:  Average number of shared memory store transactions performed for each shared memory store
    gld_transactions_per_request:  Average number of global memory load transactions performed for each global memory load
    gst_transactions_per_request:  Average number of global memory store transactions performed for each global memory store
         local_load_transactions:  Number of local memory load transactions
        local_store_transactions:  Number of local memory store transactions
        shared_load_transactions:  Number of shared memory load transactions
       shared_store_transactions:  Number of shared memory store transactions
                gld_transactions:  Number of global memory load transactions
                gst_transactions:  Number of global memory store transactions
        sysmem_read_transactions:  Number of system memory read transactions
       sysmem_write_transactions:  Number of system memory write transactions
          tex_cache_transactions:  Texture cache read transactions
          dram_read_transactions:  Device memory read transactions
         dram_write_transactions:  Device memory write transactions
            l2_read_transactions:  Memory read transactions seen at L2 cache for all read requests
           l2_write_transactions:  Memory write transactions seen at L2 cache for all write requests
           local_load_throughput:  Local memory load throughput
          local_store_throughput:  Local memory store throughput
          shared_load_throughput:  Shared memory load throughput
         shared_store_throughput:  Shared memory store throughput
              l2_read_throughput:  Memory read throughput seen at L2 cache for all read requests
             l2_write_throughput:  Memory write throughput seen at L2 cache for all write requests
          sysmem_read_throughput:  System memory read throughput
         sysmem_write_throughput:  System memory write throughput
                       cf_issued:  Number of issued control-flow instructions
                     cf_executed:  Number of executed control-flow instructions
                     ldst_issued:  Number of issued load and store instructions
                   ldst_executed:  Number of executed load and store instructions
                        flops_sp:  Single-precision floating point operations executed
                    flops_sp_add:  Single-precision floating point add operations executed
                    flops_sp_mul:  Single-precision floating point multiply operations executed
                    flops_sp_fma:  Single-precision floating point multiply accumulate operations executed
                        flops_dp:  Double-precision floating point operations executed
                    flops_dp_add:  Double-precision floating point add operations executed
                    flops_dp_mul:  Double-precision floating point multiply operations executed
                    flops_dp_fma:  Double-precision floating point multiply accumulate operations executed
                flops_sp_special:  Single-precision floating point special operations executed
           l1_shared_utilization:  The utilization level of the L1/shared memory relative to peak utilization
                  l2_utilization:  The utilization level of the L2 cache relative to the peak utilization
                 tex_utilization:  The utilization level of the texture cache relative to the peak utilization
                dram_utilization:  The utilization level of the device memory relative to the peak utilization
              sysmem_utilization:  The utilization level of the system memory relative to the peak utilization
             ldst_fu_utilization:  The utilization level of the multiprocessor function units that execute load and store instructions
              alu_fu_utilization:  The utilization level of the multiprocessor function units that execute integer and floating-point arithmetic instructions
               cf_fu_utilization:  The utilization level of the multiprocessor function units that execute control-flow instructions
              tex_fu_utilization:  The utilization level of the multiprocessor function units that execute texture instructions
                   inst_executed:  The number of instructions executed
                     inst_issued:  The number of instructions issued
                     issue_slots:  The number of issue slots used

events

(chroma_env)delta:e blyth$ which nvprof
/Developer/NVIDIA/CUDA-5.5/bin/nvprof
(chroma_env)delta:e blyth$
(chroma_env)delta:e blyth$ nvprof --query-events
Available Events:
                            Name   Description
Device 0 (GeForce GT 750M):
        Domain domain_a:
       tex0_cache_sector_queries:  Number of texture cache 0 requests. This increments by 1 for each 32-byte access.
       tex1_cache_sector_queries:  Number of texture cache 1 requests. This increments by 1 for each 32-byte access.
       tex2_cache_sector_queries:  Number of texture cache 2 requests. This increments by 1 for each 32-byte access. Value will be 0 for devices that contain only 2 texture units.
       tex3_cache_sector_queries:  Number of texture cache 3 requests. This increments by 1 for each 32-byte access. Value will be 0 for devices that contain only 2 texture units.
        tex0_cache_sector_misses:  Number of texture cache 0 misses. This increments by 1 for each 32-byte access.
        tex1_cache_sector_misses:  Number of texture cache 1 misses. This increments by 1 for each 32-byte access.
        tex2_cache_sector_misses:  Number of texture cache 2 misses. This increments by 1 for each 32-byte access. Value will be 0 for devices that contain only 2 texture units.
        tex3_cache_sector_misses:  Number of texture cache 3 misses. This increments by 1 for each 32-byte access. Value will be 0 for devices that contain only 2 texture units.
               elapsed_cycles_sm:  Elapsed clocks

        Domain domain_b:
           fb_subp0_read_sectors:  Number of DRAM read requests to sub partition 0, increments by 1 for 32 byte access.
           fb_subp1_read_sectors:  Number of DRAM read requests to sub partition 1, increments by 1 for 32 byte access.
          fb_subp0_write_sectors:  Number of DRAM write requests to sub partition 0, increments by 1 for 32 byte access.
          fb_subp1_write_sectors:  Number of DRAM write requests to sub partition 1, increments by 1 for 32 byte access.
    l2_subp0_write_sector_misses:  Number of write misses in slice 0 of L2 cache. This increments by 1 for each 32-byte access.
    l2_subp1_write_sector_misses:  Number of write misses in slice 1 of L2 cache. This increments by 1 for each 32-byte access.
    l2_subp2_write_sector_misses:  Number of write misses in slice 2 of L2 cache. This increments by 1 for each 32-byte access.
    l2_subp3_write_sector_misses:  Number of write misses in slice 3 of L2 cache. This increments by 1 for each 32-byte access.
     l2_subp0_read_sector_misses:  Number of read misses in slice 0 of L2 cache. This increments by 1 for each 32-byte access.
     l2_subp1_read_sector_misses:  Number of read misses in slice 1 of L2 cache. This increments by 1 for each 32-byte access.
     l2_subp2_read_sector_misses:  Number of read misses in slice 2 of L2 cache. This increments by 1 for each 32-byte access.
     l2_subp3_read_sector_misses:  Number of read misses in slice 3 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp0_write_l1_sector_queries:  Number of write requests from L1 to slice 0 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp1_write_l1_sector_queries:  Number of write requests from L1 to slice 1 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp2_write_l1_sector_queries:  Number of write requests from L1 to slice 2 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp3_write_l1_sector_queries:  Number of write requests from L1 to slice 3 of L2 cache. This increments by 1 for each 32-byte access.
 l2_subp0_read_l1_sector_queries:  Number of read requests from L1 to slice 0 of L2 cache. This increments by 1 for each 32-byte access.
 l2_subp1_read_l1_sector_queries:  Number of read requests from L1 to slice 1 of L2 cache. This increments by 1 for each 32-byte access.
 l2_subp2_read_l1_sector_queries:  Number of read requests from L1 to slice 2 of L2 cache. This increments by 1 for each 32-byte access.
 l2_subp3_read_l1_sector_queries:  Number of read requests from L1 to slice 3 of L2 cache. This increments by 1 for each 32-byte access.
    l2_subp0_read_l1_hit_sectors:  Number of read requests from L1 that hit in slice 0 of L2 cache. This increments by 1 for each 32-byte access.
    l2_subp1_read_l1_hit_sectors:  Number of read requests from L1 that hit in slice 1 of L2 cache. This increments by 1 for each 32-byte access.
    l2_subp2_read_l1_hit_sectors:  Number of read requests from L1 that hit in slice 2 of L2 cache. This increments by 1 for each 32-byte access.
    l2_subp3_read_l1_hit_sectors:  Number of read requests from L1 that hit in slice 3 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp0_read_tex_sector_queries:  Number of read requests from Texture cache to slice 0 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp1_read_tex_sector_queries:  Number of read requests from Texture cache to slice 1 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp2_read_tex_sector_queries:  Number of read requests from Texture cache to slice 2 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp3_read_tex_sector_queries:  Number of read requests from Texture cache to slice 3 of L2 cache. This increments by 1 for each 32-byte access.
   l2_subp0_read_tex_hit_sectors:  Number of read requests from Texture cache that hit in slice 0 of L2 cache. This increments by 1 for each 32-byte access.
   l2_subp1_read_tex_hit_sectors:  Number of read requests from Texture cache that hit in slice 1 of L2 cache. This increments by 1 for each 32-byte access.
   l2_subp2_read_tex_hit_sectors:  Number of read requests from Texture cache that hit in slice 2 of L2 cache. This increments by 1 for each 32-byte access.
   l2_subp3_read_tex_hit_sectors:  Number of read requests from Texture cache that hit in slice 3 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp0_read_sysmem_sector_queries:  Number of system memory read requests to slice 0 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp1_read_sysmem_sector_queries:  Number of system memory read requests to slice 1 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp2_read_sysmem_sector_queries:  Number of system memory read requests to slice 2 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp3_read_sysmem_sector_queries:  Number of system memory read requests to slice 3 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp0_write_sysmem_sector_queries:  Number of system memory write requests to slice 0 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp1_write_sysmem_sector_queries:  Number of system memory write requests to slice 1 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp2_write_sysmem_sector_queries:  Number of system memory write requests to slice 2 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp3_write_sysmem_sector_queries:  Number of system memory write requests to slice 3 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp0_total_read_sector_queries:  Total read requests to slice 0 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
l2_subp1_total_read_sector_queries:  Total read requests to slice 1 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
l2_subp2_total_read_sector_queries:  Total read requests to slice 2 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
l2_subp3_total_read_sector_queries:  Total read requests to slice 3 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
l2_subp0_total_write_sector_queries:  Total write requests to slice 0 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
l2_subp1_total_write_sector_queries:  Total write requests to slice 1 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
l2_subp2_total_write_sector_queries:  Total write requests to slice 2 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
l2_subp3_total_write_sector_queries:  Total write requests to slice 3 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.

        Domain domain_c:
                   gld_inst_8bit:  Total number of 8-bit global load instructions that are executed by all the threads across all thread blocks.
                  gld_inst_16bit:  Total number of 16-bit global load instructions that are executed by all the threads across all thread blocks.
                  gld_inst_32bit:  Total number of 32-bit global load instructions that are executed by all the threads across all thread blocks.
                  gld_inst_64bit:  Total number of 64-bit global load instructions that are executed by all the threads across all thread blocks.
                 gld_inst_128bit:  Total number of 128-bit global load instructions that are executed by all the threads across all thread blocks.
                   gst_inst_8bit:  Total number of 8-bit global store instructions that are executed by all the threads across all thread blocks.
                  gst_inst_16bit:  Total number of 16-bit global store instructions that are executed by all the threads across all thread blocks.
                  gst_inst_32bit:  Total number of 32-bit global store instructions that are executed by all the threads across all thread blocks.
                  gst_inst_64bit:  Total number of 64-bit global store instructions that are executed by all the threads across all thread blocks.
                 gst_inst_128bit:  Total number of 128-bit global store instructions that are executed by all the threads across all thread blocks.

        Domain domain_d:
                 prof_trigger_00:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
                 prof_trigger_01:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
                 prof_trigger_02:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
                 prof_trigger_03:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
                 prof_trigger_04:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
                 prof_trigger_05:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
                 prof_trigger_06:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
                 prof_trigger_07:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
                  warps_launched:  Number of warps launched on a multiprocessor.
                threads_launched:  Number of threads launched on a multiprocessor.
                    inst_issued1:  Number of single instruction issued per cycle
                    inst_issued2:  Number of dual instructions issued per cycle
                   inst_executed:  Number of instructions executed, do not include replays.
                     shared_load:  Number of executed load instructions where state space is specified as shared, increments per warp on a multiprocessor.
                    shared_store:  Number of executed store instructions where state space is specified as shared, increments per warp on a multiprocessor.
                      local_load:  Number of executed load instructions where state space is specified as local, increments per warp on a multiprocessor.
                     local_store:  Number of executed store instructions where state space is specified as local, increments per warp on a multiprocessor.
                     gld_request:  Number of executed load instructions where the state space is not specified and hence generic addressing is used, increments per warp on a multiprocessor. It can include the load operations from global,local and shared state space.
                     gst_request:  Number of executed store instructions where the state space is not specified and hence generic addressing is used, increments per warp on a multiprocessor. It can include the store operations to global,local and shared state space.
                      atom_count:  Number of warps executing atomic reduction operations. Increments by one if at least one thread in a warp executes the instruction.
                      gred_count:  Number of warps executing reduction operations on global and shared memory. Increments by one if at least one thread in a warp executes the instruction
                          branch:  Number of branch instructions executed per warp on a multiprocessor.
                divergent_branch:  Number of divergent branches within a warp. This counter will be incremented by one if at least one thread in a warp diverges (that is, follows a different execution path) via a conditional branch.
                   active_cycles:  Number of cycles a multiprocessor has at least one active warp. This event can increment by 0 - 1 on each cycle.
                    active_warps:  Accumulated number of active warps per cycle. For every cycle it increments by the number of active warps in the cycle which can be in the range 0 to 64.
                 sm_cta_launched:  Number of thread blocks launched on a multiprocessor.
         local_load_transactions:  Number of local load transactions from L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
        local_store_transactions:  Number of local store transactions to L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
     l1_shared_load_transactions:  Number of shared load transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
    l1_shared_store_transactions:  Number of shared store transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
   __l1_global_load_transactions:  Number of global load transactions from L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
  __l1_global_store_transactions:  Number of global store transactions from L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
               l1_local_load_hit:  Number of cache lines that hit in L1 cache for local memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.
              l1_local_load_miss:  Number of cache lines that miss in L1 cache for local memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.
              l1_local_store_hit:  Number of cache lines that hit in L1 cache for local memory store accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.
             l1_local_store_miss:  Number of cache lines that miss in L1 cache for local memory store accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32,64 and 128 bit accesses by a warp respectively.
              l1_global_load_hit:  Number of cache lines that hit in L1 cache for global memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.
             l1_global_load_miss:  Number of cache lines that miss in L1 cache for global memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.
uncached_global_load_transaction:  Number of uncached global load transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
        global_store_transaction:  Number of global store transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
              shared_load_replay:  Replays caused due to shared load bank conflict (when the addresses for two or more shared memory load requests fall in the same memory bank) or when there is no conflict but the total number of words accessed by all threads in the warp executing that instruction exceed the number of words that can be loaded in one cycle (256 bytes).
             shared_store_replay:  Replays caused due to shared store bank conflict (when the addresses for two or more shared memory store requests fall in the same memory bank) or when there is no conflict but the total number of words accessed by all threads in the warp executing that instruction exceed the number of words that can be stored in one cycle.
global_ld_mem_divergence_replays:  global ld is replayed due to divergence
global_st_mem_divergence_replays:  global st is replayed due to divergence