Opticks : GPU Optical Photon Simulation for Particle Physics with NVIDIA OptiX

Opticks : GPU Optical Simulation via NVIDIA® OptiX™
+ A Mental Model for Effective Application of GPUs

Open source, https://bitbucket.org/simoncblyth/opticks

Simon C Blyth, IHEP, CAS — December 2019, IHEP EPD/PIFI Seminar


Outline

/env/presentation/newtons-opticks.png
 


Outline of Mental Model



Opticks : GPU Optical Simulation via NVIDIA® OptiX™
+ A Mental Model for Effective Application of GPUs


JUNO_Intro_2


JUNO_Intro_3


Geant4 : Monte Carlo Simulation Toolkit


Geant4 : Monte Carlo Simulation Toolkit Generality


Optical Photon Simulation Problem...









Optical Photon Simulation ≈ Ray Traced Image Rendering

Much in common : geometry, light sources, optical physics


Many Applications of ray tracing :


Ray-tracing vs Rasterization

/env/presentation/nvidia/nv_rasterization.png /env/presentation/nvidia/nv_raytrace.png

SIGGRAPH_2018_Announcing_Worlds_First_Ray_Tracing_GPU








10 Giga Rays/s

TURING BUILT FOR RTX 2








NVIDIA RTX Metro Exodus











Spatial Index Acceleration Structure













NVIDIA® OptiX™ Ray Tracing Engine -- http://developer.nvidia.com/optix

OptiX makes GPU ray tracing accessible

NVIDIA expertise:

Opticks provides (Yellow):

[1] Turing RTX GPUs


Geant4OpticksWorkflow


Opticks : Translates G4 Optical Physics to CUDA/OptiX

OptiX : single-ray programming model -> line-by-line translation

CUDA Ports of Geant4 classes
  • G4Cerenkov (only generation loop)
  • G4Scintillation (only generation loop)
  • G4OpAbsorption
  • G4OpRayleigh
  • G4OpBoundaryProcess (only a few surface types)
Modify Cherenkov + Scintillation Processes
  • collect genstep, copy to GPU for generation
  • avoids copying millions of photons to GPU
Scintillator Reemission
  • fraction of bulk absorbed "reborn" within same thread
  • wavelength generated by reemission texture lookup
Opticks (OptiX/Thrust GPU interoperation)
  • OptiX : upload gensteps
  • Thrust : seeding, distribute genstep indices to photons
  • OptiX : launch photon generation and propagation
  • Thrust : pullback photons that hit PMTs
  • Thrust : index photon step sequences (optional)

G4Solid -> CUDA Intersect Functions for ~10 Primitives

/env/presentation/tboolean_parade_sep2017.png

Sphere, Cylinder, Disc, Cone, Convex Polyhedron, Hyperboloid, Torus, ...


G4Boolean -> CUDA/OptiX Intersection Program Implementing CSG

Complete Binary Tree, pick between pairs of nearest intersects:

UNION tA < tB Enter B Exit B Miss B
Enter A ReturnA LoopA ReturnA
Exit A ReturnA ReturnB ReturnA
Miss A ReturnB ReturnB ReturnMiss
[1] Ray Tracing CSG Objects Using Single Hit Intersections, Andrew Kensler (2006)
with corrections by author of XRT Raytracer http://xrt.wikidot.com/doc:csg
[2] https://bitbucket.org/simoncblyth/opticks/src/tip/optixrap/cu/csg_intersect_boolean.h
Similar to binary expression tree evaluation using postorder traverse.

Opticks : Translates G4 Geometry to GPU, Without Approximation

G4 Structure Tree -> Instance+Global Arrays -> OptiX

Group structure into repeated instances + global remainder:

instancing -> huge memory savings for JUNO PMTs



        
        

j1808_top_rtx


j1808_top_ogl


Validation of Opticks Simulation by Comparison with Geant4

Bi-simulations of all JUNO solids, with millions of photons

mis-aligned histories
mostly < 0.25%, < 0.50% for largest solids
deviant photons within matched history
< 0.05% (500/1M)

Primary sources of problems

Primary cause : float vs double

Geant4 uses double everywhere, Opticks only sparingly (observed double costing 10x slowdown with RTX)

Conclude


scan-pf-check-GUI-TO-SC-BT5-SD


Recording the steps of Millions of Photons

Up to 16 steps of the photon propagation are recorded.

Photon Array : 4 * float4 = 512 bits/photon

Step Record Array : 2 * short4 = 2*16*4 = 128 bits/record

Compression uses known domains of position (geometry center, extent), time (0:200ns), wavelength, polarization.


scan-pf-check-GUI-TO-BT5-SD


Performance : Scanning from 1M to 400M Photons

Full JUNO Analytic Geometry j1808v5

Production Mode : does the minimum

Multi-Event Running, Measure:

interval
avg time between successive launches, including overheads: (upload gensteps + launch + download hits)
launch
avg of 10 OptiX launches

NVIDIA Quadro RTX 8000 (48G)

谢谢 NVIDIA China
for loaning the card

scan-pf-1_NHit










scan-pf-1_Opticks_vs_Geant4 2





JUNO analytic, 400M photons from center Speedup
Geant4 Extrap. 95,600 s (26 hrs)  
Opticks RTX ON (i) 58 s 1650x

scan-pf-1_Opticks_Speedup 2










JUNO analytic, 400M photons from center Speedup
Opticks RTX ON (i) 58s 1650x
Opticks RTX OFF (i) 275s 350x
Geant4 Extrap. 95,600s (26 hrs)  

scan-pf-1_RTX_Speedup












5x Speedup from RTX with JUNO analytic geometry

Useful Speedup > 1000x : But Why Not Giga Rays/s ? (1 Photon ~10 Rays)













OptiX Performance Tools and Tricks, David Hart, NVIDIA https://developer.nvidia.com/siggraph/2019/video/sig915-vid


Where Next for Opticks ?

JUNO+Opticks into Production

Geant4+Opticks Integration : Work with Geant4 Collaboration

Alpha Development ------>-----------------> Robust Tool


Drastically Improved Optical Photon Simulation Performance...

Three revolutions reinforcing each other:

Deep rivers of development, ripe for re-purposing

Example : DL denoising for faster ray trace convergence

Re-evaluate long held practices in light of new realities:


Summary

/env/presentation/1px.png

Opticks : state-of-the-art GPU ray tracing applied to optical photon simulation and integrated with Geant4, giving a leap in performance that eliminates memory and time bottlenecks.

/env/presentation/1px.png
  • Drastic speedup -> better detector understanding -> greater precision
    • any simulation limited by optical photons can benefit
    • more photon limited -> more overall speedup (99% -> 100x)
/env/presentation/1px.png
https://bitbucket.org/simoncblyth/opticks code repository
https://simoncblyth.bitbucket.io presentations and videos
https://groups.io/g/opticks forum/mailing list archive
email:opticks+subscribe@groups.io subscribe to mailing list

geocache_360


Outline of Mental Model



Opticks : GPU Optical Simulation via NVIDIA® OptiX™
+ A Mental Model for Effective Application of GPUs


Amdahls "Law" : Expected Speedup Limited by Serial Processing

optical photon simulation, P ~ 99% of CPU time

Must consider processing "big picture"


Understanding GPU Graphical Origins -> Effective GPU Computation

GPUs evolved to rasterize 3D graphics at 30/60 fps

Simple Array Data Structures (N-million,4)

Constant "Uniform" 4x4 matrices : scaling+rotation+translation

Graphical Experience Informs Fast Computation on GPUs


CPU Optimizes Latency, GPU Optimizes Throughput

/env/presentation/nvidia/cpu_vs_gpu_architecture.png

Waiting for memory read/write, is major source of latency...

CPU : latency-oriented : Minimize time to complete single task : avoid latency with caching
  • complex : caching system, branch prediction, speculative execution, ...
GPU : throughput-oriented : Maximize total work per unit time : hide latency with parallelism
  • many simple processing cores, hardware multithreading, SIMD (single instruction multiple data)
  • simpler : lots of compute (ALU), at expense of cache+control
  • can tolerate latency, by assuming abundant other tasks to resume : design assumes parallel workload
Totally different processor architecture -> Total reorganization of data and computation
  • major speedups typically require total rethink of data structures and computation

How to Make Effective Use of GPUs ? Parallel / Simple / Uncoupled

Abundant parallelism
  • many thousands of tasks (ideally millions)
Low register usage : otherwise limits concurrent threads
  • simple kernels, avoid branching
Little/No Synchronization
  • avoid waiting, avoid complex code/debugging
Minimize CPU<->GPU copies
  • reuse GPU buffers across multiple CUDA launches
/env/presentation/1px.png

How Many Threads to Launch ?

Understanding Throughput-oriented Architectures https://cacm.acm.org/magazines/2010/11/100622-understanding-throughput-oriented-architectures/fulltext

NVIDIA Titan V: 80 SM, 5120 CUDA cores


GPU Demands Simplicity (Arrays) -> Big Benefits : NumPy + CuPy

Separate address space -> cudaMemcpy -> Serialization
upload/download : host(CPU)<->device(GPU)

Object-oriented : mixes data and compute

Array-oriented : separate data from compute

NumPy : standard array handling package

https://realpython.com/numpy-array-programming/


Parallel Processing 0th Step : Re-shape Data into "Parallel" Array Form

Q1 : What Shape is Your Data ?

Q2 : What are Independent Chunks ?

CUDA Thread for each element:


Survey of High Level General Purpose CUDA Packages

C++ Based Interfaces to CUDA

Mature NVIDIA Basis Libraries

RAPIDS : New NVIDIA "Suite" of open source data science libs


NumPy : Foundation of Python Data Ecosystem

https://bitbucket.org/simoncblyth/intro_to_numpy

Very terse, array-oriented (no-loop) python interface to C performance

/env/presentation/numpy_ecosystem.png
Recommended paper:
The NumPy array: a structure for efficient numerical computation https://hal.inria.fr/inria-00564007
 

https://docs.scipy.org/doc/numpy/user/quickstart.html

http://www.scipy-lectures.org/intro/index.html

https://github.com/donnemartin/data-science-ipython-notebooks


CuPy : NumPy API (+ some of SciPy) accelerated by NVIDIA CUDA : cuRAND/cuBLAS/cuSOLVER/cuSPARSE/Thrust/NCCL















CuPy : Easy Interface to NVIDIA CUDA stack
















NumPy + CuPy Example : closest approach of Ellipse to a Point

import numpy as np, cupy as cp

def ellipse_closest_approach_to_point( xp, ex, ez, _c, N ):
    """ex, ez: ellipse semi-axes, c: coordinates of point in ellipse frame"""
    c = xp.asarray( _c )  ; assert c.shape == (2,)

    t = xp.linspace( 0, 2*np.pi, N )      # t: array of N angles [0,2pi] 
    e = xp.zeros( [len(t), 2] )
    e[:,0] = ex*xp.cos(t)
    e[:,1] = ez*xp.sin(t)                       # e: N parametric [x,z] points on the ellipse 

    return  e[xp.sum((e-c)**2, axis=1).argmin()]    # point on ellipse closest to c 
expression shape note
e (10000000,2)  
c (2,)  
e-c (10000000,2) c is broadcast over e : must be compatible shape
np.sum((e-c)**2, 1) (10000000,) axis=1 : summing over the axis of length 2
np.sum((e-c)**2, 0) (2,) axis=0 : summing over axis of length 10M
np.sum((e-c)**2, None) () axis=None : summing over all elements, yielding scalar

Python vs NumPy vs CuPy

https://bitbucket.org/simoncblyth/intro_to_numpy/src/default/python_vs_numpy_vs_cupy/ellipse_closest_approach_to_point.py


 17 import numpy as np, cupy as cp
 ...
 67 if __name__ == '__main__':
 68
 69     N = 10000000     # 10M is large enough to make overheads negligible  
 70     M = 8            # repeat timings  
 71
 72     ex,ey = 10,20      # parameters of ellipse 
 73     c = [100,100]       # point to check 
 74
 75     r = {}               # python dict for results 
 76     t = np.zeros(M)      # numpy array for timings  
 77
 78     for i in range(M):
 79         if i == 0:
 80             r[i],t[i] = timed(pure_python_ellipse_closest_approach_to_point_DO_NOT_DO_THIS)(ex,ey,c,N)
 81         else:
 82             xp = np if i < M/2 else cp   # switch between NumPy and CuPy implementations of same API 
 83             r[i],t[i] = timed(ellipse_closest_approach_to_point)( xp, ex,ey,c, N )
 84         pass
 85     pass
 86     for k in r.keys():
 87         if k == M/2: print("")
 88         print(k, r[k], t[k])
 89     pass


Python vs NumPy vs CuPy : Python/NumPy ~ 10, NumPy/CuPy ~ 1300 (CUDA 10.1, NVIDIA TITAN V)

[blyth@localhost python_vs_numpy_vs_cupy]$ ipython -i ellipse_closest_approach_to_point.py
                     # "ipython -i" : run python script leaving context for interactive examination 

Python 2.7.15 |Anaconda, Inc.| (default, May  1 2018, 23:32:55)
IPython 5.7.0 -- An enhanced Interactive Python.
...
(0, (4.983054096206095, 17.34003135802051), '6.266726970672607')
(1, array([ 4.9830541 , 17.34003136]), '0.674299955368042')
(2, array([ 4.9830541 , 17.34003136]), '0.6548769474029541')
(3, array([ 4.9830541 , 17.34003136]), '0.6450159549713135')

(4, array([ 4.9830541 , 17.34003136]), '0.4854261875152588')     # 1st CuPy run CUDA compiles populating cache 
(5, array([ 4.9830541 , 17.34003136]), '0.000415802001953125')
(6, array([ 4.9830541 , 17.34003136]), '0.0003409385681152344')
(7, array([ 4.9830541 , 17.34003136]), '0.000762939453125')

In [1]: tpy = t[0]
In [2]: tnp = np.average(t[1:M/2])
In [3]: tcp = np.average(t[M/2+1:])

In [4]: tpy/tnp
Out[4]: 9.522970786915796     #  NumPy ~ 10x Python 

In [5]: tnp/tcp
Out[5]: 1299.0845622842799    #  CuPy > 1000x NumPy : with almost zero effort can apply the CUDA stack 

Persist NumPy Arrays to .npy Files from Python, Examine File Format

IPython persisting NumPy arrays:

In [1]: a = np.arange(10)           # array of 10 long (int64) 
In [2]: a
Out[2]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [3]: np.save("/tmp/a.npy", a )   # serialize array to file 

In [4]: a2 = np.load("/tmp/a.npy")  # load array into memory 
In [5]: assert np.all( a == a2 )    # check all elements the same 

IPython examine NPY file format:

In [6]: !xxd /tmp/a.npy             # xxd hexdump file contents 
                                    # minimal header : type and shape 

00000000: 934e 554d 5059 0100 7600 7b27 6465 7363  .NUMPY..v.{'desc
00000010: 7227 3a20 273c 6938 272c 2027 666f 7274  r': '<i8', 'fort
00000020: 7261 6e5f 6f72 6465 7227 3a20 4661 6c73  ran_order': Fals
00000030: 652c 2027 7368 6170 6527 3a20 2831 302c  e, 'shape': (10,
00000040: 292c 207d 2020 2020 2020 2020 2020 2020  ), }
00000050: 2020 2020 2020 2020 2020 2020 2020 2020
00000060: 2020 2020 2020 2020 2020 2020 2020 2020
00000070: 2020 2020 2020 2020 2020 2020 2020 200a                 .
00000080: 0000 0000 0000 0000 0100 0000 0000 0000  ................
00000090: 0200 0000 0000 0000 0300 0000 0000 0000  ................
..

Create and Write .npy Array from C/C++ with NP.hh header (No Python)

Check C++ created NumPy array from commandline:

$ python -c "import numpy as np ; print(np.load('/tmp/a.npy'))"
[[[ 0.  1.  2.  3.]
  [ 4.  5.  6.  7.]]

 [[ 8.  9. 10. 11.]
  [12. 13. 14. 15.]]

 [[16. 17. 18. 19.]
  [20. 21. 22. 23.]]]

Access/Create Array Data Anywhere

google:"parse NumPy file with Java/Rust/R/Swift/Android/..."


C Union "Trick" Mixed Type Arrays -> Simpler + Faster CUDA Usage

http://bitbucket.org/simoncblyth/intro_to_numpy/src/tip/union_trick/


//uif.h #pragma once union uif_t { // store three 32-bit types at same location unsigned u; int i ; float f; } ;
In [1]: import numpy as np In [2]: a = np.ones(3, dtype=np.float32) In [3]: a.view(np.int32)[1] = -10 # int32 inside float32 array // NumPy view reinterprets same bits as different type In [4]: a Out[4]: array([ 1., nan, 1.], dtype=float32) In [5]: a.view(np.int32) Out[5]: array([1065353216, -10, 1065353216], dtype=int32)

CUDA reinterpret device functions, also __int_as_float:


Example 2 : OpenGL Interactive Photon History Selection

Photon record renderer, OpenGL geometry shader extracts:


 01 #version 410 core
 ..
 37 uniform ivec4 RecSelect ;  //  constant input to the "launch" 
 ..
 44 in ivec4 sel[];            //  input array 
 ..
 46
 47 layout (lines) in;
 48 layout (points, max_vertices = 1) out;
 ..
 52 void main ()
 53 {
 54     uint seqhis = sel[0].x ;
 55     uint seqmat = sel[0].y ;
 56     if( RecSelect.x > 0 && RecSelect.x != seqhis )  return ;
 57     if( RecSelect.y > 0 && RecSelect.y != seqmat )  return ;
 ..     // History and Material Selection 
 63     vec4 p0 = gl_in[0].gl_Position  ;
 64     vec4 p1 = gl_in[1].gl_Position  ;
 65     float tc = Param.w / TimeDomain.y ;
 66     // for interpolation of photon position at uniform input time  

 ..

Example 2 : Photon History Indexing with CUDA Thrust

https://bitbucket.org/simoncblyth/opticks/src/default/thrustrap/TSparse_.cu

089 template <typename T>
 90 void TSparse<T>::count_unique()
 91 {
 ///     preparation of src with strided range iterator elided for bevity
 97
 98     thrust::device_vector<T> data(src.begin(), src.end());  // GPU copy to avoid sorting original
 99
100     thrust::sort(data.begin(), data.end());  // GPU sort of millions of integers
101
102     // inner_product of sorted data with shifted by one self finds "edges" between values
103     m_num_unique = thrust::inner_product(
104                                   data.begin(),data.end() - 1,   // first1, last1
105                                              data.begin() + 1,   // first2
106                                                        int(1),   // output type init
107                                           thrust::plus(),   // reduction operator
108                                     thrust::not_equal_to()    // pair-by-pair operator, returning 1 at edges
109                                       );
116     m_values.resize(m_num_unique);
117     m_counts.resize(m_num_unique);
118
119     // find all unique key values with their counts
120     thrust::reduce_by_key(
121                                 data.begin(),    // keys_first
122                                   data.end(),    // keys_last
123            thrust::constant_iterator(1),    // values_first
124                             m_values.begin(),    // keys_output
125                             m_counts.begin()     // values_output
126                          );
///     sort into descending key order elided for brevity
147 }

Using Thrust will improve your C++ skills


Applying CUDA Thrust -> Translate Task into Processing "Primitives"

Benefit from highly optimized GPU implementations

Thrust Hides GPU Details

Thrust : High level C++ interface to CUDA

/env/presentation/nvidia/thrust_logo.png

thrust::inner_product

/env/presentation/thrust/inner_product.png

An example of Thrust documentation


thrust::reduce_by_key

/env/presentation/thrust/reduce_by_key.png

Best way to learn Thrust API is exercise it by writing small tests


Summary of a Mental Model for Effective GPU Usage


geocache_360 2