Opticks : GPU Optical Photon Simulation for Particle Physics with NVIDIA OptiX

Opticks: GPU photon simulation via NVIDIA OptiX ;
Applied to neutrino telescope simulations ?

Opticks : GPU photon simulation via NVIDIA® OptiX™
+ GPU/Graphics background
+ Application to neutrino telescope simulations ?

Open source, https://bitbucket.org/simoncblyth/opticks

Simon C Blyth, IHEP, CAS — August 2020, SJTU, Neutrino Telescope Simulation Workshop


Outline Opticks

/env/presentation/newtons-opticks.png
 


Outline of Graphics/GPU background + Application to neutrino telescopes

/env/presentation/newtons-opticks.png

JUNO_Intro_2


JUNO_Intro_3


Geant4 : Monte Carlo Simulation Toolkit


Geant4 : Monte Carlo Simulation Toolkit Generality


Optical Photon Simulation Problem...









Optical Photon Simulation ≈ Ray Traced Image Rendering

Much in common : geometry, light sources, optical physics


Many Applications of ray tracing :


Ray-tracing vs Rasterization

/env/presentation/nvidia/nv_rasterization.png /env/presentation/nvidia/nv_raytrace.png

SIGGRAPH_2018_Announcing_Worlds_First_Ray_Tracing_GPU








10 Giga Rays/s

TURING BUILT FOR RTX 2








NVIDIA RTX Metro Exodus











Spatial Index Acceleration Structure













NVIDIA® OptiX™ Ray Tracing Engine -- http://developer.nvidia.com/optix

OptiX makes GPU ray tracing accessible

NVIDIA expertise:

Opticks provides (Yellow):

[1] Turing RTX GPUs


Geant4OpticksWorkflow


Opticks : Translates G4 Optical Physics to CUDA/OptiX

OptiX : single-ray programming model -> line-by-line translation

CUDA Ports of Geant4 classes
  • G4Cerenkov (only generation loop)
  • G4Scintillation (only generation loop)
  • G4OpAbsorption
  • G4OpRayleigh
  • G4OpBoundaryProcess (only a few surface types)
Modify Cherenkov + Scintillation Processes
  • collect genstep, copy to GPU for generation
  • avoids copying millions of photons to GPU
Scintillator Reemission
  • fraction of bulk absorbed "reborn" within same thread
  • wavelength generated by reemission texture lookup
Opticks (OptiX/Thrust GPU interoperation)
  • OptiX : upload gensteps
  • Thrust : seeding, distribute genstep indices to photons
  • OptiX : launch photon generation and propagation
  • Thrust : pullback photons that hit PMTs
  • Thrust : index photon step sequences (optional)

G4Solid -> CUDA Intersect Functions for ~10 Primitives

/env/presentation/tboolean_parade_sep2017.png

Sphere, Cylinder, Disc, Cone, Convex Polyhedron, Hyperboloid, Torus, ...


G4Boolean -> CUDA/OptiX Intersection Program Implementing CSG

Complete Binary Tree, pick between pairs of nearest intersects:

UNION tA < tB Enter B Exit B Miss B
Enter A ReturnA LoopA ReturnA
Exit A ReturnA ReturnB ReturnA
Miss A ReturnB ReturnB ReturnMiss
[1] Ray Tracing CSG Objects Using Single Hit Intersections, Andrew Kensler (2006)
with corrections by author of XRT Raytracer http://xrt.wikidot.com/doc:csg
[2] https://bitbucket.org/simoncblyth/opticks/src/tip/optixrap/cu/csg_intersect_boolean.h
Similar to binary expression tree evaluation using postorder traverse.

Opticks : Translates G4 Geometry to GPU, Without Approximation

G4 Structure Tree -> Instance+Global Arrays -> OptiX

Group structure into repeated instances + global remainder:

instancing -> huge memory savings for JUNO PMTs



        
        

j1808_top_rtx


j1808_top_ogl


Validation of Opticks Simulation by Comparison with Geant4

Bi-simulations of all JUNO solids, with millions of photons

mis-aligned histories
mostly < 0.25%, < 0.50% for largest solids
deviant photons within matched history
< 0.05% (500/1M)

Primary sources of problems

Primary cause : float vs double

Geant4 uses double everywhere, Opticks only sparingly (observed double costing 10x slowdown with RTX)

Conclude


scan-pf-check-GUI-TO-SC-BT5-SD


Recording the steps of Millions of Photons

Up to 16 steps of the photon propagation are recorded.

Photon Array : 4 * float4 = 512 bits/photon

Step Record Array : 2 * short4 = 2*16*4 = 128 bits/record

Compression uses known domains of position (geometry center, extent), time (0:200ns), wavelength, polarization.


scan-pf-check-GUI-TO-BT5-SD


Performance : Scanning from 1M to 400M Photons

Full JUNO Analytic Geometry j1808v5

Production Mode : does the minimum

Multi-Event Running, Measure:

interval
avg time between successive launches, including overheads: (upload gensteps + launch + download hits)
launch
avg of 10 OptiX launches

NVIDIA Quadro RTX 8000 (48G)

谢谢 NVIDIA China
for loaning the card

scan-pf-1_NHit










scan-pf-1_Opticks_vs_Geant4 2





JUNO analytic, 400M photons from center Speedup
Geant4 Extrap. 95,600 s (26 hrs)  
Opticks RTX ON (i) 58 s 1650x

scan-pf-1_Opticks_Speedup 2










JUNO analytic, 400M photons from center Speedup
Opticks RTX ON (i) 58s 1650x
Opticks RTX OFF (i) 275s 350x
Geant4 Extrap. 95,600s (26 hrs)  

scan-pf-1_RTX_Speedup












5x Speedup from RTX with JUNO analytic geometry

Useful Speedup > 1500x : But Why Not Giga Rays/s ? (1 Photon ~10 Rays)













OptiX Performance Tools and Tricks, David Hart, NVIDIA https://developer.nvidia.com/siggraph/2019/video/sig915-vid


Where Next for Opticks ?

JUNO+Opticks into Production

Geant4+Opticks Integration : Work with Geant4 Collaboration

Alpha Development ------>-----------------> Robust Tool


Drastically Improved Optical Photon Simulation Performance...

Three revolutions reinforcing each other:

Deep rivers of development, ripe for re-purposing

Example : DL denoising for faster ray trace convergence

Re-evaluate long held practices in light of new realities:


geocache_360


Outline of Graphics/GPU background + Application to neutrino telescopes

/env/presentation/newtons-opticks.png

Amdahls "Law" : Expected Speedup Limited by Serial Processing

optical photon simulation, P ~ 99% of CPU time

Must consider processing "big picture"


Understanding GPU Graphical Origins -> Effective GPU Computation

GPUs evolved to rasterize 3D graphics at 30/60 fps

Simple Array Data Structures (N-million,4)

Constant "Uniform" 4x4 matrices : scaling+rotation+translation

Graphical Experience Informs Fast Computation on GPUs


CPU Optimizes Latency, GPU Optimizes Throughput

/env/presentation/nvidia/cpu_vs_gpu_architecture.png

Waiting for memory read/write, is major source of latency...

CPU : latency-oriented : Minimize time to complete single task : avoid latency with caching
  • complex : caching system, branch prediction, speculative execution, ...
GPU : throughput-oriented : Maximize total work per unit time : hide latency with parallelism
  • many simple processing cores, hardware multithreading, SIMD (single instruction multiple data)
  • simpler : lots of compute (ALU), at expense of cache+control
  • can tolerate latency, by assuming abundant other tasks to resume : design assumes parallel workload
Totally different processor architecture -> Total reorganization of data and computation
  • major speedups typically require total rethink of data structures and computation

How to Make Effective Use of GPUs ? Parallel / Simple / Uncoupled

Abundant parallelism
  • many thousands of tasks (ideally millions)
Low register usage : otherwise limits concurrent threads
  • simple kernels, avoid branching
Little/No Synchronization
  • avoid waiting, avoid complex code/debugging
Minimize CPU<->GPU copies
  • reuse GPU buffers across multiple CUDA launches
/env/presentation/1px.png

How Many Threads to Launch ?

Understanding Throughput-oriented Architectures https://cacm.acm.org/magazines/2010/11/100622-understanding-throughput-oriented-architectures/fulltext

NVIDIA Titan V: 80 SM, 5120 CUDA cores


GPU Demands Simplicity (Arrays) -> Big Benefits : NumPy + CuPy

Separate address space -> cudaMemcpy -> Serialization
upload/download : host(CPU)<->device(GPU)

Object-oriented : mixes data and compute

Array-oriented : separate data from compute

NumPy : standard array handling package

https://realpython.com/numpy-array-programming/


Survey of High Level General Purpose CUDA Packages

C++ Based Interfaces to CUDA

Mature NVIDIA Basis Libraries

RAPIDS : New NVIDIA "Suite" of open source data science libs


Rendering Five Decades of Research 1


Rendering Five Decades of Research 2


Project Sol


Path Tracing in Production 1


Path Tracing in Production 2


The Rendering Equation 1


The Rendering Equation 2


Samples per Pixel 1


Samples per Pixel 2


NVIDIA OptiX AI Denoiser 1


NVIDIA OptiX AI Denoiser 2


https://research.nvidia.com/publication/interactive-reconstruction-monte-carlo-image-sequences-using-recurrent-denoising


Physically Based Rendering Book : www.pbr-book.org



Optical Simulation : Computer Graphics vs Physics

CG Rendering "Simulation" Particle Physics Simulation
simulates: image formation, vision simulates photons: generation, propagation, detection
(red, green, blue) wavelength range eg 400-700 nm
ignore polarization polarization vector propagated throughout
participating media: clouds,fog,fire [1] bulk scattering: Rayleigh, MIE
human exposure times nanosecond time scales
equilibrium assumption transient phenomena
ignores light speed, time arrival time crucial, speed of light : 30 cm/ns

Despite differences many techniques+hardware+software directly applicable to physics eg:

Potentially Useful CG techniques for "billion photon simulations"

[1] search for: "Volumetric Rendering Equation"


Neutrino Telescope Optical Simulations : Giga-Photon Propagations

Opticks as drop in fast replacement for Geant4

Full+fast GPU accelerated simulation:

Re-usage is caching optimization, still need full propagation:


Opticks Rayleigh Scattering : CUDA line-by-line port of G4OpRayleigh

130 __device__ void rayleigh_scatter(Photon &p, curandState &rng)
131 {
137     float3 newDirection, newPolarization ;
139     float cosTheta ;
141     do {
145         newDirection = uniform_sphere(&rng);
146         rotateUz(newDirection, p.direction );
151
152         float constant = -dot(newDirection,p.polarization);
153         newPolarization = p.polarization + constant*newDirection ;
154
155         // newPolarization
156         // 1. transverse to newDirection (as that component is subtracted)
157         // 2. same plane as old p.polarization and newDirection (by construction)
158         // 
...         ... corner case elided ...
182         if(curand_uniform(&rng) < 0.5f) newPolarization = -newPolarization ;
184
185         newPolarization = normalize(newPolarization);
189         cosTheta = dot(newPolarization,p.polarization) ;
190
191     } while ( cosTheta*cosTheta < curand_uniform(&rng)) ;
192
193     p.direction = newDirection ;
194     p.polarization = newPolarization ;
195 }

Have to persist the polarization vector, to truly resume a propagation

https://bitbucket.org/simoncblyth/opticks/src/master/optixrap/cu/rayleigh.h


Developing a photon "snapshot" cache

Where/when/what to collect ?

Too many options: experimentation needed to iterate towards solution

[1] RTX Beyond Ray Tracing: Exploring the Use of Hardware Ray Tracing Cores for Tet-Mesh Point Location https://www.willusher.io/publications/rtx-points


Photon Mapping 1


Photon Mapping 2


Conclusion

/env/presentation/1px.png

Opticks : state-of-the-art GPU ray tracing applied to optical photon simulation and integrated with Geant4, eliminating memory and time bottlenecks.

  • neutrino telescope simulation can benefit drastically from Opticks
    • Drastic speedup -> better detector understanding -> greater precision
    • more photon limited -> more overall speedup ( 99.9% -> 1000x )
    • graphics : rich source of techniques, inspiration, CUDA code to try
/env/presentation/1px.png
https://bitbucket.org/simoncblyth/opticks code repository
https://simoncblyth.bitbucket.io presentations and videos
https://groups.io/g/opticks forum/mailing list archive
email:opticks+subscribe@groups.io subscribe to mailing list