Opticks : GPU Optical Photon Simulation for Particle Physics with NVIDIA OptiX

Opticks: GPU photon simulation via NVIDIA OptiX ;
Applied to neutrino telescope simulations ?

Opticks : GPU photon simulation via NVIDIA® OptiX™
+ GPU/Graphics background
+ Application to neutrino telescope simulations ?

Open source, https://bitbucket.org/simoncblyth/opticks

Simon C Blyth, IHEP, CAS — August 2020, SJTU, Neutrino Telescope Simulation Workshop

Outline Opticks


Outline of Graphics/GPU background + Application to neutrino telescopes




Geant4 : Monte Carlo Simulation Toolkit

Geant4 : Monte Carlo Simulation Toolkit Generality

Optical Photon Simulation Problem...

Optical Photon Simulation ≈ Ray Traced Image Rendering

Much in common : geometry, light sources, optical physics

Many Applications of ray tracing :

Ray-tracing vs Rasterization

/env/presentation/nvidia/nv_rasterization.png /env/presentation/nvidia/nv_raytrace.png


10 Giga Rays/s


NVIDIA RTX Metro Exodus

Spatial Index Acceleration Structure

NVIDIA® OptiX™ Ray Tracing Engine -- http://developer.nvidia.com/optix

OptiX makes GPU ray tracing accessible

NVIDIA expertise:

Opticks provides (Yellow):

[1] Turing RTX GPUs


Opticks : Translates G4 Optical Physics to CUDA/OptiX

OptiX : single-ray programming model -> line-by-line translation

CUDA Ports of Geant4 classes
  • G4Cerenkov (only generation loop)
  • G4Scintillation (only generation loop)
  • G4OpAbsorption
  • G4OpRayleigh
  • G4OpBoundaryProcess (only a few surface types)
Modify Cherenkov + Scintillation Processes
  • collect genstep, copy to GPU for generation
  • avoids copying millions of photons to GPU
Scintillator Reemission
  • fraction of bulk absorbed "reborn" within same thread
  • wavelength generated by reemission texture lookup
Opticks (OptiX/Thrust GPU interoperation)
  • OptiX : upload gensteps
  • Thrust : seeding, distribute genstep indices to photons
  • OptiX : launch photon generation and propagation
  • Thrust : pullback photons that hit PMTs
  • Thrust : index photon step sequences (optional)

G4Solid -> CUDA Intersect Functions for ~10 Primitives


Sphere, Cylinder, Disc, Cone, Convex Polyhedron, Hyperboloid, Torus, ...

G4Boolean -> CUDA/OptiX Intersection Program Implementing CSG

Complete Binary Tree, pick between pairs of nearest intersects:

UNION tA < tB Enter B Exit B Miss B
Enter A ReturnA LoopA ReturnA
Exit A ReturnA ReturnB ReturnA
Miss A ReturnB ReturnB ReturnMiss
[1] Ray Tracing CSG Objects Using Single Hit Intersections, Andrew Kensler (2006)
with corrections by author of XRT Raytracer http://xrt.wikidot.com/doc:csg
[2] https://bitbucket.org/simoncblyth/opticks/src/tip/optixrap/cu/csg_intersect_boolean.h
Similar to binary expression tree evaluation using postorder traverse.

Opticks : Translates G4 Geometry to GPU, Without Approximation

G4 Structure Tree -> Instance+Global Arrays -> OptiX

Group structure into repeated instances + global remainder:

instancing -> huge memory savings for JUNO PMTs




Validation of Opticks Simulation by Comparison with Geant4

Bi-simulations of all JUNO solids, with millions of photons

mis-aligned histories
mostly < 0.25%, < 0.50% for largest solids
deviant photons within matched history
< 0.05% (500/1M)

Primary sources of problems

Primary cause : float vs double

Geant4 uses double everywhere, Opticks only sparingly (observed double costing 10x slowdown with RTX)



Recording the steps of Millions of Photons

Up to 16 steps of the photon propagation are recorded.

Photon Array : 4 * float4 = 512 bits/photon

Step Record Array : 2 * short4 = 2*16*4 = 128 bits/record

Compression uses known domains of position (geometry center, extent), time (0:200ns), wavelength, polarization.


Performance : Scanning from 1M to 400M Photons

Full JUNO Analytic Geometry j1808v5

Production Mode : does the minimum

Multi-Event Running, Measure:

avg time between successive launches, including overheads: (upload gensteps + launch + download hits)
avg of 10 OptiX launches

NVIDIA Quadro RTX 8000 (48G)

谢谢 NVIDIA China
for loaning the card


scan-pf-1_Opticks_vs_Geant4 2

JUNO analytic, 400M photons from center Speedup
Geant4 Extrap. 95,600 s (26 hrs)  
Opticks RTX ON (i) 58 s 1650x

scan-pf-1_Opticks_Speedup 2

JUNO analytic, 400M photons from center Speedup
Opticks RTX ON (i) 58s 1650x
Opticks RTX OFF (i) 275s 350x
Geant4 Extrap. 95,600s (26 hrs)  


5x Speedup from RTX with JUNO analytic geometry

Useful Speedup > 1500x : But Why Not Giga Rays/s ? (1 Photon ~10 Rays)

OptiX Performance Tools and Tricks, David Hart, NVIDIA https://developer.nvidia.com/siggraph/2019/video/sig915-vid

Where Next for Opticks ?

JUNO+Opticks into Production

Geant4+Opticks Integration : Work with Geant4 Collaboration

Alpha Development ------>-----------------> Robust Tool

Drastically Improved Optical Photon Simulation Performance...

Three revolutions reinforcing each other:

Deep rivers of development, ripe for re-purposing

Example : DL denoising for faster ray trace convergence

Re-evaluate long held practices in light of new realities:


Outline of Graphics/GPU background + Application to neutrino telescopes


Amdahls "Law" : Expected Speedup Limited by Serial Processing

optical photon simulation, P ~ 99% of CPU time

Must consider processing "big picture"

Understanding GPU Graphical Origins -> Effective GPU Computation

GPUs evolved to rasterize 3D graphics at 30/60 fps

Simple Array Data Structures (N-million,4)

Constant "Uniform" 4x4 matrices : scaling+rotation+translation

Graphical Experience Informs Fast Computation on GPUs

CPU Optimizes Latency, GPU Optimizes Throughput


Waiting for memory read/write, is major source of latency...

CPU : latency-oriented : Minimize time to complete single task : avoid latency with caching
  • complex : caching system, branch prediction, speculative execution, ...
GPU : throughput-oriented : Maximize total work per unit time : hide latency with parallelism
  • many simple processing cores, hardware multithreading, SIMD (single instruction multiple data)
  • simpler : lots of compute (ALU), at expense of cache+control
  • can tolerate latency, by assuming abundant other tasks to resume : design assumes parallel workload
Totally different processor architecture -> Total reorganization of data and computation
  • major speedups typically require total rethink of data structures and computation

How to Make Effective Use of GPUs ? Parallel / Simple / Uncoupled

Abundant parallelism
  • many thousands of tasks (ideally millions)
Low register usage : otherwise limits concurrent threads
  • simple kernels, avoid branching
Little/No Synchronization
  • avoid waiting, avoid complex code/debugging
Minimize CPU<->GPU copies
  • reuse GPU buffers across multiple CUDA launches

How Many Threads to Launch ?

Understanding Throughput-oriented Architectures https://cacm.acm.org/magazines/2010/11/100622-understanding-throughput-oriented-architectures/fulltext

NVIDIA Titan V: 80 SM, 5120 CUDA cores

GPU Demands Simplicity (Arrays) -> Big Benefits : NumPy + CuPy

Separate address space -> cudaMemcpy -> Serialization
upload/download : host(CPU)<->device(GPU)

Object-oriented : mixes data and compute

Array-oriented : separate data from compute

NumPy : standard array handling package


Survey of High Level General Purpose CUDA Packages

C++ Based Interfaces to CUDA

Mature NVIDIA Basis Libraries

RAPIDS : New NVIDIA "Suite" of open source data science libs

Rendering Five Decades of Research 1

Rendering Five Decades of Research 2

Project Sol

Path Tracing in Production 1

Path Tracing in Production 2

The Rendering Equation 1

The Rendering Equation 2

Samples per Pixel 1

Samples per Pixel 2

NVIDIA OptiX AI Denoiser 1

NVIDIA OptiX AI Denoiser 2


Physically Based Rendering Book : www.pbr-book.org

Optical Simulation : Computer Graphics vs Physics

CG Rendering "Simulation" Particle Physics Simulation
simulates: image formation, vision simulates photons: generation, propagation, detection
(red, green, blue) wavelength range eg 400-700 nm
ignore polarization polarization vector propagated throughout
participating media: clouds,fog,fire [1] bulk scattering: Rayleigh, MIE
human exposure times nanosecond time scales
equilibrium assumption transient phenomena
ignores light speed, time arrival time crucial, speed of light : 30 cm/ns

Despite differences many techniques+hardware+software directly applicable to physics eg:

Potentially Useful CG techniques for "billion photon simulations"

[1] search for: "Volumetric Rendering Equation"

Neutrino Telescope Optical Simulations : Giga-Photon Propagations

Opticks as drop in fast replacement for Geant4

Full+fast GPU accelerated simulation:

Re-usage is caching optimization, still need full propagation:

Opticks Rayleigh Scattering : CUDA line-by-line port of G4OpRayleigh

130 __device__ void rayleigh_scatter(Photon &p, curandState &rng)
131 {
137     float3 newDirection, newPolarization ;
139     float cosTheta ;
141     do {
145         newDirection = uniform_sphere(&rng);
146         rotateUz(newDirection, p.direction );
152         float constant = -dot(newDirection,p.polarization);
153         newPolarization = p.polarization + constant*newDirection ;
155         // newPolarization
156         // 1. transverse to newDirection (as that component is subtracted)
157         // 2. same plane as old p.polarization and newDirection (by construction)
158         // 
...         ... corner case elided ...
182         if(curand_uniform(&rng) < 0.5f) newPolarization = -newPolarization ;
185         newPolarization = normalize(newPolarization);
189         cosTheta = dot(newPolarization,p.polarization) ;
191     } while ( cosTheta*cosTheta < curand_uniform(&rng)) ;
193     p.direction = newDirection ;
194     p.polarization = newPolarization ;
195 }

Have to persist the polarization vector, to truly resume a propagation


Developing a photon "snapshot" cache

Where/when/what to collect ?

Too many options: experimentation needed to iterate towards solution

[1] RTX Beyond Ray Tracing: Exploring the Use of Hardware Ray Tracing Cores for Tet-Mesh Point Location https://www.willusher.io/publications/rtx-points

Photon Mapping 1

Photon Mapping 2



Opticks : state-of-the-art GPU ray tracing applied to optical photon simulation and integrated with Geant4, eliminating memory and time bottlenecks.

  • neutrino telescope simulation can benefit drastically from Opticks
    • Drastic speedup -> better detector understanding -> greater precision
    • more photon limited -> more overall speedup ( 99.9% -> 1000x )
    • graphics : rich source of techniques, inspiration, CUDA code to try
https://bitbucket.org/simoncblyth/opticks code repository
https://simoncblyth.bitbucket.io presentations and videos
https://groups.io/g/opticks forum/mailing list archive
email:opticks+subscribe@groups.io subscribe to mailing list