Opticks : GPU Optical Photon Simulation for Particle Physics with NVIDIA OptiX

Opticks: GPU photon simulation via NVIDIA OptiX ;: Applied to neutrino telescope simulations ?

Opticks : GPU photon simulation via NVIDIA® OptiX™
+ GPU/Graphics background
+ Application to neutrino telescope simulations ?

Open source, https://bitbucket.org/simoncblyth/opticks

Simon C Blyth, IHEP, CAS — August 2020, SJTU, Neutrino Telescope Simulation Workshop

Outline Opticks

Context and Problem
- Jiangmen Underground Neutrino Observatory (JUNO)
- Optical Photon Simulation Problem...
Tools to create Solution
- Optical Photon Simulation ≈ Ray Traced Image Rendering
- Rasterization and Ray tracing
- Turing Built for RTX
- BVH : Bounding Volume Hierarchy
- NVIDIA OptiX Ray Tracing Engine
Opticks : The Solution
- Geant4 + Opticks Hybrid Workflow : External Optical Photon Simulation
- Opticks : Translates G4 Optical Physics to CUDA/OptiX
- Opticks : Translates G4 Geometry to GPU, Without Approximation
- CUDA/OptiX Intersection Functions for ~10 Primitives
- CUDA/OptiX Intersection Functions for Arbitrarily Complex CSG Shapes
Validation and Performance
- Random Aligned Bi-Simulation -> Direct Array Comparison
- Perfomance Scanning from 1M to 400M Photons
Overview + Links

Outline of Graphics/GPU background + Application to neutrino telescopes

GPU + Parallel Processing Background
- Amdahls "Law" : Expected speedup limited by serial processing
- Understanding GPU Graphical Origins -> Effective GPU Computation
- CPU Optimizes Latency, GPU Optimizes Throughput
- How to make effective use of GPUs ? Parallel/Simple/Uncoupled
- GPU Demands Simplicity (Arrays) -> Big Benefits : NumPy + CuPy
- Survey of High Level General Purpose CUDA Packages
Graphics History/Background
- 50 years of rendering progress
- 2018 : NVIDIA RTX : Project Sol Demo
- Monte Carlo Path Tracing in Movie Production
- Fundamental "Rendering Equation" of Computer Graphics
- Neumann Series solution of Rendering Equation
- Noise : Problem with Monte Carlo Path Tracing
- NVIDIA OptiX Denoiser
- Physically Based Rendering Book : Free Online
- Optical Simulations : Graphics vs Physics
Neutrino Telescope Optical simulations
- Giga-photon propagations : Re-usable photon "snapshots"
- Opticks Rayleigh Scattering : CUDA line-by-line port of G4OpRayleigh
- Developing a photon "snapshot" cache
- Photon Mapping
Summary

JUNO_Intro_2

JUNO_Intro_3

Geant4 : Monte Carlo Simulation Toolkit

Geant4 : Monte Carlo Simulation Toolkit Generality

Optical Photon Simulation Problem...

Optical Photon Simulation ≈ Ray Traced Image Rendering

Much in common : geometry, light sources, optical physics

simulation : photon parameters at PMT detectors
rendering : pixel values at image plane
both limited by ray geometry intersection, aka ray tracing

Many Applications of ray tracing :

advertising, design, architecture, films, games,...
-> huge efforts to improve hw+sw over 30 yrs

Ray-tracing vs Rasterization

/env/presentation/nvidia/nv_rasterization.png

/env/presentation/nvidia/nv_raytrace.png

SIGGRAPH_2018_Announcing_Worlds_First_Ray_Tracing_GPU

10 Giga Rays/s

TURING BUILT FOR RTX 2

NVIDIA RTX Metro Exodus

`Spatial Index Acceleration Structure`

NVIDIA® OptiX™ Ray Tracing Engine -- http://developer.nvidia.com/optix

OptiX makes GPU ray tracing accessible

accelerates ray-geometry intersections
simple : single-ray programming model
"...free to use within any application..."
access RT Cores[1] with OptiX 6.0.0+ via RTX™ mode

NVIDIA expertise:

~linear scaling up to 4 GPUs
acceleration structure creation + traversal (Blue)
instanced sharing of geometry + acceleration structures
compiler optimized for GPU ray tracing

Opticks provides (Yellow):

ray generation program
ray geometry intersection+bbox programs

[1] Turing RTX GPUs

Geant4OpticksWorkflow

Opticks : Translates G4 Optical Physics to CUDA/OptiX

OptiX : single-ray programming model -> line-by-line translation

CUDA Ports of Geant4 classes

G4Cerenkov (only generation loop)
G4Scintillation (only generation loop)
G4OpAbsorption
G4OpRayleigh
G4OpBoundaryProcess (only a few surface types)

Modify Cherenkov + Scintillation Processes

collect genstep, copy to GPU for generation
avoids copying millions of photons to GPU

Scintillator Reemission

fraction of bulk absorbed "reborn" within same thread
wavelength generated by reemission texture lookup

Opticks (OptiX/Thrust GPU interoperation)

OptiX : upload gensteps
Thrust : seeding, distribute genstep indices to photons
OptiX : launch photon generation and propagation
Thrust : pullback photons that hit PMTs
Thrust : index photon step sequences (optional)

G4VSolid -> CUDA Intersect Functions for ~10 Primitives

3D parametric ray : ray(x,y,z;t) = rayOrigin + t * rayDirection
implicit equation of primitive : f(x,y,z) = 0
-> polynomial in t , roots: t > t_min -> intersection positions + surface normals

/env/presentation/tboolean_parade_sep2017.png

Sphere, Cylinder, Disc, Cone, Convex Polyhedron, Hyperboloid, Torus, ...

G4Boolean -> CUDA/OptiX Intersection Program Implementing CSG

Complete Binary Tree, pick between pairs of nearest intersects:

UNION tA < tB	Enter B	Exit B	Miss B
Enter A	ReturnA	LoopA	ReturnA
Exit A	ReturnA	ReturnB	ReturnA
Miss A	ReturnB	ReturnB	ReturnMiss

Nearest hit intersect algorithm [1] avoids state
- sometimes Loop : advance t_min , re-intersect both
- classification shows if inside/outside
Evaluative [2] implementation emulates recursion:
- recursion not allowed in OptiX intersect programs
- bit twiddle traversal of complete binary tree
- stacks of postorder slices and intersects
Identical geometry to Geant4
- solving the same polynomials
- near perfect intersection match

[1] Ray Tracing CSG Objects Using Single Hit Intersections, Andrew Kensler (2006): with corrections by author of XRT Raytracer http://xrt.wikidot.com/doc:csg
[2] https://bitbucket.org/simoncblyth/opticks/src/tip/optixrap/cu/csg_intersect_boolean.h: Similar to binary expression tree evaluation using postorder traverse.

Opticks : Translates G4 Geometry to GPU, Without Approximation

Material/Surface/Scintillator properties

interpolated to standard wavelength domain
interleaved into "boundary" texture
"reemission" texture for wavelength generation

Material/surface boundary : 4 indices

outer material (parent)
outer surface (inward photons, parent -> self)
inner surface (outward photons, self -> parent)
inner material (self)

Primitives labelled with unique boundary index

ray primitive intersection -> boundary index
texture lookup -> material/surface properties

simple/fast properties + reemission wavelength

G4 Structure Tree -> Instance+Global Arrays -> OptiX

Group structure into repeated instances + global remainder:

auto-identify repeated geometry with "progeny digests"
- JUNO : 9 distinct instances + 1 global
instance transforms used in OptiX/OpenGL geometry

instancing -> huge memory savings for JUNO PMTs

j1808_top_rtx

j1808_top_ogl

Validation of Opticks Simulation by Comparison with Geant4

Bi-simulations of all JUNO solids, with millions of photons

mis-aligned histories: mostly < 0.25%, < 0.50% for largest solids
deviant photons within matched history: < 0.05% (500/1M)

Primary sources of problems

grazing incidence, edge skimmers
incidence at constituent solid boundaries

Primary cause : float vs double

Geant4 uses double everywhere, Opticks only sparingly (observed double costing 10x slowdown with RTX)

Conclude

neatly oriented photons more prone to issues than realistic ones
perfect "technical" matching not feasible
instead shift validation to more realistic full detector "calibration" situation

scan-pf-check-GUI-TO-SC-BT5-SD

Recording the steps of Millions of Photons

Domain compression to fit in VRAM

16 step records per photon -> 256 bytes/photon
10M photons -> 2.56 GB

4-bit History Flags at Each Step

BT : boundary
BR : boundary reflect
SC : bulk scatter
AB : bulk absorb
SD : surface detect
SA : surface absorb

seqhis: 64-bit integer history sequence

Up to 16 steps of the photon propagation are recorded.

Photon Array : 4 * float4 = 512 bits/photon

float4: position, time [32 * 4 = 128 bits]
float4: direction, weight
float4: polarization, wavelength
float4: flags: material, boundary, history

Step Record Array : 2 * short4 = 2*16*4 = 128 bits/record

short4: position, time (snorm compressed) [4*16 = 64 bits]
uchar4: polarization, wavelength (uchar compressed) [4*8 = 32 bits]
uchar4: material, history flags [4*8 = 32 bits]

Compression uses known domains of position (geometry center, extent), time (0:200ns), wavelength, polarization.

scan-pf-check-GUI-TO-BT5-SD

Performance : Scanning from 1M to 400M Photons

Full JUNO Analytic Geometry j1808v5

"calibration source" genstep at center of scintillator

Production Mode : does the minimum

only saves hits
skips : genstep, photon, source, record, sequence, index, ..
no Geant4 propagation (other than at 1M for extrapolation)

Multi-Event Running, Measure:

interval: avg time between successive launches, including overheads: (upload gensteps + launch + download hits)
launch: avg of 10 OptiX launches

overheads < 10% beyond 20M photons

`NVIDIA Quadro RTX 8000 (48G)`

谢谢 NVIDIA China
for loaning the card

scan-pf-1_NHit

scan-pf-1_Opticks_vs_Geant4 2

JUNO analytic, 400M photons from center		Speedup
Geant4 Extrap.	95,600 s (26 hrs)
Opticks RTX ON (i)	58 s	1650x

scan-pf-1_Opticks_Speedup 2

JUNO analytic, 400M photons from center		Speedup
Opticks RTX ON (i)	58s	1650x
Opticks RTX OFF (i)	275s	350x
Geant4 Extrap.	95,600s (26 hrs)

scan-pf-1_RTX_Speedup

5x Speedup from RTX with JUNO analytic geometry

Useful Speedup > 1500x : But Why Not Giga Rays/s ? (1 Photon ~10 Rays)

Launch times for various geometries
Geometry	Launch (s)	Giga Rays/s	Relative to ana
JUNO ana	13.2	0.07
JUNO tri.sw	6.9	0.14	1.9x
JUNO tri.hw	2.2	0.45	6.0x

Boxtest ana	0.59	1.7
Boxtest tri.sw	0.62	1.6
Boxtest tri.hw	0.30	3.3	1.9x

ana : Opticks analytic CSG (SM)
tri.sw : software triangle intersect (SM)
tri.hw : hardware triangle intersect (RT)

JUNO 15k triangles, 132M without instancing

Simple Boxtest geometry gets into ballpark

NVIDIA claim : 10 Giga Rays/s with RT Core
-> 1 Billion photons per second
RT cores : built-in triangle intersect + 1-level of instancing
flatten scene model to avoid SM<->RT roundtrips ?

OptiX Performance Tools and Tricks, David Hart, NVIDIA https://developer.nvidia.com/siggraph/2019/video/sig915-vid

Where Next for Opticks ?

JUNO+Opticks into Production

optimize geometry modelling for RTX
full JUNO geometry validation iteration
JUNO offline integration
optimize GPU cluster throughput:
- split/join events to fit VRAM
- job/node/multi-GPU strategy
support OptiX 7, find multi-GPU load balancing approach

Geant4+Opticks Integration : Work with Geant4 Collaboration

finalize Geant4+Opticks extended example
- aiming for Geant4 distrib
prototype Genstep interface inside Geant4
- avoid customizing G4Cerenkov G4Scintillation

Alpha Development ------>-----------------> Robust Tool

many more users+developers required (current ~10+1)
if you have an optical photon simulation problem ...
- start by joining : https://groups.io/g/opticks

Drastically Improved Optical Photon Simulation Performance...

Three revolutions reinforcing each other:

games -> graphics revolution -> GPU -> cheap TFLOPS
internet scale big datasets -> ML revolution
computer vision revolution for autonomous vehicles

Deep rivers of development, ripe for re-purposing

analogous problems -> solutions
experience across fields essential to find+act on analogies

Example : DL denoising for faster ray trace convergence

analogous to hit aggregation
skip the hits, jump straight to DL smoothed probabilities
- blurs the line between simulation and reconstruction

Re-evaluate long held practices in light of new realities:

large ROOT format (C++ object) MC samples repeatedly converted+uploaded to GPU for DL training ... OR:
small Genstep NumPy arrays uploaded, dynamically simulated into GPU hit arrays in fractions of a second

Overview + Links

Opticks : state-of-the-art GPU ray tracing applied to optical photon simulation and integrated with Geant4, giving a leap in performance that eliminates memory and time bottlenecks.

Drastic speedup -> better detector understanding -> greater precision

any simulation limited by optical photons can benefit

more photon limited -> more overall speedup (99% -> 100x)

https://bitbucket.org/simoncblyth/opticks	code repository
https://simoncblyth.bitbucket.io	presentations and videos
https://groups.io/g/opticks	forum/mailing list archive
email:opticks+subscribe@groups.io	subscribe to mailing list

geocache_360

Outline of Graphics/GPU background + Application to neutrino telescopes

GPU + Parallel Processing Background
- Amdahls "Law" : Expected speedup limited by serial processing
- Understanding GPU Graphical Origins -> Effective GPU Computation
- CPU Optimizes Latency, GPU Optimizes Throughput
- How to make effective use of GPUs ? Parallel/Simple/Uncoupled
- GPU Demands Simplicity (Arrays) -> Big Benefits : NumPy + CuPy
- Survey of High Level General Purpose CUDA Packages
Graphics History/Background
- 50 years of rendering progress
- 2018 : NVIDIA RTX : Project Sol Demo
- Monte Carlo Path Tracing in Movie Production
- Fundamental "Rendering Equation" of Computer Graphics
- Neumann Series solution of Rendering Equation
- Noise : Problem with Monte Carlo Path Tracing
- NVIDIA OptiX Denoiser
- Physically Based Rendering Book : Free Online
- Optical Simulations : Graphics vs Physics
Neutrino Telescope Optical simulations
- Giga-photon propagations : Re-usable photon "snapshots"
- Opticks Rayleigh Scattering : CUDA line-by-line port of G4OpRayleigh
- Developing a photon "snapshot" cache
- Photon Mapping
Summary

Amdahls "Law" : Expected Speedup Limited by Serial Processing

optical photon simulation, P ~ 99% of CPU time

-> potential overall speedup S(n) is 100x
even with parallel speedup factor >> 1500x

Must consider processing "big picture"

remove bottlenecks one by one
re-evaluate "big picture" after each

Understanding GPU Graphical Origins -> Effective GPU Computation

GPUs evolved to rasterize 3D graphics at 30/60 fps

30/60 "launches" per second, each handling millions of items
literally billions of small "shader" programs run per second

Simple Array Data Structures (N-million,4)

millions of vertices, millions of triangles
vertex: (x y z w)
colors: (r g b a)

Constant "Uniform" 4x4 matrices : scaling+rotation+translation

4-component homogeneous coordinates -> easy projection

Graphical Experience Informs Fast Computation on GPUs

array shapes similar to graphics ones are faster
- "float4" 4*float(32bit) = 128 bit memory reads are favored
- Opticks photons use "float4x4" just like 4x4 matrices
GPU Launch frequency < ~30/60 per second
- avoid copy+launch overheads becoming significant
- ideally : handle millions of items in each launch

CPU Optimizes Latency, GPU Optimizes Throughput

/env/presentation/nvidia/cpu_vs_gpu_architecture.png

Waiting for memory read/write, is major source of latency...

CPU : latency-oriented : Minimize time to complete single task : avoid latency with caching

complex : caching system, branch prediction, speculative execution, ...

GPU : throughput-oriented : Maximize total work per unit time : hide latency with parallelism

many simple processing cores, hardware multithreading, SIMD (single instruction multiple data)
simpler : lots of compute (ALU), at expense of cache+control
can tolerate latency, by assuming abundant other tasks to resume : design assumes parallel workload

Totally different processor architecture -> Total reorganization of data and computation

major speedups typically require total rethink of data structures and computation

How to Make Effective Use of GPUs ? Parallel / Simple / Uncoupled

Abundant parallelism

many thousands of tasks (ideally millions)

Low register usage : otherwise limits concurrent threads

simple kernels, avoid branching

Little/No Synchronization

avoid waiting, avoid complex code/debugging

Minimize CPU<->GPU copies

reuse GPU buffers across multiple CUDA launches

How Many Threads to Launch ?

can (and should) launch many millions of threads
- largest Opticks launch : 400M threads, at VRAM limit
maximum thread launch size : so large its irrelevant
maximum threads inflight : #SM*2048 = 80*2048 ~ 160k
- best latency hiding when launch > ~10x this ~ 1M

Understanding Throughput-oriented Architectures https://cacm.acm.org/magazines/2010/11/100622-understanding-throughput-oriented-architectures/fulltext

NVIDIA Titan V: 80 SM, 5120 CUDA cores

GPU Demands Simplicity (Arrays) -> Big Benefits : NumPy + CuPy

Persist everything to file -> fast development cycle

data portability into any environment
interactive debug/analysis : NumPy,IPython
flexible testing

Can transport everything across network:

production flexibility : distributed compute

Arrays for Everything -> direct access debug

(num_photons,4,4) float32
(num_photons,16,2,4) int16 : step records
(num_photons,2) uint64 : history flags
(num_gensteps,6,4) float32
(num_csgnodes,4,4) float32
(num_transforms,3,4,4) float32
(num_planes,4) float32
...

Separate address space -> cudaMemcpy -> Serialization: upload/download : host(CPU)<->device(GPU)

Serialize everything -> Arrays
Many small tasks -> Arrays
Random Access/Order undefined -> Arrays

Object-oriented : mixes data and compute

complicated serialization
good for complex systems, up to ~1000 objects

Array-oriented : separate data from compute

inherent serialization + simplicity
good for millions of element systems

NumPy : standard array handling package

simple .npy serialization
read/write NumPy arrays from C++ https://github.com/simoncblyth/np/blob/master/NP.hh

https://realpython.com/numpy-array-programming/

Survey of High Level General Purpose CUDA Packages

Learn CUDA basics (kernels, thread+memory hierarchy, ...)
- BUT: base development on higher level libs -> faster start

C++ Based Interfaces to CUDA

Thrust : https://developer.nvidia.com/Thrust
- C++ interface to CUDA performance
- high-level abstraction : reduce, scan, sort
CUB : http://nvlabs.github.io/cub/
- CUDA C++ specific, GPU less hidden
MGPU : https://github.com/moderngpu/moderngpu
- teaching tool : examples of CUDA algorithms

Mature NVIDIA Basis Libraries

cuRAND, cuFFT, cuBLAS, cuSOLVER, cuTENSOR, ...
- https://developer.nvidia.com/gpu-accelerated-libraries

RAPIDS : New NVIDIA "Suite" of open source data science libs

GPU-accelerated open source data science suite
- "... end-to-end data science workflows..." http://rapids.ai/
- cuDF : GPU dataframe library, Pandas-on-GPU

Rendering Five Decades of Research 1

Rendering Five Decades of Research 2

Project Sol

Path Tracing in Production 1

Path Tracing in Production 2

The Rendering Equation 1

The Rendering Equation 2

Samples per Pixel 1

Samples per Pixel 2

NVIDIA OptiX AI Denoiser 1

NVIDIA OptiX AI Denoiser 2

https://research.nvidia.com/publication/interactive-reconstruction-monte-carlo-image-sequences-using-recurrent-denoising

Physically Based Rendering Book : www.pbr-book.org

Optical Simulation : Computer Graphics vs Physics

CG Rendering "Simulation"	Particle Physics Simulation
simulates: image formation, vision	simulates photons: generation, propagation, detection
(red, green, blue)	wavelength range eg 400-700 nm
ignore polarization	polarization vector propagated throughout
participating media: clouds,fog,fire [1]	bulk scattering: Rayleigh, MIE
human exposure times	nanosecond time scales
equilibrium assumption	transient phenomena
ignores light speed, time	arrival time crucial, speed of light : 30 cm/ns

handling of time is the crucial difference

Despite differences many techniques+hardware+software directly applicable to physics eg:

GPU accelerated ray tracing (NVIDIA OptiX)
GPU accelerated property interpolation via textures (NVIDIA CUDA)
GPU acceleration structures (NVIDIA BVH)

Potentially Useful CG techniques for "billion photon simulations"

irradiance caching, photon mapping, progressive photon mapping

[1] search for: "Volumetric Rendering Equation"

Neutrino Telescope Optical Simulations : Giga-Photon Propagations

full simulation -> photon "snapshot" cache
- when crossing virtual segmented "shells" ?
  - collect direction,polarization onto positioned segments
  - what shape/segmentation ?
  - lots of duplicated information
- when scattering (pre or post parameters)
  - collect position,direction,polarization
  - no shells, more involved lookup
fast no-photon simulation
- orient "snapshot" to the primary
"snapshots" near sensors -> resume propagation
- times, incidence angles at sensors -> hits

GPU "snapshot" cache data structure:

photon lists, binned PDFs ?
k-d tree (for nearest neighbor searches)

Cherenkov light generation
radioactive + biological backgrounds
propagation : scattering + absorption (billions)
- direct light (unscattered) : fast
- indirect (scattered) : slow
detection on sparse sensors

Opticks as drop in fast replacement for Geant4

Full+fast GPU accelerated simulation:

Cerenkov generation, Rayleigh scattering, absorption
angle dependent sensor collection efficiency culling
BUT: launch size, VRAM limited: 48G ≈ 400M photons

Re-usage is caching optimization, still need full propagation:

populate the cache
validate the trickery
re-usage reduces need for expensive propagations

Opticks Rayleigh Scattering : CUDA line-by-line port of G4OpRayleigh

130 __device__ void rayleigh_scatter(Photon &p, curandState &rng)
131 {
137     float3 newDirection, newPolarization ;
139     float cosTheta ;
141     do {
145         newDirection = uniform_sphere(&rng);
146         rotateUz(newDirection, p.direction );
151
152         float constant = -dot(newDirection,p.polarization);
153         newPolarization = p.polarization + constant*newDirection ;
154
155         // newPolarization
156         // 1. transverse to newDirection (as that component is subtracted)
157         // 2. same plane as old p.polarization and newDirection (by construction)
158         // 
...         ... corner case elided ...
182         if(curand_uniform(&rng) < 0.5f) newPolarization = -newPolarization ;
184
185         newPolarization = normalize(newPolarization);
189         cosTheta = dot(newPolarization,p.polarization) ;
190
191     } while ( cosTheta*cosTheta < curand_uniform(&rng)) ;
192
193     p.direction = newDirection ;
194     p.polarization = newPolarization ;
195 }

Have to persist the polarization vector, to truly resume a propagation

could persist pre-scatter : polarization, direction

https://bitbucket.org/simoncblyth/opticks/src/master/optixrap/cu/rayleigh.h

Developing a photon "snapshot" cache

Where/when/what to collect ?

tetrahedral volumetric meshes (tet-mesh)
- inherent segmentation
- natural adaptive resolution
- triangle faces : Giga-rays/s intersection (RT Cores)[1]
- good for general light field capture
"concentric" spheres/cylinders/cones oriented to primary
- natural for exploiting track axis rotational symmetry
at scatters (pre-scatter/post-scatter parameters)
- position, direction, polarization
- can generate post from pre, but not v.v.
collect photons OR aggregate binned PDFs ?
- PDF->CDF->generate photons (like Opticks "gensteps")

Too many options: experimentation needed to iterate towards solution

[1] RTX Beyond Ray Tracing: Exploring the Use of Hardware Ray Tracing Cores for Tet-Mesh Point Location https://www.willusher.io/publications/rtx-points

Photon Mapping 1

Photon Mapping 2

Conclusion

Opticks : state-of-the-art GPU ray tracing applied to optical photon simulation and integrated with Geant4, eliminating memory and time bottlenecks.

neutrino telescope simulation can benefit drastically from Opticks

Drastic speedup -> better detector understanding -> greater precision

more photon limited -> more overall speedup ( 99.9% -> 1000x )

graphics : rich source of techniques, inspiration, CUDA code to try

https://bitbucket.org/simoncblyth/opticks	code repository
https://simoncblyth.bitbucket.io	presentations and videos
https://groups.io/g/opticks	forum/mailing list archive
email:opticks+subscribe@groups.io	subscribe to mailing list