Opticks Experience of GPU Optical Photon Simulation with NVIDIA OptiX

Opticks GPU Optical Simulation with NVIDIA® OptiX™ - Development Experience : Problems and Successes

Open source, https://bitbucket.org/simoncblyth/opticks

Simon C Blyth, IHEP, CAS — HSF Simulation Working Group Meeting, 27 May 2020



JUNO Optical Photon Simulation Problem...

CPU vs GPU architectures, Latency vs Throughput


Waiting for memory read/write, is major source of latency...

CPU : latency-oriented : Minimize time to complete single task : avoid latency with caching
  • complex : caching system, branch prediction, speculative execution, ...
GPU : throughput-oriented : Maximize total work per unit time : hide latency with parallelism
  • many simple processing cores, hardware multithreading, SIMD (single instruction multiple data)
  • simpler : lots of compute (ALU), at expense of cache+control
  • design assumes abundant parallelism

Effective use of Totally different processor architecture -> Total reorganization of data and computation

Understanding Throughput-oriented Architectures https://cacm.acm.org/magazines/2010/11/100622-understanding-throughput-oriented-architectures/fulltext

Understanding GPU Graphical Origins -> Effective GPU Computation

GPUs evolved to rasterize 3D graphics at 30/60 fps

Simple Array Data Structures (N-million,4)

Constant "Uniform" 4x4 matrices : scaling+rotation+translation

Graphical Experience Informs Fast Computation on GPUs

Optical Photon Simulation ≈ Ray Traced Image Rendering

Much in common : geometry, light sources, optical physics

Many Applications of ray tracing :

Ray-tracing vs Rasterization

/env/presentation/nvidia/nv_rasterization.png /env/presentation/nvidia/nv_raytrace.png


Spatial Index Acceleration Structure

NVIDIA® OptiX™ Ray Tracing Engine -- http://developer.nvidia.com/optix

OptiX makes GPU ray tracing accessible

NVIDIA expertise:

Opticks provides (Yellow):

[1] Turing RTX GPUs


Opticks : Translates G4 Optical Physics to CUDA/OptiX

OptiX : single-ray programming model -> line-by-line translation

CUDA Ports of Geant4 classes
  • G4Cerenkov (only generation loop)
  • G4Scintillation (only generation loop)
  • G4OpAbsorption
  • G4OpRayleigh
  • G4OpBoundaryProcess (only a few surface types)
Modify Cherenkov + Scintillation Processes
  • collect genstep, copy to GPU for generation
  • avoids copying millions of photons to GPU
Scintillator Reemission
  • fraction of bulk absorbed "reborn" within same thread
  • wavelength generated by reemission texture lookup
Opticks (OptiX/Thrust GPU interoperation)
  • OptiX : upload gensteps
  • Thrust : seeding, distribute genstep indices to photons
  • OptiX : launch photon generation and propagation
  • Thrust : pullback photons that hit PMTs
  • Thrust : index photon step sequences (optional)

G4Solid -> CUDA Intersect Functions for ~10 Primitives


Sphere, Cylinder, Disc, Cone, Convex Polyhedron, Hyperboloid, Torus, ...

G4Boolean -> CUDA/OptiX Intersection Program Implementing CSG

Complete Binary Tree, pick between pairs of nearest intersects:

UNION tA < tB Enter B Exit B Miss B
Enter A ReturnA LoopA ReturnA
Exit A ReturnA ReturnB ReturnA
Miss A ReturnB ReturnB ReturnMiss
[1] Ray Tracing CSG Objects Using Single Hit Intersections, Andrew Kensler (2006)
with corrections by author of XRT Raytracer http://xrt.wikidot.com/doc:csg
[2] https://bitbucket.org/simoncblyth/opticks/src/master/optixrap/cu/csg_intersect_boolean.h
Similar to binary expression tree evaluation using postorder traverse.

CSG Complete Binary Tree Serialization -> simplifies GPU side

Geant4 solid -> CSG binary tree (leaf primitives, non-leaf operators, 4x4 transforms on any node)

Serialize to complete binary tree buffer:

Height 3 complete binary tree with level order indices:

                                                   depth     elevation

                     1                               0           3

          10                   11                    1           2

     100       101        110        111             2           1

 1000 1001  1010 1011  1100 1101  1110  1111         3           0

postorder_next(i,elevation) = i & 1 ? i >> 1 : (i << elevation) + (1 << elevation) ; // from pattern of bits

Postorder tree traverse visits all nodes, starting from leftmost, such that children are visited prior to their parents.

Opticks : Translates G4 Geometry to GPU, Without Approximation

G4 Structure Tree -> Instance+Global Arrays -> OptiX

Group structure into repeated instances + global remainder:

instancing -> huge memory savings for JUNO PMTs




Validation of Opticks Simulation by Comparison with Geant4

Bi-simulations of all JUNO solids, with millions of photons

mis-aligned histories
mostly < 0.25%, < 0.50% for largest solids
deviant photons within matched history
< 0.05% (500/1M)

Primary sources of problems

Primary cause : float vs double

Geant4 uses double everywhere, Opticks only sparingly (observed double costing 10x slowdown with RTX)




Performance : Scanning from 1M to 400M Photons

Full JUNO Analytic Geometry j1808v5

Production Mode : does the minimum

Multi-Event Running, Measure:

avg time between successive launches, including overheads: (upload gensteps + launch + download hits)
avg of 10 OptiX launches

NVIDIA Quadro RTX 8000 (48G)

谢谢 NVIDIA China
for loaning the card


scan-pf-1_Opticks_vs_Geant4 2

JUNO analytic, 400M photons from center Speedup
Geant4 Extrap. 95,600 s (26 hrs)  
Opticks RTX ON (i) 58 s 1650x

scan-pf-1_Opticks_Speedup 2

JUNO analytic, 400M photons from center Speedup
Opticks RTX ON (i) 58s 1650x
Opticks RTX OFF (i) 275s 350x
Geant4 Extrap. 95,600s (26 hrs)  


5x Speedup from RTX with JUNO analytic geometry

Opticks Experience : Main Operational Problem : Manpower

Lots of interest, very little contribution, why ?

Tool Innovation is Disincentivized ?

Why GPU simulation development difficult ?

Opticks Experience : Main Technical Problem : Geometry Translation

Intersection Performance -> Simulation Performance, Drivers:

Analytic Geometry : translate volume -> surface based model

Coincident faces (even in CSG boolean constituents)

Analytic Torus Intersection

Opticks Experience : Problems with using NVIDIA OptiX

Optimization Issues

Linux GPU Cluster (eg Tesla V100) Deployment Issues

[1] NVIDIA RTX Server with 8x NVIDIA Quadro RTX 8000 : probably restricted to car, design, film companies ... [2] NVIDIA Quadro RTX 8000 PCIe Server Card (Passive)

Opticks Experience : Benefits from using NVIDIA OptiX

NVIDIA OptiX 3,4,5,6


1 or 2 Releases per Year



Opticks : state-of-the-art GPU ray tracing applied to optical photon simulation and integrated with Geant4, giving a leap in performance that eliminates memory and time bottlenecks.

  • Drastic speedup -> better detector understanding -> greater precision
    • any simulation limited by optical photons can benefit
    • more photon limited -> more overall speedup (99% -> 100x)
https://bitbucket.org/simoncblyth/opticks code repository
https://simoncblyth.bitbucket.io presentations and videos
https://groups.io/g/opticks forum/mailing list archive
email:opticks+subscribe@groups.io subscribe to mailing list