Opticks : GPU Optical Photon Simulation using NVIDIA OptiX 7 and NVIDIA CUDA
Opticks : GPU Optical Photon Simulation via NVIDIA® OptiX™ 7, NVIDIA® CUDA™
Open source, https://bitbucket.org/simoncblyth/opticks
Simon C Blyth, IHEP, CAS — (Presented by Dr Tao Lin, IHEP, CAS) — CHEP, Norfolk VA, 8 May 2023
Outline
- Optical Photon Simulation : Context and Problem
- p2: Jiangmen Underground Neutrino Observatory (JUNO)
- p3: JUNO Optical Photon Simulation Problem...
- p4: Optical Photon Simulation ≈ Ray Traced Image Rendering
- NVIDIA Tools to create Solution
- p5,6: NVIDIA Ada Lovelave : 3rd Generation RTX, RT Cores in Data-Center
- p7: NVIDIA OptiX Ray Tracing Engine
- p8: NVIDIA OptiX 7 : Entirely new thin API
- Opticks : Introduction + Full Re-implementation
- p9,10: Geant4 + Opticks Hybrid Workflow : External Optical Photon Simulation
- p11-12: Full re-implementation for NVIDIA OptiX 7 API
- p13-14: CSGFoundry Geometry Model, Translation to GPU
- p16: QUDARap : CUDA Optical Simulation Implementation
- Opticks : New Features
- p17:n-Ary CSG "List-Nodes"
- p18,19: Specialized n-ary CSG intersect algs : "contiguous" and "dis-contiguous"
- p20: Multi-Layer Thin Film (A,R,T) Calc using TMM (Custom4 Package)
- p21: Summary + Links
JUNO_Intro_2
Optical Photon Simulation Problem...
Optical Photon Simulation ≈ Ray Traced Image Rendering
- simulation
- photon parameters at sensors (PMTs)
- rendering
- pixel values at image plane
Much in common : geometry, light sources, optical physics
- both limited by ray geometry intersection, aka ray tracing
Many Applications of ray tracing :
- advertising, design, architecture, films, games,...
- -> huge efforts to improve hw+sw over 30 yrs
NVIDIA Ada : 3rd Generation RTX
- RT Core : ray trace dedicated GPU hardware
- NVIDIA GeForce RTX 4090 (2022)
- 16,384 CUDA Cores, 24GB VRAM, USD 1599
- Continued large ray tracing improvements:
- Ada ~2x ray trace over Ampere (2020), 4x with DLSS 3
- Ampere ~2x ray trace over Turing (2018)
- DLSS : Deep Learning Super Sampling
- AI upsampling, not applicable to optical simulation
Hardware accelerated Ray tracing (RT Cores) in the Data Center
NVIDIA L4 Tensor Core GPU (Released 2023/03)
- Ada Lovelace GPU architecture
- universal accelerator for graphics and AI workloads
- small form-factor, easy to integrate, power efficient
- PCIe Gen4 x16 slot without extra power
- Google Cloud adopted for G2 VMs, successor to NVIDIA T4
- NVIDIA L4 likely to become a very popular GPU
NVIDIA L4 Tensor Core GPU (Data Center, low profile+power)
NVIDIA® OptiX™ Ray Tracing Engine -- Accessible GPU Ray Tracing
OptiX makes GPU ray tracing accessible
- Programmable GPU-accelerated Ray-Tracing Pipeline
- Single-ray shader programming model using CUDA
- ray tracing acceleration using RT Cores (RTX GPUs)
- "...free to use within any application..."
OptiX features
- acceleration structure creation + traversal (eg BVH)
- instanced sharing of geometry + acceleration structures
- compiler optimized for GPU ray tracing
https://developer.nvidia.com/rtx/ray-tracing/optix
User provides (Green):
- ray generation
- geometry bounding boxes
- intersect functions
- instance transforms
NVIDIA OptiX 7 : Entirely new thin API => Full Opticks Re-implementation
NVIDIA OptiX 6->7 : drastically slimmed down
- low-level CUDA-centric thin API (Vulkan-ized)
- headers only (no library, impl in Driver)
- Minimal host state, All host functions are thread-safe
- GPU launches : explicit, asynchronous (CUDA streams)
- near perfect scaling to 4 GPUs, for free
- Shared CPU/GPU geometry context
- GPU memory management
- Multi-GPU support
Advantages of 6->7 transition
- More control/flexibility over everything
- Keep pace with state-of-the-art GPU ray tracing
- Fully benefit from current + future GPUs : RT cores, RTX
BUT: demanded full re-implementation of Opticks
Geant4 + Opticks + NVIDIA OptiX 7 : Hybrid Workflow
https://bitbucket.org/simoncblyth/opticks |
Opticks API : split according to dependency -- Optical photons are GPU "resident", only hits need to be copied to CPU memory
Geant4 + Opticks + NVIDIA OptiX 7 : Hybrid Workflow 2
Primary Packages and Structs Of Re-Implemented Opticks
- SysRap : many small CPU/GPU headers
- stree.h,snode.h : geometry base types
- sctx.h sphoton.h : event base types
- NP.hh : serialization into NumPy .npy format files
- QUDARap
- QSim : optical photon simulation steering
- QScint,QCerenkov,QProp,... : modular CUDA implementation
- U4
- U4Tree : convert geometry into stree.h
- U4 : collect gensteps, return hits
- CSG
- CSGFoundry/CSGSolid/CSGPrim/CSGNode geometry model
- csg_intersect_tree.h csg_intersect_node.h csg_intersect_leaf.h : CPU/GPU intersection functions
- CSGOptiX
- CSGOptiX.h : manage geometry convert from CSG to OptiX 7 IAS GAS, pipeline creation
- CSGOptiX7.cu : compiled into ptx that becomes OptiX 7 pipeline
- includes QUDARap headers for simulation
- includes csg_intersect_tree.h,.. headers for CSG intersection
- G4CX
- G4CXOpticks : Top level Geant4 geometry interface
Full re-implementation of Opticks for NVIDIA OptiX 7 API
- Huge change unavoidable from new OptiX API --> So profit from rethink of simulation code --> 2nd impl advantage
Old simulation (OptiXRap) |
New simulation (QUDARap/qsim.h + CSGOptiX, CSG) |
- implemented on top of old OptiX API
|
- pure CUDA implementation
- OptiX use kept separate, just for intersection
|
- monolithic .cu
- GPU only implementation
- deep stack of support code
|
- many small headers
- many GPU+CPU headers
- shallow stack : QUDARap depends only on SysRap
|
- most code in GPU only context,
even when not needing OptiX or CUDA
|
- strict code segregation
- code not needing GPU in SysRap not QUDARap
|
- testing : GPU only, coarse
|
- testing : CPU+GPU , fine-grained
- curand mocking on CPU
|
- limited CPU/GPU code sharing
|
- maximal sharing : SEvt.hh, sphoton.h, ...
|
- timeconsuming manual random alignment
conducted via debugger
|
- new systematic approach to random alignment
|
Goals of re-implementation : flexible, modular GPU simulation, easily testable, less code
- code reduction, sharing as much as possible between CPU and GPU
- fine grained testing on both CPU and GPU, with GPU curand mocking
- profit from several years of CUDA experience, eg QSim.hh/qsim.h host/device counterpart pattern:
- hostside initializes and uploads device side counterpart --> device side hits ground running
Geometry Model Translation : Geant4 => CSGFoundry => NVIDIA OptiX 7
Geant4 Geometry Model (JUNO: 300k PV, deep hierarchy)
PV |
G4VPhysicalVolume |
placed, refs LV |
LV |
G4LogicalVolume |
unplaced, refs SO |
SO |
G4VSolid,G4BooleanSolid |
binary tree of SO "nodes" |
Opticks CSGFoundry Geometry Model (index references)
struct |
Notes |
Geant4 Equivalent |
CSGFoundry |
vectors of the below, easily serialized + uploaded |
None |
qat4 |
4x4 transform refs CSGSolid using "spare" 4th column |
|
CSGSolid |
refs sequence of CSGPrim |
Groups of nearby PV, LV
+ Remainder |
eg JUNO CSGSolid numPrim [3089, 5, 11, 14, 6, 1, 1, 1, 1, 130] |
CSGPrim |
bbox, refs sequence of CSGNode, root of CSG Tree of nodes |
root G4VSolid |
CSGNode |
CSG node parameters (JUNO: ~23k CSGNode) |
node G4VSolid |
NVIDIA OptiX 7 Geometry Acceleration Structures (JUNO: 1 IAS + 10 GAS, 2-level hierarchy)
IAS |
Instance Acceleration Structures |
JUNO: 1 IAS created from vector of ~50k qat4 (JUNO) |
GAS |
Geometry Acceleration Structures |
JUNO: 10 GAS created from 10 CSGSolid (which refs CSGPrim,CSGNode ) |
[9]cxr_i0_t8,_-1 : EXCLUDE SLOWEST
QUDARap : CUDA Optical Simulation Implementation
|
CPU |
GPU header |
context steering |
QSim.hh |
qsim.h |
curandState setup |
QRng.hh |
qrng.h |
property interpolation |
QProp.hh |
qprop.h |
event handling |
QEvent.hh |
qevent.h |
Cerenkov generation |
QCerenkov.hh |
qcerenkov.h |
Scintillation generation |
QScint.hh |
qscint.h |
texture handling |
QTex.hh |
|
Aims of counterpart code organization:
- facilitate fine-grained modular simulation testing
- bulk of GPU code in simple to test headers
- many can be tested on CPU
- QUDARap does not depend on OptiX -> more flexible -> simpler testing
n-ary CSG Compound "List-Nodes" => Much Smaller CSG trees
- list-node references sub-nodes by subNum subOffset
- CSG_CONTIGUOUS Union
- user guarantees contiguous
- like G4MultiUnion of prim only
- CSG_DISCONTIGUOUS Union
- user guarantees no overlaps
- => simple, low resource intersect
- eg "union of holes" to be CSG subtracted
- CSG_OVERLAP Intersection
- user guarantees overlap
- eg general G4Sphere: inner radius, thetacut, phicut
- Communicate shape more precisely
- => better suited intersect alg => less resources => faster
Generalized Opticks CSG into three levels : tree < node < leaf (avoids recursion in intersect)
CSG_CONTIGUOUS Union : n-ary (not bin-ary) CSG intersection
- zeroth pass : find nearest_enter and count first exits
- if zero exits => outside compound => return nearest_enter
- first pass : collect enter distances, farthest_exit
- order enter indices making enter distances ascend
- n-ary : store, sort enters (cf bin-ary : compare two)
- no tree overheads, but must store+sort distances
- 2nd pass : loop over enters in distance order
- contiguous requirement : enter < farthest_exit so far
- find Exits for Enters that qualify as contiguous, update farthest_exit
- return farthest_exit that qualifies as contiguous
+----------------+ +-------------------+ DISJOINT MUST BE DISQUALIFIED
|B | |D |
+----|----+ +----|-----|----+ +------|----------+ +-----------+
|A | | |C | | | |E | | | |
| | | | | | | | | | | |
| 0 E1 X2 E3 X4 E5 X6 E7 X8 [X9] E10 X11
| | | | | | | | | | | |
| | | | | | | | | | | |
+----|----+ +----|-----|----+ +------|----------+ +-----------+
| | | |
+----------------+ +-------------------+
E E E E E
X X X X X X
CSG_DISCONTIGUOUS Union : CSG intersection
User guarantees : absolutely no overlapping between constituents
+-------+ +-------+ +-------+ +-------+ +-------+
| | | | | | | | | |
| | | | | | | | | |
+-------+ +-------+ +-------+ +-------+ +-------+
+-------+ +-------+ +-------+ +-------+ +-------+
| | | | | | | | | |
| | | | | | | | | |
+-------+ +-------+ +-------+ +-------+ +-------+
- => very simple low resource intersection : closest Enter or Exit
- More closely suiting algorithm to geometry => better performance
- this can help with "holes" subtracted from another solid : the "holes" usually do not overlap
Multi-Layer Thin Film (A,R,T) Calc using TMM Calc (Custom4 Package)
- C4OpBoundaryProcess.hh
- G4OpBoundaryProcess with C4CustomART.h
- C4CustomART.h
- integrate custom boundary process and TMM calculation
- C4MultiLayrStack.h : CPU/GPU TMM calculation of (A,R,T)
based on complex refractive indices and layer thicknesses
- GPU: using thrust::complex CPU:using std::complex
Custom4: Simplifies JUNO PMT Optical Model + Geometry
Summary and Links
Opticks : state-of-the-art GPU ray traced optical simulation integrated with Geant4.
Full re-implementation of Opticks geometry and simulation for NVIDIA OptiX 7 completed.
- NVIDIA Ray Trace Performance continues rapid progress (2x each generation)
- Increasing availability of HW accelerated ray tracing
- NEXT: Opticks production testing, optimization