- Context and Problem
- Jiangmen Underground Neutrino Observatory (JUNO)
- Optical Photon Simulation Problem...

- Tools to create Solution
- Optical Photon Simulation ≈ Ray Traced Image Rendering
- Rasterization and Ray tracing
- Turing Built for RTX
- BVH : Bounding Volume Hierarchy
- NVIDIA OptiX Ray Tracing Engine

- Opticks : The Solution
- Geant4 + Opticks Hybrid Workflow : External Optical Photon Simulation
- Opticks : translates G4 optical physics to CUDA/OptiX
- Opticks : translates G4 geometry to GPU, without approximation
- CUDA/OptiX Intersection functions for ~10 primitives
- CUDA/OptiX Intersection functions for Arbitrarily Complex CSG shapes

- Validation and Performance
- Random Aligned Bi-Simulation -> Direct Array Comparison
- Perfomance Scanning from 1M to 400M photons

- Overview + Links

Huge CPU Memory+Time Expense

**JUNO Muon Simulation Bottleneck**- ~99% CPU time, memory constraints
**Ray-Geometry intersection Dominates**- simulation is not alone in this problem...
**Optical photons : naturally parallel, simple :**- produced by Cerenkov+Scintillation
- yield only Photomultiplier hits

Not a Photo, a Calculation

**Much in common : geometry, light sources, optical physics**

- simulation : photon parameters at PMT detectors
- rendering : pixel values at image plane
- both limited by ray geometry intersection, aka ray tracing

**Many Applications of ray tracing** :

- advertising, design, architecture, films, games,...
- -> huge efforts to improve hw+sw over 30 yrs

**August 2018 : Major Ray Tracing Advance**

- NVIDIA RTX Platform, Turing GPU
- ray trace dedicated hardware : RT cores

10 Giga Rays/s |

Ray Trace Dedicated RT Cores

- BVH (Bounding Volume Hierarchy) traversal
- ray triangle intersection

**Move part of ray tracing : SM -> RT Core**

RTX Platform : Hybrid Rendering

- Ray trace (RT cores)
- AI inference (Tensor cores) -> Denoising
- Compute (SM, CUDA cores)
- Rasterization (pipeline)

-> real-time photoreal cinematic 3D rendering

Tree of Bounding Boxes (bbox)

- aims to minimize bbox+primitive intersects
- accelerates ray-geometry intersection

OptiX Raytracing Pipeline

Analogous to OpenGL rasterization pipeline:

**OptiX makes GPU ray tracing accessible**

**accelerates**ray-geometry intersections- simple : single-ray programming model
- "...free to use within any application..."
- access RT Cores[1] with OptiX 6.0.0+ via RTX™ mode

**NVIDIA expertise:**

- ~linear scaling up to 4 GPUs
- acceleration structure creation + traversal (Blue)
- instanced sharing of geometry + acceleration structures
- compiler optimized for GPU ray tracing

**Opticks provides (Yellow):**

- ray generation program
- ray geometry intersection+bbox programs

[1] Turing RTX GPUs

GPU Resident Photons

**Seeded on GPU**- associate photons ->
*gensteps*(via seed buffer) **Generated on GPU, using genstep param:**- number of photons to generate
- start/end position of step

**Propagated on GPU**- Only photons hitting PMTs copied to CPU

Thrust: **high level C++ access to CUDA**

OptiX : single-ray programming model -> line-by-line translation

**CUDA Ports of Geant4 classes**- G4Cerenkov (only generation loop)
- G4Scintillation (only generation loop)
- G4OpAbsorption
- G4OpRayleigh
- G4OpBoundaryProcess (only a few surface types)

**Modify Cerenkov + Scintillation Processes**- collect
*genstep*, copy to GPU for generation - avoids copying millions of photons to GPU

- collect
**Scintillator Reemission**- fraction of bulk absorbed "reborn" within same thread
- wavelength generated by reemission texture lookup

**Opticks (OptiX/Thrust GPU interoperation)****OptiX**: upload gensteps**Thrust**: seeding, distribute genstep indices to photons**OptiX**: launch photon generation and propagation**Thrust**: pullback photons that hit PMTs**Thrust**: index photon step sequences (optional)

Volumes -> Boundaries

**Ray tracing favors Boundaries**

Material/surface boundary : 4 indices

- outer material (parent)
- outer surface (inward photons, parent -> self)
- inner surface (outward photons, self -> parent)
- inner material (self)

Primitives labelled with unique boundary index

- ray primitive intersection -> boundary index
- texture lookup -> material/surface properties

**Automated : Geant4 "World" -> Opticks CSG -> CUDA/OptiX**- intersection functions for ~10 primitives
- intersection program for arbitrarily complex CSG shapes

**Structure**- repeated geometry instances identified (progeny digests)
- instance transforms used in OptiX/OpenGL geometry
- merge CSG trees into global + instance buffers

**Material/Surface/Scintillator properties**- interpolated to standard wavelength domain
- interleaved into "boundary" texture
- "reemission" texture for wavelength generation

- 3D parametric ray :
**ray(x,y,z;t) = rayOrigin + t * rayDirection** - implicit equation of primitive :
**f(x,y,z) = 0** - -> polynomial in
**t**, roots:**t > t_min**-> intersection positions + surface normals

Outside/Inside Unions

dot(normal,rayDir) -> Enter/Exit

**A + B**boundary not inside other**A * B**boundary inside other

Complete Binary Tree, pick between pairs of nearest intersects:

UNION tA < tB |
Enter B | Exit B | Miss B |
---|---|---|---|

Enter A |
ReturnA | LoopA | ReturnA |

Exit A |
ReturnA | ReturnB | ReturnA |

Miss A |
ReturnB | ReturnB | ReturnMiss |

*Nearest hit intersect algorithm*[1] avoids state- sometimes Loop : advance
**t_min**, re-intersect both - classification shows if inside/outside

- sometimes Loop : advance
*Evaluative*[2] implementation emulates recursion:- recursion not allowed in OptiX intersect programs
- bit twiddle traversal of complete binary tree
- stacks of postorder slices and intersects

- Identical geometry to Geant4
- solving the same polynomials
- near perfect intersection match

- [1] Ray Tracing CSG Objects Using Single Hit Intersections, Andrew Kensler (2006)
- with corrections by author of XRT Raytracer http://xrt.wikidot.com/doc:csg
- [2] https://bitbucket.org/simoncblyth/opticks/src/tip/optixrap/cu/csg_intersect_boolean.h
- Similar to binary expression tree evaluation using postorder traverse.

Random Aligned Bi-Simulation

Same inputs to *Opticks* and *Geant4*:

- CPU generated photons
- GPU generated randoms, fed to
*Geant4*

Common recording into *OpticksEvents*:

- compressed photon step record, up to 16 steps
- persisted as
*NumPy*arrays for python analysis

Aligned random consumption, direct comparison:

- ~every
**scatter, absorb, reflect, transmit**at matched positions, times, polarization, wavlen

**Bi-simulations of all JUNO solids, with millions of photons**

- mis-aligned histories
- mostly < 0.25%, < 0.50% for largest solids
- deviant photons within matched history
- < 0.05% (500/1M)

**Primary sources of problems**

- grazing incidence, edge skimmers
- incidence at constituent solid boundaries

**Primary cause : float vs double**

*Geant4* uses *double* everywhere, *Opticks* only sparingly (observed *double* costing 10x slowdown with RTX)

**Conclude**

- neatly oriented photons more prone to issues than realistic ones
- perfect "technical" matching not feasible
- instead shift validation to more realistic full detector "calibration" situation

Test Hardware + Software

**Hardware**

- DELL Precision 7920T Workstation
- Intel Xeon Gold 5118, 2.3GHz, 48 cores, 62G
- NVIDIA Quadro RTX 8000 (48G)

**Software**

- Opticks 0.0.0 Alpha
- Geant4 10.4p2
- NVIDIA OptiX 6.5.0
- NVIDIA Driver 435.21
- CUDA 10.1

**Full JUNO Geometry j1808v5**

- "calibration" source genstep at center of scintillator

**Production Mode : does the minimum**

- only saves hits
- skips : genstep, photon, source, record, sequence, index, ..
- no
*Geant4*propagation (other than at 1M for extrapolation)

**Multi-Event Running** : measure interval and launch

interval : **avg time between successive launches**, including:

- upload gensteps
- launch : avg
**photon generation + propagation**time - download hits

for loaning the card

Photon Launch Size : VRAM Limited

**NVIDIA Quadro RTX 8000 (48 GB)**

- photon 4*4 floats : 64 bytes
- curandState : 48 bytes

**400M photons** x 112 bytes ~ 45G

Genstep/Hit Copying Overheads

**launch**- time of each OptiX launch (avg of 10)
**interval, including overhead**- time between subsequent launches (avg of 9)

Mostly < 10% Overhead beyond 20M photons

JUNO Full, 400M photons from center | |
---|---|

Geant4 Extrap. | 95,600 s (26 hrs) |

Opticks RTX ON (i) | 58 s |

JUNO Full, 400M photons from center | Speedup | |
---|---|---|

Opticks RTX ON (i) | 58s | x1660 |

Opticks RTX OFF (i) | 275s | x348 |

Geant4 Extrap. | 95,600s (26 hrs) |

5x Speedup from RTX with full JUNO geometry |

100M photon RTX times, avg of 10

Launch times for various geometries | |||
---|---|---|---|

Geometry | Launch (s) | Giga Rays/s | Relative to ana |

JUNO ana | 13.2 | 0.07 | |

JUNO tri.sw | 6.9 | 0.14 | 1.9x |

JUNO tri.hw | 2.2 | 0.45 | 6.0x |

Boxtest ana | 0.59 | 1.7 | |

Boxtest tri.sw | 0.62 | 1.6 | |

Boxtest tri.hw | 0.30 | 3.3 | 1.9x |

- ana : Opticks analytic CSG (SM)
- tri.sw : software triangle intersect (SM)
- tri.hw : hardware triangle intersect (RT)

JUNO 15k triangles, 132M without instancing

**Simple Boxtest geometry gets into ballpark**

- NVIDIA claim : 10 Giga Rays/s with RT Core
- ->
**1 Billion photons per second** **RT cores : built-in triangle intersect + 1-level of instancing**- flatten scene model to avoid SM<->RT roundtrips ?

OptiX Performance Tools and Tricks, David Hart, NVIDIA https://developer.nvidia.com/siggraph/2019/video/sig915-vid

NVIDIA OptiX 7 : Entirely new API

- introduced August 2019
- low-level CUDA-centric thin API
- near perfect scaling to 4 GPUs, for free

**JUNO+Opticks into Production**

- optimize geometry modelling
- full JUNO geometry validation iteration
- offline integration
- optimize GPU cluster throughput:
- multi-GPU strategy, split/join events to fit VRAM

- support OptiX 7, find multi-GPU load balancing approach

**Geant4+Opticks Integration : Work with Geant4 Collaboration**

- finalize
*Geant4+Opticks*extended example- aiming for
*Geant4*distrib

- aiming for
- prototype
*Genstep*interface inside*Geant4*- avoid customizing
*G4Cerenkov**G4Scintillation*

- avoid customizing

**Expand Community : Webinars, Conference Tutorials ?**

- geometry translation help : NEXO, DUNE, LZ
- interest -> usage : SABRE, Baikal GVD, KM3Net, MicroBooNE
- expand interest : scintillator using medical imaging companies

Highlights 2019

- Profit from hardware accelerated ray tracing
**Opticks > 1000x Geant4**(single Turing GPU)- more photons -> more overall speedup
- 99% -> 100x

Opticks: state-of-the-art GPU ray tracing applied to optical photon simulation and integrated withGeant4to eliminate memory and time bottlenecks.

- Drastic speedup -> better detector understanding -> greater precision

any simulation limited by optical photons can benefit

https://bitbucket.org/simoncblyth/opticks | code repository |

https://simoncblyth.bitbucket.io | presentations and videos |

https://groups.io/g/opticks | forum/mailing list archive |

email:opticks+subscribe@groups.io | subscribe to mailing list |