Opticks : GPU Accelerated Optical Photon Simulation for JUNO and Other Experiments

Opticks: GPU Accelerated Optical Photon Simulation for JUNO and Other Experiments

Open source, https://github.com/simoncblyth/opticks

Simon C Blyth, IHEP, CAS — 28th CHEP Conf, Chulalongkorn Univ, Bangkok, Thailand — (27 May 2026)


Outline

newtons-opticks.png
 

(JUNO) Optical Photon Simulation Problem...

Opticks solves this using GPU ray tracing via NVIDIA OptiX


Optical Photon Simulation ≈ Ray Traced Image Rendering

simulation
photon parameters at sensors (PMTs)
rendering
pixel values at image plane

Much in common : geometry, light sources, optical physics

Many Applications of ray tracing :


NVIDIA RTX PRO 6000 Blackwell

Blackwell(4th Gen RTX) vs Ada(3rd Gen RTX):

Server Edition


Four NVIDIA RTX Gen : Ray Tracing : ~2x every ~2 yrs

Gen Model Year VRAM CUDA Cores RT (Ray tracing)
GB GB/s Cores TFLOPS Rise
Turing Quadro RTX 6000 2018 24 672 4,608 72 ~34 [1]
Ampere RTX A6000 2020 48 768 10,752 84 ~76 2.2x
Ada RTX 6000 Ada 2023 48 960 18,176 142 ~211 2.7x
Blackwell RTX PRO 6000 2025 96 1792 24,064 188 ~380 1.8x

NVIDIA RT TFLOPS: synthetic ray trace metric -- Turing -> Blackwell : 11x

(Equivalent FLOPs per Ray Intersection) x (Intersections per clock) x (Core Clock) x (Number of RT Cores)

[1] baseline : 2018 "World's First Ray-Tracing GPU -- 10 Gigarays/sec"


AB_Substamp_ALL_Etime_vs_Photon_rtx_gen1_gen3.png

Event Time(s) vs PH(M)
PH(M) G1 G3 G1/G3
1 0.47 0.14 3.28
10 0.44 0.13 3.48
20 4.39 1.10 3.99
30 8.87 2.26 3.93
40 13.29 3.38 3.93
50 18.13 4.49 4.03
60 22.64 5.70 3.97
70 27.31 6.78 4.03
80 32.24 7.99 4.03
90 37.92 9.33 4.06
100 41.93 10.42 4.03

Optical simulation 4x faster 1st->3rd gen RTX

3rd gen Ada : 100M ph sim. in 10s [TMM PMT model, Custom CSG]

Opticks optical simulation speed directly scales with ray tracing speed.

TMM : Transfer-Matrix Method                      


NVIDIA® OptiX™ Ray Tracing Engine -- Accessible GPU Ray Tracing

OptiX makes GPU ray tracing accessible

OptiX features

User provides (Green):

Latest Release : NVIDIA® OptiX™ 9.1.0 (Dec 2025)


Geant4 + Opticks + NVIDIA OptiX : Hybrid Workflow

Opticks enables Geant4 based simulation to offload optical photon simulation to the GPU

NVIDIA GPU ray tracing of billions[1] of rays per second applied to optical simulation

[1] Actual performance depends on geometry and its modelling, JUNO optical simulation speedups > 1000x Geant4 have been measured


amdahl300


GEOM_J25_4_0_opticks_Debug_cxr_min_muon_cxs_20250707_112242.png

EVT=muon_cxs cxr_min.sh #12 : photons from muon crossing JUNO Scintillator


GEOM_J25_4_0_opticks_Debug_cxr_min_muon_cxs_20250707_112243.png

EVT=muon_cxs cxr_min.sh #13


GEOM_J25_4_0_opticks_Debug_cxr_min_muon_cxs_20250707_112244.png

EVT=muon_cxs cxr_min.sh #14


Recent Opticks Enhancements : directed by Muon Production Experience

Add Opticks "lite" photons : used with JUNOSW "Muon" hits (--pmt-hit-type 2)

Removed 32-bit max photon limits -> simulation of giga optical photon events

Add CUDA implementation of hit merging (thrust::sort_by_key,reduce_by_key)


GPU Hit Merging : High Level Parallelization with CUDA Thrust

struct key_functor {   //  Bitwise-OR (pmtid,timebucket) 
  float    timewindow;
  uint64_t operator()(const sphotonlite& p) const // 16+48 = 64
  {
     return (uint64_t(p.identity()) << 48) | uint64_t(p.time/timewindow);
  }
};

Opticks/sysrap SPM::merge_partial_select using CUDA Thrust (higher level C++ way to use CUDA)

Thrust method Action Note
copy_if photon -> hit using flagmask
transform hit -> key bitwise-OR (pmtid, timebucket)
sort_by_key hit, key -> hit hit ordered with same (pmtid,timebucket) contiguous
reduce_by_key hit, key -> hitmerged merge two hit : earlier time, sum hitcount

https://github.com/simoncblyth/opticks/blob/master/sysrap/SPM.cu

https://github.com/simoncblyth/opticks/blob/master/sysrap/sphotonlite.h


GPU Hit Merging : Avoids hiding Opticks performance

Simulation times (excl. init) for one double muon event, ~150M photons, 28M hit, 6.4M mergedHit, 1ns bucket merge

JUNOSW Standard full PMT hit summary "muon" hit
  7112 s (118min) 6904 s (115min)
Opticks+JUNOSW hit_mode merge ph-lite Simulate Kernel [s]
(GPU merge)+
Download [s]
(CPU merge)+
Collection [s]
Total [s] Speedup vs std
hit CPU 22.996 1.949 190.445 215.560 x32
hitlite 23.108 0.484 146.471 170.226 x31
hitmerged GPU 22.988 0.543 6.712 30.400 x233
hitlitemerged 23.097 0.181 0.403 23.835 x221

Opticks+J : overall speedup > x200 [~2 hrs → ~30 s]


Scaling Opticks

Geant4 + Opticks + NVIDIA OptiX : Production Scaling ?

=> Client-Server architecture


Geant4 + Opticks + NVIDIA OptiX : Hybrid Workflow 2x2 ?

Geant4 + Opticks + NVIDIA OptiX : Monolith x4 ?


Geant4 + Opticks + NVIDIA OptiX : Hybrid Workflow 4x4 ?

Geant4 + Opticks + NVIDIA OptiX : Monolith x16 ?

"Monolithic" scaling : very inefficient use of scarce GPU resources


OpticksClients + OpticksService : Share GPUs


Client.png


Opticks "Optical Core" => Server, "Periphery" => Client

Package Role Client Server
SysRap Geometry and event types, array NP.hh
CSG CPU/GPU geometry model
QUDARap CUDA optical simulation
CSGOptiX OptiX 7+ geometry, GPU ray trace
U4 Geometry convert, collect gensteps, return hits
G4CX Top level interface, acts via SSimulator
  Client Server
depends: NVIDIA GPU + CUDA + OptiX Geant4, U4, G4CX
depends: libcurl 7.76.1+, NP_CURL.h python, FastAPI, nanobind
SSimulator: SOpticksClientSimulator CSGOptiX

Client build from common Opticks codebase with OPTICKS_CONFIG=Client


NP_CURL.h : Array transport via HTTP POST

Basis for Opticks Client - using libcurl 7.76.1+ (2021) - default in many Linux distro

NP* NP_CURL::transformRemote( NP* a, size_t index, size_t count )
HTTP POST array to endpoint, receive array in response, metadata in headers

NP.hh : C++ array with NumPy serialization (Opticks numerical base), NP_CURL.h headers:

HTTP metadata headers Note
x-opticks-shape array shape eg "(10,6,4)" for 10 gensteps
x-opticks-dtype eg "float32"
x-opticks-index eg eventID controlling random number stream offsets
x-opticks-count eg: number of photons in genstep, cost of request
x-opticks-meta general eg geometry root node digest - assert same geometry

HTTP 429, 503 : Too Many Requests, Service Unavailable (temporary downtime)

https://github.com/simoncblyth/np/blob/master/NP_CURL.h -- https://github.com/simoncblyth/np/blob/master/NP.hh


Opticks Server Prototype : python + FastAPI + nanobind + CSGOptiX

CSGOptiX/tests/CSGOptiXService_FastAPI_test/CSGOptiXService_FastAPI_test.sh

Prototype client + service operational

"Roll your own" Prototype Server : Educational, BUT:


NVIDIA Triton Inference Server (aka Dynamo-Triton)

Triton[1] : open-source, designed to accelerate AI deployment at scale

Wrap Opticks as Custom C++ Triton Backend "model(s)" ?

Make request load like inference : smaller, more uniform

Make requests during genstep collection :
  • => tunable max_slots
  • decouple compute from physics
High GPU utilization => concurrency :
  • async CUDA Opticks
  • async CUDA memory pools [2]
Robust Server (MC campaign):
  • static flat VRAM
  • backpressure, signal client retry

[1] https://developer.nvidia.com/dynamo-triton

[2] cudaMallocFromPoolAsync