Opticks : SEMINAR : Optical Photon Simulation via GPU Ray Tracing from NVIDIA OptiX

The Story of Opticks, applying NVIDIA OptiX GPU ray tracing to Optical Photon Simulation

Open source, https://bitbucket.org/simoncblyth/opticks

Simon C Blyth, IHEP, CAS — IHEP EPD Seminar — 18 April 2024


Outline : Opticks First Incarnation

newtons-opticks.png
 


Outline : Opticks Reborn

newtons-opticks.png

JUNO Optical Photon Simulation Problem...


Optical photons limit many simulations => lots of interest in Opticks

EXPT Reactor neutrino
Daya Bay neutrino oscillations
JUNO mass heirarchy + oscillations => NVIDIA CN Contacts
  Long baseline neutrino beam
DUNE FermiLab->Sanford, LAr TPC, => Assistance from Fermilab Geant4 Group
  Neutrinoless double beta decay, dark matter, other search
LZ LUX-ZEPLIN dark matter experiment, Sandford => NVIDIA US Contacts
LEGEND Large Enriched Germanium Experiment, Gran Sasso/SNOLAB
SABRE dark matter direct-detection, Australia
AMoRE Mo-based Rare process Experiment, S.Korea
nEXO next Enriched Xenon Observatory, LLNL
  Neutrino telescope
KM3Net Cubic Kilometre Neutrino Telescope, Mediterranean
IceCube IceCube Neutrino Observatory, South Pole
  Air shower : gamma-ray and cosmic-ray observatory
LHAASO Large High Altitude Air Shower Observatory, Sichuan
  Accelerator
LHCb-RICH LHCb ring imaging Cherenkov sub-detector, CERN => NVIDIA EU Contacts

Optical Photon Simulation ≈ Ray Traced Image Rendering

simulation
photon parameters at sensors (PMTs)
rendering
pixel values at image plane

Much in common : geometry, light sources, optical physics

Many Applications of ray tracing :


Rasterization vs Ray-tracing

nvidia/nv_rasterization.png nvidia/nv_raytrace.png

Spatial Index Acceleration Structure


CPU vs GPU architectures, Latency vs Throughput

nvidia/cpu_vs_gpu_architecture.png

Waiting for memory read/write, is major source of latency...

CPU : latency-oriented : Minimize time to complete single task : avoid latency with caching
  • complex : caching system, branch prediction, speculative execution, ...
GPU : throughput-oriented : Maximize total work per unit time : hide latency with parallelism
  • many simple processing cores, hardware multithreading, SIMD (single instruction multiple data)
  • simpler : lots of compute (ALU), at expense of cache+control
  • design assumes abundant parallelism

Effective use of Totally different processor architecture -> Total reorganization of data and computation

Understanding Throughput-oriented Architectures https://cacm.acm.org/magazines/2010/11/100622-understanding-throughput-oriented-architectures/fulltext


Understanding GPU Graphical Origins -> Effective GPU Computation

GPUs evolved to rasterize 3D graphics at 30/60 fps

Simple Array Data Structures (N-million,4)

Constant "Uniform" 4x4 matrices : scaling+rotation+translation

Graphical Experience Informs Fast Computation on GPUs


How to Make Effective Use of GPUs ? Parallel / Simple / Uncoupled

Abundant parallelism
  • many thousands of tasks (ideally millions)
Low register usage : otherwise limits concurrent threads
  • simple kernels, avoid branching
Little/No Synchronization
  • avoid waiting, avoid complex code/debugging
Minimize CPU<->GPU copies
  • reuse GPU buffers across multiple CUDA launches
1px.png

How Many Threads to Launch ?

Understanding Throughput-oriented Architectures https://cacm.acm.org/magazines/2010/11/100622-understanding-throughput-oriented-architectures/fulltext

NVIDIA Titan V: 80 SM, 5120 CUDA cores


Opticks pre-History : Chroma investigations

Experimented with Chroma to speedup Daya Bay optical sim:

Chroma : Disadvantages

Chroma : Fundamental Problem, triangles only


chroma_camera_raycast

(g4daeview.py) Chroma Raycast of Daya Bay geometry (3x3 CUDA kernel lunches, 1.8s for 1.23M pixels, Geforce 750M GPU)


G4DAE : DYB pool bottom, Chroma raycast

(g4daeview.py) Chroma raycast render of triangulated geometry


G4DAE : DYB pool bottom, OpenGL render

(g4daeview.py) OpenGL rasterized render of triangulated geometry


Triangulated Geometry : Great for Visualization

Many apps/libs can view/edit DAE/COLLADA files

Triangle Visualization Advantage

CUDA/OpenGL interoperation


Daya Bay Chroma Photon Propagation (1)

(g4daeview.py) Chroma GPU photon propagation at 12 nanoseconds.  The photons are generated by Geant4 simulation of a 100 GeV muon travelling from right to left. Photon colors indicate reemission (green), absorption(red), specular reflection (magenta), scattering(blue), no history (white).


Daya Bay Chroma Photon Propagation (2)

(g4daeview.py) Chroma GPU photon propagation at 14 nanoseconds. The interface provides interactive control of the propagation time allowing any stage of the propagation to be viewed by scrubbing time backwards/forwards. The speed of this visualization is achieved by interoperation of CUDA kernels and OpenGL shaders accessing the same GPU resident photon propagation data.


Daya Bay Chroma Photon Propagation (3)

(g4daeview.py) Initial photon positions of a Geant4 simulated muon that crosses between the Dayabay Near hall ADs. Colors represent photon wavelengths. Optical photons: collected in G4 StackAction, serialized, sent over ZeroMQ, deserialized, presented using OpenGL GLSL shaders.


Opticks History : Chroma + G4DAE -> NVIDIA OptiX + "Opticks"

Why switch to NVIDIA OptiX ?

DBNS geometry raycast comparison using mobile GPU

Performance improvement ~50x

"Opticks" started as synthesis:

Package name "Opticks", taken from world changing publication:

  1. Sir Isaac Newton FRS "Opticks: or, A Treatise of the Reflexions, Refractions, Inflexions and Colours of Light."

NVIDIA® OptiX™ Ray Tracing Engine -- http://developer.nvidia.com/optix

OptiX makes GPU ray tracing accessible

NVIDIA expertise:

https://developer.nvidia.com/rtx

User provides (Yellow):

[1] Turing+ GPUs eg NVIDIA TITAN RTX


GGeoView

(GGeoView) Cerenkov photons from an 100 GeV muon travelling from right to left across Dayabay AD. Primaries are simulated by Geant4, Cerenkov "steps" of the primaries are transferred to the GPU. The dots represent OptiX calculated first intersections of GPU generated photons with colors corresponding to material boundaries: (red) GdDopedLS/Acrylic (green) LiquidScintillator/Acrylic, (blue) Acrylic/LiquidScintillator, (white) IwsWater:UnstStainlessSteel, (grey) others. The red lines represent the positions and directions of the "steps" with an arbitrary scaling for visibility.


Opticks History : Handling Huge Geometry (eg JUNO)

Opticks History : Handling Huge Geometry (eg JUNO) with instancing

Instancing in OptiX and OpenGL avoids repetition of geometry data on GPU for repeated elements (eg PMTs). [Image is composite of OpenGL rasterized event representation and OptiX raytraced triangulated geom]


Ray Intersection with Transformed Object -> Geometry Instancing

instancing/ray_intersection_in_two_spaces_p308_shirley_ch13_half.png

Fig 13.5 "Realistic Ray Tracing", Peter Shirley

Advantages apply equally to acceleration structures

Equivalent Intersects -> same t

  1. ray with ellipsoid : M*p
  2. M-1 ray with sphere : p

Local Frame Advantages

  1. simpler intersect (sphere vs ellipsoid)
  2. closer to origin -> better precision

Geometry Instancing Advantages

Requirements


Opticks History : Triangulated Geometry Problems

triangulated geometry : not practical for general simulation, but very useful for fast visualization

../graphics/ggeoview/dpib-triangulated-pmt.png ../graphics/ggeoview/dpib-test-disco-ball.png

Opticks History : Analytic PMT (12 parts, not 2928 triangles)

NVIDIA OptiX provided no intersection (it just accelerated intersection)

Partition PMT at constituent joins (semi-manually)

../nuwa/detdesc/pmt/hemi-pmt-parts.png

Daya Bay Opticks Propagation : Triangulated + Analytic PMT

Daya Bay Opticks Propagation : Triangulated geometry with Analytic PMT [composite OptiX raytrace geometry + OpenGL rasterized Cerenkov photons]


Opticks : GPU Geometry starts from ray-primitive intersection

tboolean_parade_sep2017.png

CUDA/OptiX intersection for ~10 primitives -> Exact geometry translation


Torus : much more difficult/expensive than other primitives

3D parametric ray : ray(x,y,z;t) = rayOrigin + t * rayDirection

High order equation

Best Solution : replace torus


Torus : different artifacts as change implementation/params/viewpoint

torus_cloud_artifact_2017_08_14.png torus_cuts_artifact_2017_08_08.png torus_fan_artifact_2017_07_28.png

Constructive Solid Geometry (CSG)

Simple by construction definition, implicit geometry.

CSG expressions

3D Parametric Ray : ray(t) = r0 + t rDir

Ray Geometry Intersection

How to pick exactly ?


CSG : Which primitive intersect to pick ?

Classical Roth diagram approach

Computational requirements:

BUT : High performance on GPU requires:

Classical approach not appropriate on GPU


Developed GPU CSG Impl. based on short note with the idea

kensler/frontpage.png kensler/xrt_wikidot_csg.png

Ray intersection with general CSG binary trees, on GPU

Pick between pairs of nearest intersects, eg:

UNION tA < tB Enter B Exit B Miss B
Enter A ReturnA LoopA ReturnA
Exit A ReturnA ReturnB ReturnA
Miss A ReturnB ReturnB ReturnMiss
[1] Ray Tracing CSG Objects Using Single Hit Intersections, Andrew Kensler (2006)
with corrections by author of XRT Raytracer http://xrt.wikidot.com/doc:csg
[2] https://bitbucket.org/simoncblyth/opticks/src/tip/optixrap/cu/csg_intersect_boolean.h
Similar to binary expression tree evaluation using postorder traverse.

CSG Complete Binary Tree Serialization -> simplifies GPU side

Geant4 solid -> CSG binary tree (leaf primitives, non-leaf operators, 4x4 transforms on any node)

Serialize to complete binary tree buffer:

Height 3 complete binary tree with level order indices:

                                                   depth     elevation

                     1                               0           3

          10                   11                    1           2

     100       101        110        111             2           1

 1000 1001  1010 1011  1100 1101  1110  1111         3           0

postorder_next(i,elevation) = i & 1 ? i >> 1 : (i << elevation) + (1 << elevation) ; // from pattern of bits

Postorder tree traverse visits all nodes, starting from leftmost, such that children are visited prior to their parents.


Opticks Analytic Daya Bay Near Site, GPU Raytrace

Pure analytic CSG Daya Bay near geometry, auto-converted from Geant4 to Opticks GPU geometry, NVIDIA OptiX GPU raytrace render [no triangles]


j1808_top_rtx

Pure analytic CSG JUNO geometry, auto-converted from Geant4 to Opticks GPU geometry, NVIDIA OptiX GPU raytrace render [no triangles] (GGeoView)


j1808_top_ogl

Approximate triangulated JUNO geometry [note impingement of torus guide tube and acrylic "sphere"], OpenGL rasterized render (GGeoView)


Deep CSG tree from complex solids, can cause performance issue

lvidx_57_esr_deep_tree_disc_with_holes.png x016_deeptree.png

Subtraction of thin CSG_CYLINDER -> speckle in the hole

CSG_DISC implemented to handle disc like cylinders : intersects at middle (z1+z2)/2 and offsets, avoids issue

../opticks_refs/speckle_lvTopESR.png ../opticks_refs/tboolean_esr_speckle_centered_on_pole_delta_10-3.png

Coincident Faces are Primary Cause of Issues : Fake Intersects

Coincidences common (alignment too tempting?). To fix:

../opticks_refs/opticks_tscan_29_nzero_5_OcrGdsPrt.png

Debugging Coincident Subtractions

Switching subtraction into union with complemented -> can see whats subtracted.

../opticks_refs/lvidx_69_ring_box_cuts_artifact.png

Opticks : translates G4 geometry to GPU, without approximation

Geant4 Solids+Volumes -> Opticks CSG,GGeo -> GPU

Structure of Volumes


Opticks Instancing : "Factorizes" Geometry

Structural volumes vs solid shapes
distinction for convenience only, distinction is movable

JUNO: ~300,000 GVolume : mostly small repeated groups (PMTs)

GGeo/GInstancer

  1. GVolume progeny digest : shapes+transforms -> subtree ident.
  2. find repeated digests, disqualifying repeats inside others
  3. label all nodes with repeat index, non-repeated remainder : 0

For each repeat+remainder create GMergedMesh:

GMergedMesh -> IAS+GAS

https://bitbucket.org/simoncblyth/opticks/src/master/ggeo/GInstancer.hh


Opticks : Translate Geant4 Optical Physics to GPU (OptiX/CUDA)

OptiX : single-ray programming model -> line-by-line translation

CUDA Ports of Geant4 classes
  • G4Cerenkov (only generation loop)
  • G4Scintillation (only generation loop)
  • G4OpAbsorption
  • G4OpRayleigh
  • G4OpBoundaryProcess (only a few surface types)
Modify Cerenkov + Scintillation Processes
  • collect genstep, copy to GPU for generation
  • avoids copying millions of photons to GPU
Scintillator Reemission
  • fraction of bulk absorbed "reborn" within same thread
  • wavelength generated by reemission texture lookup
Opticks (OptiX/Thrust GPU interoperation)
  • OptiX : upload gensteps
  • Thrust : seeding, distribute genstep indices to photons
  • OptiX : launch photon generation and propagation
  • Thrust : pullback photons that hit PMTs
  • Thrust : index photon step sequences (optional)

Optical Photon Simulation : Deciding history on way to boundary

  1. intersect ray with geometry -> distance to boundary
  2. lookup absorption length, scattering length for material depending on wavelength
    • Opticks uses GPU texture interpolation
  3. "role dice" : characteristic lengths -> stochastic distances

Pick winning process from smallest distance

boundary_distance = from_geometry  # no random number needed
absorption_distance = -absorption_length * ln(u0)
scattering_distance = -scattering_length * ln(u1)
## u0, u1 uniform randoms in [0,1] : distances always +ve

If scatter:

  1. pick new photon direction at random
  2. set polarization perpendicular to new direction (transverse) and in same plane as direction and initial polarization
  3. rejection-sampling used to pick new polarization such that angle between old and new follows cos^2 distribution
  4. then repeat from 1.

Theory (eg Rayleigh scattering) -> PDFs used in the simulation


Geant4OpticksWorkflow


Validation of Opticks Simulation by Comparison with Geant4

Bi-simulations of all JUNO solids, with millions of photons

mis-aligned histories
mostly < 0.25%, < 0.50% for largest solids
deviant photons within matched history
< 0.05% (500/1M)

Primary sources of problems

Primary cause : float vs double

Geant4 uses double everywhere, Opticks only sparingly (observed double costing 10x slowdown with RTX)

Conclude


scan-pf-check-GUI-TO-SC-BT5-SD


scan-pf-check-GUI-TO-BT5-SD


Performance : Scanning from 1M to 400M Photons

Full JUNO Analytic Geometry j1808v5

Production Mode : does the minimum

Multi-Event Running, Measure:

interval
avg time between successive launches, including overheads: (upload gensteps + launch + download hits)
launch
avg of 10 OptiX launches

scan-pf-1_NHit










scan-pf-1_Opticks_vs_Geant4 2





JUNO analytic, 400M photons from center Speedup
Geant4 Extrap. 95,600 s (26 hrs)  
Opticks RTX ON (i) 58 s 1650x

scan-pf-1_Opticks_Speedup 2










JUNO analytic, 400M photons from center Speedup
Opticks RTX ON (i) 58s 1650x
Opticks RTX OFF (i) 275s 350x
Geant4 Extrap. 95,600s (26 hrs)  

CHEP 2019 Plenary, Adelaide, Australia


NVIDIA Giveth and NVIDIA Taketh away ...

2006 CUDA 1.0
2009 NVIDIA OptiX 1.0
2018 NVIDIA: "World's first ray tracing GPU" : ray trace dedicated RT cores, RTX, 10 Giga Rays/s
2019 Opticks: [1st Gen. RTX GPU] OptiX 6.5, JUNO analytic: 58s 400M photons (7M photons/s, ~70M rays/s)
2019 NVIDIA OptiX 7.0 : ENTIRELY NEW API => Opticks needs full re-implementation
2021 NVIDIA Engineers assist Opticks dev. for 6->7 in series of seven meetings (LBNL, LZ )
2022 3rd generation RTX : expect > 4x ray trace performance of 1st gen.

NVIDIA OptiX 7 : Entirely new thin API => Full Opticks Re-implementation

NVIDIA OptiX 6->7 : drastically slimmed down

Advantages of 6->7 transition

BUT: demanded full re-implementation of Opticks


NVIDIA® OptiX™ Ray Tracing Engine -- Accessible GPU Ray Tracing

OptiX makes GPU ray tracing accessible

OptiX features

https://developer.nvidia.com/rtx/ray-tracing/optix

User provides (Green):

Same high level model in OptiX 7, everything else new


"Foundry" Model : Shared CPU/GPU Geometry Context

Geometry model designed for CPU/GPU

Simple CPU/GPU intersect headers

https://github.com/simoncblyth/opticks/tree/master/CSG
csg_intersect_tree.h/csg_intersect_node.h/...

GAS : Geometry Acceleration Structure

IAS : Instance Acceleration Structure

CSG : Constructive Solid Geometry


Two-Level Hierarchy : Instance transforms (IAS) over Geometry (GAS)

OptiX supports multiple instance levels : IAS->IAS->GAS BUT: Simple two-level is faster : works in hardware RT Cores

AS
Acceleration Structure
IAS (aka TLAS)
4x4 transforms, refs to GAS
GAS (aka BLAS)
custom primitives : AABB
triangles : vertices, indices
AABB
axis-aligned bounding box

SBT : Shader Binding Table

Flexibly binds together:

  1. geometry objects
  2. shader programs
  3. data for shader programs

Hidden in OptiX 1-6 APIs


Geometry Model Translation : Geant4 => CSGFoundry => NVIDIA OptiX 7

Geant4 Geometry Model (JUNO: 300k PV, deep hierarchy)

PV G4VPhysicalVolume placed, refs LV
LV G4LogicalVolume unplaced, refs SO
SO G4VSolid,G4BooleanSolid binary tree of SO "nodes"

Opticks CSGFoundry Geometry Model (index references)

struct Notes Geant4 Equivalent
CSGFoundry vectors of the below, easily serialized + uploaded + used on GPU None
qat4 4x4 transform refs CSGSolid using "spare" 4th column (becomes IAS) Transforms ref from PV
CSGSolid refs sequence of CSGPrim Grouped Vols + Remainder
CSGPrim bbox, refs sequence of CSGNode, root of CSG Tree of nodes root G4VSolid
CSGNode CSG node parameters (JUNO: ~23k CSGNode) node G4VSolid

NVIDIA OptiX 7 Geometry Acceleration Structures (JUNO: 1 IAS + 10 GAS, 2-level hierarchy)

IAS Instance Acceleration Structures JUNO: 1 IAS created from vector of ~50k qat4 (JUNO)
GAS Geometry Acceleration Structures JUNO: 10 GAS created from 10 CSGSolid (which refs CSGPrim,CSGNode )

JUNO : Geant4 ~300k volumes "factorized" into 1 OptiX IAS referencing ~10 GAS


cxr_overview_emm_t0_elv_t_moi__ALL_with-debug-disable-xj.jpg


cxr_min__eye_-10,0,0__zoom_0.5__tmin_0.1__sChimneyAcrylic_increased_TMAX.jpg

Raytrace render view from inside JUNO Water Buffer


cxr_overview_emm_image_grid_overview

Comparison of ray traced render times of different geometry
simple way to find issues, eg over complex CSG, overlarge BBox

image_grid_elv_scan.jpg

Spot the differences : from volume exclusions


cxr_overview.sh ELV scan 1080x1920 2M (NVIDIA TITAN RTX)

idx -e time(s) relative enabled geometry description
0 t133 0.0077 0.9347 EXCL: sReflectorInCD
1 t37 0.0079 0.9518 EXCL: GLw1.bt08_bt09_FlangeI_Web_FlangeII
2 t74 0.0079 0.9616 EXCL: GZ1.B06_07_FlangeI_Web_FlangeII
...
35 t 0.0083 1.0000 ALL
...
141 t50 0.0097 1.1750 EXCL: GLb1.up01_FlangeI_Web_FlangeII
142 t39 0.0097 1.1751 EXCL: GLw1.bt10_bt11_FlangeI_Web_FlangeII
143 t123 0.0097 1.1753 EXCL: PMT_3inch_inner1_solid_ell_helper
144 t46 0.0097 1.1758 EXCL: GLb1.up05_FlangeI_Web_FlangeII
145 t16 0.0102 1.2320 EXCL: sExpRockBox

Dynamic geometry : excluding volumes of each of 146 solids (after excluding slowest: solidXJfixture)

Small time range suggests no major geometry performance issues remain, after excluding slowest


n-ary CSG Compound "List-Nodes" => Much Smaller CSG trees

Communicate shape more precisely
=> better suited intersect alg => less resources => faster

Generalized Opticks CSG into three levels : tree < node < leaf

Generalizes binary to n-ary CSG trees

CSG_CONTIGUOUS Union
user guarantees contiguous, like G4MultiUnion of prim
CSG_DISCONTIGUOUS Union
user guarantees no overlaps, eg "union of holes" to be CSG subtracted : => simple, low resource intersect
CSG_OVERLAP Intersection
user guarantees overlap, eg general G4Sphere: inner radius, thetacut, phicut

Promising approach to avoid slowdowns from complex CSG solids


CSG_DISCONTIGUOUS Union : CSG intersection

User guarantees : absolutely no overlapping between constituents

 +-------+          +-------+          +-------+          +-------+         +-------+
 |       |          |       |          |       |          |       |         |       |
 |       |          |       |          |       |          |       |         |       |
 +-------+          +-------+          +-------+          +-------+         +-------+

 +-------+          +-------+          +-------+          +-------+         +-------+
 |       |          |       |          |       |          |       |         |       |
 |       |          |       |          |       |          |       |         |       |
 +-------+          +-------+          +-------+          +-------+         +-------+

 

Geant4 + Opticks + NVIDIA OptiX 7 : Hybrid Workflow

https://bitbucket.org/simoncblyth/opticks

Opticks API : split according to dependency -- Optical photons are GPU "resident", only hits need to be copied to CPU memory


Geant4 + Opticks + NVIDIA OptiX 7 : Hybrid Workflow 2


Primary Packages and Structs Of Re-Implemented Opticks

SysRap : many small CPU/GPU headers
  • stree.h,snode.h : geometry base types
  • sctx.h sphoton.h : event base types
  • NP.hh : serialization into NumPy .npy format files
QUDARap
  • QSim : optical photon simulation steering
  • QScint,QCerenkov,QProp,... : modular CUDA implementation
U4
  • U4Tree : convert geometry into stree.h
  • U4 : collect gensteps, return hits
CSG
  • CSGFoundry/CSGSolid/CSGPrim/CSGNode geometry model
  • csg_intersect_tree.h csg_intersect_node.h csg_intersect_leaf.h : CPU/GPU intersection functions
CSGOptiX
  • CSGOptiX.h : manage geometry convert from CSG to OptiX 7 IAS GAS, pipeline creation
  • CSGOptiX7.cu : compiled into ptx that becomes OptiX 7 pipeline
    • includes QUDARap headers for simulation
    • includes csg_intersect_tree.h,.. headers for CSG intersection
G4CX
  • G4CXOpticks : Top level Geant4 geometry interface

QUDARap : CUDA Optical Simulation Implementation

CPU/GPU Counterpart Code Organization for Simulation

  CPU GPU
context steering QSim.hh qsim.h
curandState setup QRng.hh qrng.h
property interpolation QProp.hh qprop.h
event handling QEvent.hh qevent.h
Cerenkov generation QCerenkov.hh qcerenkov.h
Scintillation generation QScint.hh qscint.h
texture handling QTex.hh cudaTextureObject_t

Validation of Opticks Simulation(A) by Comparison with Geant4 Sim. (B)

A and B always same photon counts (due to gensteps)

  1. direct comparison when simulations are random aligned
  2. when not aligned : statistical Chi2 history comparison
    • compare history frequencies, Chi2 points to issues

Primary Issue : double vs float, also:

After debugged : fraction of percent diffs


Optical Simulation Comparison : Statistical OR Direct

Statistical Chi-squared comparison of photon history occurence between two simulations

c2sum/c2n:c2per(C2CUT)  280.88/188:1.494 (30)

np.c_[siq,_quo,siq,sabo2,sc2,sabo1][0:25]  ## A-B history frequency chi2 comparison
    0   TO BT BT BT BT SD                                             33322  33343    0.0066        1      2
    1   TO BT BT BT BT SA                                             28160  28070    0.1441        8      0
    2   TO BT BT BT BT BT SR SA                                        6270   6268    0.0003    10363  10565
    3   TO BT BT BT BT BT SA                                           4552   4649    1.0226     8398   8433
    4   TO BT BT BT BT BT SR BR SR SA                                  1154   1186    0.4376    21156  21014
    5   TO BT BT BT BT BT SR BR SA                                      923    989    2.2782    20241  20201
    6   TO BT BT BT BT BR BT BT BT BT BT BT AB                          946    958    0.0756    10389   8432
    7   TO BT BT BT BT BT SR SR SA                                      901    942    0.9121    10399  10410
    8   TO BT BT AB                                                     878    895    0.1630       26    102
    9   TO BT BT BT BT BT SR BT BT BT BT BT BT BT AB                    615    635    0.3200    20974  22027
   10   TO BT BT BT BT BR BT BT BT BT AB                                571    601    0.7679     8459   9208
   11   TO BT BT BT BT BR BT BT BT BT BT BT BT BT SA                    533    537    0.0150     7312   7299
   12   TO BT BT BT BT BR BT BT BT BT BT BT BT BT BT BT BT BT SD        503    396   12.7353    12018  11465
   13   TO BT BT BT BT BR BT BT BT BT BT BT BT BT SD                    480    497    0.2958     7974   7967
   14   TO BT BT BT BT BR BT BT BT BT BT BT BT BT BT BT BT BT SA        412    411    0.0012    11467  11471
   15   TO BT BT BT BT BT SR SR SR SA                                   383    396    0.2169    10362  10368
 

When causes of discrepancy cannot be identified statistically

Comparison of two independent optical simulation implementations : ideal way find issues


B_V1J008_N1_ip_MOI_Hama:0:1000_yy_frame_close.png

Green : start position (100k input photons)

Red : end position,  Cyan : other position


B_V1J008_N1_ip_MOI_Hama:0:1000_b.png

                                                  cd ~/j/ntds ; N=1 ./ntds.sh ana

Geant4/U4Recorder 3D photon points transformed into target frame, viewed in 2D


B_V1J008_N1_OIPF_NNVT:0:1000_gridxy.png

export OPTICKS_INPUT_PHOTON=GridXY_X1000_Z1000_40k_f8.npy
export OPTICKS_INPUT_PHOTON_FRAME=NNVT:0:1000

MODE=3 EDL=1 N=0 EYE=500,0,2300 CHECK=not_first ~/j/ntds/ntds.sh ana

Photon step points from grid of input photons target NNVT:0:1000 (POM:1)

cxr_min__eye_1,0,5__zoom_2__tmin_0.5__NNVT:0:1000_demo.jpg

ray traced renders : exact same geometry "seen" by simulation


Selection of DetSim issues revealed/studied using Opticks

Geometry Issues

Physics issues


Crucial Importance of Standalone (fast cycle) tests

How standalone tests created:

 38 #ifdef PMTSIM_STANDALONE
 39 #include "PMTSIM_API_EXPORT.hh"
 40 class PMTSIM_API HamamatsuR12860PMTManager : public IGeomManager {
 41 #else
 42 class HamamatsuR12860PMTManager: public IPMTElement,
 43                                  public ToolBase {
 44 #endif

CSG : (Cylinder - Torus) PMT neck : spurious intersects

CSG : (Cylinder - Torus) PMT neck : spurious intersects

OptiX 5.5 raytrace comparing two PMT neck models:

  1. Ellipsoid + Hyperboloid + Cylinder
  2. Ellipsoid + (Cylinder - Torus) + Cylinder

body_solid_nurs

X4IntersectTest shows Geant4 also has spurious intersects from G4Torus


cxr_view_solidXJfixture:55:-3_cam_1_eye_8,-4,-4_zoom_1_tmin_0.1

EYE=8,-4,-4 LOOK=0,0,0 MOI=solidXJfixture:55:-3 ./cxr_view.sh
view directly at the fixture, but not visible as uni_acrylic1 in front

cxr_view_solidXJfixture:55:-3_cam_1_eye_1,-0.5,-0.5_zoom_1_tmin_0.1

EYE=1,-0.5,-0.5 LOOK=0,0,0 MOI=solidXJfixture:55:-3 ./cxr_view.sh
closer again, at same angle of view : fixture now visible, coincidence speckle between spherically curved uni_acrylic1 base and sAcrylic

image_grid_cxr_solidXJfixture:XX:-3


FewPMT_2942_Unphysical_cross_geometry_vac_reflect.png


FewPMT_demo0.png

Compare Unnatural/Natural N=0/1 geometry simulations by skipping N=0 Pyrex/Pyrex + Vacuum/Vacuum fake points


FewPMT_demo.png

two_pmt layout with HAMA, NNVT (Natural Geometry N=1)


Compare Reflected Polarization Impls for Brewster Angle Incidence

sboundary_test/figs/sboundary_test/pvcap/AOI_BREWSTER_linear_polarization_by_reflection.png sboundary_test/figs/sboundary_test/pvcap/AOI_BREWSTER_reflect_alt_pol.png
G4OpBoundaryProcess/qsim.h/sboundary.h : Only S-polarized survives junoPMTOpticalModel::Reflect : very different

Brewster (or polarizing) incident angle th1 : tan(th1) = n2/n1  ;  th1 + th2 = pi/2


G4OpBoundaryProcess : customized for JUNO PMT Optical Model (POM)

Custom Boundary Process : Advantages

Old FastSim POM 4 Solid, 4 LV, 4 PV
Custom Boundary POM 2 Solid, 2 LV, 2 PV

Disadvantages

Advantages far outweigh disadvantages
  • JUNOSW MERGED May 25, 2023

https://github.com/simoncblyth/customgeant4/


Multi-Layer Thin Film (A,R,T) Calc using TMM Calc (Custom4 Package)

C4OpBoundaryProcess.hh
G4OpBoundaryProcess with C4CustomART.h
C4CustomART.h
integrate custom boundary process and TMM calculation
C4MultiLayrStack.h : CPU/GPU TMM calculation of (A,R,T)

based on complex refractive indices and layer thicknesses

  • GPU: using thrust::complex CPU:using std::complex

Custom4: Simplifies JUNO PMT Optical Model + Geometry

GEOM/FewPMT/U4SimtraceTest/1/figs/U4SimtraceTest/mpcap/FewPMT_demo.png

LayrTest__R12860_Aspa.png


Opticks performance with GPU PMT Optical Model

TEST=large_scan ~/opticks/cxs_min.sh

Generate 20 optical only (Torch) events with 0.1M->100M photons starting from CD center, gather and save only Hits.

OPTICKS_RUNNING_MODE=SRM_TORCH  ## "Torch" running enables num_photon scan
OPTICKS_NUM_PHOTON=H1:10,M2,3,5,7,10,20,40,60,80,100
OPTICKS_NUM_EVENT=20
OPTICKS_EVENT_MODE=Hit
Test Hardware Notes
DELL Precison Workstation with NVIDIA TITAN RTX(24G) Primary test hardware
DELL Precision Workstation with NVIDIA TITAN V(12G) VRAM limited
DELL Precision Workstation with NVIDIA Quadro RTX 8000 (48G) TODO : push to memory limit ~400M photons
GPU cluster nodes with NVIDIA V100 (32GB) TODO: Production Config Testing, expect ~250M photon per launch limit

ALL1_scatter_10M_photon_22pc_hit_alt.png

~/o/cxs_min.sh  ## 2.2M hits from 10M photon TorchGenstep, 3.1 seconds


ALL1_scatter_10M_photon_22pc_hit.png


S7_Substamp_ALL_Hit_vs_Photon__linear.png


S7_Substamp_ALL_Etime_vs_Photon__100M_31s_Release.png

Release : 0.314 seconds per million photons


scan-pf-1_Opticks_vs_Geant4 3

Absolute Comparison with ancient Opticks Measurements.. ? [Below presented at CHEP 2019] 58s / 400M photons





JUNO analytic, 400M photons from center Speedup
Geant4 Extrap. 95,600 s (26 hrs)  
Opticks RTX ON (i) 58 s 1650x

Absolute Comparison with ancient Opticks Measurements ?

JUNO analytic, 400M photons from center Speedup Notes
Geant4 Extrap. 95,600 s (26 hrs)   Ancient (2019)
Opticks RTX ON (i) 58 s 1650x Ancient (2019)
JUNOSW+Opticks 1st 124 s (~2x slower) "770x" extrapolated from 31s for 100M

Practically everything different between these measurements : nevertheless, its natural to compare

  1. NVIDIA OptiX 6.5 -> 7.5 [entirely new API] => Opticks almost entirely re-implemented
  2. JUNO geometry : more complex than 4 years ago(?) : despite efforts to simplify
  3. JUNO PMT Optical Model (POM) (traditional vs "bouncy" with complex {A,R,T} TMM calculation)
  4. NVIDIA RTX 8000 (48G) vs NVIDIA TITAN RTX (24G) [similar spec other than VRAM]
  5. Geant4 setup : Geant4 is not a good candle : far too flexible

Expected Primary Cause of 2x slowdown : "bouncy" POM


hit_position_wavelength_time.png

Yuxiang Hu : Gamma Event at CD center : Comparison of JUNOSW with JUNOSW+Opticks

Hit position, wavelength and time comparison


gamma_event_at_center.png

Yuxiang Hu : Gamma Event at CD center : Comparison of JUNOSW with JUNOSW+Opticks

Overall speedup [JSW/(JSW+Opticks)] ~60X UN-OPTIMIZED + PRELIM

[Calculation: same TMM header as JUNOSW, Lookup: using uploaded "ART" texture (Megabytes)]


Amdahls "Law" : Expected Speedup Limited by Serial Processing

optical photon simulation, P ~ 99% of CPU time

Must consider processing "big picture"


amdahl_p_sensitive.png

parallel/amdahl.png

How much parellelized speedup actually useful to overall speedup?

Very dependant on the parallel fraction

Theoretical Overall Speedup for various parallel fractions and parallelized speedups
  Parallelized Speedup  
Parallel Fraction 100x 1000x limit Notes
95% 17x 20x 20x Little benefit beyond ~100x parallelized speedup
96% 20x 24x 25x
97% 25x 32x 33.3x
98% 34x 48x 50x Substantial benefit from more parallelized speedup
99% 50x 91x 100x
In [1]: run ~/opticks/ana/amdahl.py

In [2]: Amdahl.Overall_Speedup(np.array([100,1000,np.inf]),0.95)
Out[2]: array([16.807, 19.627, 20.   ])

In [3]: Amdahl.Overall_Speedup(np.array([100,1000,np.inf]),0.99)
Out[3]: array([ 50.251,  90.992, 100.   ])

Summary

Opticks : state-of-the-art GPU ray traced optical simulation integrated with Geant4 with automated geometry translation for use with NVIDIA OptiX 7+ API.


Acknowledgements : G4 Collab. + DM Search Community + NVIDIA

Developing Opticks would not have been possible without longterm support and assistance from many organizations and individuals
ALSO : Assistance of many with promoting Opticks to a wide audience are acknowledged

Geant4 11.0+ (Dec 2021) : Opticks Advanced Example

Dark Matter Search Community (XENON,LZ,DARWIN,..)

Dark-matter And Neutrino Computation Explored (DANCE)


Innovation Lessons Learned


Three Laws of Good Software Design + Development

  1. Arrange fast development cycle : TOP PRIORITY
  • seconds to make change and see outcome, not minutes
  • <1s cycletime : no waiting, focussed development
  • cheap cycles => great code

  1. Always minimize dependencies
  • just good design : code more useful
  • organize code by dependency, NOT TOPIC
  • dependency is fundamental, topic just opinion
  • use serialization to cut dependencies

  1. Unit tests are your most powerful tool
  • not just for testing, also for development+documenting
  • use serialization + mocking to simplify tests

https://simoncblyth.bitbucket.io/env/presentation/standalone_20230930_cpp_test_debug_ana_with_numpy.html


Extras Follow


NVIDIA Ada : 3rd Generation RTX


Hardware accelerated Ray tracing (RT Cores) in the Data Center

NVIDIA L4 Tensor Core GPU (Released 2023/03)

NVIDIA L4 Tensor Core GPU (Data Center, low profile+power)


US restricts export of highest performing GPUs

[0] GPUs above performance threshold require export permits

[1] Federal Register on 10/25/2023 : restriction details

[2] NVIDIA filing to US Securities and Exchange Commission

...additional licensing requirements for exports to China...

...(including but not limited to the A100, A800, H100, H800, L40, L40S, and RTX 4090)

[0] https://www.theregister.com/2023/10/19/china_biden_ai/

[1] https://public-inspection.federalregister.gov/2023-23055.pdf

[2] https://www.sec.gov/ix?doc=/Archives/edgar/data/1045810/000104581023000217/nvda-20231017.htm