LZ-Opticks-NVIDIA OptiX 6->7 Notes : "Qudarap" pure CUDA photon generation

Opticks replaces Geant4 optical photon simulation with an equivalent implementation that benefits from state-of-the-art GPU ray tracing from NVIDIA OptiX. All optically relevant aspects of Geant4 context must be translated+copied to GPU:

geometry : solids, structure, material+surface properties

generation : Cerenkov + Scintillation (using Gensteps from Geant4)

propagation : Rayleigh scattering, absorption, reemission, boundary

Achieving+maintaining equivalence is time consuming, BUT:

transformative performance benefits >1000x Geant4

Simon C Blyth, July 27, 2021

Geant4OpticksWorkflow

Scintillation + Cerenkov Matching

Scintillation

light produced in certain special scintillator materials, such as JUNO LS (liquid scintillator)

implementation consumes fixed number (6) of randoms per photon
requires high resolution inverse-CDF texture to get Geant4/Opticks match
find no need to resort to doubles in GPU implementation

Cerenkov

light produced in many materials (eg Water[refractive index 1.333]) when particle travels through it faster than the speed of light in the medium (for water when velocity v/c > 1/1.3333 = 0.75 )

rejection sampling : variable random consumption, sometimes > 100 per photon
much more problematic to get good Geant4/Opticks match
currently finding have to resort to double precision rejection sampling

Qudarap qctx.h : all new pure-CUDA Opticks Context

qctx.h QCtx.hh https://bitbucket.org/simoncblyth/opticks/src/master/qudarap/: GPU context and CPU counterpart that preps it acting as coordinator of all the below
QRng.hh: loading+uploading curandState : use curand without the stack cost of curand_init
QTex.hh: 2D texture creation
QBnd.hh: ggeo/GBndLib -> QTex "boundary texture" (TODO: qbnd.h encapsulation)
QScint.hh: ggeo/GScintillatorLib -> QTex "scintillation inverse-CDF texture"
qprop.h QProp.hh: marshalling variable length paired (energy/wavelength,property) into compound array, binary bin search + linear interpolation just like G4. (TODO: accuracy/performance comparison with QBnd)
qgs.h: union based collective Scintillation and Cerenkov genstep handling
qcurand.h: templated float/double specializations for uniform access to curand_uniform/curand_uniform_double
QU.hh: utilitles : eg device<->host copies

qctx.h : currently photon generation expts

34 template <typename T> struct qctx
35 {
36     curandStateXORWOW*  r ;
37
38     cudaTextureObject_t scint_tex ;
..
52     qprop<T>*           prop ;
..
64     QCTX_METHOD float4  boundary_lookup( unsigned ix, unsigned iy );
65     QCTX_METHOD float4  boundary_lookup( float nm, unsigned line, unsigned k );
66
67     QCTX_METHOD float   scint_wavelength_hd0(curandStateXORWOW& rng);
68     QCTX_METHOD float   scint_wavelength_hd10(curandStateXORWOW& rng);
69     QCTX_METHOD float   scint_wavelength_hd20(curandStateXORWOW& rng);
70     QCTX_METHOD void    scint_dirpol(quad4& p, curandStateXORWOW& rng);
71     QCTX_METHOD void    reemit_photon(quad4& p, float scintillationTime, curandStateXORWOW& rng);
72     QCTX_METHOD void    scint_photon( quad4& p, GS& g, curandStateXORWOW& rng);
73     QCTX_METHOD void    scint_photon( quad4& p, curandStateXORWOW& rng);
74
75     QCTX_METHOD void    cerenkov_fabricate_genstep(GS& g, bool energy_range );
..
80     QCTX_METHOD void    cerenkov_photon(quad4& p, unsigned id, curandStateXORWOW& rng, const GS& g, int print_id = -1 ) ;
81     QCTX_METHOD void    cerenkov_photon(quad4& p, unsigned id, curandStateXORWOW& rng, int print_id = -1 ) ;
82
83     QCTX_METHOD void    cerenkov_photon_enprop(quad4& p, unsigned id, curandStateXORWOW& rng, const GS& g, int print_id = -1 ) ;
84     QCTX_METHOD void    cerenkov_photon_enprop(quad4& p, unsigned id, curandStateXORWOW& rng, int print_id = -1 ) ;
85
86     QCTX_METHOD void    cerenkov_photon_expt(  quad4& p, unsigned id, curandStateXORWOW& rng, int print_id = -1 );
87

Scintillation Wavelength : Compare Geant4 and Opticks implementations

 200 G4VParticleChange*
 201 DsG4Scintillation::PostStepDoIt(
        const G4Track& aTrack, const G4Step& aStep)
 208 {
 ...
 540     G4PhysicsOrderedFreeVector* ScintillationIntegral = NULL;
 543     ScintillationIntegral =
 544       (G4PhysicsOrderedFreeVector*)
               ((*theFastIntegralTable)(materialIndex));
 ...
 579    for(G4int i = 0 ; i < NumPhoton ; i++) {
 ...
 583    G4double sampledEnergy;
 586       G4double CIIvalue = G4UniformRand()*
 587          ScintillationIntegral->GetMaxValue();
 588       sampledEnergy=
 589          ScintillationIntegral->GetEnergy(CIIvalue);  

 096 G4double G4PhysicsOrderedFreeVector::GetEnergy(G4double aValue)
  97 {
  98         G4double e;
 ...
 104           size_t closestBin = FindValueBinLocation(aValue);
 105           e = LinearInterpolationOfEnergy(aValue, closestBin);
 107         return e;
 108 }
 110 size_t G4PhysicsOrderedFreeVector::FindValueBinLocation(G4double aValue)
 111 {
 112         size_t bin = std::lower_bound(dataVector.begin(), dataVector.end(), aValue)
 113                    - dataVector.begin() - 1;
 114         bin = std::min(bin, numberOfNodes-2);
 115         return bin;
 116 }

Geant4 : find ICDF bin then interpolate, Opticks : direct linear interpolation on HD GPU texture : 4096 bins

Scintillation Wavelength Matching : Explanations of following plots

hd0 : simple 4096 bins
hd20 : 4096 bins with 20x bins for u < 0.05 and u > 0.95
cudaFilterModePoint : linear interpolation disabled

DsG4Scintillator compared with	chi2	note
hd0_cudaFilterModePoint	13.79	"artifact" issue in tails, terrible chi2
hd0	1.00	interpolation enabled, fixes chi2
hd20	0.88	effectively x20 bins, improves chi2
hd20_cudaFilterModePoint	1.29	XHD "HD20" even OK without interpolation

GPU textures can reproduce very precisely inverse cumulative distribution functions by using high definition
"hd20" multi-resolution technique can effect variable resolution suited to scintillator inverse-CDF

1M Wavelength Sample comparison hd0 no interpolation

1M Wavelength Sample comparison hd0

1M Wavelength Sample comparison hd20

1M Wavelength Sample comparison hd20 no interpolation

GScintillatorLib ICDF

"Multi-resolution" GPU texture (3,4096,1) : 20x Resolution for 3x bins

156 NPY<double>* X4Scintillation::CreateGeant4InterpolatedInverseCDF(
157        const G4PhysicsOrderedFreeVector* ScintillatorIntegral,
158        unsigned num_bins, unsigned hd_factor, ..
161 )
164     double mx = ScintillatorIntegral->GetMaxValue() ;
166     NPY<double>* icdf = NPY<double>::make(3, num_bins, 1 );
170     int nj = icdf->getShape(1);
180     double edge = 1./double(hd_factor) ;
...
202     for(int j=0 ; j < nj ; j++)
203     {
204     double u_all = double(j)/double(nj) ;
205     double u_lhs = double(j)/double(hd_factor*nj) ;
206     double u_rhs = 1. - edge + double(j)/double(hd_factor*nj) ;
207
208     double energy_all = ScintillatorIntegral->GetEnergy( u_all*mx );
209     double energy_lhs = ScintillatorIntegral->GetEnergy( u_lhs*mx );
210     double energy_rhs = ScintillatorIntegral->GetEnergy( u_rhs*mx );
211
212     double wavelength_all = h_Planck*c_light/energy_all/nm ;
213     double wavelength_lhs = h_Planck*c_light/energy_lhs/nm ;
214     double wavelength_rhs = h_Planck*c_light/energy_rhs/nm ;
215
217     icdf->setValue(0, j, 0, 0,  wavelength_all );
218     icdf->setValue(1, j, 0, 0,  wavelength_lhs );
219     icdf->setValue(2, j, 0, 0,  wavelength_rhs );
220     }
221     return icdf ;
222 }

https://bitbucket.org/simoncblyth/opticks/src/master/extg4/X4Scintillation.cc

https://bitbucket.org/simoncblyth/opticks/src/master/qudarap/qctx.h

Cerenkov Cone Angle + Beta requirements reminder


              \    B    /
          \    .   |   .    /                            AC     ct / n          1         i       BetaInverse
      \    C       |       C    /             cos th =  ---- =  --------   =  ------ =   ---  =  -------------
  \    .    \      |      /    .     /                   AB       bct           b n       n        sampledRI
   .         \    bct    /          .
              \    |    /                                  BetaInverse
               \   |   /  ct                  maxCos  =  -----------------
                \  |th/  ----                                nMax
                 \ | /    n
                  \|/
                   A

Particle travels AB, light travels AC,  ACB is right angle


 Only get Cerenkov radiation when

        cos th <= 1 ,

        beta >= beta_min = 1/n        BetaInverse <= BetaInverse_max = n


 At the small beta threshold AB = AC,   beta = beta_min = 1/n     eg for n = 1.333, beta_min = 0.75

        cos th = 1,  th = 0         light is in direction of the particle


 For ultra relativistic particle beta = 1, there is a maximum angle

        th = arccos( 1/n )

G4Cerenkov : Photon Energy/Wavelength + Cone angle generation

168 G4VParticleChange*
169 G4Cerenkov_modified::PostStepDoIt(
      const G4Track& aTrack, const G4Step& aStep)
...
252   G4double Pmin = Rindex->GetMinLowEdgeEnergy();
253   G4double Pmax = Rindex->GetMaxLowEdgeEnergy();
254   G4double dp = Pmax - Pmin;
...
268   G4double maxCos = BetaInverse / nMax;
270   G4double maxSin2 = (1.0 - maxCos) * (1.0 + maxCos);
...
315   for (G4int i = 0; i < fNumPhotons; i++) {
317
318       G4double rand;
319       G4double sampledEnergy, sampledRI;
320       G4double cosTheta, sin2Theta;
...
324       do {
325          rand = G4UniformRand();
326          sampledEnergy = Pmin + rand * dp;
327          sampledRI = Rindex->Value(sampledEnergy);
334          cosTheta = BetaInverse / sampledRI;
342          sin2Theta = (1.0 - cosTheta)*(1.0 + cosTheta);
343          rand = G4UniformRand();
344
346       } while (rand*maxSin2 > sin2Theta);

https://bitbucket.org/simoncblyth/opticks/src/master/examples/Geant4/CerenkovStandalone/G4Cerenkov_modifiedTest.cc

G4Cerenkov_modifiedTest_SKIP_CONTINUE

1.5/1.8 = .83, 1.5/1.62 = 0.92

ana/ck.py rejection sampling

Cerenkov energy/cone angle sample : resorting to double precision

574 template <typename T>
575 inline QCTX_METHOD void qctx<T>::cerenkov_photon_expt(
        quad4& p, unsigned id, curandStateXORWOW& rng, int print_id )
576 {
577     double BetaInverse = 1.5 ;
578     double Pmin = 1.55 ;
579     double Pmax = 15.5 ;
580     double nMax = 1.793 ;
581     double maxCos = BetaInverse / nMax;
582     double maxSin2 = ( 1. - maxCos )*( 1. + maxCos );
583
584     double u0 ;
585     double u1 ;
586     double energy ;
587     double sampledRI ;
588     double cosTheta ;
589     double sin2Theta ;
590     double u_mxs2_s2 ;
592     unsigned loop = 0u ;
593
594     do {
596         u0 = curand_uniform_double(&rng) ;
598         energy = Pmin + u0*(Pmax - Pmin) ;
600         sampledRI = prop->interpolate( 0u, energy );
602         cosTheta = BetaInverse / sampledRI ;
604         sin2Theta = (1. - cosTheta)*(1. + cosTheta);
606         u1 = curand_uniform_double(&rng) ;
608         u_mxs2_s2 = u1*maxSin2 - sin2Theta ;
610         loop += 1 ;
612     } while ( u_mxs2_s2 > 0. );

ck_photon_enprop 100 very poor chi2

ck_photon_expt 100 matched

ck_photon_enprop 1001 better but still poor chi2

ck_photon_expt 1001 matched

[17]lLowerChimney_phys_all_00000

compare render times varying geometry included from same viewpoint
render time is good indicator for simulation time

[11]lLowerChimney_phys8,00000

590 "uni_acrylic3" Fasteners : cost 65x more than all 45.6k PMTs

[4]lLowerChimney_phys6,00000

590 of these "uni1" render quickly
BUT : this shape is subtracted from "uni_acrylic3"

quicktime lAddition_uni_acrylic3

LV : lAddition, Solid : uni_acrylic3

quicktime 2 lAddition_uni_acrylic3

gplt lAddition_uni_acrylic3

LV : lAddition, Solid : uni_acrylic3, Class : AdditionAcrylicConstruction

Suspect : subtracting spherical segment, sagitta 5.68mm, is performance killer.
also complicated interior of Fastener irrelevant to optical photons

dotted line is tangent to acrylic sphere circle touching at (0,0) in uni_acrylic3 frame
sagitta : 17820. - np.sqrt( np.power(17820.,2) - np.power(450.,2)) = 5.68 mm

2d GDML matplotlib.pyplot with : opticks/ana/gplt.py opticks/analytic/GDML.py parser

Next Steps

complete photon generation in qctx.h
- angles/positions/times/polarization straightforward (?)
genstep handling for "real" generation
- seeding : associating photons with their gensteps (previously used Thrust)
integrating CSGOptiX geometry + ray tracing with qctx.h for propagation

"Extra" Slides Follow

LS Wavelength

NVIDIA OptiX 7 : Entirely new thin API (Introduced Aug 2019)

NVIDIA OptiX 6->7 : drastically slimmed down

headers only (no library, just Driver)
low-level CUDA-centric thin API (Vulkan-ized)
Minimal host state, All host functions are thread-safe
GPU launches : explicit, asynchronous (CUDA streams)
near perfect scaling to 4 GPUs, for free
Shared CPU/GPU geometry context
GPU memory management
Multi-GPU support

Advantages

More control/flexibility over everything.

Fully benefit from future GPUs
Keep pace with state-of-the-art GPU ray tracing

Disadvantages

Demands much more developer effort than OptiX 6

Major re-implementation of Opticks required

New "Foundry" Model : Shared CPU/GPU Geometry Context

Inst : 4x4 tran. + Solid ref. ( Inst -> 1 IAS )
Solid : 1 or more Prim : ( Solid -> GAS )
Prim : 1,3,7,15,31,... Node : (Prim ~ G4VSolid)

struct CSGFoundry
{
   void upload(); // to GPU 
...
   std::vector<CSGSolid>  solid ; // compounds (eg PMT)
   std::vector<CSGPrim>   prim ;
   std::vector<CSGNode>   node ; // shapes, operators

   std::vector<float4> plan ; // planes
   std::vector<qat4>   tran ; // CSG transforms
   std::vector<qat4>   itra ; // inverse CSG transforms
   std::vector<qat4>   inst ; // instance transforms

   // entire geometry in four GPU allocations
   CSGPrim*    d_prim ;
   CSGNode*    d_node ;
   float4*     d_plan ;
   qat4*       d_itra ;
 };

referencing by offset, count

replaces geometry context dropped in OptiX 6->7
array-based -> simple, inherent serialization + persisting
entire geometry in 4 GPU allocations

Simple intersect headers, common CPU/GPU types

use with : pre-7, 7 + testing on CPU

https://github.com/simoncblyth/CSG "Foundry" model

csg_intersect_tree.h/csg_intersect_node.h/...: simple headers common to pre-7/7/CPU-testing

https://github.com/simoncblyth/CSG_GGeo

Convert Opticks/GGeo -> CSGFoundry

https://github.com/simoncblyth/CSGOptiX

OptiX 7 + pre-7 rendering

GAS : Geometry Acceleration Structure

IAS : Instance Acceleration Structure

CSG : Constructive Solid Geometry

[9]cxr_i0_t8,_-1 : EXCLUDE SLOWEST

Current JUNO Geometry : Auto-Factorized by "progeny digest"

ridx	plc	prim	component	note
0	1	3084	3084:sWorld	non-repeated remainder
1	25600	5	5:PMT_3inch_pmt_solid	4 types of PMT
2	12612	5	5:NNVTMCPPMTsMask
3	5000	5	5:HamamatsuR12860sMask
4	2400	5	5:mask_PMT_20inch_vetosMask
5	590	1	1:sStrutBallhead	4 parts of same assembly, BUT not grouped as siblings (not parent-child)
6	590	1	1:uni1
7	590	1	1:base_steel
8	590	1	1:uni_acrylic3
9	504	130	130:sPanel	repeated parts of TT

ridx:0 "remainder" Prim
- Prim that did not pass instancing criteria, on number of repeats + complexity
- TODO: tune criteria to instance more, reducing remainder Prim (Expect: 3084->~ 84)
ridx:1,2,3,4
- four types of PMT, all with 5 Prim
ridx:5,6,7,8
- same 590x assembly but not grouped together : as siblings (not parent-child like PMTs)
- TODO: implement instancing of siblings, combining 4 -> 1

Increasing instancing : reduces memory for geometry -> improved performance

JUNO OptiX 7 : "Foundry" Geometry Scan

JUNO Geometry : OptiX 7 Ray Trace Times ~2M pixels : TITAN RTX

idx	-e	time(s)	relative	enabled geometry description
0	9,	0.0017	0.1702	ONLY: 130:sPanel
1	7,	0.0017	0.1714	ONLY: 1:base_steel
2	6,	0.0019	0.1923	ONLY: 1:uni1
3	5,	0.0027	0.2780	ONLY: 1:sStrutBallhead
4	4,	0.0032	0.3268	ONLY: 5:mask_PMT_20inch_vetosMask
5	1,	0.0032	0.3287	ONLY: 5:PMT_3inch_pmt_solid
6	2,	0.0055	0.5669	ONLY: 5:NNVTMCPPMTsMask
7	3,	0.0074	0.7582	ONLY: 5:HamamatsuR12860sMask
8	1,2,3,4	0.0097	1.0000	ONLY PMT
9	t8,0	0.0099	1.0179	EXCL: 1:uni_acrylic3 3084:sWorld
10	0,	0.1171	12.0293	ONLY: 3084:sWorld
11	t8,	0.1186	12.1769	EXCL: 1:uni_acrylic3
12	t0,	0.5278	54.2066	EXCL: 3084:sWorld
13	8,	0.5310	54.5298	ONLY: 1:uni_acrylic3
14	t3,	0.6017	61.7954	EXCL: 5:HamamatsuR12860sMask
15	t2,	0.6043	62.0620	EXCL: 5:NNVTMCPPMTsMask
16	t5,	0.6171	63.3787	EXCL: 1:sStrutBallhead
17	t6,	0.6196	63.6301	EXCL: 1:uni1
18	t7,	0.6226	63.9458	EXCL: 1:base_steel
19	t0	0.6240	64.0879	3084:sWorld
20	t4,	0.6243	64.1169	EXCL: 5:mask_PMT_20inch_vetosMask
21	t9,	0.6335	65.0636	EXCL: 130:sPanel
22	t1,	0.6391	65.6384	EXCL: 5:PMT_3inch_pmt_solid

JUNO ALL PMTs : 2M ray traced pixels in 0.0097 s : NVIDIA TITAN RTX, NVIDIA OptiX 7.0.0, Opticks

LZ-Opticks-NVIDIA OptiX 6->7 Notes : "Qudarap" pure CUDA photon generation

LZ-Opticks-NVIDIA OptiX 6->7 Notes : "Qudarap" pure CUDA photon generation

Geant4OpticksWorkflow

Scintillation + Cerenkov Matching

Qudarap qctx.h : all new pure-CUDA Opticks Context

qctx.h : currently photon generation expts

Scintillation Wavelength : Compare Geant4 and Opticks implementations

Scintillation Wavelength Matching : Explanations of following plots

1M Wavelength Sample comparison hd0 no interpolation

1M Wavelength Sample comparison hd0

1M Wavelength Sample comparison hd20

1M Wavelength Sample comparison hd20 no interpolation

GScintillatorLib ICDF

"Multi-resolution" GPU texture (3,4096,1) : 20x Resolution for 3x bins

Cerenkov Cone Angle + Beta requirements reminder

G4Cerenkov : Photon Energy/Wavelength + Cone angle generation

G4Cerenkov_modifiedTest_SKIP_CONTINUE

ana/ck.py rejection sampling

Cerenkov energy/cone angle sample : resorting to double precision

ck_photon_enprop 100 very poor chi2

ck_photon_expt 100 matched

ck_photon_enprop 1001 better but still poor chi2

ck_photon_expt 1001 matched

[17]lLowerChimney_phys_all_00000

[11]lLowerChimney_phys__8,__00000

[4]lLowerChimney_phys__6,__00000

quicktime lAddition_uni_acrylic3

quicktime 2 lAddition_uni_acrylic3

gplt lAddition_uni_acrylic3

Next Steps

"Extra" Slides Follow

LS Wavelength

NVIDIA OptiX 7 : Entirely new thin API (Introduced Aug 2019)

New "Foundry" Model : Shared CPU/GPU Geometry Context

[9]cxr_i0_t8,_-1 : EXCLUDE SLOWEST

Current JUNO Geometry : Auto-Factorized by "progeny digest"

JUNO OptiX 7 : "Foundry" Geometry Scan

JUNO Geometry : OptiX 7 Ray Trace Times ~2M pixels : TITAN RTX

JUNO ALL PMTs : 2M ray traced pixels in 0.0097 s : NVIDIA TITAN RTX, NVIDIA OptiX 7.0.0, Opticks

[11]lLowerChimney_phys8,00000

[4]lLowerChimney_phys6,00000