JUNOSW + Opticks : Profiling + Status

JUNOSW + Opticks : Profiling + Status

Simon C Blyth, IHEP, CAS — Simulation Meeting — 11 Dec 2023

(November 13, 2023) Technical Test Release on /cvmfs/opticks.ihep.ac.cn

/cvmfs/opticks.ihep.ac.cn/ok/releases/Opticks-v0.2.1/x86_64-CentOS7-gcc1120-geant4_10_04_p02-dbg

Geant4 10.4.2 (Opticks with newer Geant4 already in use elsewhere)
Custom4 0.1.8 small package : but deeply coupled with : Geant4 + JUNOSW + Opticks
OptiX 7.5 CUDA 11.7 straightforward update, so far did not exploit new features
Opticks-v0.2.1 latest release https://github.com/simoncblyth/opticks/releases/tag/v0.2.1

Updated environment setup + distribution scripts:

Testing distribution using ctests:

cd $OPTICKS_PREFIX/tests
ctest -N   # list tests
ctest  —-output-on-failure   # run all
ctest -R CSGFoundry_CreateFromSimTest --output-on-failure # run selected

cxr_min__eye_0,1.5,0__zoom_4__tmin_1.3__ALL.jpg

EYE=0,1.5,0 TMIN=1.3 ZOOM=4 ~/opticks/cxr_min.sh  ## CSGOptiXRMTest

Using GEOM J23_1_0_rc3_ok0

Geometry in use based on J23_1_0_rc3

Deferred geometry, switched off by tut_detsim.py options.

--no-guide_tube OptiX 7.1 has curves : thought might enable G4Torus translation, but docs show are one-sided : so instead triangulate torus[T] ?
--debug-disable-xj XJfixture XJanchor Deep CSG trees require dev. to see if "listnode" (similar to G4MultiUnion) can provide solution
--debug-disable-sj SJCLSanchor SJFixture SJReceiver SJFixture
--debug-disable-fa FastenerAcrylic

Virtual surface shifts used to avoid degeneracy, together with defaults:

export Tub3inchPMTV3Manager__VIRTUAL_DELTA_MM=0.10           ## 1.e-3
export HamamatsuMaskManager__MAGIC_virtual_thickness_MM=0.10 ## 0.05
export NNVTMaskManager__MAGIC_virtual_thickness_MM=0.10      ## 0.05

sigma_alpha/polish ground surface handling ?

[T] torus quartic analytic solution is painful : instead simply use appropriate triangulation approx, more precise that analytic with much less pain

Introduce Three Opticks test scripts [1] [2] [3]

idx control script initialization time (seconds) Notes
[1] ~/j/okjob.sh 149 JUNOSW+Opticks (tut_detsim.py "main")
[2] ~/opticks/g4cx/tests/G4CXTest_GEOM.sh 127 InputPhoton, TorchGenstep, NOT YET InputGenstep
[3] ~/opticks/CSGOptiX/cxs_min.sh <2 InputPhoton, TorchGenstep, InputGenstep
  1. "insitu" test of Opticks embedded into JUNOSW : translates geometry and persists it
  2. standalone optical only bi-simulation for A:Opticks <=> B:Geant4 comparison
  3. pure Opticks (no Geant4 dependency) GPU optical simulation : uses geometry persisted by [1]
    • fast initialization : loads CSGFoundry geometry and uploads to GPU in <2 seconds
    • fast cycle for development and Opticks performance measurements
TorchGenstep
disc, sphere, line, point, circle, rectangle : shapes of photon sources implemented in sysrap/storch.h
InputGenstep
general gensteps eg obtained from [1]:okjob.sh can be used in [3]:cxs_min.sh, not yet [2]:G4CXTest (expect straightforward)

[2] G4CXTest_GEOM.sh : standalone optical bi-simulation for validation

G4CXApp.h
standalone pure optical Geant4 + Opticks in single header

Enables pure optical simulation comparison

Test Status
InputPhotons targetting PMTs chi2 matched, no known issues
TorchGenstep from CD center chi2 marginal : chimney issue ?

A/B Validation Comparison : ~/o/G4CXTest_GEOM.sh ana

 QCF qcf :  a.q 1000000 b.q 1000000
 c2sum :   567.1130 c2n :   506.0000 c2per:     1.1208  C2CUT:  200     CHI2 ISSUE WITH TORCH RUNNING 
 c2sum/c2n:c2per(C2CUT)  567.11/506:1.121 (200) pv[0.031,< 0.05 : NOT:null-hyp ]    INPUT PHOTONS TARGETTING PMT CHI2 OK 

 np.c_[siq,_quo,siq,sabo2,sc2,sabo1][0:40]  ## A-B history frequency chi2 comparison
 [[' 0' 'TO AB                                               ' ' 0' '126549 126732' ' 0.1322' '     2      5']
  [' 1' 'TO BT BT BT BT BT BT SD                             ' ' 1' ' 70494  70173' ' 0.7325' '    18      2']
  [' 2' 'TO BT BT BT BT BT BT SA                             ' ' 2' ' 57103  56944' ' 0.2217' '     5     25']
  [' 3' 'TO SC AB                                            ' ' 3' ' 51434  51739' ' 0.9016' '     4      9']
  [' 4' 'TO SC BT BT BT BT BT BT SD                          ' ' 4' ' 35878  36119' ' 0.8067' '    58     45']
  [' 5' 'TO SC BT BT BT BT BT BT SA                          ' ' 5' ' 29676  30164' ' 3.9797' '   124      4']
  [' 6' 'TO SC SC AB                                         ' ' 6' ' 19993  19499' ' 6.1794' '   137    124']
  [' 7' 'TO BT BT SA                                         ' ' 7' ' 18932  18837' ' 0.2390' '    71     14']
  [' 8' 'TO RE AB                                            ' ' 8' ' 18319  18272' ' 0.0604' '     9     64']
  [' 9' 'TO SC SC BT BT BT BT BT BT SD                       ' ' 9' ' 15454  15701' ' 1.9582' '    19     85']
  ['10' 'TO SC SC BT BT BT BT BT BT SA                       ' '10' ' 12785  12696' ' 0.3109' '    24      3']
  ['11' 'TO BT BT AB                                         ' '11' ' 10993  11100' ' 0.5182' '    72    188']
  ['12' 'TO BT AB                                            ' '12' '  9250   9727' '11.9897' '    36     96'] ## ABSLEN ACRYLIC ? 
  ['13' 'TO BT BT BT BT BT BT BT SA                          ' '13' '  7476   7627' ' 1.5097' '   176    162']
  ['14' 'TO SC SC SC AB                                      ' '14' '  7544   7545' ' 0.0001' '    90     84']
  ['15' 'TO RE BT BT BT BT BT BT SD                          ' '15' '  7419   7364' ' 0.2046' '   197      6']
  ['16' 'TO SC RE AB                                         ' '16' '  7137   7191' ' 0.2035' '   110     93']
  ['17' 'TO RE BT BT BT BT BT BT SA                          ' '17' '  7126   7104' ' 0.0340' '    48    181']
  ['18' 'TO SC BT BT AB                                      ' '18' '  6419   6527' ' 0.9010' '   153     89']
  ['19' 'TO BT BT BT BT BT BT BT SR SA                       ' '19' '  6385   6367' ' 0.0254' '    16    139']
  ['20' 'TO BT BT BT BT SD                                   ' '20' '  6146   6190' ' 0.1569' '    13     99']
  ['21' 'TO SC SC SC BT BT BT BT BT BT SD                    ' '21' '  6148   6175' ' 0.0592' '   145    194']
  ['22' 'TO SC BT BT SA                                      ' '22' '  6087   6170' ' 0.5620' '   120    185']
  ['23' 'TO SC BT AB                                         ' '23' '  5589   5782' ' 3.2758' '     8     17']
  ['24' 'TO BT BT DR BT SA                                   ' '24' '  5449   5543' ' 0.8039' '   600    246']
  ['25' 'TO RE RE AB                                         ' '25' '  5538   5420' ' 1.2707' '   267    125']
  ['26' 'TO BT BT BT SA                                      ' '26' '  5532   5259' ' 6.9066' '   745      7']
  ['27' 'TO SC SC SC BT BT BT BT BT BT SA                    ' '27' '  5084   4974' ' 1.2030' '    23     31']
  ['28' 'TO SC BT BT BT BT BT BT BT SA                       ' '28' '  4609   4610' ' 0.0001' '    20     63']
  ['29' 'TO BT BT BT BT BT BT BR BT BT BT BT BT BT BT BT SD  ' '29' '  3809   3813' ' 0.0021' '   362    812']
  ['30' 'TO RE SC AB                                         ' '30' '  3660   3565' ' 1.2491' '    54     30']
  ['31' 'TO SC RE BT BT BT BT BT BT SD                       ' '31' '  3192   3134' ' 0.5318' '   292    136']
  ['32' 'TO SC BT BT BT BT BT BT BT SR SA                    ' '32' '  3145   3173' ' 0.1241' '   243    419']
  ['33' 'TO BT BT BT BT BT BT BT SD                          ' '33' '  3168   3138' ' 0.1427' '   181    424']
  ['34' 'TO BT BT BT BT BT BT BR BT BT BT BT BT BT BT BT SA  ' '34' '  3142   3163' ' 0.0699' '    22    257']
  ['35' 'TO BT BT BT BT BT BT BT SR SR SA                    ' '35' '  3043   3096' ' 0.4576' '   286   1591']
  ['36' 'TO SC SC BT BT AB                                   ' '36' '  2878   2987' ' 2.0257' '   636    252']
  ['37' 'TO SC RE BT BT BT BT BT BT SA                       ' '37' '  2877   2960' ' 1.1802' '   151    301']
  ['38' 'TO BT BT BT BT AB                                   ' '38' '  2857   2834' ' 0.0930' '   225    228']
  ['39' 'TO SC BT BT BT BT SD                                                                           ' '39' '  2841   2800' ' 0.2980' '   224    323']]
  

[2] Chimney Issue : Photons going up the Chimney discrepant ?

 np.c_[siq,_quo,siq,sabo2,sc2,sabo1][bzero]  ## in A but not B
 [['1107' 'TO BT BT BT BT BT BT BT BT SD                                                                  ' '1107' '    41      0' ' 0.0000' ' 11355     -1']
  ['1305' 'TO BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT' '1305' '    33      0' ' 0.0000' ' 11040     -1']
  ['1623' 'TO BT BT DR BT BT BT SD                                                                        ' '1623' '    26      0' ' 0.0000' '  1930     -1']
  ['2375' 'TO BT BT BT BT BT BT BR BT BT BT BT BT BT BT BT BT SD                                          ' '2375' '    17      0' ' 0.0000' ' 10972     -1']
  ['3264' 'TO SC BT BT BT BT BT BT BR BT BT BT BT BT BT BT BT BT SD                                       ' '3264' '    12      0' ' 0.0000' ' 22140     -1']]

 In [1]: w = a.q_startswith("TO BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT") ; w
 Out[1]:
 array([ 11040,  15219, 118322, 152607, 165838, 215978, 299136, 374379, 395244, 422394, 427598, 434101, 443666, 445392, 479186, 531698, 549984, 592656, 604821, 637582, 656052, 736283, 777988, 789501,
        821402, 837105, 853410, 898084, 903045, 923645, 927731, 974750, 989689])

 In [2]: a.f.record[w[0],:,0]    ## PHOTON STEP POINT POSITIONS : ALL SIMILAR : GOING UP THE CHIMNEY 
 Out[2]:
 array([[   -1.594,     0.835,    99.984,     0.   ],        ##  photon step point (x,y,z,t)  
        [ -284.142,   148.854, 17823.998,    81.302],
        [ -315.513,   165.289, 20000.   ,    88.563],
        [ -332.369,   174.119, 21750.   ,    94.401],
        [ -332.383,   174.127, 21752.   ,    94.407],
        [ -344.817,   180.641, 23500.   ,   103.396],
        [ -368.795,   193.202, 25752.   ,   110.911],
        [ -409.762,   214.663, 29599.7  ,   123.75 ],
        [ -409.764,   214.664, 29599.85 ,   123.75 ],
        ...
        [ -412.414,   216.053, 29848.799,   124.581],
        [ -412.424,   216.058, 29849.7  ,   124.584],
        [ -412.426,   216.059, 29849.85 ,   124.585],
        [ -412.532,   216.114, 29859.85 ,   124.618],
        [ -412.534,   216.115, 29860.   ,   124.618],
        [ -412.543,   216.12 , 29860.9  ,   124.621],
        [ -412.55 ,   216.124, 29861.5  ,   124.623],
        [ -412.56 ,   216.129, 29862.5  ,   124.627]], dtype=float32)
 

G4CXTest_GEOM_A_chimney_issue.png

[2] G4CXTest_GEOM.sh
1M photon Torch Genstep at CD center
Red End points
Green Step points
Cyan Hit points

[3] Pure Optical TorchGenstep 20 evt scan : 0.1M to 100M photons

TEST=large_scan ~/opticks/cxs_min.sh

Generate 20 optical only events with 0.1M->100M photons starting from CD center, gather and save only Hits.

OPTICKS_RUNNING_MODE=SRM_TORCH
OPTICKS_NUM_PHOTON=H1:10,M2,3,5,7,10,20,40,60,80,100
OPTICKS_NUM_EVENT=20
OPTICKS_EVENT_MODE=Hit
Test Hardware Notes
DELL Precison Workstation with NVIDIA TITAN RTX(24G) Primary test hardware
DELL Precision Workstation with NVIDIA TITAN V(12G) VRAM limited
DELL Precision Workstation with NVIDIA Quadro RTX 8000 (48G) TODO : try for 400M photons
GPU cluster nodes with NVIDIA V100 (32GB) Basic function tests only so far

ALL1_scatter_10M_photon_22pc_hit_alt.png

~/o/cxs_min.sh  ## 2.2M hits from 10M photon TorchGenstep, 3.1 seconds

ALL1_scatter_10M_photon_22pc_hit.png

N7_Substamp_ALL_Hit_vs_Photon__linear.png

S7_Substamp_ALL_Hit_vs_Photon__linear.png

Optimizing separate "Release" build in addition to "Debug" build

Release preprocessor macros : adds: PRODUCTION , removes: DEBUG_TAG, DEBUG_PIDX,...

Examine flattened kernel source CSGOptiX/CSGOptiX7.cu (103k lines) : all includes included

~/opticks/preprocessor.sh > /tmp/out.cc   ## using gcc -E -C -P

Grepping Kernel PTX : Parallel Thread Execution ~Assembly code

Grepping PTX for doubles and printf, and then removing from source : opticks-ptx bash function eg:

grep \\.f64 $OPTICKS_PREFIX/ptx/CSGOptiX_generated_CSGOptiX7.cu.ptx

sreport.{sh,cc,py} : Opticks Event metadata reports and plots

Opticks Event => folders of NumPy .npy (NPFold.h/NP.hh)

sreport executable:

  1. loads SEvt metadata in folders below eg ALL1 (using NPFold::LoadNoData : metadata only)
  2. saves summary NPFold to ../ALL1_sreport
  3. sreport.py loads summary and makes plots

Usage on workstation/GPU job and laptop:

~/o/cxs_min.sh  ## create SEvt

Laptop, rsync small metadata summary from remote:

JOB=N7 ~/o/sreport.sh grab
JOB=N7 PLOT=Substamp_ALL_Etime_vs_Photon ~/o/sreport.sh

Effective automated reporting+plotting are essential for optimization

N7_Substamp_ALL_Etime_vs_Photon__34s_100M_debug.png

Debug : 0.341 seconds per million photons

S7_Substamp_ALL_Etime_vs_Photon__100M_31s_Release.png

Release : 0.314 seconds per million photons

Profile reporting eg: ~/o/cxs_min.sh report

   ...
   A018_QSim__simulate_PREL :   1701933491020126,19102924,1300668    2023-12-07T15:18:11.020126  92,039,765  92,038,118       2,598
   A018_QSim__simulate_POST :   1701933526625966,19102924,1300668    2023-12-07T15:18:46.625966 127,645,605 127,643,958  35,605,840
        SEvt__endIndex_A018 :   1701933526626230,19102924,1300668    2023-12-07T15:18:46.626230 127,645,869 127,644,222         264
 SEvt__endOfEvent_LAST_EGPU :   1701933531837026,19102924,1300668    2023-12-07T15:18:51.837026 132,856,665 132,855,018   5,210,796
             SEvt__EndOfRun :   1701933531837143,19102924,1300668    2023-12-07T15:18:51.837143 132,856,782 132,855,135         117
   A018_QSim__simulate_TAIL :   1701933531837486,19102924,1300668    2023-12-07T15:18:51.837486 132,857,125 132,855,478         343
CSGOptiX__SimulateMain_TAIL :   1701933531837541,19102924,1300668    2023-12-07T15:18:51.837541 132,857,180 132,855,533          55

 juncture:4 [SEvt__Init_RUN_META,SEvt__BeginOfRun,SEvt__EndOfRun,SEvt__Init_RUN_META] time ranges between junctures
            SEvt__Init_RUN_META :           -1                        :            0 : 2023-12-07T15:16:38.980361 JUNCTURE
               SEvt__BeginOfRun :   22,181,663                        :   22,181,663 : 2023-12-07T15:17:01.162024 JUNCTURE
                 SEvt__EndOfRun :  110,675,119                        :  132,856,782 : 2023-12-07T15:18:51.837143 JUNCTURE
            SEvt__Init_RUN_META : -132,856,782                        :            0 : 2023-12-07T15:16:38.980361 JUNCTURE

 ranges:6 time ranges between pairs of stamps
             SEvt__Init_RUN_META ==>           CSGFoundry__Load_HEAD                 1,774    ## init
           CSGFoundry__Load_HEAD ==>           CSGFoundry__Load_TAIL             1,325,321    ## load_geom
           CSGOptiX__Create_HEAD ==>           CSGOptiX__Create_TAIL            20,854,325    ## upload_geom
        A000_QSim__simulate_HEAD ==>        A000_QSim__simulate_PREL                19,450    ## upload_genstep
        A000_QSim__simulate_PREL ==>        A000_QSim__simulate_POST                55,697    ## simulate
        A000_QSim__simulate_POST ==>        A000_QSim__simulate_TAIL                 7,686    ## download
        A001_QSim__simulate_HEAD ==>        A001_QSim__simulate_PREL                 1,037    ## upload_genstep
        A001_QSim__simulate_PREL ==>        A001_QSim__simulate_POST               103,109    ## simulate
        A001_QSim__simulate_POST ==>        A001_QSim__simulate_TAIL                11,304    ## download
        A002_QSim__simulate_HEAD ==>        A002_QSim__simulate_PREL                 1,022    ## upload_genstep
        A002_QSim__simulate_PREL ==>        A002_QSim__simulate_POST               112,313    ## simulate
        A002_QSim__simulate_POST ==>        A002_QSim__simulate_TAIL                16,068    ## download
        A003_QSim__simulate_HEAD ==>        A003_QSim__simulate_PREL                   988    ## upload_genstep
        ...

N7_Ranges_SPAN__slow_downloads.png

"Debug" : rather slow hit downloads ?

S7_Ranges_SPAN__fast_downloads.png

Unclear why "Release" downloads so much faster than "Debug"

N7_Ranges_ONE__Debug.png

S7_Ranges_ONE__Release.png

Back to G4CXTest

Now back to [2] G4CXTest_GEOM.sh optical only comparison

N5_Substamp_ALL_Etime_vs_Photon__100M_31s_Release.png

Only got to 80M : due to U4Recorder memory leak

S5_Substamp_ALL_Etime_vs_Photon__U4Recorder_benefits_more.png

"Release" benefits B:U4Recorder more than A:CSGOptiX

N5_Subprofile_ALL__leaking.png

U4Recorder leaking badly! [Geant4 propagation recorded into Opticks SEvt]

S7_Subprofile_ALL__no_leak.png

([3] cxs_min.sh) Pure Opticks (no Geant4 or U4Recorder) : no leak

N5_Substamp_ALL_RATIO_vs_Photon__approach_200.png

B:U4Recorder / A:CSGOptiX : ratio only ~190 !

scan-pf-1_Opticks_vs_Geant4 2

Absolute Comparison with ancient Opticks Measurements.. ? [Below presented at CHEP 2019] 58s / 400M photons





JUNO analytic, 400M photons from center Speedup
Geant4 Extrap. 95,600 s (26 hrs)  
Opticks RTX ON (i) 58 s 1650x

Absolute Comparison with ancient Opticks Measurements ?

JUNO analytic, 400M photons from center Speedup Notes
Geant4 Extrap. 95,600 s (26 hrs)   Ancient (2019)
Opticks RTX ON (i) 58 s 1650x Ancient (2019)
Current Opticks 124 s (~2x slower) "770x" extrapolated from 31s for 100M

Practically everything different between these measurements : nevertheless, its natural to compare

  1. NVIDIA OptiX 6.5 -> 7.5 [entirely new API] => Opticks almost entirely re-implemented
  2. JUNO geometry : more complex than 4 years ago(?) : despite efforts to simplify
  3. JUNO PMT Optical Model (POM) (traditional vs "bouncy" with complex {A,R,T} TMM calculation)
  4. NVIDIA RTX 8000 (48G) vs NVIDIA TITAN RTX (24G) [similar spec other than VRAM]
  5. Geant4 setup : Geant4 is not a good candle : far too flexible

Expected Primary Cause of 2x slowdown : "bouncy" POM

N6_Substamp_ONE_maxb_scan_A_expensive_tail.png

Use cxs_min_scan.sh to vary OPTICKS_MAX_BOUNCE from 0->32

N6_Substamp_ONE_maxb_scan_HIT__slow_hit_increase.png

Slow hit increase above MAX_BOUNCE 20

Reproducing the MAX_BOUNCE scan plot

Using ~/o/cxs_min.sh script with:

OPTICKS_RUNNING_MODE : SRM_TORCH
OPTICKS_EVENT_MODE   : HitPhoton   (picked with VERSION 3)
OPTICKS_NUM_PHOTON   : H1          (100K)
OPTICKS_MAX_PHOTON   : M1

Workstation:

~/o/cxs_min_scan.sh      ## o is symbolic link to opticks

Laptop:

~/o/cxs_min.sh grab

PLOT=Substamp_ONE_maxb_scan PICK=A ~/o/sreport.sh
PLOT=Substamp_ONE_maxb_scan PICK=A ~/o/sreport.sh mpcap
PLOT=Substamp_ONE_maxb_scan PICK=A PUB=expensive_tail ~/o/sreport.sh mppub
vi ~/opticks/notes/issues/OPTICKS_MAX_BOUNCE_scanning.rst  ## notes

ALL4_thit_high_time_tail.png

TODO: check performance with MAX_TIME = 200,300,400 ns

ALL4_seqnib_small_truncation_bump.png

Small truncation bump at 32                      

Reproducing step point counts (aka seqnib) plot

Using ~/o/cxs_min.sh script with:

OPTICKS_RUNNING_MODE : SRM_TORCH
OPTICKS_EVENT_MODE   : HitPhotonSeq
OPTICKS_NUM_PHOTON   : M1
OPTICKS_MAX_PHOTON   : M1

Workstation:

VERSION=4 ~/o/cxs_min.sh

Laptop:

VERSION=4 ~/o/cxs_min.sh grab
VERSION=4 MODE=2 PLOT=seqnib ~/o/cxs_min.sh ana
VERSION=4 MODE=2 PLOT=seqnib ~/o/cxs_min.sh mpcap
VERSION=4 MODE=2 PLOT=seqnib PUB=small_truncation_bump ~/o/cxs_min.sh mppub

Amdahls "Law" : Expected Speedup Limited by Serial Processing

optical photon simulation, P ~ 99% of CPU time

Must consider processing "big picture"


amdahl_p_sensitive.png

/env/presentation/parallel/amdahl.png

How much parellelized speedup actually useful to overall speedup?

Very dependant on the parallel fraction

Theoretical Overall Speedup for various parallel fractions and parallelized speedups
  Parallelized Speedup  
Parallel Fraction 100x 1000x Notes
95% 17x 20x Little benefit beyond ~100x parellized speedup
96% 20x 24x
97% 25x 32x
98% 34x 48x Substantial benefit from more parallelized speedup
99% 50x 91x
In [4]: Amdahl.Overall_Speedup(np.array([100,1000]),0.95)
Out[4]: array([16.807, 19.627])

In [5]: Amdahl.Overall_Speedup(np.array([100,1000]),0.99)
Out[5]: array([50.251, 90.992])

NEXT STEPS

Try Optix-IR (Intermediate Representation) alternative to PTX (new in OptiX 7.1)