Opticks : GPU Optical Photon Simulation for Particle Physics with NVIDIA OptiX

Tools, Techniques + Opticks : GPU Optical Photon Simulation for Particle Physics with NVIDIA® OptiX™

Simon C Blyth, IHEP — https://bitbucket.org/simoncblyth/opticks — Oct 2018, IHEP, Beijing

Opticks Benefits

Outline : Tools, Techniques and Opticks

/env/presentation/newtons-opticks.png      

Raytrace Diagram

Ray-tracing vs Rasterization

/env/presentation/nvidia/nv_rasterization.png /env/presentation/nvidia/nv_raytrace.png

Outline : Introducing the Tools

/env/presentation/newtons-opticks.png
 

Understanding GPUs : Graphical Origins : An Extremely Parallel Problem

GPUs evolved to rasterize 3D graphics, eg OpenGL graphics pipeline
  • millions of triangles, millions of pixels, mostly independent
  • simple data structures : vertices, triangles, pixels
  • literally billions of small "shader" programs run per second
/env/presentation/graphics/rendering_pipeline.png

NVIDIA Turing GPU : 72 SM, 4608 CUDA cores

NVIDIA Turing GPU : 72 SM, 4608 CUDA cores (spec)

CPU vs GPU architectures, Latency vs Throughput

/env/presentation/nvidia/cpu_vs_gpu_architecture.png

Waiting for memory read/write, is major source of latency...

CPU : latency-oriented : Minimize time to complete single task : avoid latency with caching
  • complex : caching system, branch prediction, speculative execution, ...
GPU : throughput-oriented : Maximize total work per unit time : hide latency with parallelism
  • many simple processing cores, hardware multithreading, SIMD (single instruction multiple data)
  • simpler : lots of compute (ALU), at expense of cache+control
  • design assumes abundant parallelism

Effective use of Totally different processor architecture -> Total reorganization of data and computation

Understanding Throughput-oriented Architectures https://cacm.acm.org/magazines/2010/11/100622-understanding-throughput-oriented-architectures/fulltext

How to Make Effective Use of GPUs ? -> Use Higher Level Libraries

Not many available, use if possible
  • benefit from other peoples experience

https://developer.nvidia.com/gpu-accelerated-libraries

/env/presentation/1px.png

For adventurous early adopters

How to Make Effective Use of GPUs ? Parallel / Simple / Uncoupled

Abundant parallelism
  • many thousands of tasks (ideally millions)
Low register usage : otherwise limits concurrent threads
  • simple kernels, avoid branching
Little/No Synchronization
  • avoid waiting, avoid complex code/debugging
Minimize CPU<->GPU copies
  • reuse GPU buffers across multiple CUDA launches
/env/presentation/1px.png

How Many Threads to Launch ?

Understanding Throughput-oriented Architectures https://cacm.acm.org/magazines/2010/11/100622-understanding-throughput-oriented-architectures/fulltext

NVIDIA Titan V: 80 SM, 5120 CUDA cores

GPU Constraints -> Array-Oriented Design -> NumPy

Separate address space -> Serialization
upload/download : host(CPU)<->device(GPU)
Object-oriented
  • mixes data and compute
  • complicated serialization
  • model complex systems, great up to ~1000 objects
Array-oriented : ideal fit for GPU work
  • separate data from compute, functional approach
  • inherent serialization
  • inherent simplicity, works for millions of items
  • NumPy leading array-oriented package
NumPy .npy : Simple Array Serialization Format

http://www.numpy.org/neps/nep-0001-npy-format.html

https://realpython.com/numpy-array-programming/

NumPy : Foundation of Python Data Ecosystem

https://bitbucket.org/simoncblyth/intro_to_numpy

Very terse, array-oriented (no-loop) python interface to C performance

/env/presentation/numpy_ecosystem.png
Recommended paper:
The NumPy array: a structure for efficient numerical computation https://hal.inria.fr/inria-00564007
 

https://docs.scipy.org/doc/numpy/user/quickstart.html

http://www.scipy-lectures.org/intro/index.html

https://github.com/donnemartin/data-science-ipython-notebooks

Python : fastest growing major programming language : Why?

/env/presentation/incredible_rise_of_python_2.png

https://stackoverflow.blog/2017/09/14/python-growing-quickly/

https://insights.stackoverflow.com/trends

https://jakevdp.github.io/blog/2014/05/05/introduction-to-the-python-buffer-protocol/

NumPy Example : closest approach of Ellipse to a Circle


import numpy as np

def ellipse_closest_approach_to_point( ex, ez, _c ):
    """ex, ez: ellipse semi-axes, c: coordinates of point in ellipse frame"""
    c = np.asarray( _c )  ; assert c.shape == (2,)

    t = np.linspace( 0, 2*np.pi, 1000000 )      # t: array of 1M angles [0,2pi] 
    e = np.zeros( [len(t), 2] )
    e[:,0] = ex*np.cos(t)
    e[:,1] = ez*np.sin(t)                       # e: 1M parametric [x,z] points on the ellipse 

    return  e[np.sum((e-c)**2, axis=1).argmin()]    # point on ellipse closest to c 
expression shape note
e (1000000,2)  
c (2,)  
e-c (1000000,2) c is broadcast over e : must be compatible shape
np.sum((e-c)**2, 1) (1000000,) axis=1 : summing over the axis of length 2
np.sum((e-c)**2, 0) (2,) axis=0 : summing over axis of length 1M
np.sum((e-c)**2, None) () axis=None : summing over all elements, yielding scalar

https://bitbucket.org/simoncblyth/opticks/src/tip/ana/x018_torus_hyperboloid_plt.py

NumPy Example : NLL Reconstruction fit : On one slide

##!/usr/bin/env python
import numpy as np, scipy.stats as st, scipy.optimize as so
np.random.seed(0)  # reproducibility

# generate n*n (x,y,z) coordinates on a sphere 
n = 50
u,v = np.meshgrid( np.linspace(0,np.pi,n+2)[1:-1],
                   np.linspace(0,2*np.pi,n+1)[:-1] )
uu = u.ravel() ; vv = v.ravel()
sph = np.zeros( [len(uu), 3] )
sph[:,0] = np.sin(uu)*np.cos(vv)
sph[:,1] = np.sin(uu)*np.sin(vv)
sph[:,2] = np.cos(uu)
R = 10  ; sph *= R

# mockup "truth" position 
parTru = np.array( [0,0,R/2, 1] )

# distances from all the sphere coordinates to the "truth" position  
d = np.sqrt(np.sum((sph - parTru[:3])**2, axis=1 ))

# mockup a time linear with the distance with a normal smearing     
t = d + parTru[3]*np.random.randn(len(d))

# geometric "time" as function of position 
t_model = lambda par:np.sqrt(np.sum((sph - par[:3])**2, axis=1 ))

# Assumed PDF of "time" at each sphere position, normal around geometric time with some sigma. 
NLL = lambda par:-np.sum( st.norm.logpdf(t, loc=t_model(par), scale=par[3] ))

parIni = np.array( [0,0,0,1] )  # initial parameter values
parFit = so.minimize(NLL, parIni, method='nelder-mead').x ; print(parFit)

https://bitbucket.org/simoncblyth/intro_to_numpy/src/default/recon_terse.py

Outline of Techniques : Monte Carlo Method

/env/presentation/newtons-opticks.png

Monte Carlo Method : apply randomness to model complex systems

Simulate designs before building them

Generate history of a system using random sampling techniques making variables follow expected PDFs[1]

simulation is vital to understand/design anything complex

[1] PDFs : probability density functions

https://www.slideserve.com/hafwen/monte-carlo-detector-simulation Pat Ward, Cambridge University, 74 pages Powerpoint

https://en.wikipedia.org/wiki/Monte_Carlo_method

Optical Photon Simulation : Deciding history on way to boundary

  1. intersect ray with geometry -> distance to boundary
  2. lookup absorption length, scattering length for material depending on wavelength
    • Opticks uses GPU texture interpolation
  3. "role dice" : characteristic lengths -> stochastic distances

Pick winning process from smallest distance

boundary_distance = from_geometry  # no random number needed
absorption_distance = -absorption_length * ln(u0)
scattering_distance = -scattering_length * ln(u1)
## u0, u1 uniform randoms in [0,1] : distances always +ve

If scatter:

  1. pick new photon direction at random
  2. set polarization perpendicular to new direction (transverse) and in same plane as direction and initial polarization
  3. rejection-sampling used to pick new polarization such that angle between old and new follows cos^2 distribution
  4. then repeat from 1.

Theory (eg Rayleigh scattering) -> PDFs used in the simulation

https://bitbucket.org/simoncblyth/opticks/src/default/optixrap/cu/propagate.h https://bitbucket.org/simoncblyth/opticks/src/default/optixrap/cu/rayleigh.h

MC Method : Sampling using Cumulative Distribution Function (CDF)

/env/presentation/normal_pdf_cdf.png

Aim : create sample (a set of values) that follows an analytic PDF

Cumulative distribution function (CDF) : Prob( X < x)

Map uniform randoms onto probabilities

  1. integrate PDF -> CDF (mapping PDF domain onto probability [0,1] )
  2. invert the CDF, so domain becomes [0,1]
  3. inverted_CDF(uniform random) -> sample value

Intuitively : throw randoms uniformly "vertically"

CDF "encodes" shape of PDF in convenient form for sampling

https://en.wikipedia.org/wiki/Inverse_transform_sampling

Monte Carlo Method : Modelling Scintillator Re-emission

/env/presentation/inverted_cdf_GScintillatorLib_py.png
reemission_probability(wavelength)
fraction of light absorbed in scintillator re-emitted with different wavelength.
generating re-emitted wavelength
  • expected PDF(wavelength) -> inverted_CDF(probability)

Opticks Re-emission model

float uniform_sample_reemit = curand_uniform(&rng);
if (uniform_sample_reemit < reemission_probability )
{
    ...
    p.wavelength = reemission_lookup(curand_uniform(&rng));    # re-emission texture lookup  
    s.flag = BULK_REEMIT ;
    return CONTINUE;
}
else
{
    s.flag = BULK_ABSORB ;
    return BREAK;
}

https://bitbucket.org/simoncblyth/opticks/src/default/optixrap/cu/propagate.h

Monte Carlo Method : Modelling Scattering and Absorption

/env/presentation/exponential_pdf_cdf.png

Create sample with exponential PDF, using CDF

a
characteristic scale of process, eg decay time or scattering/absorption/attenuation length
u
uniform random value in [0,1] ; 1-u equivalent to u

Known analytic form of CDF, means simple sampling:

Equate CDF probability with uniform random value [0,1]

X
stochastic distance/time obtained from characteristic scale a and uniform random u

https://www.eg.bucknell.edu/~xmeng/Course/CS6337/Note/master/node50.html

https://en.wikipedia.org/wiki/Inverse_transform_sampling

Monte Carlo Method : Accept-Reject Sampling

Simple technique providing a sample that follows a distribution. Example PDF : pdf(x) = (3/8).( 1 + x^2 )

  1. scatter random points (x,y) across PDF graph (piecewise for better efficiency)
  2. accept x for pdf(x) < y : implement in C++ with do { } while ( condition )
/env/presentation/accept_reject_sampling.png

https://bitbucket.org/simoncblyth/intro_to_numpy/src/default/accept_reject_sampling.py

https://en.wikipedia.org/wiki/Rejection_sampling

Monte Carlo Method : Accept-Reject Sampling (NumPy Code in IPython)

in [1]: import numpy as np
In [2]: a = np.random.rand( 1000000,2 )       # 2M random floats uniform in [0,1]  

In [3]: a[:,0] = a[:,0]*2. - 1.   #  [0,1] -> [-1,1]
In [4]: a[:,1] = a[:,1]*0.75      #  [0,1] -> [0,0.75]     pdf(0) = pdf(1) = 0.75

In [5]: pdf = lambda x:(3./8.)*(1+x**2)         # simple symmetric PDF normalized on -1:1 

In [6]: w = np.where( a[:,1] < pdf(a[:,0]) )     # indices of accepted sample  

In [7]: s = a[w][:,0]                     # accepted sample of 666301 values 

In [8]: s
Out[8]: array([ 0.0382,  0.9755,  0.6855, ...,  0.5962,  0.9094, -0.8292])

In [9]: s.shape
Out[9]: (666301,)

In [10]: a.shape
Out[10]: (1000000, 2)

In [11]: a
Out[11]:
array([[ 0.0382,  0.0752],
       [ 0.9755,  0.0495],
       [ 0.6855,  0.1835],
       ...,
       [ 0.5837,  0.7297],
       [-0.7534,  0.7253],
       [-0.8292,  0.0019]])

https://bitbucket.org/simoncblyth/intro_to_numpy/src/default/accept_reject_sampling.py

MC Method : Estimating Pi from circle/square count ratio : pi r^2/(2r)^2

/env/presentation/estimate_pi.png

https://bitbucket.org/simoncblyth/intro_to_numpy/src/default/estimate_pi.py

Monte Carlo Method : Estimating Pi : NumPy Code in IPython

In [1]: from __future__ import division     # python3 style default 
In [2]: import numpy as np

In [3]: a = np.random.rand( 1000000,2 )     # 2M uniform [0,1] random floats   

In [4]: a[:,0] = a[:,0]*2. - 1.             # NB: no loops at python level  

In [5]: a[:,1] = a[:,1]*2. - 1.

In [6]: mask = np.sum(a*a,1) < 1            # mask : array of booleans with same shape as a  

In [7]: w = np.where(mask)                  # w : array of indices of a within the mask 

In [8]: epi = 4*len(w[0])/len(a)            # estimate of pi =  4*(cicle count)/(square count)   

In [9]: label = " estimate_pi 4*%d/%d = %10.5f  (%10.5f) " % (len(w[0]), len(a), epi, epi-np.pi )

In [10]: a.shape
Out[10]: (1000000, 2)

In [11]: a[w].shape                         # shape of the selection  
Out[11]: (785209, 2)

In [12]: a[w]                              # select via an array of indices 
Out[12]:
array([[-0.3253,  0.8069],
       [ 0.2202, -0.8232],
       [-0.9173, -0.088 ],
       ...,
       [ 0.1739, -0.1885],
       [-0.1812, -0.8979],
       [ 0.4604, -0.524 ]])

https://bitbucket.org/simoncblyth/intro_to_numpy/src/default/estimate_pi.py

Thrust : Estimate pi : operator() method runs on device (GPU)


##include <curand_kernel.h>

struct estimate_pi
{
    estimate_pi(int _N) : N(_N) {}

    __device__ float operator()(unsigned seed)
    {
        float sum = 0;
        curandState rng;
        curand_init(seed, 0, 0, &rng);
       // initializing curand is very expensive better to split/persist/load, see next pages 

        for(int i = 0; i < N; ++i)
        {
            float x = curand_uniform(&rng);
            float y = curand_uniform(&rng);
            float dist = sqrtf(x*x + y*y);
            if(dist <= 1.0f) sum += 1.0f;
        }
        sum *= 4.0f;
        return sum / N;
    }

    int N ;
};

https://bitbucket.org/simoncblyth/intro_to_cuda/src/default/thrust_curand_estimate_pi.cu

Thrust : Estimate pi : thrust::transform_reduce does the "launch"

##include <thrust/iterator/counting_iterator.h>
##include <thrust/transform_reduce.h>
##include <iostream>
##include <iomanip>

int main()
{
     int N = 10000; int M = 30000;

     float estimate = thrust::transform_reduce(            // NB functor call ordering is undefined 
                thrust::counting_iterator<int>(0),
                thrust::counting_iterator<int>(M),   // 1st two args define implicit sequence  
                estimate_pi(N),                      // the functor to apply to the sequence 
                0.0f,                                // initial value of reduction 
                thrust::plus<float>())/M ;           // how to combine results from the functor call 

      std::cout
          << " M " << M << " N " << N
          << std::setprecision(5) << std::fixed
          << " estimate " << estimate
          << " delta "   << estimate - M_PI << std::endl ;

      return 0;
}
epsilon:tests blyth$ nvcc thrust_curand_estimate_pi.cu
epsilon:tests blyth$ ./a.out
 M 30000 N 10000 estimate 3.14142 delta -0.00017

https://bitbucket.org/simoncblyth/intro_to_cuda

cuRAND Random Number Generation : Split Initialization and Usage

https://bitbucket.org/simoncblyth/intro_to_cuda/src/tip/rng/rng_states.cu

Concurrent generation of millions of reproducible sequences of pseudorandom numbers

curand_init very expensive -> huge performance hit:

Solution:

Understanding this technicality -> correct "mental model" of CUDA context

NLL Fitting Extended Example : Demo CPU+GPU project techniques

https://bitbucket.org/simoncblyth/intro_to_cuda/src/tip/recon/

1. CPU+GPU Libs, Bridging header pattern

Recon.hh
hides CUDA implementation behind Rec<T> pointer

2. NumPy array read/write from C++/CUDA C/Thrust

3. Thrust : GPU development

Real CUDA usage -> two compilers -> interop techniques required

Geant4 : Monte Carlo Simulation Toolkit

Geant4 : Monte Carlo Simulation Toolkit Generality

Geant4 : Overview

Initialization

Beginning of Run

Beginning of Event

At Each Step of a Track

60 slide intro from one of the primary Geant4 architects/developers

http://geant4.in2p3.fr/2005/Workshop/ShortCourse/session1/J.Apostolakis.pdf

Outline of Opticks : Problem, Solution, Geometry, Validation

/env/presentation/newtons-opticks.png

Optical Photon Simulation Problem...

JPMT Before Contact 2

Ray Traced Image Synthesis ≈ Optical Photon Simulation

Geometry, light sources, optical physics ->

Ray tracing has many applications :

Ray-geometry intersection


ray tracing
cast rays thru image pixels into scene, recursively reflect/refract at intersects, combine returns into pixel values
rasterization
project 3D primitives onto 2D image plane, combine fragments into pixel values

NVIDIA® OptiX™ Ray Tracing Engine -- http://developer.nvidia.com/optix

OptiX makes GPU ray tracing accessible

NVIDIA expertise:

https://developer.nvidia.com/rtx

User provides (Yellow):

BVH

BVH Pascal

BVH Turing

Opticks : GPU Geometry starts from ray-primitive intersection

/env/presentation/tboolean_parade_sep2017.png

CUDA/OptiX intersection for ~10 primitives -> Exact geometry translation

Torus : much more difficult/expensive than other primitives

3D parametric ray : ray(x,y,z;t) = rayOrigin + t * rayDirection

High order equation

Best Solution : replace torus

Torus : different artifacts as change implementation/params/viewpoint

/env/presentation/torus_cloud_artifact_2017_08_14.png /env/presentation/torus_cuts_artifact_2017_08_08.png /env/presentation/torus_fan_artifact_2017_07_28.png

Constructive Solid Geometry (CSG) : Shapes defined "by construction"

Simple by construction definition, implicit geometry.

CSG expressions

3D Parametric Ray : ray(t) = r0 + t rDir

Ray Geometry Intersection

How to pick exactly ?

CSG : Which primitive intersect to pick ?

Classical Roth diagram approach

Computational requirements:

BUT : High performance on GPU requires:

Classical approach not appropriate on GPU

Ray intersection with general CSG binary trees, on GPU

Pick between pairs of nearest intersects, eg:

UNION tA < tB Enter B Exit B Miss B
Enter A ReturnA LoopA ReturnA
Exit A ReturnA ReturnB ReturnA
Miss A ReturnB ReturnB ReturnMiss
[1] Ray Tracing CSG Objects Using Single Hit Intersections, Andrew Kensler (2006)
with corrections by author of XRT Raytracer http://xrt.wikidot.com/doc:csg
[2] https://bitbucket.org/simoncblyth/opticks/src/tip/optixrap/cu/csg_intersect_boolean.h
Similar to binary expression tree evaluation using postorder traverse.

CSG Complete Binary Tree Serialization -> simplifies GPU side

Geant4 solid -> CSG binary tree (leaf primitives, non-leaf operators, 4x4 transforms on any node)

Serialize to complete binary tree buffer:

Height 3 complete binary tree with level order indices:

                                                   depth     elevation

                     1                               0           3

          10                   11                    1           2

     100       101        110        111             2           1

 1000 1001  1010 1011  1100 1101  1110  1111         3           0

postorder_next(i,elevation) = i & 1 ? i >> 1 : (i << elevation) + (1 << elevation) ; // from pattern of bits

Postorder tree traverse visits all nodes, starting from leftmost, such that children are visited prior to their parents.

Evaluative CSG intersection Pseudocode : recursion emulated

fullTree = PACK( 1 << height, 1 >> 1 )  // leftmost, parent_of_root(=0)
tranche.push(fullTree, ray.tmin)

while (!tranche.empty)         // stack of begin/end indices 
{
    begin, end, tmin <- tranche.pop  ; node <- begin ;
    while( node != end )                   // over tranche of postorder traversal 
    {
        elevation = height - TREE_DEPTH(node) ;
        if(is_primitive(node)){ isect <- intersect_primitive(node, tmin) ;  csg.push(isect) }
        else{
            i_left, i_right = csg.pop, csg.pop            // csg stack of intersect normals, t 
            l_state = CLASSIFY(i_left, ray.direction, tmin)
            r_state = CLASSIFY(i_right, ray.direction, tmin)
            action = LUT(operator(node), leftIsCloser)(l_state, r_state)

            if(      action is ReturnLeft/Right)     csg.push(i_left or i_right)
            else if( action is LoopLeft/Right)
            {
                left = 2*node ; right = 2*node + 1 ;
                endTranche = PACK( node,  end );
                leftTranche = PACK(  left << (elevation-1), right << (elevation-1) )
                rightTranche = PACK(  right << (elevation-1),  node  )
                loopTranche = action ? leftTranche : rightTranche

                tranche.push(endTranche, tmin)
                tranche.push(loopTranche, tminAdvanced )  // subtree re-traversal with changed tmin 
                break ; // to next tranche
            }
        }
        node <- postorder_next(node, elevation)         // bit twiddling postorder 
    }
}
isect = csg.pop();         // winning intersect  

https://bitbucket.org/simoncblyth/opticks/src/tip/optixrap/cu/csg_intersect_boolean.h

CSG Deep Tree : JUNO "fastener"

/env/presentation/x016_deeptree.png

CSG Deep Tree : height 11 before balancing, too deep for GPU raytrace

 NTreeAnalyse height 11 count 25    ( un : union,  cy : cylinder, di : difference )
                                                                                       un

                                                                               un              di

                                                                       un          cy      cy      cy

                                                               un          cy

                                                       un          cy

                                               un          cy

                                       un          cy

                               un          cy

                       un          cy

               un          cy

       di          cy

   cy      cy

CSG trees are non-unique

CSG Deep Tree : Positivize tree using De Morgans laws

1st step to allow balancing : Positivize : remove CSG difference di operators

                                                     ...    ...

                                               un          cy

                                       un          cy

                               un          cy

                       un          cy

               un          cy

       di          cy

   cy      cy

                                                     ...    ...

                                               un          cy

                                       un          cy

                               un          cy

                       un          cy

               un          cy

       in          cy

   cy      !cy

CSG Deep Tree : height 4 after balancing, OK for GPU raytrace

NTreeAnalyse height 4 count 25
                                                               un

                               un                                                      un

               un                              un                      un                      in

       un              un              un              un          cy          in          cy     !cy

   cy      cy      cy      cy      cy      cy      cy      cy              cy     !cy

un : union, in : intersect, cy : cylinder, !cy : complemented cylinder

Balancing positive tree:

  1. classify tree operators and their placement
    • mono-operator trees can easily be rearranged as union un and intersection in operators are commutative
    • mono-operator above bileaf level can also easily be rearranged as the bileaf can be split off and combined
  2. create complete binary tree of appropriate size filled with placeholders
  3. populate the tree replacing placeholders
  4. prune (pull primitives up to avoid placeholder pairings)

Not a general balancer : but succeeds with all CSG solid trees from Daya Bay and JUNO so far

https://bitbucket.org/simoncblyth/opticks/src/default/npy/NTreeBalance.cpp

Opticks Analytic Daya Bay Near Site, GPU Raytrace (3)

Opticks Analytic Daya Bay Near Site, GPU Raytrace (1)

Opticks Analytic Daya Bay Near Site, GPU Raytrace (0)

Opticks Analytic Daya Bay Near Site, GPU Raytrace (2)

Opticks Analytic JUNO Chimney, GPU Raytrace (0)

Opticks Analytic JUNO PMT Snap, GPU Raytrace (1)

j1808_top_ogl

j1808_top_rtx

j1808_escapes

CSG : (Cylinder - Torus) PMT neck : spurious intersects

/env/presentation/tboolean_12_rasterized.png /env/presentation/tboolean_12_raytrace.png

OptiX Raytrace and OpenGL rasterized wireframe comparing neck models:

  1. Ellipsoid + Hyperboloid + Cylinder
  2. Ellipsoid + (Cylinder - Torus) + Cylinder

Best Solution : use simpler neck model for physically unimportant PMT neck

CSG : Alternative PMT neck designs

/env/presentation/x018_torus_hyperboloid_plt.png /env/presentation/x018_torus_hyperboloid_plt_zoom.png

Hyperboloid and Cone defined using closest point on ellipse to center of torus circle

https://bitbucket.org/simoncblyth/opticks/src/tip/ana/x018_torus_hyperboloid_plt.py

Opticks : Auto-Instancing

Structure Tree Analysis

-> all repeated volumes + transforms

For JUNO, auto-finds:

Avoids geometry specific code

OpenGL/OptiX instancing

Opticks : translates G4 geometry to GPU, without approximation

Direct Geometry : Geant4 "World" -> Opticks CSG -> GPU
  • simpler : no G4DAE+GDML export/import
Material/Surface/Scintillator properties
  • interpolated to standard wavelength domain
  • interleaved into "boundary" texture
  • "reemission" texture for wavelength generation
Structure
  • repeated geometry instances identified (progeny digests)
  • instance transforms used in OptiX/OpenGL geometry
  • merge CSG trees into global + instance buffers
  • export meshes to glTF 2.0 for 3D visualization
Ease of Use
  • easy geometry : just handover "World"
  • easy config : modern CMake + BCM[1]
  • ~easy event : modify G4Cerenkov + G4Scintillation

[1] Boost CMake 3.5+ modules : configure direct dependencies only

https://github.com/BoostCMake/cmake_modules

https://github.com/simoncblyth/bcm

Opticks Export of G4 geometry to glTF 2.0

Opticks : translates G4 optical physics to GPU

OptiX : single-ray programming model -> line-by-line translation

CUDA Ports of Geant4 classes
  • G4Cerenkov (only generation loop)
  • G4Scintillation (only generation loop)
  • G4OpAbsorption
  • G4OpRayleigh
  • G4OpBoundaryProcess (only a few surface types)
Modify Cerenkov + Scintillation Processes
  • collect genstep, copy to GPU for generation
  • avoids copying millions of photons to GPU
Scintillator Reemission
  • fraction of bulk absorbed "reborn" within same thread
  • wavelength generated by reemission texture lookup
Opticks (OptiX/Thrust GPU interoperation)
  • OptiX : upload gensteps
  • Thrust : seeding, distribute genstep indices to photons
  • OptiX : launch photon generation and propagation
  • Thrust : pullback photons that hit PMTs
  • Thrust : index photon step sequences (optional)

Validation : Compare Opticks/Geant4 with Simple Lights/Geometries

/env/graphics/ggeoview/rainbow-spol-disc-incident-sphere.png

1M Photons -> Water Sphere (S-Polarized)

/env/graphics/ggeoview/PmtInBox-approach.png

0.5M Photons -> Dayabay PMT

/env/presentation/1px.png
Photon step records
128 bit per step : highly compressed position, time, wavelength, polarization vector, material/history codes
Photon flag sequence
16x 4-bit step flags recorded in uint64 sequence, indexed using Thrust GPU sort (1M indexed ~0.040s)
Final Photons
Uncompressed : position, time, wavelength, direction, polarization, flags

1M Rainbow S-Polarized, Comparison Opticks/Geant4

Deviation angle(degrees) of 1M parallel monochromatic photons in disc shaped beam incident on water sphere. Numbered bands are visible range expectations of first 11 rainbows. S-Polarized intersection (E field perpendicular to plane of incidence) arranged by directing polarization radially.

/env/optix/cfg4/rainbow-cfg4-spol.png

Take Control of Geant4 Random Number Generator (RNG)

After CAlignEngine::SetSequenceIndex(int index) : subsequent G4UniformRand() give randoms from sequence index

// EngineMinimalTest.cc : demonstrate G4UniformRand control
##include "Randomize.hh"                                              $ EngineMinimalTest
                                                                     0.13049
struct MyEngine : public CLHEP::MixMaxRng  // MixMax is default      0.617751
{                                                                    0.995947
    double flat(){ return .42 ; }                                    0.495902
};                                                                   0.112917
                                                                     0.289871
int main(int argc, char** argv)                                      0.473044
{                                                                    0.837619
    if(argc > 1)                                                     0.359356
       CLHEP::HepRandom::setTheEngine(new MyEngine());               0.926938

    for( int i=0 ; i < 10 ; i++)                                     $ EngineMinimalTest 1
         std::cout << G4UniformRand() << std::endl ;                 0.42
                                                                     0.42
    return 0 ;                                                       0.42
}                                                                    ...

https://bitbucket.org/simoncblyth/opticks/src/default/cfg4/CAlignEngine.cc

Validation : Aligning CPU and GPU Simulations

Aligned zipping together of code and RNG values

Single executable lldb OKG4Test:

  1. run Opticks GPU simulation, persist event
  2. run Geant4 simulation
    • step-by-step check each G4 photon follows Opticks history and parameters, break at deviations
  3. fix cause of misaligned RNG consumption, or other deviation
    • tricks needed on both sides : burning RNGs, jump backs

Payoff : simplest possible direct comparison validation

http://bitbucket.com/simoncblyth/opticks/src/tip/tools/autobreakpoint.py

(lldb) help breakpoint command add

Validation : Direct comparison of GPU/CPU NumPy arrays

tboolean-box simple geometry test


In [11]: pdv = np.where(dv > 0.0001)[0]
In [12]: ab.dumpline(pdv)
      0   1230 : TO BR SC BT BR BT SA
      1   2413 : TO BT BT SC BT BR BR BT SA
      2   9041 : TO BT SC BR BR BR BR BT SA
      3  14510 : TO SC BT BR BR BT SA
      4  14747 : TO BT SC BR BR BR BR BR BR BR
      5  14747 : TO BT SC BR BR BR BR BR BR BR
    ...

In [20]: ab.b.ox[pdv,0]                                 In [21]: ab.a.ox[pdv,0]
Out[20]:                                                Out[21]:
A()sliced                                               A()sliced
A([    [-191.6262, -240.3634,  450.    ,    5.566 ],    A([    [-191.626 , -240.3634,  450.    ,    5.566 ],
       [ 185.7708, -133.8457,  450.    ,    7.3141],           [ 185.7708, -133.8456,  450.    ,    7.3141],
       [-450.    , -104.4142,  311.143 ,    9.0581],           [-450.    , -104.4142,  311.1431,    9.0581],
       [  83.6955,  208.9171, -450.    ,    5.6188],           [  83.6954,  208.9172, -450.    ,    5.6188],
       [  32.8972,  150.    ,   24.9922,    7.6757],           [  32.8973,  150.    ,   24.992 ,    7.6757],
       [  32.8972,  150.    ,   24.9922,    7.6757],           [  32.8973,  150.    ,   24.992 ,    7.6757],
       [ 450.    , -186.7449,  310.6051,    5.0707],           [ 450.    , -186.7451,  310.605 ,    5.0707],
       [ 299.2227,  318.1443, -450.    ,    4.8717],           [ 299.2229,  318.144 , -450.    ,    4.8717],
 ...

http://bitbucket.com/simoncblyth/opticks/src/tip/notes/issues/tboolean_box_perfect_alignment_small_deviations.rst

Coincident Faces are Primary Cause of Issues : Spurious Intersects

Coincidences common (alignment too tempting?). To fix:

/env/opticks_refs/opticks_tscan_29_nzero_5_OcrGdsPrt.png

Summary

/env/presentation/1px.png

Opticks enables Geant4 based simulations to benefit from effectively zero time and zero CPU memory optical photon simulation, due to massive parallelism made accessible by NVIDIA OptiX.

/env/presentation/1px.png
  • Drastic speedup -> better detector understanding -> greater precision
  • Performance discontinuity -> new possibilities -> imagination required
/env/presentation/1px.png

Subscribe to stay informed on Opticks:

opticks+subscribe@groups.io

https://groups.io/g/opticks

https://bitbucket.org/simoncblyth/opticks

Opticks References

https://simoncblyth.bitbucket.io
Opticks presentations and videos
https://groups.io/g/opticks
Opticks mailing list archive
opticks+subscribe@groups.io
send email to this address, to subscribe
https://simoncblyth.bitbucket.io/opticks/index.html
Opticks installation instructions
https://bitbucket.org/simoncblyth/opticks
Opticks code repository

OpticksDocs

The Zen of Numpy, by its creator, Travis Oliphant


Strided is better than scattered.
Contiguous is better than strided.
Descriptive is better than imperative[1] (e.g. data-types).
Array-orientated is better than object-oriented.
Broadcasting is a great idea -- use where possible!
Vectorized is better than an explicit loop.
Unless its complicated -- then use Cython or numexpr.
Think in higher dimensions.

My take : best tool depends on nature of data

[1] imperative means step by step how to do something

[2] but not so large that has trouble fitting in memory, np.memmap is possible but better to avoid for simplicity

NPY minimal file format : metadata header + data buffer : trivial to parse


In [1]: a = np.arange(10)        # array of 10 ints : 64 bit, 8 bytes each  
In [2]: np.save("a.npy", a )       # persist the array : serializing it into a file 

In [3]: a2 = np.load("a.npy")         # load array from file into memory 
In [4]: assert np.all( a == a2 )      # check all elements the same 

In [5]: !xxd a.npy                   # run xxd in shell to hexdump the byte contents of the file 

00000000: 934e 554d 5059 0100 7600 7b27 6465 7363  .NUMPY..v.{ desc
00000010: 7227 3a20 273c 6938 272c 2027 666f 7274  r': '<i8', 'fort
00000020: 7261 6e5f 6f72 6465 7227 3a20 4661 6c73  ran_order : Fals
00000030: 652c 2027 7368 6170 6527 3a20 2831 302c  e, 'shape': (10,    # minimal metadata : type, shape   
00000040: 292c 207d 2020 2020 2020 2020 2020 2020  ), }
00000050: 2020 2020 2020 2020 2020 2020 2020 2020
00000060: 2020 2020 2020 2020 2020 2020 2020 2020
00000070: 2020 2020 2020 2020 2020 2020 2020 200a                 .    # 128 bytes of header  
00000080: 0000 0000 0000 0000 0100 0000 0000 0000  ................
00000090: 0200 0000 0000 0000 0300 0000 0000 0000  ................
000000a0: 0400 0000 0000 0000 0500 0000 0000 0000  ................    # data buffer  
000000b0: 0600 0000 0000 0000 0700 0000 0000 0000  ................
000000c0: 0800 0000 0000 0000 0900 0000 0000 0000  ................

In [6]: !ls -l a.npy     # small 128 byte header + (8 bytes per integer)*10 = 208 bytes total 
-rw-r--r--  1 blyth  staff  208 Sep 13 11:01 a.npy

In [7]: a
Out[7]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [8]: a.shape
Out[8]: (10,)

One possibility for compression with blosc http://bcolz.blosc.org/en/latest/intro.html