Opticks > 1000x Geant4 (*)
GPU massive parallelism eliminates bottleneck.
[zero: effectively, compared to rest of simulation]
More Photons -> More Benefit
http://bitbucket.org/simoncblyth/opticks
(*) core extrapolated from mobile GPU speed
Thousands of CUDA cores
GPU : Turing TU102
GPU : Volta V100 (eg Titan V)
CPU : Intel Xeon Processor Family
Waiting for memory read/write, is major source of latency...
Effective use of Totally different processor architecture -> Total reorganization of data and computation
Understanding Throughput-oriented Architectures https://cacm.acm.org/magazines/2010/11/100622-understanding-throughput-oriented-architectures/fulltext
High level C++ access to CUDA
"... Code at the speed of light ..."
"... high-level interface greatly enhances programmer productivity ..."
https://developer.nvidia.com/gpu-accelerated-libraries
For adventurous early adopters
Optical Photon Simulation
~perfect match for GPU acceleration
How Many Threads to Launch ?
Understanding Throughput-oriented Architectures https://cacm.acm.org/magazines/2010/11/100622-understanding-throughput-oriented-architectures/fulltext
NVIDIA Titan V: 80 SM, 5120 CUDA cores
Serialization Benefits
YET : Opticks libs do not depend on NumPy/python
https://bitbucket.org/simoncblyth/intro_to_numpy
Very terse, array-oriented (no-loop) python interface to C performance
https://docs.scipy.org/doc/numpy/user/quickstart.html
http://www.scipy-lectures.org/intro/index.html
https://github.com/donnemartin/data-science-ipython-notebooks
https://stackoverflow.blog/2017/09/14/python-growing-quickly/
https://insights.stackoverflow.com/trends
https://jakevdp.github.io/blog/2014/05/05/introduction-to-the-python-buffer-protocol/
import numpy as np def ellipse_closest_approach_to_point( ex, ez, _c ): """ex, ez: ellipse semi-axes, c: coordinates of point in ellipse frame""" c = np.asarray( _c ) ; assert c.shape == (2,) t = np.linspace( 0, 2*np.pi, 1000000 ) # t: array of 1M angles [0,2pi] e = np.zeros( [len(t), 2] ) e[:,0] = ex*np.cos(t) e[:,1] = ez*np.sin(t) # e: 1M parametric [x,z] points on the ellipse return e[np.sum((e-c)**2, axis=1).argmin()] # point on ellipse closest to c
expression | shape | note |
---|---|---|
e | (1000000,2) | |
c | (2,) | |
e-c | (1000000,2) | c is broadcast over e : must be compatible shape |
np.sum((e-c)**2, 1) | (1000000,) | axis=1 : summing over the axis of length 2 |
np.sum((e-c)**2, 0) | (2,) | axis=0 : summing over axis of length 1M |
np.sum((e-c)**2, None) | () | axis=None : summing over all elements, yielding scalar |
https://bitbucket.org/simoncblyth/opticks/src/tip/ana/x018_torus_hyperboloid_plt.py
Array-oriented brevity : prototyping
In [5]: t_model([0,0,0]) Out[5]: array([10., 10., 10., ..., 10., 10., 10.]) In [6]: t_model([0,0,0]).shape Out[6]: (2500,) In [7]: t.shape Out[7]: (2500,) In [8]: t Out[8]: array([ 6.783 , 5.4754, 6.1462, ... ])
##!/usr/bin/env python import numpy as np, scipy.stats as st, scipy.optimize as so np.random.seed(0) # reproducibility # generate n*n (x,y,z) coordinates on a sphere n = 50 u,v = np.meshgrid( np.linspace(0,np.pi,n+2)[1:-1], np.linspace(0,2*np.pi,n+1)[:-1] ) uu = u.ravel() ; vv = v.ravel() sph = np.zeros( [len(uu), 3] ) sph[:,0] = np.sin(uu)*np.cos(vv) sph[:,1] = np.sin(uu)*np.sin(vv) sph[:,2] = np.cos(uu) R = 10 ; sph *= R # mockup "truth" position parTru = np.array( [0,0,R/2, 1] ) # distances from all the sphere coordinates to the "truth" position d = np.sqrt(np.sum((sph - parTru[:3])**2, axis=1 )) # mockup a time linear with the distance with a normal smearing t = d + parTru[3]*np.random.randn(len(d)) # geometric "time" as function of position t_model = lambda par:np.sqrt(np.sum((sph - par[:3])**2, axis=1 )) # Assumed PDF of "time" at each sphere position, normal around geometric time with some sigma. NLL = lambda par:-np.sum( st.norm.logpdf(t, loc=t_model(par), scale=par[3] )) parIni = np.array( [0,0,0,1] ) # initial parameter values parFit = so.minimize(NLL, parIni, method='nelder-mead').x ; print(parFit)
https://bitbucket.org/simoncblyth/intro_to_numpy/src/default/recon_terse.py
Casino de Monte-Carlo, Monaco (French Riviera)
Simulate designs before building them
Generate history of a system using random sampling techniques making variables follow expected PDFs[1]
simulation is vital to understand/design anything complex
[1] PDFs : probability density functions
https://www.slideserve.com/hafwen/monte-carlo-detector-simulation Pat Ward, Cambridge University, 74 pages Powerpoint
Possible Histories
Pick winning process from smallest distance
boundary_distance = from_geometry # no random number needed absorption_distance = -absorption_length * ln(u0) scattering_distance = -scattering_length * ln(u1) ## u0, u1 uniform randoms in [0,1] : distances always +ve
If scatter:
Theory (eg Rayleigh scattering) -> PDFs used in the simulation
https://bitbucket.org/simoncblyth/opticks/src/default/optixrap/cu/propagate.h https://bitbucket.org/simoncblyth/opticks/src/default/optixrap/cu/rayleigh.h
Aim : create sample (a set of values) that follows an analytic PDF
Cumulative distribution function (CDF) : Prob( X < x)
Map uniform randoms onto probabilities
Intuitively : throw randoms uniformly "vertically"
CDF "encodes" shape of PDF in convenient form for sampling
Opticks Re-emission model
float uniform_sample_reemit = curand_uniform(&rng);
if (uniform_sample_reemit < reemission_probability )
{
...
p.wavelength = reemission_lookup(curand_uniform(&rng)); # re-emission texture lookup
s.flag = BULK_REEMIT ;
return CONTINUE;
}
else
{
s.flag = BULK_ABSORB ;
return BREAK;
}
https://bitbucket.org/simoncblyth/opticks/src/default/optixrap/cu/propagate.h
Create sample with exponential PDF, using CDF
Known analytic form of CDF, means simple sampling:
Equate CDF probability with uniform random value [0,1]
https://www.eg.bucknell.edu/~xmeng/Course/CS6337/Note/master/node50.html
Simple technique providing a sample that follows a distribution. Example PDF : pdf(x) = (3/8).( 1 + x^2 )
https://bitbucket.org/simoncblyth/intro_to_numpy/src/default/accept_reject_sampling.py
in [1]: import numpy as np In [2]: a = np.random.rand( 1000000,2 ) # 2M random floats uniform in [0,1] In [3]: a[:,0] = a[:,0]*2. - 1. # [0,1] -> [-1,1] In [4]: a[:,1] = a[:,1]*0.75 # [0,1] -> [0,0.75] pdf(0) = pdf(1) = 0.75 In [5]: pdf = lambda x:(3./8.)*(1+x**2) # simple symmetric PDF normalized on -1:1 In [6]: w = np.where( a[:,1] < pdf(a[:,0]) ) # indices of accepted sample In [7]: s = a[w][:,0] # accepted sample of 666301 values In [8]: s Out[8]: array([ 0.0382, 0.9755, 0.6855, ..., 0.5962, 0.9094, -0.8292]) In [9]: s.shape Out[9]: (666301,) In [10]: a.shape Out[10]: (1000000, 2) In [11]: a Out[11]: array([[ 0.0382, 0.0752], [ 0.9755, 0.0495], [ 0.6855, 0.1835], ..., [ 0.5837, 0.7297], [-0.7534, 0.7253], [-0.8292, 0.0019]])
https://bitbucket.org/simoncblyth/intro_to_numpy/src/default/accept_reject_sampling.py
https://bitbucket.org/simoncblyth/intro_to_numpy/src/default/estimate_pi.py
In [1]: from __future__ import division # python3 style default In [2]: import numpy as np In [3]: a = np.random.rand( 1000000,2 ) # 2M uniform [0,1] random floats In [4]: a[:,0] = a[:,0]*2. - 1. # NB: no loops at python level In [5]: a[:,1] = a[:,1]*2. - 1. In [6]: mask = np.sum(a*a,1) < 1 # mask : array of booleans with same shape as a In [7]: w = np.where(mask) # w : array of indices of a within the mask In [8]: epi = 4*len(w[0])/len(a) # estimate of pi = 4*(cicle count)/(square count) In [9]: label = " estimate_pi 4*%d/%d = %10.5f (%10.5f) " % (len(w[0]), len(a), epi, epi-np.pi ) In [10]: a.shape Out[10]: (1000000, 2) In [11]: a[w].shape # shape of the selection Out[11]: (785209, 2) In [12]: a[w] # select via an array of indices Out[12]: array([[-0.3253, 0.8069], [ 0.2202, -0.8232], [-0.9173, -0.088 ], ..., [ 0.1739, -0.1885], [-0.1812, -0.8979], [ 0.4604, -0.524 ]])
https://bitbucket.org/simoncblyth/intro_to_numpy/src/default/estimate_pi.py
##include <curand_kernel.h>
struct estimate_pi
{
estimate_pi(int _N) : N(_N) {}
__device__ float operator()(unsigned seed)
{
float sum = 0;
curandState rng;
curand_init(seed, 0, 0, &rng);
// initializing curand is very expensive better to split/persist/load, see next pages
for(int i = 0; i < N; ++i)
{
float x = curand_uniform(&rng);
float y = curand_uniform(&rng);
float dist = sqrtf(x*x + y*y);
if(dist <= 1.0f) sum += 1.0f;
}
sum *= 4.0f;
return sum / N;
}
int N ;
};
https://bitbucket.org/simoncblyth/intro_to_cuda/src/default/thrust_curand_estimate_pi.cu
##include <thrust/iterator/counting_iterator.h> ##include <thrust/transform_reduce.h> ##include <iostream> ##include <iomanip> int main() { int N = 10000; int M = 30000; float estimate = thrust::transform_reduce( // NB functor call ordering is undefined thrust::counting_iterator<int>(0), thrust::counting_iterator<int>(M), // 1st two args define implicit sequence estimate_pi(N), // the functor to apply to the sequence 0.0f, // initial value of reduction thrust::plus<float>())/M ; // how to combine results from the functor call std::cout << " M " << M << " N " << N << std::setprecision(5) << std::fixed << " estimate " << estimate << " delta " << estimate - M_PI << std::endl ; return 0; }
epsilon:tests blyth$ nvcc thrust_curand_estimate_pi.cu epsilon:tests blyth$ ./a.out M 30000 N 10000 estimate 3.14142 delta -0.00017
https://bitbucket.org/simoncblyth/intro_to_cuda/src/tip/rng/rng_states.cu
Persistent CUDA Context
GPU buffers live within CUDA context
Opticks populates context at initialization:
Concurrent generation of millions of reproducible sequences of pseudorandom numbers
curand_init very expensive -> huge performance hit:
Solution:
Understanding this technicality -> correct "mental model" of CUDA context
Hiding the CUDA implementation
// recon/Rec.hh // requires nvcc compilation ##include "NLL.hh" template<typename T> T Rec<T>::nll_() const { return -thrust::transform_reduce( thrust::make_counting_iterator(0), thrust::make_counting_iterator(tnum), *nll, // NLL functor T(0), thrust::plus<T>() ); }
// recon/Recon.hh // gcc/clang/nvcc compilation ##include <vector> template <typename T> struct Rec ; template <typename T> struct Recon { Rec<T>* rec ; // <-- CUDA imp hidden behind pointer Recon( const char* dir ) ; T nll( const std::vector<T>& par ); };
https://bitbucket.org/simoncblyth/intro_to_cuda/src/tip/recon/
1. CPU+GPU Libs, Bridging header pattern
2. NumPy array read/write from C++/CUDA C/Thrust
3. Thrust : GPU development
Real CUDA usage -> two compilers -> interop techniques required
Standard Simulation Tool of HEP
Geant4 simulates particles travelling through matter
Geant4 Approach
Very General and Capable Tool
No main(), No defaults
Geant4 is a toolkit : it provides no main()
Materials and surfaces can be assigned optical properties such as: RINDEX, REFLECTIVITY
Initialization
Beginning of Run
Beginning of Event
At Each Step of a Track
60 slide intro from one of the primary Geant4 architects/developers
http://geant4.in2p3.fr/2005/Workshop/ShortCourse/session1/J.Apostolakis.pdf
Optical Photon Problem
-> Hybrid Solution : Geant4 + Opticks
Not a Photo, a Calculation
Geometry, light sources, optical physics ->
Ray tracing has many applications :
Ray-geometry intersection
OptiX Raytracing Pipeline
Analogous to OpenGL rasterization pipeline:
OptiX makes GPU ray tracing accessible
NVIDIA expertise:
https://developer.nvidia.com/rtx
User provides (Yellow):
Torus artifacts
3D parametric ray : ray(x,y,z;t) = rayOrigin + t * rayDirection
High order equation
Best Solution : replace torus
CSG Binary Tree
Primitives combined via binary operators
Simple by construction definition, implicit geometry.
CSG expressions
3D Parametric Ray : ray(t) = r0 + t rDir
Ray Geometry Intersection
How to pick exactly ?
In/On/Out transitions
Classical Roth diagram approach
Computational requirements:
BUT : High performance on GPU requires:
Classical approach not appropriate on GPU
Outside/Inside Unions
dot(normal,rayDir) -> Enter/Exit
Pick between pairs of nearest intersects, eg:
UNION tA < tB | Enter B | Exit B | Miss B |
---|---|---|---|
Enter A | ReturnA | LoopA | ReturnA |
Exit A | ReturnA | ReturnB | ReturnA |
Miss A | ReturnB | ReturnB | ReturnMiss |
Bit Twiddling Navigation
Geant4 solid -> CSG binary tree (leaf primitives, non-leaf operators, 4x4 transforms on any node)
Serialize to complete binary tree buffer:
Height 3 complete binary tree with level order indices:
depth elevation 1 0 3 10 11 1 2 100 101 110 111 2 1 1000 1001 1010 1011 1100 1101 1110 1111 3 0
postorder_next(i,elevation) = i & 1 ? i >> 1 : (i << elevation) + (1 << elevation) ; // from pattern of bits
Postorder tree traverse visits all nodes, starting from leftmost, such that children are visited prior to their parents.
fullTree = PACK( 1 << height, 1 >> 1 ) // leftmost, parent_of_root(=0) tranche.push(fullTree, ray.tmin) while (!tranche.empty) // stack of begin/end indices { begin, end, tmin <- tranche.pop ; node <- begin ; while( node != end ) // over tranche of postorder traversal { elevation = height - TREE_DEPTH(node) ; if(is_primitive(node)){ isect <- intersect_primitive(node, tmin) ; csg.push(isect) } else{ i_left, i_right = csg.pop, csg.pop // csg stack of intersect normals, t l_state = CLASSIFY(i_left, ray.direction, tmin) r_state = CLASSIFY(i_right, ray.direction, tmin) action = LUT(operator(node), leftIsCloser)(l_state, r_state) if( action is ReturnLeft/Right) csg.push(i_left or i_right) else if( action is LoopLeft/Right) { left = 2*node ; right = 2*node + 1 ; endTranche = PACK( node, end ); leftTranche = PACK( left << (elevation-1), right << (elevation-1) ) rightTranche = PACK( right << (elevation-1), node ) loopTranche = action ? leftTranche : rightTranche tranche.push(endTranche, tmin) tranche.push(loopTranche, tminAdvanced ) // subtree re-traversal with changed tmin break ; // to next tranche } } node <- postorder_next(node, elevation) // bit twiddling postorder } } isect = csg.pop(); // winning intersect
https://bitbucket.org/simoncblyth/opticks/src/tip/optixrap/cu/csg_intersect_boolean.h
NTreeAnalyse height 11 count 25 ( un : union, cy : cylinder, di : difference ) un un di un cy cy cy un cy un cy un cy un cy un cy un cy un cy di cy cy cy
CSG trees are non-unique
Positive form CSG Trees
Apply deMorgan pushing negations down tree
End with only UNION, INTERSECT operators, and some complemented leaves.
COMMUTATIVE -> easily rearranged
1st step to allow balancing : Positivize : remove CSG difference di operators
... ... un cy un cy un cy un cy un cy di cy cy cy
... ... un cy un cy un cy un cy un cy in cy cy !cy
NTreeAnalyse height 4 count 25 un un un un un un in un un un un cy in cy !cy cy cy cy cy cy cy cy cy cy !cy
un : union, in : intersect, cy : cylinder, !cy : complemented cylinder
Balancing positive tree:
Not a general balancer : but succeeds with all CSG solid trees from Daya Bay and JUNO so far
https://bitbucket.org/simoncblyth/opticks/src/default/npy/NTreeBalance.cpp
OptiX Raytrace and OpenGL rasterized wireframe comparing neck models:
Best Solution : use simpler neck model for physically unimportant PMT neck
Hyperboloid and Cone defined using closest point on ellipse to center of torus circle
https://bitbucket.org/simoncblyth/opticks/src/tip/ana/x018_torus_hyperboloid_plt.py
Structure Tree Analysis
-> all repeated volumes + transforms
For JUNO, auto-finds:
Avoids geometry specific code
OpenGL/OptiX instancing
Volumes -> Boundaries
Ray tracing favors Boundaries
Material/surface boundary : 4 indices
Primitives labelled with unique boundary index
[1] Boost CMake 3.5+ modules : configure direct dependencies only
Emerging 3D Standard
"JPEG" of 3D
GPU Resident Photons
Thrust: high level C++ access to CUDA
OptiX : single-ray programming model -> line-by-line translation
Deviation angle(degrees) of 1M parallel monochromatic photons in disc shaped beam incident on water sphere. Numbered bands are visible range expectations of first 11 rainbows. S-Polarized intersection (E field perpendicular to plane of incidence) arranged by directing polarization radially.
After CAlignEngine::SetSequenceIndex(int index) : subsequent G4UniformRand() give randoms from sequence index
// EngineMinimalTest.cc : demonstrate G4UniformRand control ##include "Randomize.hh" $ EngineMinimalTest 0.13049 struct MyEngine : public CLHEP::MixMaxRng // MixMax is default 0.617751 { 0.995947 double flat(){ return .42 ; } 0.495902 }; 0.112917 0.289871 int main(int argc, char** argv) 0.473044 { 0.837619 if(argc > 1) 0.359356 CLHEP::HepRandom::setTheEngine(new MyEngine()); 0.926938 for( int i=0 ; i < 10 ; i++) $ EngineMinimalTest 1 std::cout << G4UniformRand() << std::endl ; 0.42 0.42 return 0 ; 0.42 } ...
https://bitbucket.org/simoncblyth/opticks/src/default/cfg4/CAlignEngine.cc
lldb python scripting
Auto-configure breakpoints using code markers:
opticks/tools/autobreakpoint.py opticks/tools/g4lldb.py
Ubiquitous Data access with NPY
All Opticks data managed in NumPy buffers, easy access from python,C++,CUDA,lldb-python
Aligned zipping together of code and RNG values
Single executable lldb OKG4Test:
Payoff : simplest possible direct comparison validation
http://bitbucket.com/simoncblyth/opticks/src/tip/tools/autobreakpoint.py
(lldb) help breakpoint command add
CPU/GPU matching
tboolean-box simple geometry test
In [11]: pdv = np.where(dv > 0.0001)[0] In [12]: ab.dumpline(pdv) 0 1230 : TO BR SC BT BR BT SA 1 2413 : TO BT BT SC BT BR BR BT SA 2 9041 : TO BT SC BR BR BR BR BT SA 3 14510 : TO SC BT BR BR BT SA 4 14747 : TO BT SC BR BR BR BR BR BR BR 5 14747 : TO BT SC BR BR BR BR BR BR BR ... In [20]: ab.b.ox[pdv,0] In [21]: ab.a.ox[pdv,0] Out[20]: Out[21]: A()sliced A()sliced A([ [-191.6262, -240.3634, 450. , 5.566 ], A([ [-191.626 , -240.3634, 450. , 5.566 ], [ 185.7708, -133.8457, 450. , 7.3141], [ 185.7708, -133.8456, 450. , 7.3141], [-450. , -104.4142, 311.143 , 9.0581], [-450. , -104.4142, 311.1431, 9.0581], [ 83.6955, 208.9171, -450. , 5.6188], [ 83.6954, 208.9172, -450. , 5.6188], [ 32.8972, 150. , 24.9922, 7.6757], [ 32.8973, 150. , 24.992 , 7.6757], [ 32.8972, 150. , 24.9922, 7.6757], [ 32.8973, 150. , 24.992 , 7.6757], [ 450. , -186.7449, 310.6051, 5.0707], [ 450. , -186.7451, 310.605 , 5.0707], [ 299.2227, 318.1443, -450. , 4.8717], [ 299.2229, 318.144 , -450. , 4.8717], ...
Cylinder - Cone
Coincident endcaps -> spurious intersects
Grow subtracted cone downwards, avoids coincidence : does not change composite solid
Coincidences common (alignment too tempting?). To fix:
Highlights
Opticks enables Geant4 based simulations to benefit from effectively zero time and zero CPU memory optical photon simulation, due to massive parallelism made accessible by NVIDIA OptiX.
- Drastic speedup -> better detector understanding -> greater precision
- Performance discontinuity -> new possibilities -> imagination required
Subscribe to stay informed on Opticks:
Introductions
Open Source Opticks
Documentation, install instructions. Repository.
Geometry/event data use NumPy serialization:
import numpy as np a = np.load("photons.npy")
(*) Windows VS2015, non-CUDA only, not recently!
Strided is better than scattered. Contiguous is better than strided. Descriptive is better than imperative[1] (e.g. data-types). Array-orientated is better than object-oriented. Broadcasting is a great idea -- use where possible! Vectorized is better than an explicit loop. Unless its complicated -- then use Cython or numexpr. Think in higher dimensions.
My take : best tool depends on nature of data
[1] imperative means step by step how to do something
[2] but not so large that has trouble fitting in memory, np.memmap is possible but better to avoid for simplicity
In [1]: a = np.arange(10) # array of 10 ints : 64 bit, 8 bytes each In [2]: np.save("a.npy", a ) # persist the array : serializing it into a file In [3]: a2 = np.load("a.npy") # load array from file into memory In [4]: assert np.all( a == a2 ) # check all elements the same In [5]: !xxd a.npy # run xxd in shell to hexdump the byte contents of the file 00000000: 934e 554d 5059 0100 7600 7b27 6465 7363 .NUMPY..v.{ desc 00000010: 7227 3a20 273c 6938 272c 2027 666f 7274 r': '<i8', 'fort 00000020: 7261 6e5f 6f72 6465 7227 3a20 4661 6c73 ran_order : Fals 00000030: 652c 2027 7368 6170 6527 3a20 2831 302c e, 'shape': (10, # minimal metadata : type, shape 00000040: 292c 207d 2020 2020 2020 2020 2020 2020 ), } 00000050: 2020 2020 2020 2020 2020 2020 2020 2020 00000060: 2020 2020 2020 2020 2020 2020 2020 2020 00000070: 2020 2020 2020 2020 2020 2020 2020 200a . # 128 bytes of header 00000080: 0000 0000 0000 0000 0100 0000 0000 0000 ................ 00000090: 0200 0000 0000 0000 0300 0000 0000 0000 ................ 000000a0: 0400 0000 0000 0000 0500 0000 0000 0000 ................ # data buffer 000000b0: 0600 0000 0000 0000 0700 0000 0000 0000 ................ 000000c0: 0800 0000 0000 0000 0900 0000 0000 0000 ................ In [6]: !ls -l a.npy # small 128 byte header + (8 bytes per integer)*10 = 208 bytes total -rw-r--r-- 1 blyth staff 208 Sep 13 11:01 a.npy In [7]: a Out[7]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) In [8]: a.shape Out[8]: (10,)
One possibility for compression with blosc http://bcolz.blosc.org/en/latest/intro.html