numbers.md

<pre>
------------------------------------------------
V100
------------------------------------------------

7 TFlop double
14 TFlop single
32 GB/sec PCIe
300 GB/sec NVLINK

------------------------------------------------
1x1x4 chiral
------------------------------------------------

input 1058444 particles
mps: 254 srt: 625 frc: 3521 int: 106 cpy: 2424 out: 2.8e+05 (us))

------------------------------------------------
coupon
------------------------------------------------

input 986154 particles
mps: 277 srt: 289 frc: 3242 int: 91 cpy: 2251 out: 2.72e+05 (us))

------------------------------------------------
simulate
------------------------------------------------

14 TFlop / 277 Mpps = 50541 ops/particle

------------------------------------------------
integrate
------------------------------------------------

986154 particles / 91e-6 s = 10.8 G/s
14 TFlop / (986154 particles / 91e-6 s) = 1291 ops/integrate

------------------------------------------------
copy
------------------------------------------------

986154 particles * 7 arrays * 4 bytes / 2251 us
= 12266 = 12 GB/s

------------------------------------------------
GPU_check
------------------------------------------------

peer access:
   from 0 to 1: yes
   from 0 to 2: yes
   from 0 to 3: yes
   from 0 to 4: yes
   from 0 to 5: no
   from 0 to 6: no
   from 0 to 7: no
   from 1 to 0: yes
   from 1 to 2: yes
   from 1 to 3: yes
   from 1 to 4: no
   from 1 to 5: yes
   from 1 to 6: no
   from 1 to 7: no
   from 2 to 0: yes
   from 2 to 1: yes
   from 2 to 3: yes
   from 2 to 4: no
   from 2 to 5: no
   from 2 to 6: yes
   from 2 to 7: no
   from 3 to 0: yes
   from 3 to 1: yes
   from 3 to 2: yes
   from 3 to 4: no
   from 3 to 5: no
   from 3 to 6: no
   from 3 to 7: yes
   from 4 to 0: yes
   from 4 to 1: no
   from 4 to 2: no
   from 4 to 3: no
   from 4 to 5: yes
   from 4 to 6: yes
   from 4 to 7: yes
   from 5 to 0: no
   from 5 to 1: yes
   from 5 to 2: no
   from 5 to 3: no
   from 5 to 4: yes
   from 5 to 6: yes
   from 5 to 7: yes
   from 6 to 0: no
   from 6 to 1: no
   from 6 to 2: yes
   from 6 to 3: no
   from 6 to 4: yes
   from 6 to 5: yes
   from 6 to 7: yes
   from 7 to 0: no
   from 7 to 1: no
   from 7 to 2: no
   from 7 to 3: yes
   from 7 to 4: yes
   from 7 to 5: yes
   from 7 to 6: yes
GPUs:
   number: 0
      name: Tesla V100-SXM2-16GB
      global memory: 16945512448
      max grid size: 2147483647
      max threads per block: 1024
      max threads dimension: 1024
      multiprocessor count: 80
      max threads per multiprocessor: 2048
   number: 1
      name: Tesla V100-SXM2-16GB
      global memory: 16945512448
      max grid size: 2147483647
      max threads per block: 1024
      max threads dimension: 1024
      multiprocessor count: 80
      max threads per multiprocessor: 2048
   number: 2
      name: Tesla V100-SXM2-16GB
      global memory: 16945512448
      max grid size: 2147483647
      max threads per block: 1024
      max threads dimension: 1024
      multiprocessor count: 80
      max threads per multiprocessor: 2048
   number: 3
      name: Tesla V100-SXM2-16GB
      global memory: 16945512448
      max grid size: 2147483647
      max threads per block: 1024
      max threads dimension: 1024
      multiprocessor count: 80
      max threads per multiprocessor: 2048
   number: 4
      name: Tesla V100-SXM2-16GB
      global memory: 16945512448
      max grid size: 2147483647
      max threads per block: 1024
      max threads dimension: 1024
      multiprocessor count: 80
      max threads per multiprocessor: 2048
   number: 5
      name: Tesla V100-SXM2-16GB
      global memory: 16945512448
      max grid size: 2147483647
      max threads per block: 1024
      max threads dimension: 1024
      multiprocessor count: 80
      max threads per multiprocessor: 2048
   number: 6
      name: Tesla V100-SXM2-16GB
      global memory: 16945512448
      max grid size: 2147483647
      max threads per block: 1024
      max threads dimension: 1024
      multiprocessor count: 80
      max threads per multiprocessor: 2048
   number: 7
      name: Tesla V100-SXM2-16GB
      global memory: 16945512448
      max grid size: 2147483647
      max threads per block: 1024
      max threads dimension: 1024
      multiprocessor count: 80
      max threads per multiprocessor: 2048
copy 5000000 floats from CPU to GPU
   6854.000000 us, 2.918e+09 B/s
   1846.000000 us, 1.08342e+10 B/s pinned
copy 5000000 floats from GPU to GPU 0:
   GPU 1: 849.000000 us, 2.35571e+10 B/s
   GPU 2: 851.000000 us, 2.35018e+10 B/s
   GPU 3: 443.000000 us, 4.51467e+10 B/s
   GPU 4: 444.000000 us, 4.5045e+10 B/s
   GPU 5: 2051.000000 us, 9.75134e+09 B/s
   GPU 6: 2026.000000 us, 9.87167e+09 B/s
   GPU 7: 2017.000000 us, 9.91572e+09 B/s
add 5000000x5000000 floats:
   3.747896 s, 6670.409500 G/s
peer add 5000000x5000000 GPU 0 floats:
   GPU 1: 3.755309 s, 6657.241500 G/s
   GPU 2: 3.731777 s, 6699.221000 G/s
   GPU 3: 3.735422 s, 6692.684500 G/s
   GPU 4: 3.736765 s, 6690.278500 G/s
parallel peer add 5000000x5000000 GPU 0 floats:
   5 GPUS: 0.783563 s, 19940.962000 G/s

------------------------------------------------
CPU_check
------------------------------------------------

processor	: 95
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
stepping	: 7
microcode	: 0x5002f01
cpu MHz		: 1221.613
cache size	: 36608 KB
physical id	: 1
siblings	: 48
core id		: 23
cpu cores	: 24
apicid		: 111
initial apicid	: 111
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips	: 5999.99
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

add 500000x500000 floats with 1 thread:
   65.058972 s, 3.842667 G/s
add 500000x500000 floats with 96 threads:
   1.376176 s, 181.662813 G/s

------------------------------------------------
pipe_check
------------------------------------------------

send: 100000000 points
receive: 4.301932 s, 9.29815e+07 B/s

</pre>