Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
<pre>
------------------------------------------------
V100
------------------------------------------------
7 TFlop double
14 TFlop single
32 GB/sec PCIe
300 GB/sec NVLINK
------------------------------------------------
1x1x4 chiral
------------------------------------------------
input 1058444 particles
mps: 254 srt: 625 frc: 3521 int: 106 cpy: 2424 out: 2.8e+05 (us))
------------------------------------------------
coupon
------------------------------------------------
input 986154 particles
mps: 277 srt: 289 frc: 3242 int: 91 cpy: 2251 out: 2.72e+05 (us))
------------------------------------------------
simulate
------------------------------------------------
14 TFlop / 277 Mpps = 50541 ops/particle
------------------------------------------------
integrate
------------------------------------------------
986154 particles / 91e-6 s = 10.8 G/s
14 TFlop / (986154 particles / 91e-6 s) = 1291 ops/integrate
------------------------------------------------
copy
------------------------------------------------
986154 particles * 7 arrays * 4 bytes / 2251 us
= 12266 = 12 GB/s
------------------------------------------------
GPU_check
------------------------------------------------
peer access:
from 0 to 1: yes
from 0 to 2: yes
from 0 to 3: yes
from 0 to 4: yes
from 0 to 5: no
from 0 to 6: no
from 0 to 7: no
from 1 to 0: yes
from 1 to 2: yes
from 1 to 3: yes
from 1 to 4: no
from 1 to 5: yes
from 1 to 6: no
from 1 to 7: no
from 2 to 0: yes
from 2 to 1: yes
from 2 to 3: yes
from 2 to 4: no
from 2 to 5: no
from 2 to 6: yes
from 2 to 7: no
from 3 to 0: yes
from 3 to 1: yes
from 3 to 2: yes
from 3 to 4: no
from 3 to 5: no
from 3 to 6: no
from 3 to 7: yes
from 4 to 0: yes
from 4 to 1: no
from 4 to 2: no
from 4 to 3: no
from 4 to 5: yes
from 4 to 6: yes
from 4 to 7: yes
from 5 to 0: no
from 5 to 1: yes
from 5 to 2: no
from 5 to 3: no
from 5 to 4: yes
from 5 to 6: yes
from 5 to 7: yes
from 6 to 0: no
from 6 to 1: no
from 6 to 2: yes
from 6 to 3: no
from 6 to 4: yes
from 6 to 5: yes
from 6 to 7: yes
from 7 to 0: no
from 7 to 1: no
from 7 to 2: no
from 7 to 3: yes
from 7 to 4: yes
from 7 to 5: yes
from 7 to 6: yes
GPUs:
number: 0
name: Tesla V100-SXM2-16GB
global memory: 16945512448
max grid size: 2147483647
max threads per block: 1024
max threads dimension: 1024
multiprocessor count: 80
max threads per multiprocessor: 2048
number: 1
name: Tesla V100-SXM2-16GB
global memory: 16945512448
max grid size: 2147483647
max threads per block: 1024
max threads dimension: 1024
multiprocessor count: 80
max threads per multiprocessor: 2048
number: 2
name: Tesla V100-SXM2-16GB
global memory: 16945512448
max grid size: 2147483647
max threads per block: 1024
max threads dimension: 1024
multiprocessor count: 80
max threads per multiprocessor: 2048
number: 3
name: Tesla V100-SXM2-16GB
global memory: 16945512448
max grid size: 2147483647
max threads per block: 1024
max threads dimension: 1024
multiprocessor count: 80
max threads per multiprocessor: 2048
number: 4
name: Tesla V100-SXM2-16GB
global memory: 16945512448
max grid size: 2147483647
max threads per block: 1024
max threads dimension: 1024
multiprocessor count: 80
max threads per multiprocessor: 2048
number: 5
name: Tesla V100-SXM2-16GB
global memory: 16945512448
max grid size: 2147483647
max threads per block: 1024
max threads dimension: 1024
multiprocessor count: 80
max threads per multiprocessor: 2048
number: 6
name: Tesla V100-SXM2-16GB
global memory: 16945512448
max grid size: 2147483647
max threads per block: 1024
max threads dimension: 1024
multiprocessor count: 80
max threads per multiprocessor: 2048
number: 7
name: Tesla V100-SXM2-16GB
global memory: 16945512448
max grid size: 2147483647
max threads per block: 1024
max threads dimension: 1024
multiprocessor count: 80
max threads per multiprocessor: 2048
copy 5000000 floats from CPU to GPU
6854.000000 us, 2.918e+09 B/s
1846.000000 us, 1.08342e+10 B/s pinned
copy 5000000 floats from GPU to GPU 0:
GPU 1: 849.000000 us, 2.35571e+10 B/s
GPU 2: 851.000000 us, 2.35018e+10 B/s
GPU 3: 443.000000 us, 4.51467e+10 B/s
GPU 4: 444.000000 us, 4.5045e+10 B/s
GPU 5: 2051.000000 us, 9.75134e+09 B/s
GPU 6: 2026.000000 us, 9.87167e+09 B/s
GPU 7: 2017.000000 us, 9.91572e+09 B/s
add 5000000x5000000 floats:
3.747896 s, 6670.409500 G/s
peer add 5000000x5000000 GPU 0 floats:
GPU 1: 3.755309 s, 6657.241500 G/s
GPU 2: 3.731777 s, 6699.221000 G/s
GPU 3: 3.735422 s, 6692.684500 G/s
GPU 4: 3.736765 s, 6690.278500 G/s
parallel peer add 5000000x5000000 GPU 0 floats:
5 GPUS: 0.783563 s, 19940.962000 G/s
------------------------------------------------
CPU_check
------------------------------------------------
processor : 95
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
stepping : 7
microcode : 0x5002f01
cpu MHz : 1221.613
cache size : 36608 KB
physical id : 1
siblings : 48
core id : 23
cpu cores : 24
apicid : 111
initial apicid : 111
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5999.99
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
add 500000x500000 floats with 1 thread:
65.058972 s, 3.842667 G/s
add 500000x500000 floats with 96 threads:
1.376176 s, 181.662813 G/s
------------------------------------------------
pipe_check
------------------------------------------------
send: 100000000 points
receive: 4.301932 s, 9.29815e+07 B/s
</pre>