-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathProjectNotes.txt
870 lines (662 loc) · 48.8 KB
/
ProjectNotes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
Katanga (maybe release name is Screen3d, Scr33d, umm Helix TV) is the virtual 3D screen.
Any given 3D game running on NVidia or SBS will take the bits and put them into
the screen Quad in the Unity environment.
The performance inside the injected game must be as fast as possible to avoid
CPU hits that can be very costly.
The performance inside the Unity display needs to be fast at least 90Hz, to
avoid any VR lag. This can be simple scene, and will be on different CPU cores.
Sync problems will not exist, because we'll copy bits from the game to IPC, which
will then be put to the Quad. Since the headset only draws every 1/90th second,
there should be no tearing, because it will always fetch the newest image.
The Katanga project is multi-component.
1) The Unity app itself, which draw in the VR headset.
2) The UnityNativePlugin, which is native C code called from Unity.
3) The destination Deviare plugin, which will be injected into the game.
The Unity app is x64 only, because there is no point in a x32 version, since VR
requires x64. The Deviare plugin will be x32 and x64 depending upon the game.
DX9 and DX11 and OpenGL should all be supported with the Deviare Plugin.
Trying to decide whether to use Deviare2 itself, or to use the in-proc version only.
Not completely clear what would be best, and Deviare2 is fairly confusing and
poorly documented. Lots of funny pieces that are not clear, like their Agent, and
active plugins. Plugins can be hook specific and native.
The code for the Unity project itself must include a C++ chunk, because we need
to call the OpenVR code, which is native C++. Maybe this is handled by the Unity
plugin code, as we enable VR and it activates, but we still need access to the
buffer for the quad/screen.
We don't need to hook remotely, hooking from inside the running game is OK. And we
really only need the Present() call. There also does not seem to be a particularly
good callback mechanism. They have OnFunctionCalled, but that is before we finish
copying the backbuffer, and we need a more direct IPC of some form.
Probably we need to use memory mapped file as the IPC, so that the backbuffer copy to
the Unity app is fast. We can also use that for notification of new data.
So, all in all, doesn't seem worth using Deviare over in-proc, but maybe.
Back with more thoughts. Given that the Quad we are drawing into is managed by Unity
it seems to push toward using Deviare2 directly, not in-proc. Unity is all C#, and
it is harder to do in native plugin.
So, for starting at least, current plan is to not use native plugin, and do all the
work in C#, including Deviare, and hooking the Present call.
The actual code for the game itself will be a Deviare native plugin in C++, that will
do the work of fetching the backbuffer and copying it for the Unity app to use. That
code needs to be as fast as possible, to avoid impacting the game's CPU use.
Starting down this path, ran into some weird Unity problems. Was getting a missing file
error for the use of Nektra.Deviare2.dll, missing a stdole file that was 1.1 version.
Not sure I understood all that, because the file exists in the right spots on Win10, but
was not being seen. Making a copy from the interop files in Program Files(x86) and
dropping that into the Unity root worked to solve that error. Doesn't seem releated to
Nektra, seems related to Unity.
Using the Nektra.Deviare2.dll from the 2.8.3 binary release. I was able to build my own
nektra libs and changed params from XP support, but seems to be unnecessary to do all that.
Also required- regsrvr32 DeviareCOM64.dll. Unity runs in x64 now, and the creation of
the NktSpyMgr was crashing with a COM exception because it was not registered. This is
possibly problematic, because we'll need this on target systems.
With those in place, the Unity C# code calling to NktSpyMgr actually works.
Right now it's setup for VS2017, can move back to VS2013 if that seems better. Right at
the moment, I'm going to keep it this way, because the Unity->VS debugging is already
setup and working for breakpoints.
Deviare CTest is a good example of similar behavior. Uses the built in Agent, not needing
the custom plugin until performance is needed. Patches all hooks, works in VS, including
debug through their code.
Requires several pieces that were not obvious. Both Deviaredb and db64. DeviareCom64.dll
because it's a 64 bit app. dvagent.dll for x32 apps, dvagent64.dll for x64.
10-9-17
Mostly working in a basic format. All hooking is working into game.
App to built in Unity needs to be DX11 target and x64 only. Oculus Rift requires that,
so there is no point in making any conversions to other APIs.
Basic strategy will be to create our game snapshot texture as a DX9EX Texture2D, because
that can share with DX11 easily, and also with DX9 easily. Works both way, so DX9 game
works, DX11 game works. Destination is a DX11 Texture2D in Unity.
Testing a bare minimum Unity project seemed to reqired 10% of the GTX 980 GPU, when in
medium quality mode. On High quality, it was 22%. There is some complexity in terms of
vsync and driver defaults, so best bet will be to profile it when ready.
Sticking with the DX9 hooking for now, because it's already working, and needs to be done
in any case. DX11 is more familiar, but requires DXGI::Present hook instead. However,
the GameSurface will be a DX9EX variant, in order to be shareable.
The destination copy texture must be a RenderTarget, because of driver restrictions. It
cannot be a simple surface. Also, for sharing, it needs to be a RenderTarget because the
backbuffer is always a RenderTarget and they must match in order to share.
https://msdn.microsoft.com/en-us/library/windows/desktop/bb174471(v=vs.85).aspx
10-17-17
OK, turns out that the MSDN docs half-lie and it is not possible to use surface
sharing on any DX9 device. Only DX9Ex devices will work. Confirmation here:
https://www.gamedev.net/forums/topic/638495-shared-resources-eg-textures-between-devicesthreads/
And, also confirmed using the DX9 debug layer on Win7 setup. When trying to use any
variant of DX9 and a surface return handle, it would put up the error:
Direct3D9: (ERROR) :Device is not capable of sharing resource. CreateTexture fails.
Using the sample program, I confirmed that using the DX9Ex path works without any
error or notice using Debug layer. Can also use CreateRenderTarget, no Ex, but that
seems of little value. The errors looked like a bad parameter to CreateRenderTarget,
but in fact it was incorrect device.
This also points out a really important conclusion. The DX9 and DX9Ex objects are
not in fact the same. Coercing from an Ex object will work, because it's a superset,
but going from DX9 to DX9Ex does not work, because the fundamental object is different.
Also confirmed by using QueryInterface, and getting E-NOINTERFACE.
10-19-17
Finally figured out what the crash was about. Fuck! I say. Fucking Microsoft has a
bad dxd9.h header file, been bad for over 5 years. The dx9Ex interface is missing a
routine name in the IDirect3D9Ex interface section, of RegisterSoftwareDevice. That
makes the C interface for IDirect3D9Ex, off by one. So it calls the wrong routine.
Like GetAdapterMonitor instead of CreateDeviceEx. Fuck! Cost me two weeks on this one.
10-20-17
Experimenting with Console app, which just does CreateDX9, then CreateDevice, to see
what all needs to be an Ex object. When CreateDeviceEx is used, we end up getting
a debug layer break in CreateTexture that the pool cannot be managed. Which leads
down a path of tweaking parameters.
However, when trying this to create a shared surface:
hr = pDev9->CreateRenderTarget(1280, 720, D3DFMT_A8R8G8B8, D3DMULTISAMPLE_NONE, 0, false,
&gGameSurface, &gGameSurfaceShare);
we get error:
Direct3D9: (ERROR) :Device is not capable of sharing resource. CreateRenderTarget/CreateDepthStencil fails.
Clearly the runtime requires an IDirect3DDevice9Ex.
And, since you cannot create a IDirect3DDevice9Ex with a DX9 factory, that means the
top level CreateD39 must be Ex as well.
And, in test app, if I do StretchRect from dev9 to dev9Ex, I get:
Direct3D9: (ERROR) :DstSurface was not allocated with this Device. StretchRect fails.
Which clearly indicates that we cannot go cross-device. If the source object is the
dev9Ex, then it just crashes, debug layer is probably lame for Ex.
10-24-17
Getting closer. With it not successfully creating and copying into a shared surface,
we now need to display this in Unity app. The access to the surface there is not
terrific, and we also need to have it switch from DX9 to DX11 for the Unity display,
so the plan is to create a C++ unit for Unity side, and use their plugin model to
get access to C++ in this app as well. Mostly this just needs to use the HANDLE to
the shared surface, and copy the bits into the in-Unity TV screen.
Some questions about what needs to be write-only for performance, and how to sync
the copies, but let's get it limping first.
At present using the simple DX9 test app of Textures, because when I launch TheBall,
it fires some debug layer problems like CreateTexture needing to use only defaultpool.
Probably will need to hook and patch up some of these since we are putting the game
into an DX9Ex runtime, which changes some usage.
Restructering the project folder layout, because we need to have the next piece of
the UnityNativePlugin for C++ access. Also, this layout was never right, Unity
rebuilds the .sln file for their piece at every open, so it's better if it's by
itself and used as a subproject to a main project. Doing this piecemeal to avoid
breaking the repo.
11-8-17
Bit of a ballbuster here, with Deviare crashing at launch instead of returning an error.
This was happening on a full build, instead of while running in Unity. The problem wound
up being that the DeviairePlugin DLL path would change from Unity to the compiled app, so
yeah, thanks Unity for that kick in the nuts too. There is no /Assets when the app is
compiled, which is fucking retarded.
Then there is this crash in Deviare, which happens if the DLL is not found. This is also
completely lame. If it can't find it, it should return an error, not blow up. If you see
this crash, it's because of the missing deviare plugin dll.
Crash:
> 00000001() Unknown
[Frames below may be incorrect and/or missing]
DvAgent.dll!TNktArrayList<CNktDvParam *,128,TNktArrayListItemRemove_Release<CNktDvParam *> >::RemoveAllElements() Line 330 C++
DvAgent.dll!CNktDvHookEngine::Hook(CNktDvHookEngine::tagHOOKINFO * aHookInfo=0x8007007e, unsigned long nCount=1, int bIsInternal=0) Line 494 C++
DvAgent.dll!CDvAgentMgr::OnEngMsg_AddHook(tagNKT_DV_TMSG_ADDHOOK * lpMsg=0xffe16054, CNktDvTransportBigData * lpConnBigData=0xffe300e0) Line 2522 C++
DvAgent.dll!CDvAgentMgr::TAC_OnEngineMessage(CNktDvTransportAgent * lpTransport=0x77ad4060, tagNKT_DV_TMSG_COMMON * lpMsg=0xffe16054, unsigned long nMsgSize=1084, CNktDvTransportBigData * lpConnBigData=0xffe300e0) Line 699 C++
DvAgent.dll!CNktDvTransportAgent::WorkerThreadProc(unsigned long nIndex=5) Line 564 C++
DvAgent.dll!TNktClassWorkerThread<CNktDvTransportAgent>::ThreadProc() Line 169 C++
DvAgent.dll!thread_start<unsigned int (__stdcall*)(void *)>(void * const parameter=0xffe0418c) Line 115 C++
kernel32.dll!@BaseThreadInitThunk@12() Unknown
ntdll.dll!___RtlUserThreadStart@8() Unknown
ntdll.dll!__RtlUserThreadStart@8() Unknown
Once all that is straight, and we use the Application.dataPath to get the proper path
to the /Assets or /appname_Data folder, then we can properly build the full path to the
DeviarePlugin.dll, and pass it to LoadCustomDLL. Using the forward slashes returned
from Unity seems to work without any trouble.
11-15-17
Performance with OculusVR as the type XR Setting. Headset sensor activated.
Using NVidia Inspector graph. TheBall splash screen. Anaglyph 3D Vision.
Oculus: 41-45% of GPU
OpenVR: 52-55% of GPU
None: 24-26% of GPU
No game running, no SBS script, simplest scene.
Oculus: 13-14% of GPU
OpenVR: 20-21% of GPU
So, Oculus for SDK is clearly superior for performance. The Stats part of Unity is misleading,
because for Oculus it shows full frame rate of 1/90 ms. For OpenVR, it shows 0.5ms for Unity code.
Different ways they handle end of frame.
It's also notable how much impact this has.
For GTX 980, we jump from 25% for the game, to 45% for game+Rift.
LegacyShader/Diffuse, the default, is 2% GPU faster than Standard. 12-13% GPU.
Using sbsShader did not help any.
As another test, to see how much overhead Unity provides, running the Oculus sample
app, TinyRoom, release x64.
Oculus TinyRoom: 12%-13% of GPU.
That app will use Oculus SDK directly, with native DX11 calls. No reason to
think we can ever do better than that.
11-25-17
Seriously hard to get all this working, but sucess! Now getting stereo bits from the
game, into Unity as a texture. The display texture is side-by-side. Took much time
and debugging to figure out how to get stereo bits out of DX9, no samples, bad docs.
Also that target surface for StretchRect cannot be shared, or it breaks the stereo.
Every stage of the pipeline has been unbelievably complicated to get working. It's
not a surprise no one else has tried this.
11-28-17
Finally! Got stereo bits from the game, all the way to the headset, and showing in
stereo in the virtual 3D TV. Yes!
Last trick here was in Unity, trying to sweettalk the VR support into showing my SBS
image as actual stereo, half for each eye. Currently not quite right, it's requiring
multipath support, and I really want SinglePass Stereo for performance. But, this is
working, with multipath, and an OnPreRender call for the camera script. That gets
called once for each eye, so I can alternate eyes for the CopyTexture call.
Using CopyTexture seems superior to Blit, because Blit requires a shader to run, and
CopyTexture just does something closer to a memcpy. May not matter at all, because
ultimately something probably runs a shader to put bits into the quad, but I think
this might save a Draw call from shaders. Probably does not matter, not measurable.
Performance with a full Release buid is good. Game runs well, if it's full screen
it's at full 60 fps. No F notification in VR, very smooth. 55-70% of GPU in use.
Some stalls during flyover in game.
11-30-17
And... Winner! Got it to show stereo all the way to the virtual 3D TV. Correct eyes,
changeable depth and convergence like you'd want. Looks good.
The key aspect here was working out the best way to do this. The best way seems to
be to have a custom shader attached the the Quad Material, which does the work of
copying the texture to the actual VR screen and transforming them through the MVP
matrix, as the head camera moves. This saves an extra copy of the bits, and that
stage has to run anyway, so modifying that shader seems like best. While there,
it has the ability to use the primary input texture of _bothEyes, and decide to
fetch either left or right half depending upon the unity_StereoEyeIndex variable.
Changes which half of the texture bits are fetched from, and thus shows the correct
piece to the correct eye.
We specifically do not use the Graphics.Blit, because that required a script on the
camera in order to function, and it wasn't clear if would work to handle both eyes.
It would also be another copy that is unnecessary.
Similarly, we don't Graphics.CopyTexture to the Quad Material, because there needed
to be two copies, and had to be done from the camera script as well, where the
OnPreRender would get called twice, so we could swap eyes. Also a second copy,
although this does work when using Multi-Pass Stereo. In SinglePassStereo, OnPreRender
is only called once, so this won't work for the faster path.
Tried to work out how to use the vrDesc for a RenderTexture, but nothing made
sense, and could not get anything to work. It was documented as automatically
working for built in shaders, but it never seemed to recognize it was a VR texture.
This is also using SinglePassStereo. This is supposed to be quite a lot faster,
and we don't need multipass. Successfully drawing in VR using SinglePassStereo, using
the custom shader.
This is going to be nearly as fast as it is possible to make it. The bits from the
game are copied only twice, once to get stereo bits cleanly out of game, then once
again to the shared texture. The shared texture is used directly in the DX11 side,
by the shader to fetch pixels. So, very little extra copies. Would be nice to lose
that second copy, but NVidia doesn't work copying to a shared resource.
Overall performance is good. GPU usage is fairly low, 38-42% GPU on the splash screen.
There are some F frame rate indicators in VR that are a concern. Probably blocking
VR from multithreaded access causing stalls, but don't really know. Need to look.
Ths is using 3D Vision to screen, which might be locked to 60Hz, and stall when
lower. We get stalls during Ball flyover. Oculus F indicator up constantly when
the ball is close with pull effect. With that effect on, GPU usage goes to 70-74%.
Idling in-game is 57-60%.
No crashes. Very solid.
12-3-17
Looking at a dropped Compositor problem, where I see C show up periodically. Should
not be happening, we have maybe 50% of GPU headroom. Tried backing up from the
extra thread to do second copy, and that did not help, so extra thread is not the
problem.
Lots of other tests here, cannot quite pin it down. Almost seems like a false
error, although Oculus debug tends to be very good. No other tool shows a snag
when the ball is transparent up close, where in headset we see flashing F. Tried
VS profile, Unity GPU profile. And multiple paths of dropping pieces of my
pipeline. Only change that mattered was drawing the shared surface, even if bits
were not changing.
If not in the full screen foreground, I get Fs. If it's full screen, but not front
app, still get frames through, but no Fs.
Changing vsync in game, to limited to 160 using .ini. Works. Game runs at 105 fps
natively. Still getting Fs.
Turning off shared surface altogether, and setting the Textures to null still gives
me Fs. So even with no share at all between apps, it still stalls. This suggests
that it has something to do with the GPU itself, where the dual GPU processes stall
the pipeline, or can't switch effectively or something.
Definitely seems related to the transparency effect. When I set ball to not transparent
it doesn't F, in windowed mode. Game specific? Might be pipeline flush on transparent
or something.
After doing quite a bit of analysis, including using GPUView to look at GPU usage while
both apps are running- the problem is that TheBall has a sequence that takes 8ms of time
and locks out anything else from getting GPU time during that sequence. When this overlaps
the Present for Katanga, it stalls and shows a dropped F frame.
The actual underlying problem is that the GPU scheduler is retarded, and apparently
cannot be tuned. It decides unilatterally that the frontmost window TheBall is more
important than anything else and thus does not give up time when our much more critical
VR app calls. SetGPUPriority does nothing, and there are no nvapi calls that would
allow a fix. Scheduler is a black box and fuck off.
12-6-17
Interesting experiment where I only create a DX9Ex factory, but then allow the calls
to go through as normal DX9 calls for CreateDevice, CreateTexture, Present et. al.
Creating the Shared resource still works, no error, and I get stereo out.
This is probably the best way to go, to avoid having to tweak all the other calls for
DX9, like CreateTexture, CreateTexure3D to add parameters.
This works with only DX9Ex factory, but the returned CreateDevice is still considered
a Device9Ex by the debug runtime, and I still need all the CreateTexture/Buffer overrides
to fix those debug flaws.
5-18-18
Pretty big gap there as I lost motivation. Back looking at the dropped frames problem.
Definitely seems to be priority related, but I have no tools to fix that. Asked for
access to Context Priority in the NVidia VRworks, but no one responds, including Dave.
Managed to create a semaphore based stall system, using a Windows Event that can be
triggered on/off. In the main VR app, I trigger this off after 3ms from the front of
the frame, then in the game, at patched draw calls, I look for and stall if that is
off. This works in that the code does what I expected, and doesn't seem to have any
bugs. In GPUView I can see that the game 8ms blob is broken into two pieces and the
frame rate in the game drops to 45.
Not a full solution though. Still get occasional dropped frames as it conflicts.
It would be possible to tune it for just this game, but that doesn't seem like a
particularly hot strategy anyway.
Switching now to embedding the DX11 VR side into the game app as well, and doing the
entire VR world on my own handbuilt code, no Unity. This gives me the most flexibility,
with the most work as well. Worth an experiment to see if I can keep performance up
this way. Experiments with VorpX suggest that he is doing this, and his performance
seems spot on.
Still going to use Unity app for the moment, as all the launching and Deviare injection
happen there. Just turning off the VR aspect, so as far as it is concerned, it's just
a regular 2D game.
8-5-18
Installing for new Squanchando computer. Decided to update the VS2017 and latest LTS
Unity. Reasoning that it's better to have latest graphical debugging tools here, as
more valuable than stable setup. Causes a few problems, like broken directory paths.
Also of note, doesn't work at all on WMR, because WMR does not presently support
3D Vision. Driver crashes if 3D is enabled. Bug reported.
8-27-18
Tried to get Registration-Free COM to work, but cannot quite get the right combo for
the Unity app. Too strange a runtime, not clear where to connect. Got it working
OK for InvisibleWalls sample, including a subdirectory of DeviareCOM, but nothing
else. Does not appear to support an arbitrary directory. Documentation sucks, and I
could not find any examples of people using a different directory than the root.
Even using the root for Unity does not work, probably because they set the working
directory to PlugIns or something. Super fragile mechanism, no good debug tools.
The sxstrace command creates an empty log in the Unity case. All in all, a good
waste of multiple weeks.
Change of plan there- RegFree is a nice-to-have, not required. Definitely far
superior, but tweaky and hard to figure out. Easy enough to regsvr32 at install
time instead, so let's skip the distraction and get on more important stuff.
Like avoiding the Steam sublaunch for hooking. If I follow:
https://stackoverflow.com/questions/9624629/debug-games-from-steam-with-pix
and create a steamapp_id.txt file with 35460 in it, I can successfully launch
the Steam version of TheBall without any problems.
10-7-18
Switching gears here. Tried to hookup to Psychonauts, and was successful at running
in VR as well. Then tried to add HelixMod and discovered the old versions do not
support the CreateDeviceEx call, so when it's installed, there is no hook available.
Tried to work around it by calling out to System32 specifically, but something is
interfering with HelixMod, or it is, and no longer hooks the game and fixes the shaders.
This combined with all the tweaks needed to get Ex variant working seems like a hint
this isn't the right approach.
Also worth noting, using the tweaks to different calls may not work properly under
HelixMod, because of his own hooks. So, the workarounds could introduce problems.
Could make an automatic upgrade to last version HelixMod however.
This seems like as good a time as any to try to use the Encode/Decode chip that is
built into all GPUs now. We can convert the game data to a video stream, and pass
that to the VR/Unity side to Decode. Pretty sure this is how BigScreen works.
It can work lossless video, which was a big concern. Only real question is whether
it is fast enough. Probably not as good as surface sharing, but maybe that doesn't
matter. Should be more reliable and simpler to implement. Should also make it
much easier to connect with DX11 games, as the decode part would not change.
Hopefully the encode/decode does not require going to system memory, which would
be a big killer to this approach.
11-7-18
Setting up to use nvcodec is challenging, because of the many, many versions. The
latest version is probably not the right choice, because it requires an up to date
driver. As of the moment, SDK 8.2 requires driver 397.93 or higher. That's quite a
bit higher than we'd want to require, because older cards and games can run better
on older drivers. Sometimes we need those older drivers.
Using that idea, going back to SDK 6.0 is not unreasonable. Requires driver 358.xx or
higher, which is after GTX 980ti, but before GTX 1080. Not as far back as we'd like,
but better than nothing. SDK 5.0 requires driver 347.09 or better, but also requires
installing the CUDA toolkit, which we'd rather not do.
SDK 6.0 supports lossless encoding, which is the main goal.
So for the moment, we are going with SDK 6.0. Might change depending upon user
requirements.
12-19-18
Got a basic structure setup for using nvcodec, but it doesn't work. Seems to require
a source buffer of an OffscreenPlainSurface, which can be on the video card memory,
but does not allow any normal ways of updating the surface with game bits. StretchRect
will not work with Offscreenplain as the destination. Seems to only be viable for
loads from system memory, which is pointless.
The DX9 interop code does use an alternate path though, using a Media Foundation Layer
from Microsoft, with a DXVA2 library. Trying that path now.
One strange thing is launching Steam games. The steam launcher will often interfere, and
throw up the error 5:65432 error. This can be bypassed most times with a file named
'steam_appid.txt' with the appid in the file. Doesn't seem to work with The Ball though.
1-7-19
Got basic code working, and can see an encoded sample program. Backed up a bit, and
integrated the Katanga code into the StereoUnproject sample, as a test of encoding a
DX9 stereo backbuffer. The code works, and encodes a video to disk. Colors are wrong,
but basic data is there, and probably something off with the interminable settings in
the nvcodec. This uses the DXVA2 interface, which apparently is required to allow us
to copy the backbuffer onto something the nvcodec can accept. Using StretchRect is not
possible. Hard.
Next up is decoding the stream. As near as I can tell, it requires the use of Cuda, and
cannot use a DXVA2 variant. Giant mess here, seems like only one path actually works, the
rest is just broken. Begs the question of using Cuda though- if we can share via a
Cuda surface, there is no need to use the encode/decode. Still in investigation after
2 years of development. Christ.
Also worth noting that the performance problem seen above, happens because the VR app is
in the background. That's allowing the game to block the VR app and cause stalls. Probably
not an issue, but might be for more demanding games. Could be solved by putting the VR
generation in the game process space itself, and not use Unity. Might be necessary and/or
worthwhile.
This is a separate problem from getting bits to the VR app. That requires surface sharing
or the video encode/decode, or possibly cuda. That needs to be solved regardless.
1-11-19
Tested another scenario that is worth writing down as a possible solution. While setting up
for the DVAX2 surface sharing, it's surprising that it allows the nvcodec to access the
surface, and seems to be the way the samples all work. So, I made a test case (checked-in) of
using a decode destination surface as the target for a StrechtRect, then take that surface
and StretchRect back to the backbuffer. This works. This was with standard DX9 factory, not
the DX9Ex factory, so would require no changes to the game setup.
So, that means it should be possible to use this as a sharing technique, without needing to
modify everything for DX9Ex. Might be useful.
Still, does not seem like the way to go, because we actually need to go from DX9 to DX11 in
order to draw in VR. This DXVA2 buffer can possibly be setup as a surface share by itself,
and thus not require game changes, but have not tested. This is still suboptimal though,
because we want to be able to beam to Oculus Go/Quest as well, for a larger market. The only
way to beam to those will be using the video encoder/decoder for a smaller bitstream.
After some research it also looks likely that we could just use Cuda directly. It has direct
access to DX11 and DX9 surfaces, and a way to copy from one to the other, closer to a memcopy
than the freakish DX stuff. And keep it all on the GPU. For a local-only mode, this seems
like a much better path, because Cuda is a general purpose compute, not laden down with all
the graphics restrictions of DX.
Should be fairly simple to just use Cuda to copy a registered DX9 buffer directly to a DX11
buffer. There does not appear to be any Device restrictions there, but like always, hard to
say until you try it.
Still, this is local-only, and a good backup plan, but does not get a stream out to Quest. Not
sure that is important, but it would also allow us to get to Cardboard devices. Maybe no one
cares. Probably. But, unless there are limitations to the streaming approach, like terrible
latency or quality, this is a more general solution.
Of course... the streaming decode *also* requires Cuda for the decode. But heading that way
now, to make a DX11 device that will receive the stream, decode it, and draw it. Wish me luck.
1-17-19
Also worth considering is that the process space gets priority for the GPU. In my Ball case,
the VR app was not getting GPU priority, because Ball was in front. If I launch the VR app
in the same space, then it would get the same priority and no preemption.
This looks like it might be possible. The OpenVR and OculusSDK can run as 32 bit applications.
They require x64 OS, but not for app. That suggests we could create a VR runtime output
directly in the game process, and thus have it be at same priority as the game itself, and
avoid the attendant game launching problems.
I have proven with the StereoUnproject modified test sample, that I can create DX9 and DX11
environments in the same app, without any problems. Sharing data across the DX is non-trivial,
and currently am looking at nvcodec for this. It looks like it might be possible to use
SharedSurfaces via the DXVA2 however, which is a fallback approach if nvcodec fails. nvcodec
is a better approach, because we can then beam off device to wireless headsets as well. Performance
is the only real question for whether to use it or not.
If I launch a VR output in the game process space, that will be a standalone environment, and no
Unity support. That's partly bad because Unity brings UI features and Steam Workshop integration
for free. It would simplify the runtime dramatically however. This Deviare launch is weird.
2-8-19
Finally, finally, figured out the nvcodec and and their SDK. Took much longer than desired,
and motivation was low, but I finally did it. The problem is that the decoder *requires*
a file. They do not expose the internal stream buffer they use to decode, and their API
is oriented around the sole concept of files. This is remarkably stupid, but there are no
versions, not even the latest, that use anything as an input except files. Everything is
opaque. Sheeeet.
This is bad because that means that the only way to get the compressed bits to the DX11
side is to go through system RAM. We can map a file to memory to avoid hard drive slowness,
but that's RAM, not GPU memory. This may or may not matter though, depending upon the
performance. The H.264 encoder can drop the 2M pixels (8MB) of 1080p down to roughly 500K
at good quality. PCI express can do something like 8MB/frame (500MB/60 frames). That's quite
a bit more than needed, probably. Pencils out, but real world will no doubt... vary.
Given the huge advantage of being able to stream off device to something like Go/Quest,
it's probably still worth finishing the sample using memory mapped file I/O. Especially for
DX9, the games themselves will not be as demanding, so some overhead is going to be OK.
Next up was an investigation into Cuda copies. Like everything I've looked at, this seems
like the way to go. Probably get boned in the end too. But, there is a direct interop
support for DX9 and DX11 in Cuda, for copying surface to surface, and it does not rely
upon Microsoft's lame surface sharing, it's GPU based using Cuda processor code. Closer to
a GPU memcpy.
Initial look says this will work without too much trouble, and the API doesn't totally suck,
which is a refreshing change of pace. Copying from backbuffer seems problematic, but from
a non-rendertarget it looks fine. And keeps everything on the GPU like desired.
The only small gotcha is that the DX9 device *must* be an Ex device. Otherwise the cuda calls
fail. Even using an old SDK on modern hardware. But, old SDK using just the change to
Direct3DCreate9Ex works. The 'simpleD3D9' works, from the 7.0 cuda SDK, using non-Ex calls
for all the remaining pieces. This maps directly to being able to switch to Ex call at init
for a given game, but no other change, which allows the game to run without all the glitched
APIs.
2-24-19
OK, finally, some real progress. Going down the path of encode video to decode was a dead
end. The stupid nvcodec *requires* a file to decode. There is no support for a buffer
based decode, which means that the data has to be sent back across PCI to the CPU to get to
the DX11 side. This is maybe OK, performance of the compressed data can conceivably be OK,
especially by using a memory mapped file so it avoids HD hits. But, lame.
Given that, took a look at Cuda to see if it could do inter-op in any sensible fashion.
APIs are 100x better than nvcodec, and SDK was usable instead of a piece of shit. Looked
doable, so I built up a sample program from the stereounproject. Same one as for nvcodec,
DX9 cubes being drawn, to a DX11 output window. Made a new branch for the different approach.
Works! Got it fully working, showing the DX9 game/app drawing the red cubes, and then making
a copy into a DX9 Surface, which is then cuda copied into a DX11 Texture2D. Since both of
those are mapped at once, it does an cudaMemcpyArrayToArray to get all the bits across. From
there, the Texture2D is drawn to the Quad. Fully functional, all data kept on the GPU. A
couple of extra copies using StretchRect, but that's not going to sting.
Now moving into incorporating this sample code into Katanga.
3-4-19
Got a crash after integrating the test app code, the DX9 source does not seem to be available
to the DX11 side, using a different context, because it's a different process. That's not a
huge surprise. The surprise is that there doesn't seem to be any nvidia context management
while using their CudaRuntime API. That meant going back to the test app, and switching out
all the calls to use the CudaDriver API instead, which does support context management.
3-8-19
Fail, fail, fail. Finished working out the StereoUnproject app, to create a second app via
CreateProcess. That's a DX11 environment outside of the original. Inside the original game/app
the cuda share works. Outside in a different process, it does not work, because the cuda
context apparently cannot be shared across processes. The app use MemoryMappedFile sharing
for IPC and shares the setup struct. Even in x32->x32 sharing, this does not work.
So... Cannot work in the Unity launches game scenario using cuda, because it's cross process.
Options are:
1) to build the VR environment into the game itself, which would remove a lot of sick
complexity from the Unity/C# side, and also solve the performance problem of the VR environment
getting stalled by a busy game. This would make adding a Steam workshop harder, and lots of
manual effort to build an environment. No UI help from Unity.
2) Try to get the video sharing working. This goes cross process and x86-x64 without any
problems, because it goes back to main RAM for the video stream. Using MMF, this would be
doable. Only question is whether performance takes a hit by shuffling all the data across
the PCI bus. And, would still have VR environment stalls if game is busy.
It's worth noting that the x86 game could create an x86 VR environment, it does not require
x64 for the VR world. It requires x64 OS, but not runtime. That means we could build the
DX11 environment as x86 for the VR world. I know this works for sure, because that's how
the Alien Isolation injector worked, and it's an x86 game.
3-11-19
NVidia announced end of life for 3D Vision. Last driver standing will be one year.
That means that we are out of time here. No one will buy this thing if it requires an
old stale driver.
So... switching gears yet again. Now it's time to get something working that can maybe
be sold or at least is useful. That means taking the simplest path, not solving the
hardest problem (DX9 support).
So, not going to nvcodec at present, but will move back to the SurfaceSharing approach.
But, this will be DX11 specific and skip DX9 for now. This is the easist path, should
be able to just create new projects, and get DX11 shared surfaces. Should also be
most popular, as DX11 and modern games are really the interest nowadays.
3-24-19
Got the DX11 side working! So this is x64 game to x64 Unity, and showing in the big screen
TV in VR. Witcher 3 test case is working fine at present, performance is good, way beyond
what I need while using 1080ti.
Mostly straightforward switch to using DX11, Deviare setup was a pain, but works, and allows
for multiple injected targets to get to Present. Surface Sharing in DX11 works very similar
to DX9, so much so that the Unity plugin side didn't need to change. After using DebugLayer
on both sides, all problems were solved. Stunning.
8-14-19
Long uphill road to get here. Long! Trying to get it into shippable state. Pretty close now.
Currently setting up a Jenkins server for builds. Some data worth noting down are the AMI
that AWS provides and server storage. Looking for smallest install.
2008-R2_SP1 Base: 5.45GB free of 29.9GB
2016 server Base: 7.31GB free of 29.9GB partially setup with jenkins, vs2017 4G + 2.8G
2019 server Base: 16.7GB free of 29.9GB
2016 server Base: 13GB free of 29.9GB fresh
Before installing Unity on [email protected] Unity takes 1.8G it says.
After Unity: 4.82GB free.
After Unity build: 3.7GB free.
2019 server, fresh install: 16.7GB free
Git install: 16.4GB free
Unity install: 13.8GB free
MSBuild install: 7.28GB free
Jenkins install: 6.89GB free
MSBuild optimized: 8.26GB free successful katanga msbuild
Unity build: 7.83GB free successful Unity end build with Release
Cleared both VS2017 for 3DFM builds: 12.1GB free
Building 3DFM: 5.21GB free
Building both: 5.34GB free
Build Unity, katanga, not 3DFM: 4.60GB free
After reboot: 5.89GB free
Fully functional build: 2.87GB free *** could possibly save some space
After reboot: 3.57GB free
==Server setup notes==
This Jenkins build server has been setup to build the HelixVision code.
There are 3 fundamental pieces-
1) Katanga, C++ dlls
2) Unity VR app
3) 3D Fix Manager .NET app
This makes for a lot of tool requirements.
We want the server install to be as small as possible, so that we can run on the AWS free tier.
If we keep below 30GB of usage, we can run a free build server. After all install and setup,
we are running at about 5GB free. Perfectly fine.
We switched to 2019 Windows Server because it has the smallest footprint of the AWS server OSes.
The server is small, 1G RAM, 1 CPU. You need the AWS credit system for any decent performance at all,
otherwise it's 10% of a single Xeon 2GHz CPU. Slow.
The VS2017 is fully installed, instead of just the msbuild environment, because there are a lot of
pieces for the 3DFM build that require nuget packages. It's likely possible to get this setup without,
but after a day, it just doesn't matter to have it be solely MSBuild. Just install VS2017 community.
Likewise, the Unity is fully installed. 2017.4.26f1. Same version as used for our normal builds.
This is pretty much required because even for cmd-line builds, it uses the main unity.exe.
We do not install the debugging tools for either Unity or VS.
The Katanga C++ build is straightforward. The only tricky part was needing the ATL package, because
Deviare requires it. Using toolset v141, and SDK 10.0.17134.
The Unity build has been trying to use the 3DUnity Jenkins plugin, but that no longer seems to function
at all, and returns build failures. We are just going to use a Bat file to run the Unity command, and
rely upon the Jenkins job timeout to catch failures. We will also look for the arrival of the Release
folder, as the demonstration of a successful build.
For 3DFM builds, we need to fetch the Nuget packages, as they are not in source control. There are 19
in use, so it doesn't really make any sense to change this.
This is using older style of package.config files, and could be converted to newer version, but there
is no apparnent value to that for this project. Rather than do that, we install the nuget.exe here, in
the Jenkins root folder. We can then call it directly from a Bat file for "nuget.exe restore". This
works to solve the missing packages for a Jenkins build.
3DFM build is not quite right, getting a missing package error for things not specified in the project.
Probable wrong install, trying reinstall of full VS2017. Works on main machine, something installed
there but not reported by VS Installer.
Fixed by adding the Blend for Visual Studio component.
Finally getting all builds to work. Katanga with a batch file Unity build, and 3DFM with nuget restore.
Top level project of HelixVision fires off those builds, and then copies the pieces to the right places
and creates a final zip file.
9-5-19
Worked a couple of days on a way to get the Katanga window to auto-close after the game exits.
The obvious way was to add a hook for OnAgentUnloaded, which will happen when game exits.
Unfortunately, that is not apparently possible. Whenever I try this, the function to += the
routine to be called- it crashes with a NotImplementedException. Something to do with the runtime
or build system, such that it's not possible in C# to do this. Deviare samples show this working,
so it's not clear what is different, although this uses a Mono runtime, not a normal .Net runtime,
which is the likely culprit.
Tried 2018.1 version of Unity as well, which supports the 4.0 .Net apis. But apparently not the full
runtime, because it does not work there either. Nor with their ILCPP new runtime. All in all, pretty
lame.
Tried using the Deviare Thread.IsActive call, but it has a default timeout of 1000ms, and thus stalls
the primary thread and draws once a second. No way that I can tell to get the parameter passed, something
about the Deviare interface through the idl does not export the time parameter.
Can't use the OnAgentUnloaded in the injected DLL, because that is in the game itself. Could, but would
require having IPC back to the Unity app for notification. Best of not great options seems to be to setup
an alternate thread to check for IsActive and quit upon that.
Part and parcel of using this bad runtime layout. Causes a lot of problems. Should have used
UE4 with native C++ access, but at the time they did not support single pass stereo.
11-10-19
Some notes on the Steam DesktopGameTheater- it's lame. Cannot figure out a way to get it to back off and
not kill the Katanga process. If it's enabled in Settings and per game, then they decide unilaterally that
their theater is the one for all 2D game launches. Even if Katanga launched the process, and has an active
VR connection, they still send a quit message, and if Katanga does not respond, they kill the process.
Even if Katanga is admin, even if we respond to OnApplicationQuit with a CancelQuit call. Any launch using
the -applaunch ID approach will be force quit.
Tried a bunch of different things, including adding -nokillprocess to the katanga launch, no effect. Not sure
what that does anyway, but it was suggestive. Does not matter if the game is launched by 3DFixManager or Katanga
they don't respect that. Does not matter if 3DFM has an active VR connection.
Asking Steam support for help was useless, they didn't even understand the question, and finally just sent
it off to the VR team.
If we launch by exe, and use the steam_appid.txt files, then mostly we don't get killed. Still gets killed for
double-launch games, but it's a partial solution.
Setting the flags to off for a game does work, but is asking the user to do something, which is against our
approach.
11-12-19
OK, further tedious debugging requires some written down documentation.
It seemed interesting to use the DX11 Shared Mutex to handle synchronization to the two sides, even if we can
only use it for DX11, it would theoretically prevent the hangs we see at ResizeBuffers. Doesn't work. Cannot work.
On the Unity side, we are letting Unity do the drawing, and because of that we can't get to the texture to
acquire/release the mutex when updating. If the mutex is added at Dx11 time, then Unity simply stops drawing.
No idea why, because the mutex should not be required and if not initialized it's free to access, but Unity
stops even the only thing changed is adding the D3D11_RESOURCE_MISC_SHARED_KEYEDMUTEX.
It would be possible to move the texture down to the native c++ layer, and then draw it there, but it hardly
seems worth the effort for this.
Going to a different sync approach to try to avoid the hangs at ResizeBuffer.
Interesting idea: Run 3D as alternating eyes for performance, but instead of requiring highest speed,
allow the lower speed, but interpolate the frame for second eye, like they interpolate for VR.
This could be the 2nd non-dominant eye being the one interpolated, which would only have errors during
the fast pan times. And the errors are filled in with fuzz, so it's like a cross between fake 3D and real.
8-2-20
Just spent three days studying the hangs at game launch. No complete answer, but after commenting out all
of the actionable code in the game plugin, to where it simply creates the surface and never uses it for
anything, it's fairly clear this is a Katanga specific hang.
Tried a motley assortment of sync techniques, including EnterCriticalSection, the SetupMutex from game
side, not doing any dispose of data structures that were stale, not using the surface at all on the
Katanga side. Nothing made a difference at all.
Tried using the KeyedMutex- but didn't work either. And I proved it has nothing to do with drawing in
any case, by removing the CopySubRegion in the game plugin. So no sync necessary, because it's never
in use. Also tried using the debug layer by setting R9=0x03 at CreateDevice in Unity. This solved the
problem- but only by making it no longer appear. No debug output ever showed.
Had a very solid hang example/test case in TheSurge, but it just recently stopped hanging at all anymore.
It's fairly obvious this is a race condition of some form, so somehow the race is not being hit now. And...
just figured it out. It's because I set the APP_COMPAT_SHIM=1 for Katanga profile with NVidia inspector.
So that by itself is very strong evidence that the shim fixes it. It looked fairly clear that this was
a driver bug, leading to a hang in Katanga, waiting for an NTObject that never arrives. Happened exactly
inside the OpenSharedResource call, with no intervening steps. Shouldn't happen, there is nothing multi-
threaded in Katanga, it's a very simple app.
Tried doing a Context->Flush, but that did not solve the problem either. Somehow, some other piece of the
pipeline is busy or using something related to the shared surface, and hanging us up. No idea how to clear
that state to avoid this sort of thing.
Answer is to use the APP_COMPAT_SHIM=1. I'm going to remove all the other sync junk because all that did
was to change the race condition, not solve it. Would hang less, but still hang in cases. We will set an
Nvidia profile for Katanga with this flag set.