torchure A torch GPU Matrix multiplication benchmark Command line python torchure.py --num-devices 1 --dtype bfloat16 --warmup --output-file results.csv