Release v0.4.0 · NVIDIA/MatX

New Features

slice optimization to use builtin tensor function when possible by @luitjens in #360
Slice support for std::array shapes by @luitjens in #363
svd power iteration example, benchmark and unit tests. by @luitjens in #366
matmul: support real/complex tensors by @kshitij12345 in #362
Adding sign/index operators: by @luitjens in #369
optimized cast and conj op to return a tensor view when possible. by @luitjens in #371
implement QR for small batched matrices. by @luitjens in #373
Implement block power iteration (qr iterations) for svd by @luitjens in #375
Added output iterator support for CUB sums, and converted all sum() by @cliffburdick in #380
Removing inheritance from std::iterator by @cliffburdick in #381
DLPack support by @cliffburdick in #392
Adding ref-count for DLPack by @cliffburdick in #394
updating cub optimization selection for >= 2.0 by @tylera-nvidia in #395
Refactored make_tensor to allow lvalue init by @cliffburdick in #397
Updated notebook documentation and refactored some code by @cliffburdick in #398
Allow 0-stride dimensions for cublas input/output by @tbensonatl in #400
16-bit float reductions + updated softmax by @cliffburdick in #399

Fix Duplicate Print and remove member prints by @tylera-nvidia in #364
cublasLT col major detection fix. by @luitjens in #368
Fixes for 32b mode by @cliffburdick in #388
Fixed a bogus maybe-unitialized warning/error in release mode by @cliffburdick in #389
Fixed issue with using const pointers by @cliffburdick in #393
Generator Printing Patch by @tylera-nvidia in #370

Full Changelog: v0.3.0...v0.4.0