Machine Learning Performance
The need for speed Coming from a real-time world (games and graphics), current machine learning training is a shock. Even small simple training tasks take a long time, which gives me plenty of time to think about the performance of my graphs. Currently most of the open source deep learning stacks consist of a C++ and Cuda back-end driven by a Python/R/Lua front end. The Cuda code tends to be fairly generic, which makes sense from an HPC and experimental point of view. Its important to note HPC stuff is very optimised but tends to really on standard interfaces with relatively custom generic back-ends. For example BLAS is a old FORTRAN standard for linear algebra, it has numerous back-ends included optimised x64, Cuda, OpenCL etc. However it only accelerates the tensor multiplies in a few data formats, other more data specific optimisation like overlapping conversions isn't in its remit. Its quite a different optimisation strategy from the rea...