Project BFLOAT16 on ALL hardware (>= 2009), up to 2000x faster | Data Scientology

Project BFLOAT16 on ALL hardware (>= 2009), up to 2000x faster ML algos, 50% less RAM usage for all old/new hardware - Hyperlearn Reborn.

Hello everyone!! It's been a while!! Years back I released Hyperlearn https://github.com/danielhanchen/hyperlearn. It has 1.2K Github stars, where I made tonnes of algos faster.

I was a bit busy back at NVIDIA and my startup, and I've been casually developing some algos. The question is are people still interested in fast algorithms? Does anyone want to collaborate on reviving Hyperlearn? (Or making a NEW package?) Note the current package is ahhh A MESSS... I'm fixing it - sit tight!!

NEW algos for release:

1. PCA with 50% less memory usage with ZERO data corruption!! (Maths tricks :)) (ie no need to do X - X.mean()!!!)) How you may ask???!
2. Randomized PCA with 50% less memory usage (ie no need to do X - X.mean()).
3. Linear Regression is EVEN faster with now Pivoted Cholesky making algo 100% stable. No package on the internet to my knowledge has pivoted cholesky solvers.
4. Bfloat16 on ALL hardware all the way down to SSE4!!! (Intel Core i7 2009!!)
5. Matrix multiplication with Bfloat16 on ALL hardware/?ASD@! Not the cheap 2x extra memory copying trick - true 0 extra RAM usage on the fly CPU conversion.
6. New Paratrooper Optimizer which trains neural nets 50% faster using the latest fast algos.
7. Sparse blocked matrix multiplication on ALL hardware (NNs) !!
8. Super fast Neural Net training with batched multiprocessing (ie when NN is doing backprop on batch 1, we load batch 2 already etc).
9. Super fast softmax making attention softmax(Q @ K.T / sqrt(d))V super fast and all operations use the fastest possible matrix multiplciation config (tall skinny, square matrices)
10. AND MORE!!!

Old algos made faster:

1. 70% less time to fit Least Squares / Linear Regression than sklearn + 50% less memory usage
2. 50% less time to fit Non Negative Matrix Factorization than sklearn due to new parallelized algo
3. 40% faster full Euclidean / Cosine distance algorithms
4. 50% less time LSMR iterative least squares
5. 50% faster Sparse Matrix operations - parallelized
6. RandomizedSVD is now 20 - 30% faster

Also you might remember my 50 page machine learning book: https://drive.google.com/file/d/18fxyBiPE0G4e5yixAj5S--YL\_pgTh3Vo/view?usp=sharing

https://preview.redd.it/vmmiocvvk7391.png?width=1793&format=png&auto=webp&s=d2c26b4f2fbbfcd007b44d528579a271ea7960cc

/r/MachineLearning
https://redd.it/v38pwm

Data Scientology

👺 1.26K
Uncategorized

Hot data science related posts every hour. Chat: https://telegram.me/r_channels. Contacts: @lgyanf

Join
▲ Vote (1)

Project BFLOAT16 on ALL hardware (>= 2009), up to 2000x faster | Data Scientology

Login