Training 100000+ models parallely using MLLib in PySpark Big | Data Scientology

Training 100000+ models parallely using MLLib in PySpark

Big data noob here - please don't judge :)

I am developing a forecast module for a portfolio of 4000+ products (the count of products will increase as we expand the portfolio). We are currently experimenting with random forest & XGBoost models as an alternative to traditional time series models.

For each product, the model is first trained on the historical data and future predictions are generated. Is it possible to parallelize the process of modeling across multiple products through any functionality in PySpark?

I came across the usage of pandas udf for a similar use case but I would like to know if there's a better approach out there. I did come across a similar post on StackOverflow but it is in Scala language.

Any help is highly appreciated!

/r/bigdata
https://redd.it/r6cli9

Data Scientology

💂 1.26K
Uncategorized

Hot data science related posts every hour. Chat: https://telegram.me/r_channels. Contacts: @lgyanf

Join
▲ Vote (1)

Training 100000+ models parallely using MLLib in PySpark Big | Data Scientology

Login