🔥 Burn Fat Fast. Discover How! 💪

Reinforcement learning (RL) is great for tasks with a well-def | Big Data Science

Reinforcement learning (RL) is great for tasks with a well-defined reward function, as evidenced by the successful experiences of AlphaZero for Go, OpenAI Five for Dota, and AlphaStar for StarCraft. But in practice, it is not always possible to clearly define the reward function. For example, in a simple room cleaning case, an old business card found under the bed or a used concert ticket may be evaluated as trash, but if they are valuable for host they should not be thrown away. However, even if you set clear criteria for evaluating the analyzed object, converting them into rewards is not easy. If you give the agent a reward that reinforces his behavior every time when it collects the garbage, it can throw it back to collect again and receive reinforcement.
This behavior of the AI system can be prevented by forming a reward function based on feedback on the agent's behavior. But this approach requires a lot of resources: in particular, training the Deep RL Cheetah model from OpenAI Gym and MujoCo requires about 700+ human comparisons.
Therefore, researchers at the Berkeley University of California David Lindner and Rohin Shah proposed an RL-algorithm without human supervision or an explicitly assigned function to form a reward policy based on implicit information. They named it RLSP (Reward Learning by Simulating the Past) because RL is formed by modeling the past, based on judgments that allow the agent to draw inferences about human preferences without explicit feedback. The main difficulty with scaling RLSP is how to reason about previous experience in the case of a big amount of data. The authors propose to choose the most probable past trajectories of the development of events instead of their full enumeration, alternating the prediction of past actions with the prediction of the past states from which these actions were taken.
The RLSP algorithm uses gradient lifting to continuously update a linear reward function to explain the observed state. Scaling this idea is possible through a functional representation of each state and modeling a linear reward function for these characteristics, followed by an approximation of the RLSP gradient by sampling more likely past trajectories. The gradient encourages the reward function so that backward trajectories (which should have been done in the past) and forward trajectories (which the agent would have done using the current reward) are consistent with each other. Once the trajectories are consisted, the gradient becomes zero, and the reward function that is most likely to cause the observed state is known. The essence of the RLSP algorithm is to perform a gradient lift using this gradient. The algorithm was tested in the MujoCo simulator, an environment for testing RL algorithms on the problem of training simulated robots to move along the optimal trajectory or in the best possible way. The results showed that RLSP-generated reinforcement policies perform as well as those directly trained in the true reward function.
https://bair.berkeley.edu/blog/2021/05/03/rlsp/