An MDP is defined by a tuple $<S,A,T,R,\gamma ,b0>$, where $S$ is a set of states, $A$ is a set of actions, the state transition function $T(s,a,s\u2032)$ determines the probability of changing from state $s$ to $s\u2032$ when action $a$ is taken, $R(s,a)$ is the instantaneous reward of taking action $a$ at state $s$, $\gamma \u2208[0,1)$ is the discount factor of future reward, and $b0(s)$ specifies the probability of starting the process at state $s$. In RL, a control policy *π* is a mapping from a state to an action, i.e., $\pi :S\u2192A$. The long-term *value* of *π* for a starting state $s$ can be calculated by $V\pi (s)=R(s,\pi (s))+\gamma \u2211s\u2032\u2208ST(s,\pi (s),s\u2032)V\pi (s\u2032)$, and thus, the value of *π* over all possible starting states is the expectation $V\pi =\u2211s\u2208Sb0(s)V\pi (s)$. A common way to represent a control policy is to introduce a Q-function $Q(s,a;\lambda )$ with unknown control parameters $\lambda $, and let the policy be $a(s)=argmaxAQ(s,a;\lambda )$. RL identifies the optimal $\lambda $ that maximizes $V\pi $.