simulate_MDP {pomdp} | R Documentation |
Simulate trajectories through a MDP. The start state for each trajectory is randomly chosen using the specified belief. The belief is used to choose actions from an epsilon-greedy policy and then update the state.
simulate_MDP(
model,
n = 100,
start = NULL,
horizon = NULL,
visited_states = FALSE,
epsilon = NULL,
verbose = FALSE
)
model |
a MDP model. |
n |
number of trajectories. |
start |
probability distribution over the states for choosing the starting states for the trajectories. Defaults to "uniform". |
horizon |
number of epochs for the simulation. If |
visited_states |
logical; Should all visited states on the
trajectories be returned? If |
epsilon |
the probability of random actions for using an epsilon-greedy policy. Default for solved models is 0 and for unsolved model 1. |
verbose |
report used parameters. |
A vector with state ids (in the final epoch or all). Attributes containing action counts, and rewards for each trajectory may be available.
Michael Hahsler
data(Maze)
# solve the POMDP for 5 epochs and no discounting
sol <- solve_MDP(Maze, discount = 1)
sol
policy(sol)
## Example 1: simulate 10 trajectories, only the final belief state is returned
sim <- simulate_MDP(sol, n = 10, horizon = 10, verbose = TRUE)
head(sim)
# additional data is available as attributes
names(attributes(sim))
attr(sim, "avg_reward")
colMeans(attr(sim, "action"))
## Example 2: simulate starting always in state s_1
sim <- simulate_MDP(sol, n = 100, start = "s_1", horizon = 10)
sim
# the average reward is an estimate of the utility in the optimal policy:
policy(sol)[[1]][1,]