Strategy SARSA

CRLD SARSA agents in strategy space

SARSA agents take into acount the five pieces of information of current State, current Action, Reward, next State and next Action.

Example

from pyCRLD.Agents.StrategySARSA import stratSARSA
from pyCRLD.Agents.StrategyActorCritic import stratAC

from pyCRLD.Environments.SocialDilemma import SocialDilemma
from pyCRLD.Utils import FlowPlot as fp

import numpy as np
import matplotlib.pyplot as plt

env = SocialDilemma(R=1.0, T=0.8, S=-0.5, P=0.0)

Let’s compare the SARSA (in red) with the actor-critic learners (in blue). The difference is that the SARSA learners incorporate an explicit exploration term in their learning update, regulated by the choice_intensities. For low choice intensities, the SARSA learners tend to extreme exploration, i.e., toward the center of the strategy space. For high choice intensities, the SARSA map onto the actor-critic learners (see Figure below). For the actor-critic learners, the choice_intensities have no effect other than scaling the learning speed alongside the learning rates.

fig, ax = plt.subplots(1,4, figsize=(18,4))
faps = np.linspace(0.01 ,0.99, 9)
x = ([0], [0], [0])
y = ([1], [0], [0])

for i, ci in enumerate([0.1, 1.0, 10, 100]):

    maeiAC = stratAC(env=env, learning_rates=0.1, discount_factors=0.9, choice_intensities=ci)
    maeiSARSA = stratSARSA(env=env, learning_rates=0.1, discount_factors=0.9, choice_intensities=ci)

    fp.plot_strategy_flow(maeiAC, x, y, flowarrow_points=faps, cmap="Blues", axes=[ax[i]])
    fp.plot_strategy_flow(maeiSARSA, x, y, flowarrow_points=faps, cmap="Reds", axes=[ax[i]]);

    ax[i].set_xlabel("Agent 0's cooperation probability")
    ax[i].set_ylabel("Agent 1's cooperation probability")
    ax[i].set_title("Intensity of choice {}".format(ci));

API

source

stratSARSA

 stratSARSA (env, learning_rates:Union[float,Iterable],
             discount_factors:Union[float,Iterable],
             choice_intensities:Union[float,Iterable]=1.0,
             use_prefactor=False, opteinsum=True, **kwargs)

Class for CRLD-SARSA agents in strategy space.

	Type	Default	Details
env			An environment object
learning_rates	Union		agents’ learning rates
discount_factors	Union		agents’ discount factors
choice_intensities	Union	1.0	agents’ choice intensities
use_prefactor	bool	False	use the 1-DiscountFactor prefactor
opteinsum	bool	True	optimize einsum functions
kwargs

source

stratSARSA.RPEisa

 stratSARSA.RPEisa (Xisa, norm=False)

Compute reward-prediction/temporal-difference error for strategy SARSA dynamics, given joint strategy Xisa.

	Type	Default	Details
Xisa			Joint strategy
norm	bool	False	normalize error around actions?
Returns	ndarray		RP/TD error

source

stratSARSA.NextQisa

 stratSARSA.NextQisa (Xisa, Qisa=None, Risa=None, Vis=None, Tisas=None)

Compute strategy-average next state-action value for agent i, current state s and action a.

	Type	Default	Details
Xisa			Joint strategy
Qisa	NoneType	None	Optional state-action values for speed-up
Risa	NoneType	None	Optional rewards for speed-up
Vis	NoneType	None	Optional state values for speed-up
Tisas	NoneType	None	Optional transition for speed-up
Returns	Array		Next values

Note, that although stratSARSA.NextQisa is computed differently than stratAC.NextVisa, they give actually identical values.

ci = 100 * np.random.rand()

maeAC = stratAC(env=env, learning_rates=0.1, discount_factors=0.9, choice_intensities=ci)
maeSARSA = stratSARSA(env=env, learning_rates=0.1, discount_factors=0.9, choice_intensities=ci)

X = maeAC.random_softmax_strategy()

assert np.allclose(maeAC.NextVisa(X) - maeSARSA.NextQisa(X), 0, atol=1e-05)