from pyCRLD.Environments.EcologicalPublicGood import EcologicalPublicGood as EPG
from pyCRLD.Agents.StrategyActorCritic import stratAC
Base
The agent base class
contains core methods to compute the strategy-average reward-prediction error.
abase
abase (TransitionTensor:numpy.ndarray, RewardTensor:numpy.ndarray, DiscountFactors:Iterable[float], use_prefactor=False, opteinsum=True)
Base class for deterministic strategy-average independent (multi-agent) temporal-difference reinforcement learning.
Type | Default | Details | |
---|---|---|---|
TransitionTensor | ndarray | transition model of the environment | |
RewardTensor | ndarray | reward model of the environment | |
DiscountFactors | Iterable | the agents’ discount factors | |
use_prefactor | bool | False | use the 1-DiscountFactor prefactor |
opteinsum | bool | True | optimize einsum functions |
Strategy averaging
Core methods to compute the strategy-average reward-prediction error
abase.Tss
abase.Tss (Xisa:jax.Array)
Compute average transition model Tss
, given joint strategy Xisa
Type | Details | |
---|---|---|
Xisa | Array | Joint strategy |
Returns | Array | Average transition matrix |
abase.Tisas
abase.Tisas (Xisa:jax.Array)
Compute average transition model Tisas
, given joint strategy Xisa
Type | Details | |
---|---|---|
Xisa | Array | Joint strategy |
Returns | Array | Average transition Tisas |
abase.Ris
abase.Ris (Xisa:jax.Array, Risa:jax.Array=None)
Compute average reward Ris
, given joint strategy Xisa
Type | Default | Details | |
---|---|---|---|
Xisa | Array | Joint strategy | |
Risa | Array | None | Optional reward for speed-up |
Returns | Array | Average reward |
abase.Risa
abase.Risa (Xisa:jax.Array)
Compute average reward Risa
, given joint strategy Xisa
Type | Details | |
---|---|---|
Xisa | Array | Joint strategy |
Returns | Array | Average reward |
abase.Vis
abase.Vis (Xisa:jax.Array, Ris:jax.Array=None, Tss:jax.Array=None, Risa:jax.Array=None)
Compute average state values Vis
, given joint strategy Xisa
Type | Default | Details | |
---|---|---|---|
Xisa | Array | Joint strategy | |
Ris | Array | None | Optional reward for speed-up |
Tss | Array | None | Optional transition for speed-up |
Risa | Array | None | Optional reward for speed-up |
Returns | Array | Average state values |
abase.Qisa
abase.Qisa (Xisa:jax.Array, Risa:jax.Array=None, Vis:jax.Array=None, Tisas:jax.Array=None)
Compute average state-action values Qisa, given joint strategy Xisa
Type | Default | Details | |
---|---|---|---|
Xisa | Array | Joint strategy | |
Risa | Array | None | Optional reward for speed-up |
Vis | Array | None | Optional values for speed-up |
Tisas | Array | None | Optional transition for speed-up |
Returns | Array | Average state-action values |
Helpers
abase.Ps
abase.Ps (Xisa:jax.Array)
Compute stationary state distribution Ps
, given joint strategy Xisa
.
Type | Details | |
---|---|---|
Xisa | Array | Joint strategy |
Returns | Array | Stationary state distribution |
Ps
uses the compute_stationarydistribution
function.
= EPG(N=2, f=1.2, c=5, m=-5, qc=0.2, qr=0.01, degraded_choice=False)
env = stratAC(env=env, learning_rates=0.1, discount_factors=0.99, use_prefactor=True)
MAEi
= MAEi.random_softmax_strategy()
x MAEi._numpyPs(x)
array([0.91309416, 0.08690587], dtype=float32)
MAEi.Ps(x)
Array([0.91309416, 0.08690587], dtype=float32)
abase.Ri
abase.Ri (Xisa:jax.Array)
Compute average reward Ri
, given joint strategy Xisa
.
Type | Details | |
---|---|---|
Xisa | Array | Joint strategy Xisa |
Returns | Array | Average reward Ri |
MAEi.Ri(x)
Array([-4.6322937, -4.5121984], dtype=float32)
abase.trajectory
abase.trajectory (Xinit:jax.Array, Tmax:int=100, tolerance:float=None, verbose=False, **kwargs)
Compute a joint learning trajectory.
Type | Default | Details | |
---|---|---|---|
Xinit | Array | Initial condition | |
Tmax | int | 100 | the maximum number of iteration steps |
tolerance | float | None | to determine if a fix point is reached |
verbose | bool | False | Say something during computation? |
kwargs | |||
Returns | tuple | (trajectory , fixpointreached ) |
trajectory
is an Array containing the time-evolution of the dynamic variable. fixpointreached
is a bool saying whether or not a fixed point has been reached.
abase._OtherAgentsActionsSummationTensor
abase._OtherAgentsActionsSummationTensor ()
To sum over the other agents and their respective actions using einsum
.
To obtain the strategy-average reward-prediction error for agent \(i\), we need to average out the probabilities contained in the strategies of all other agents \(j \neq i\) and the transition function \(T\),
\[ \sum_{a^j} \sum_{s'} \prod_{i\neq j} X^j(s, a^j) T(s, \mathbf a, s'). \]
The _OtherAgentsActionsSummationTensor
enables this summation to be exectued in the efficient einsum
function. It contains only \(0\)s and \(1\)s and is of dimension
\[ N \times \underbrace{N \times ... \times N}_{(N-1) \text{ times}} \times M \times \underbrace{M \times ... \times M}_{N \text{ times}} \times \underbrace{M \times ... \times M}_{(N-1) \text{ times}} \]
which represent
\[ \overbrace{N}^{\text{the focal agent}} \times \overbrace{\underbrace{N \times ... \times N}_{(N-1) \text{ times}}}^\text{all other agents} \times \overbrace{M}^\text{focal agent's action} \times \overbrace{\underbrace{M \times ... \times M}_{N \text{ times}}}^\text{all actions} \times \overbrace{\underbrace{M \times ... \times M}_{(N-1) \text{ times}}}^\text{all other agents' actions} \]
It contains a \(1\) only if
- all agent indices (comprised of the focal agent index and all other agents indices) are different from each other
- and the focal agent’s action index matches the focal agents’ action index in all actions
- and if all other agents’ action indices match their corresponding action indices in all actions.
Otherwise it contains a \(0\).