machin.frame

algorithms

Base

class machin.frame.algorithms.base.TorchFramework[source]

Bases: object

Base framework for all algorithms

enable_multiprocessing()[source]

Enable multiprocessing for all modules.

classmethod generate_config(config)[source]
Parameters

Any], machin.utils.conf.Config] config (Union[Dict[str,) –

classmethod get_restorable_model_names()[source]

Get attribute name of restorable nn models.

classmethod get_top_model_names()[source]

Get attribute name of top level nn models.

classmethod init_from_config(config, model_device='cpu')[source]
Parameters
  • Any], machin.utils.conf.Config] config (Union[Dict[str,) –

  • torch.device] model_device (Union[str,) –

classmethod is_distributed()[source]

Whether this framework is a distributed framework which require multiple processes to run, and depends on torch.distributed or torch.distributed.rpc

load(model_dir, network_map=None, version=- 1)[source]

Load models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir (str) – Save directory.

  • network_map (Dict[str, str]) – Key is module name, value is saved name.

  • version (int) – Version number of the save to be loaded.

save(model_dir, network_map=None, version=0)[source]

Save models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir (str) – Save directory.

  • network_map (Dict[str, str]) – Key is module name, value is saved name.

  • version (int) – Version number of the new save.

set_backward_function(backward_func)[source]

Replace the default backward function with a custom function. The default loss backward function is torch.autograd.backward

Parameters

backward_func (Callable) –

visualize_model(final_tensor, name, directory)[source]
Parameters
  • final_tensor (torch.Tensor) –

  • name (str) –

  • directory (str) –

property backward_function
property lr_schedulers
property optimizers
property restorable_models
property top_models

DDPG

class machin.frame.algorithms.ddpg.DDPG(actor, actor_target, critic, critic_target, optimizer, criterion, *_, lr_scheduler=None, lr_scheduler_args=None, lr_scheduler_kwargs=None, batch_size=100, update_rate=0.001, update_steps=None, actor_learning_rate=0.0005, critic_learning_rate=0.001, discount=0.99, gradient_max=inf, replay_size=500000, replay_device='cpu', replay_buffer=None, visualize=False, visualize_dir='', **__)[source]

Bases: machin.frame.algorithms.base.TorchFramework

DDPG framework.

Note

Your optimizer will be called as:

optimizer(network.parameters(), learning_rate)

Your lr_scheduler will be called as:

lr_scheduler(
    optimizer,
    *lr_scheduler_args[0],
    **lr_scheduler_kwargs[0],
)

Your criterion will be called as:

criterion(
    target_value.view(batch_size, 1),
    predicted_value.view(batch_size, 1)
)

Note

DDPG supports two ways of updating the target network, the first way is polyak update (soft update), which updates the target network in every training step by mixing its weights with the online network using update_rate.

The other way is hard update, which copies weights of the online network after every update_steps training step.

You can either specify update_rate or update_steps to select one update scheme, if both are specified, an error will be raised.

These two different update schemes may result in different training stability.

Parameters
  • actor (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Actor network module.

  • actor_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target actor network module.

  • critic (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Critic network module.

  • critic_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target critic network module.

  • optimizer (Callable) – Optimizer used to optimize actor and critic.

  • criterion (Callable) – Criterion used to evaluate the value loss.

  • lr_scheduler (Callable) – Learning rate scheduler of optimizer.

  • lr_scheduler_args (Tuple[Tuple, Tuple]) – Arguments of the learning rate scheduler.

  • lr_scheduler_kwargs (Tuple[Dict, Dict]) – Keyword arguments of the learning rate scheduler.

  • batch_size (int) – Batch size used during training.

  • update_rate (float) –

    \(\tau\) used to update target networks. Target parameters are updated as:

    \(\theta_t = \theta * \tau + \theta_t * (1 - \tau)\)

  • update_steps (Optional[int]) – Training step number used to update target networks.

  • actor_learning_rate (float) – Learning rate of the actor optimizer, not compatible with lr_scheduler.

  • critic_learning_rate (float) – Learning rate of the critic optimizer, not compatible with lr_scheduler.

  • discount (float) – \(\gamma\) used in the bellman function.

  • replay_size (int) – Replay buffer size. Not compatible with replay_buffer.

  • replay_device (Union[str, torch.device]) – Device where the replay buffer locates on, Not compatible with replay_buffer.

  • replay_buffer (machin.frame.buffers.buffer.Buffer) – Custom replay buffer.

  • visualize (bool) – Whether visualize the network flow in the first pass.

  • visualize_dir (str) – Visualized graph save directory.

  • gradient_max (float) –

act(state, use_target=False, **__)[source]

Use actor network to produce an action for the current state.

Parameters
  • state (Dict[str, Any]) – Current state.

  • use_target (bool) – Whether use the target network.

Returns

Any thing returned by your actor network.

act_discrete(state, use_target=False, **__)[source]

Use actor network to produce a discrete action for the current state.

Notes

actor network must output a probability tensor, of shape (batch_size, action_dims), and has a sum of 1 for each row in dimension 1.

Parameters
  • state (Dict[str, Any]) – Current state.

  • use_target (bool) – Whether to use the target network.

Returns

Action of shape [batch_size, 1]. Action probability tensor of shape [batch_size, action_num], produced by your actor. Any other things returned by your Q network. if they exist.

act_discrete_with_noise(state, use_target=False, choose_max_prob=0.95, **__)[source]

Use actor network to produce a noisy discrete action for the current state.

Notes

actor network must output a probability tensor, of shape (batch_size, action_dims), and has a sum of 1 for each row in dimension 1.

Parameters
  • state (Dict[str, Any]) – Current state.

  • use_target (bool) – Whether to use the target network.

  • choose_max_prob (float) – Probability to choose the largest component when actor is outputing extreme probability vector like [0, 1, 0, 0].

Returns

Noisy action of shape [batch_size, 1]. Action probability tensor of shape [batch_size, action_num]. Any other things returned by your Q network. if they exist.

act_with_noise(state, noise_param=0.0, 1.0, ratio=1.0, mode='uniform', use_target=False, **__)[source]

Use actor network to produce a noisy action for the current state.

Parameters
  • state (Dict[str, Any]) – Current state.

  • noise_param (Any) – Noise params.

  • ratio (float) – Noise ratio.

  • mode (str) – Noise mode. Supported are: "uniform", "normal", "clipped_normal", "ou"

  • use_target (bool) – Whether use the target network.

Returns

Noisy action of shape [batch_size, action_dim]. Any other things returned by your actor network. if they exist.

static action_transform_function(raw_output_action, *_)[source]

The action transform function is used to transform the output of actor to the input of critic. Action transform function must accept:

  1. Raw action from the actor model.

  2. Concatenated Transition.next_state.

  3. Any other concatenated lists of custom keys from Transition.

and returns:
  1. A dictionary with the same form as Transition.action

Parameters

raw_output_action (Any) – Raw action from the actor model.

enable_multiprocessing()

Enable multiprocessing for all modules.

classmethod generate_config(config)[source]
Parameters

Any], machin.utils.conf.Config] config (Union[Dict[str,) –

classmethod get_restorable_model_names()

Get attribute name of restorable nn models.

classmethod get_top_model_names()

Get attribute name of top level nn models.

classmethod init_from_config(config, model_device='cpu')[source]
Parameters
  • Any], machin.utils.conf.Config] config (Union[Dict[str,) –

  • torch.device] model_device (Union[str,) –

classmethod is_distributed()

Whether this framework is a distributed framework which require multiple processes to run, and depends on torch.distributed or torch.distributed.rpc

load(model_dir, network_map=None, version=- 1)[source]

Load models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir (str) – Save directory.

  • network_map (Dict[str, str]) – Key is module name, value is saved name.

  • version (int) – Version number of the save to be loaded.

static reward_function(reward, discount, next_value, terminal, _)[source]
save(model_dir, network_map=None, version=0)

Save models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir (str) – Save directory.

  • network_map (Dict[str, str]) – Key is module name, value is saved name.

  • version (int) – Version number of the new save.

set_backward_function(backward_func)

Replace the default backward function with a custom function. The default loss backward function is torch.autograd.backward

Parameters

backward_func (Callable) –

store_episode(episode)[source]

Add a full episode of transition samples to the replay buffer.

Parameters

Dict]] episode (List[Union[machin.frame.transition.Transition,) –

store_transition(transition)[source]

Add a transition sample to the replay buffer.

Parameters

Dict] transition (Union[machin.frame.transition.Transition,) –

update(update_value=True, update_policy=True, update_target=True, concatenate_samples=True, **__)[source]

Update network weights by sampling from replay buffer.

Parameters
  • update_value – Whether to update the Q network.

  • update_policy – Whether to update the actor network.

  • update_target – Whether to update targets.

  • concatenate_samples – Whether to concatenate the samples.

Returns

mean value of estimated policy value, value loss

update_lr_scheduler()[source]

Update learning rate schedulers.

visualize_model(final_tensor, name, directory)
Parameters
  • final_tensor (torch.Tensor) –

  • name (str) –

  • directory (str) –

property backward_function
property lr_schedulers
property optimizers
property restorable_models
property top_models

Hysterical DDPG

class machin.frame.algorithms.hddpg.HDDPG(actor, actor_target, critic, critic_target, optimizer, criterion, *_, lr_scheduler=None, lr_scheduler_args=None, lr_scheduler_kwargs=None, batch_size=100, update_rate=0.005, update_steps=None, actor_learning_rate=0.0005, critic_learning_rate=0.001, discount=0.99, gradient_max=inf, q_increase_rate=1.0, q_decrease_rate=1.0, replay_size=500000, replay_device='cpu', replay_buffer=None, visualize=False, visualize_dir='', **__)[source]

Bases: machin.frame.algorithms.ddpg.DDPG

HDDPG framework.

See also

DDPG

Parameters
  • actor (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Actor network module.

  • actor_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target actor network module.

  • critic (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Critic network module.

  • critic_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target critic network module.

  • optimizer (Callable) – Optimizer used to optimize actor and critic.

  • criterion (Callable) – Criterion used to evaluate the value loss.

  • lr_scheduler (Callable) – Learning rate scheduler of optimizer.

  • lr_scheduler_args (Tuple[Tuple, Tuple]) – Arguments of the learning rate scheduler.

  • lr_scheduler_kwargs (Tuple[Dict, Dict]) – Keyword arguments of the learning rate scheduler.

  • batch_size (int) – Batch size used during training.

  • update_rate (float) –

    \(\tau\) used to update target networks. Target parameters are updated as:

    \(\theta_t = \theta * \tau + \theta_t * (1 - \tau)\)

  • update_steps (Optional[int]) – Training step number used to update target networks.

  • actor_learning_rate (float) – Learning rate of the actor optimizer, not compatible with lr_scheduler.

  • critic_learning_rate (float) – Learning rate of the critic optimizer, not compatible with lr_scheduler.

  • discount (float) – \(\gamma\) used in the bellman function.

  • replay_size (int) – Replay buffer size. Not compatible with replay_buffer.

  • replay_device (Union[str, torch.device]) – Device where the replay buffer locates on, Not compatible with replay_buffer.

  • replay_buffer (machin.frame.buffers.buffer.Buffer) – Custom replay buffer.

  • visualize (bool) – Whether visualize the network flow in the first pass.

  • visualize_dir (str) – Visualized graph save directory.

  • gradient_max (float) –

  • q_increase_rate (float) –

  • q_decrease_rate (float) –

act(state, use_target=False, **__)

Use actor network to produce an action for the current state.

Parameters
  • state (Dict[str, Any]) – Current state.

  • use_target (bool) – Whether use the target network.

Returns

Any thing returned by your actor network.

act_discrete(state, use_target=False, **__)

Use actor network to produce a discrete action for the current state.

Notes

actor network must output a probability tensor, of shape (batch_size, action_dims), and has a sum of 1 for each row in dimension 1.

Parameters
  • state (Dict[str, Any]) – Current state.

  • use_target (bool) – Whether to use the target network.

Returns

Action of shape [batch_size, 1]. Action probability tensor of shape [batch_size, action_num], produced by your actor. Any other things returned by your Q network. if they exist.

act_discrete_with_noise(state, use_target=False, choose_max_prob=0.95, **__)

Use actor network to produce a noisy discrete action for the current state.

Notes

actor network must output a probability tensor, of shape (batch_size, action_dims), and has a sum of 1 for each row in dimension 1.

Parameters
  • state (Dict[str, Any]) – Current state.

  • use_target (bool) – Whether to use the target network.

  • choose_max_prob (float) – Probability to choose the largest component when actor is outputing extreme probability vector like [0, 1, 0, 0].

Returns

Noisy action of shape [batch_size, 1]. Action probability tensor of shape [batch_size, action_num]. Any other things returned by your Q network. if they exist.

act_with_noise(state, noise_param=0.0, 1.0, ratio=1.0, mode='uniform', use_target=False, **__)

Use actor network to produce a noisy action for the current state.

Parameters
  • state (Dict[str, Any]) – Current state.

  • noise_param (Any) – Noise params.

  • ratio (float) – Noise ratio.

  • mode (str) – Noise mode. Supported are: "uniform", "normal", "clipped_normal", "ou"

  • use_target (bool) – Whether use the target network.

Returns

Noisy action of shape [batch_size, action_dim]. Any other things returned by your actor network. if they exist.

static action_transform_function(raw_output_action, *_)

The action transform function is used to transform the output of actor to the input of critic. Action transform function must accept:

  1. Raw action from the actor model.

  2. Concatenated Transition.next_state.

  3. Any other concatenated lists of custom keys from Transition.

and returns:
  1. A dictionary with the same form as Transition.action

Parameters

raw_output_action (Any) – Raw action from the actor model.

enable_multiprocessing()

Enable multiprocessing for all modules.

classmethod generate_config(config)[source]
Parameters

Any], machin.utils.conf.Config] config (Union[Dict[str,) –

classmethod get_restorable_model_names()

Get attribute name of restorable nn models.

classmethod get_top_model_names()

Get attribute name of top level nn models.

classmethod init_from_config(config, model_device='cpu')
Parameters
  • Any], machin.utils.conf.Config] config (Union[Dict[str,) –

  • torch.device] model_device (Union[str,) –

classmethod is_distributed()

Whether this framework is a distributed framework which require multiple processes to run, and depends on torch.distributed or torch.distributed.rpc

load(model_dir, network_map=None, version=- 1)

Load models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir (str) – Save directory.

  • network_map (Dict[str, str]) – Key is module name, value is saved name.

  • version (int) – Version number of the save to be loaded.

static reward_function(reward, discount, next_value, terminal, _)
save(model_dir, network_map=None, version=0)

Save models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir (str) – Save directory.

  • network_map (Dict[str, str]) – Key is module name, value is saved name.

  • version (int) – Version number of the new save.

set_backward_function(backward_func)

Replace the default backward function with a custom function. The default loss backward function is torch.autograd.backward

Parameters

backward_func (Callable) –

store_episode(episode)

Add a full episode of transition samples to the replay buffer.

Parameters

Dict]] episode (List[Union[machin.frame.transition.Transition,) –

store_transition(transition)

Add a transition sample to the replay buffer.

Parameters

Dict] transition (Union[machin.frame.transition.Transition,) –

update(update_value=True, update_policy=True, update_target=True, concatenate_samples=True, **__)[source]

Update network weights by sampling from replay buffer.

Parameters
  • update_value – Whether to update the Q network.

  • update_policy – Whether to update the actor network.

  • update_target – Whether to update targets.

  • concatenate_samples – Whether to concatenate the samples.

Returns

mean value of estimated policy value, value loss

update_lr_scheduler()

Update learning rate schedulers.

visualize_model(final_tensor, name, directory)
Parameters
  • final_tensor (torch.Tensor) –

  • name (str) –

  • directory (str) –

property backward_function
property lr_schedulers
property optimizers
property restorable_models
property top_models

DDPG with prioritized replay

class machin.frame.algorithms.ddpg_per.DDPGPer(actor, actor_target, critic, critic_target, optimizer, criterion, *_, lr_scheduler=None, lr_scheduler_args=None, lr_scheduler_kwargs=None, batch_size=100, update_rate=0.005, update_steps=None, actor_learning_rate=0.0005, critic_learning_rate=0.001, discount=0.99, gradient_max=inf, replay_size=500000, replay_device='cpu', replay_buffer=None, visualize=False, visualize_dir='', **__)[source]

Bases: machin.frame.algorithms.ddpg.DDPG

DDPG with prioritized experience replay.

Warning

Your criterion must return a tensor of shape [batch_size,1] when given two tensors of shape [batch_size,1], since we need to multiply the loss with importance sampling weight element-wise.

If you are using loss modules given by pytorch. It is always safe to use them without any modification.

Note

Your optimizer will be called as:

optimizer(network.parameters(), learning_rate)

Your lr_scheduler will be called as:

lr_scheduler(
    optimizer,
    *lr_scheduler_args[0],
    **lr_scheduler_kwargs[0],
)

Your criterion will be called as:

criterion(
    target_value.view(batch_size, 1),
    predicted_value.view(batch_size, 1)
)

Note

DDPG supports two ways of updating the target network, the first way is polyak update (soft update), which updates the target network in every training step by mixing its weights with the online network using update_rate.

The other way is hard update, which copies weights of the online network after every update_steps training step.

You can either specify update_rate or update_steps to select one update scheme, if both are specified, an error will be raised.

These two different update schemes may result in different training stability.

Parameters
  • actor (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Actor network module.

  • actor_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target actor network module.

  • critic (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Critic network module.

  • critic_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target critic network module.

  • optimizer (Callable) – Optimizer used to optimize actor and critic.

  • criterion – Criterion used to evaluate the value loss.

  • lr_scheduler (Callable) – Learning rate scheduler of optimizer.

  • lr_scheduler_args (Tuple[Tuple, Tuple]) – Arguments of the learning rate scheduler.

  • lr_scheduler_kwargs (Tuple[Dict, Dict]) – Keyword arguments of the learning rate scheduler.

  • batch_size (int) – Batch size used during training.

  • update_rate (float) –

    \(\tau\) used to update target networks. Target parameters are updated as:

    \(\theta_t = \theta * \tau + \theta_t * (1 - \tau)\)

  • update_steps (Optional[int]) – Training step number used to update target networks.

  • actor_learning_rate (float) – Learning rate of the actor optimizer, not compatible with lr_scheduler.

  • critic_learning_rate (float) – Learning rate of the critic optimizer, not compatible with lr_scheduler.

  • discount (float) – \(\gamma\) used in the bellman function.

  • replay_size (int) – Replay buffer size. Not compatible with replay_buffer.

  • replay_device (Union[str, torch.device]) – Device where the replay buffer locates on, Not compatible with replay_buffer.

  • replay_buffer (machin.frame.buffers.buffer.Buffer) – Custom replay buffer.

  • visualize (bool) – Whether visualize the network flow in the first pass.

  • visualize_dir (str) – Visualized graph save directory.

  • gradient_max (float) –

act(state, use_target=False, **__)

Use actor network to produce an action for the current state.

Parameters
  • state (Dict[str, Any]) – Current state.

  • use_target (bool) – Whether use the target network.

Returns

Any thing returned by your actor network.

act_discrete(state, use_target=False, **__)

Use actor network to produce a discrete action for the current state.

Notes

actor network must output a probability tensor, of shape (batch_size, action_dims), and has a sum of 1 for each row in dimension 1.

Parameters
  • state (Dict[str, Any]) – Current state.

  • use_target (bool) – Whether to use the target network.

Returns

Action of shape [batch_size, 1]. Action probability tensor of shape [batch_size, action_num], produced by your actor. Any other things returned by your Q network. if they exist.

act_discrete_with_noise(state, use_target=False, choose_max_prob=0.95, **__)

Use actor network to produce a noisy discrete action for the current state.

Notes

actor network must output a probability tensor, of shape (batch_size, action_dims), and has a sum of 1 for each row in dimension 1.

Parameters
  • state (Dict[str, Any]) – Current state.

  • use_target (bool) – Whether to use the target network.

  • choose_max_prob (float) – Probability to choose the largest component when actor is outputing extreme probability vector like [0, 1, 0, 0].

Returns

Noisy action of shape [batch_size, 1]. Action probability tensor of shape [batch_size, action_num]. Any other things returned by your Q network. if they exist.

act_with_noise(state, noise_param=0.0, 1.0, ratio=1.0, mode='uniform', use_target=False, **__)

Use actor network to produce a noisy action for the current state.

Parameters
  • state (Dict[str, Any]) – Current state.

  • noise_param (Any) – Noise params.

  • ratio (float) – Noise ratio.

  • mode (str) – Noise mode. Supported are: "uniform", "normal", "clipped_normal", "ou"

  • use_target (bool) – Whether use the target network.

Returns

Noisy action of shape [batch_size, action_dim]. Any other things returned by your actor network. if they exist.

static action_transform_function(raw_output_action, *_)

The action transform function is used to transform the output of actor to the input of critic. Action transform function must accept:

  1. Raw action from the actor model.

  2. Concatenated Transition.next_state.

  3. Any other concatenated lists of custom keys from Transition.

and returns:
  1. A dictionary with the same form as Transition.action

Parameters

raw_output_action (Any) – Raw action from the actor model.

enable_multiprocessing()

Enable multiprocessing for all modules.

classmethod generate_config(config)[source]
Parameters

Any], machin.utils.conf.Config] config (Union[Dict[str,) –

classmethod get_restorable_model_names()

Get attribute name of restorable nn models.

classmethod get_top_model_names()

Get attribute name of top level nn models.

classmethod init_from_config(config, model_device='cpu')[source]
Parameters
  • Any], machin.utils.conf.Config] config (Union[Dict[str,) –

  • torch.device] model_device (Union[str,) –

classmethod is_distributed()

Whether this framework is a distributed framework which require multiple processes to run, and depends on torch.distributed or torch.distributed.rpc

load(model_dir, network_map=None, version=- 1)

Load models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir (str) – Save directory.

  • network_map (Dict[str, str]) – Key is module name, value is saved name.

  • version (int) – Version number of the save to be loaded.

static reward_function(reward, discount, next_value, terminal, _)
save(model_dir, network_map=None, version=0)

Save models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir (str) – Save directory.

  • network_map (Dict[str, str]) – Key is module name, value is saved name.

  • version (int) – Version number of the new save.

set_backward_function(backward_func)

Replace the default backward function with a custom function. The default loss backward function is torch.autograd.backward

Parameters

backward_func (Callable) –

store_episode(episode)

Add a full episode of transition samples to the replay buffer.

Parameters

Dict]] episode (List[Union[machin.frame.transition.Transition,) –

store_transition(transition)

Add a transition sample to the replay buffer.

Parameters

Dict] transition (Union[machin.frame.transition.Transition,) –

update(update_value=True, update_policy=True, update_target=True, concatenate_samples=True, **__)[source]

Update network weights by sampling from replay buffer.

Parameters
  • update_value – Whether to update the Q network.

  • update_policy – Whether to update the actor network.

  • update_target – Whether to update targets.

  • concatenate_samples – Whether to concatenate the samples.

Returns

mean value of estimated policy value, value loss

update_lr_scheduler()

Update learning rate schedulers.

visualize_model(final_tensor, name, directory)
Parameters
  • final_tensor (torch.Tensor) –

  • name (str) –

  • directory (str) –

property backward_function
property lr_schedulers
property optimizers
property restorable_models
property top_models

TD3

class machin.frame.algorithms.td3.TD3(actor, actor_target, critic, critic_target, critic2, critic2_target, optimizer, criterion, *_, lr_scheduler=None, lr_scheduler_args=None, lr_scheduler_kwargs=None, batch_size=100, update_rate=0.001, update_steps=None, actor_learning_rate=0.0005, critic_learning_rate=0.001, discount=0.99, gradient_max=inf, replay_size=500000, replay_device='cpu', replay_buffer=None, visualize=False, visualize_dir='', **__)[source]

Bases: machin.frame.algorithms.ddpg.DDPG

TD3 framework. Which adds a additional pair of critic and target critic network to DDPG.

See also

DDPG

Parameters
  • actor (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Actor network module.

  • actor_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target actor network module.

  • critic (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Critic network module.

  • critic_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target critic network module.

  • critic2 (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – The second critic network module.

  • critic2_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – The second target critic network module.

  • optimizer (Callable) – Optimizer used to optimize actor, critic,

  • criterion (Callable) – Criterion used to evaluate the value loss.

  • lr_scheduler (Callable) – Learning rate scheduler of optimizer.

  • lr_scheduler_args (Tuple[Tuple, Tuple, Tuple]) – Arguments of the learning rate scheduler.

  • lr_scheduler_kwargs (Tuple[Dict, Dict, Dict]) – Keyword arguments of the learning rate scheduler.

  • batch_size (int) – Batch size used during training.

  • update_rate (float) –

    \(\tau\) used to update target networks. Target parameters are updated as:

    \(\theta_t = \theta * \tau + \theta_t * (1 - \tau)\)

  • update_steps (Optional[int]) – Training step number used to update target networks.

  • actor_learning_rate (float) – Learning rate of the actor optimizer, not compatible with lr_scheduler.

  • critic_learning_rate (float) – Learning rate of the critic optimizer, not compatible with lr_scheduler.

  • discount (float) – \(\gamma\) used in the bellman function.

  • replay_size (int) – Replay buffer size. Not compatible with replay_buffer.

  • replay_device (Union[str, torch.device]) – Device where the replay buffer locates on, Not compatible with replay_buffer.

  • replay_buffer (machin.frame.buffers.buffer.Buffer) – Custom replay buffer.

  • visualize (bool) – Whether visualize the network flow in the first pass.

  • visualize_dir (str) – Visualized graph save directory.

  • gradient_max (float) –

act(state, use_target=False, **__)

Use actor network to produce an action for the current state.

Parameters
  • state (Dict[str, Any]) – Current state.

  • use_target (bool) – Whether use the target network.

Returns

Any thing returned by your actor network.

act_discrete(state, use_target=False, **__)

Use actor network to produce a discrete action for the current state.

Notes

actor network must output a probability tensor, of shape (batch_size, action_dims), and has a sum of 1 for each row in dimension 1.

Parameters
  • state (Dict[str, Any]) – Current state.

  • use_target (bool) – Whether to use the target network.

Returns

Action of shape [batch_size, 1]. Action probability tensor of shape [batch_size, action_num], produced by your actor. Any other things returned by your Q network. if they exist.

act_discrete_with_noise(state, use_target=False, choose_max_prob=0.95, **__)

Use actor network to produce a noisy discrete action for the current state.

Notes

actor network must output a probability tensor, of shape (batch_size, action_dims), and has a sum of 1 for each row in dimension 1.

Parameters
  • state (Dict[str, Any]) – Current state.

  • use_target (bool) – Whether to use the target network.

  • choose_max_prob (float) – Probability to choose the largest component when actor is outputing extreme probability vector like [0, 1, 0, 0].

Returns

Noisy action of shape [batch_size, 1]. Action probability tensor of shape [batch_size, action_num]. Any other things returned by your Q network. if they exist.

act_with_noise(state, noise_param=0.0, 1.0, ratio=1.0, mode='uniform', use_target=False, **__)

Use actor network to produce a noisy action for the current state.

Parameters
  • state (Dict[str, Any]) – Current state.

  • noise_param (Any) – Noise params.

  • ratio (float) – Noise ratio.

  • mode (str) – Noise mode. Supported are: "uniform", "normal", "clipped_normal", "ou"

  • use_target (bool) – Whether use the target network.

Returns

Noisy action of shape [batch_size, action_dim]. Any other things returned by your actor network. if they exist.

static action_transform_function(raw_output_action, *_)

The action transform function is used to transform the output of actor to the input of critic. Action transform function must accept:

  1. Raw action from the actor model.

  2. Concatenated Transition.next_state.

  3. Any other concatenated lists of custom keys from Transition.

and returns:
  1. A dictionary with the same form as Transition.action

Parameters

raw_output_action (Any) – Raw action from the actor model.

enable_multiprocessing()

Enable multiprocessing for all modules.

classmethod generate_config(config)[source]
Parameters

Any], machin.utils.conf.Config] config (Union[Dict[str,) –

classmethod get_restorable_model_names()

Get attribute name of restorable nn models.

classmethod get_top_model_names()

Get attribute name of top level nn models.

classmethod init_from_config(config, model_device='cpu')
Parameters
  • Any], machin.utils.conf.Config] config (Union[Dict[str,) –

  • torch.device] model_device (Union[str,) –

classmethod is_distributed()

Whether this framework is a distributed framework which require multiple processes to run, and depends on torch.distributed or torch.distributed.rpc

load(model_dir, network_map=None, version=- 1)[source]

Load models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir (str) – Save directory.

  • network_map (Dict[str, str]) – Key is module name, value is saved name.

  • version (int) – Version number of the save to be loaded.

static policy_noise_function(actions, *_)[source]
static reward_function(reward, discount, next_value, terminal, _)
save(model_dir, network_map=None, version=0)

Save models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir (str) – Save directory.

  • network_map (Dict[str, str]) – Key is module name, value is saved name.

  • version (int) – Version number of the new save.

set_backward_function(backward_func)

Replace the default backward function with a custom function. The default loss backward function is torch.autograd.backward

Parameters

backward_func (Callable) –

store_episode(episode)

Add a full episode of transition samples to the replay buffer.

Parameters

Dict]] episode (List[Union[machin.frame.transition.Transition,) –

store_transition(transition)

Add a transition sample to the replay buffer.

Parameters

Dict] transition (Union[machin.frame.transition.Transition,) –

update(update_value=True, update_policy=True, update_target=True, concatenate_samples=True, **__)[source]

Update network weights by sampling from replay buffer.

Parameters
  • update_value – Whether to update the Q network.

  • update_policy – Whether to update the actor network.

  • update_target – Whether to update targets.

  • concatenate_samples – Whether to concatenate the samples.

Returns

mean value of estimated policy value, value loss

update_lr_scheduler()[source]

Update learning rate schedulers.

visualize_model(final_tensor, name, directory)
Parameters
  • final_tensor (torch.Tensor) –

  • name (str) –

  • directory (str) –

property backward_function
property lr_schedulers
property optimizers
property restorable_models
property top_models

DQN, Fixed-Target DQN, Dueling DQN, Double DQN

class machin.frame.algorithms.dqn.DQN(qnet, qnet_target, optimizer, criterion, *_, lr_scheduler=None, lr_scheduler_args=None, lr_scheduler_kwargs=None, batch_size=100, epsilon_decay=0.9999, update_rate=0.005, update_steps=None, learning_rate=0.001, discount=0.99, gradient_max=inf, replay_size=500000, replay_device='cpu', replay_buffer=None, mode='double', visualize=False, visualize_dir='', **__)[source]

Bases: machin.frame.algorithms.base.TorchFramework

DQN framework.

Note

DQN is only available for discrete environments.

Note

Dueling DQN is a network structure rather than a framework, so it could be applied to all three modes.

If mode = "vanilla", implements the simplest online DQN, with replay buffer.

If mode = "fixed_target", implements DQN with a target network, and replay buffer. Described in this essay.

If mode = "double", implements Double DQN described in this essay.

Note

Vanilla DQN only needs one network, so internally, qnet is assigned to qnet_target.

Note

In order to implement dueling DQN, you should create two dense output layers.

In your q network:

self.fc_adv = nn.Linear(in_features=...,
                        out_features=num_actions)
self.fc_val = nn.Linear(in_features=...,
                        out_features=1)

Then in your forward() method, you should implement output as:

adv = self.fc_adv(some_input)
val = self.fc_val(some_input).expand(self.batch_sze,
                                     self.num_actions)
return val + adv - adv.mean(1, keepdim=True)

Note

Your optimizer will be called as:

optimizer(network.parameters(), learning_rate)

Your lr_scheduler will be called as:

lr_scheduler(
    optimizer,
    *lr_scheduler_args[0],
    **lr_scheduler_kwargs[0],
)

Your criterion will be called as:

criterion(
    target_value.view(batch_size, 1),
    predicted_value.view(batch_size, 1)
)

Note

DQN supports two ways of updating the target network, the first way is polyak update (soft update), which updates the target network in every training step by mixing its weights with the online network using update_rate.

The other way is hard update, which copies weights of the online network after every update_steps training step.

You can either specify update_rate or update_steps to select one update scheme, if both are specified, an error will be raised.

These two different update schemes may result in different training stability.

epsilon

Current epsilon value, determines randomness in act_discrete_with_noise. You can set it to any value.

Parameters
  • qnet (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Q network module.

  • qnet_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target Q network module.

  • optimizer (Callable) – Optimizer used to optimize qnet.

  • criterion (Callable) – Criterion used to evaluate the value loss.

  • learning_rate (float) – Learning rate of the optimizer, not compatible with lr_scheduler.

  • lr_scheduler (Callable) – Learning rate scheduler of optimizer.

  • lr_scheduler_args (Tuple[Tuple]) – Arguments of the learning rate scheduler.

  • lr_scheduler_kwargs (Tuple[Dict]) – Keyword arguments of the learning rate scheduler.

  • batch_size (int) – Batch size used during training.

  • epsilon_decay (float) – Epsilon decay rate per acting with noise step. epsilon attribute is multiplied with this every time act_discrete_with_noise is called.

  • update_rate (Optional[float]) –

    \(\tau\) used to update target networks. Target parameters are updated as:

    \(\theta_t = \theta * \tau + \theta_t * (1 - \tau)\)

  • update_steps (Optional[int]) – Training step number used to update target networks.

  • discount (float) – \(\gamma\) used in the bellman function.

  • gradient_max (float) – Maximum gradient.

  • replay_size (int) – Replay buffer size. Not compatible with replay_buffer.

  • replay_device (Union[str, torch.device]) – Device where the replay buffer locates on, Not compatible with replay_buffer.

  • replay_buffer (machin.frame.buffers.buffer.Buffer) – Custom replay buffer.

  • mode (str) – one of "vanilla", "fixed_target", "double".

  • visualize (bool) – Whether visualize the network flow in the first pass.

  • visualize_dir (str) –

act_discrete(state, use_target=False, **__)[source]

Use Q network to produce a discrete action for the current state.

Parameters
  • state (Dict[str, Any]) – Current state.

  • use_target (bool) – Whether to use the target network.

Returns

Action of shape [batch_size, 1]. Any other things returned by your Q network. if they exist.

act_discrete_with_noise(state, use_target=False, decay_epsilon=True, **__)[source]

Randomly selects an action from the action space according to a uniform distribution, with regard to the epsilon decay policy.

Parameters
  • state (Dict[str, Any]) – Current state.

  • use_target (bool) – Whether to use the target network.

  • decay_epsilon (bool) – Whether to decay the epsilon attribute.

Returns

Noisy action of shape [batch_size, 1]. Any other things returned by your Q network. if they exist.

static action_get_function(sampled_actions)[source]

This function is used to get action numbers (int tensor indicating which discrete actions are used) from the sampled action dictionary.

enable_multiprocessing()

Enable multiprocessing for all modules.

classmethod generate_config(config)[source]
Parameters

Any], machin.utils.conf.Config] config (Union[Dict[str,) –

classmethod get_restorable_model_names()

Get attribute name of restorable nn models.

classmethod get_top_model_names()

Get attribute name of top level nn models.

classmethod init_from_config(config, model_device='cpu')[source]
Parameters
  • Any], machin.utils.conf.Config] config (Union[Dict[str,) –

  • torch.device] model_device (Union[str,) –

classmethod is_distributed()

Whether this framework is a distributed framework which require multiple processes to run, and depends on torch.distributed or torch.distributed.rpc

load(model_dir, network_map=None, version=- 1)[source]

Load models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir – Save directory.

  • network_map – Key is module name, value is saved name.

  • version – Version number of the save to be loaded.

static reward_function(reward, discount, next_value, terminal, _)[source]
save(model_dir, network_map=None, version=0)

Save models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir (str) – Save directory.

  • network_map (Dict[str, str]) – Key is module name, value is saved name.

  • version (int) – Version number of the new save.

set_backward_function(backward_func)

Replace the default backward function with a custom function. The default loss backward function is torch.autograd.backward

Parameters

backward_func (Callable) –

store_episode(episode)[source]

Add a full episode of transition samples to the replay buffer.

Parameters

Dict]] episode (List[Union[machin.frame.transition.Transition,) –

store_transition(transition)[source]

Add a transition sample to the replay buffer.

Parameters

Dict] transition (Union[machin.frame.transition.Transition,) –

update(update_value=True, update_target=True, concatenate_samples=True, **__)[source]

Update network weights by sampling from replay buffer.

Parameters
  • update_value – Whether update the Q network.

  • update_target – Whether update targets.

  • concatenate_samples – Whether concatenate the samples.

Returns

mean value of estimated policy value, value loss

update_lr_scheduler()[source]

Update learning rate schedulers.

visualize_model(final_tensor, name, directory)
Parameters
  • final_tensor (torch.Tensor) –

  • name (str) –

  • directory (str) –

property backward_function
property lr_schedulers
property optimizers
property restorable_models
property top_models

DQN with prioritized replay

class machin.frame.algorithms.dqn_per.DQNPer(qnet, qnet_target, optimizer, criterion, *_, lr_scheduler=None, lr_scheduler_args=None, lr_scheduler_kwargs=None, batch_size=100, epsilon_decay=0.9999, update_rate=0.005, update_steps=None, learning_rate=0.001, discount=0.99, gradient_max=inf, replay_size=500000, replay_device='cpu', replay_buffer=None, visualize=False, visualize_dir='', **__)[source]

Bases: machin.frame.algorithms.dqn.DQN

DQN with prioritized replay. It is based on Double DQN.

Warning

Your criterion must return a tensor of shape [batch_size,1] when given two tensors of shape [batch_size,1], since we need to multiply the loss with importance sampling weight element-wise.

If you are using loss modules given by pytorch. It is always safe to use them without any modification.

Note

DQN is only available for discrete environments.

Note

Dueling DQN is a network structure rather than a framework, so it could be applied to all three modes.

If mode = "vanilla", implements the simplest online DQN, with replay buffer.

If mode = "fixed_target", implements DQN with a target network, and replay buffer. Described in this essay.

If mode = "double", implements Double DQN described in this essay.

Note

Vanilla DQN only needs one network, so internally, qnet is assigned to qnet_target.

Note

In order to implement dueling DQN, you should create two dense output layers.

In your q network:

self.fc_adv = nn.Linear(in_features=...,
                        out_features=num_actions)
self.fc_val = nn.Linear(in_features=...,
                        out_features=1)

Then in your forward() method, you should implement output as:

adv = self.fc_adv(some_input)
val = self.fc_val(some_input).expand(self.batch_sze,
                                     self.num_actions)
return val + adv - adv.mean(1, keepdim=True)

Note

Your optimizer will be called as:

optimizer(network.parameters(), learning_rate)

Your lr_scheduler will be called as:

lr_scheduler(
    optimizer,
    *lr_scheduler_args[0],
    **lr_scheduler_kwargs[0],
)

Your criterion will be called as:

criterion(
    target_value.view(batch_size, 1),
    predicted_value.view(batch_size, 1)
)

Note

DQN supports two ways of updating the target network, the first way is polyak update (soft update), which updates the target network in every training step by mixing its weights with the online network using update_rate.

The other way is hard update, which copies weights of the online network after every update_steps training step.

You can either specify update_rate or update_steps to select one update scheme, if both are specified, an error will be raised.

These two different update schemes may result in different training stability.

epsilon

Current epsilon value, determines randomness in act_discrete_with_noise. You can set it to any value.

Parameters
  • qnet (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Q network module.

  • qnet_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target Q network module.

  • optimizer (Callable) – Optimizer used to optimize qnet.

  • criterion (Callable) – Criterion used to evaluate the value loss.

  • learning_rate (float) – Learning rate of the optimizer, not compatible with lr_scheduler.

  • lr_scheduler (Callable) – Learning rate scheduler of optimizer.

  • lr_scheduler_args (Tuple[Tuple]) – Arguments of the learning rate scheduler.

  • lr_scheduler_kwargs (Tuple[Dict]) – Keyword arguments of the learning rate scheduler.

  • batch_size (int) – Batch size used during training.

  • epsilon_decay (float) – Epsilon decay rate per acting with noise step. epsilon attribute is multiplied with this every time act_discrete_with_noise is called.

  • update_rate (Optional[float]) –

    \(\tau\) used to update target networks. Target parameters are updated as:

    \(\theta_t = \theta * \tau + \theta_t * (1 - \tau)\)

  • update_steps (Optional[int]) – Training step number used to update target networks.

  • discount (float) – \(\gamma\) used in the bellman function.

  • gradient_max (float) – Maximum gradient.

  • replay_size (int) – Replay buffer size. Not compatible with replay_buffer.

  • replay_device (Union[str, torch.device]) – Device where the replay buffer locates on, Not compatible with replay_buffer.

  • replay_buffer (machin.frame.buffers.buffer.Buffer) – Custom replay buffer.

  • mode – one of "vanilla", "fixed_target", "double".

  • visualize (bool) – Whether visualize the network flow in the first pass.

  • visualize_dir (str) –

act_discrete(state, use_target=False, **__)

Use Q network to produce a discrete action for the current state.

Parameters
  • state (Dict[str, Any]) – Current state.

  • use_target (bool) – Whether to use the target network.

Returns

Action of shape [batch_size, 1]. Any other things returned by your Q network. if they exist.

act_discrete_with_noise(state, use_target=False, decay_epsilon=True, **__)

Randomly selects an action from the action space according to a uniform distribution, with regard to the epsilon decay policy.

Parameters
  • state (Dict[str, Any]) – Current state.

  • use_target (bool) – Whether to use the target network.

  • decay_epsilon (bool) – Whether to decay the epsilon attribute.

Returns

Noisy action of shape [batch_size, 1]. Any other things returned by your Q network. if they exist.

static action_get_function(sampled_actions)

This function is used to get action numbers (int tensor indicating which discrete actions are used) from the sampled action dictionary.

enable_multiprocessing()

Enable multiprocessing for all modules.

classmethod generate_config(config)[source]
Parameters

Any], machin.utils.conf.Config] config (Union[Dict[str,) –

classmethod get_restorable_model_names()

Get attribute name of restorable nn models.

classmethod get_top_model_names()

Get attribute name of top level nn models.

classmethod init_from_config(config, model_device='cpu')[source]
Parameters
  • Any], machin.utils.conf.Config] config (Union[Dict[str,) –

  • torch.device] model_device (Union[str,) –

classmethod is_distributed()

Whether this framework is a distributed framework which require multiple processes to run, and depends on torch.distributed or torch.distributed.rpc

load(model_dir, network_map=None, version=- 1)

Load models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir – Save directory.

  • network_map – Key is module name, value is saved name.

  • version – Version number of the save to be loaded.

static reward_function(reward, discount, next_value, terminal, _)
save(model_dir, network_map=None, version=0)

Save models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir (str) – Save directory.

  • network_map (Dict[str, str]) – Key is module name, value is saved name.

  • version (int) – Version number of the new save.

set_backward_function(backward_func)

Replace the default backward function with a custom function. The default loss backward function is torch.autograd.backward

Parameters

backward_func (Callable) –

store_episode(episode)

Add a full episode of transition samples to the replay buffer.

Parameters

Dict]] episode (List[Union[machin.frame.transition.Transition,) –

store_transition(transition)

Add a transition sample to the replay buffer.

Parameters

Dict] transition (Union[machin.frame.transition.Transition,) –

update(update_value=True, update_target=True, concatenate_samples=True, **__)[source]

Update network weights by sampling from replay buffer.

Parameters
  • update_value – Whether update the Q network.

  • update_target – Whether update targets.

  • concatenate_samples – Whether concatenate the samples.

Returns

mean value of estimated policy value, value loss

update_lr_scheduler()

Update learning rate schedulers.

visualize_model(final_tensor, name, directory)
Parameters
  • final_tensor (torch.Tensor) –

  • name (str) –

  • directory (str) –

property backward_function
property lr_schedulers
property optimizers
property restorable_models
property top_models

RAINBOW

class machin.frame.algorithms.rainbow.RAINBOW(qnet, qnet_target, optimizer, value_min, value_max, *_, lr_scheduler=None, lr_scheduler_args=None, lr_scheduler_kwargs=None, batch_size=100, epsilon_decay=0.9999, update_rate=0.001, update_steps=None, learning_rate=0.001, discount=0.99, gradient_max=inf, reward_future_steps=3, replay_size=500000, replay_device='cpu', replay_buffer=None, visualize=False, visualize_dir='', **__)[source]

Bases: machin.frame.algorithms.dqn.DQN

RAINBOW DQN framework.

RAINBOW framework is described in this essay.

Note

In the RAINBOW framework, the output shape of your q network must be [batch_size, action_num, atom_num] when given a state of shape [batch_size, action_dim]. And the last dimension must be soft-maxed. Atom number is the number of segments of your q value domain.

See also

DQN

Parameters
  • qnet (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Q network module.

  • qnet_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target Q network module.

  • optimizer – Optimizer used to optimize actor and critic.

  • value_min – Minimum of value domain.

  • value_max – Maximum of value domain.

  • learning_rate (float) – Learning rate of the optimizer, not compatible with lr_scheduler.

  • lr_scheduler (Callable) – Learning rate scheduler of optimizer.

  • lr_scheduler_args (Tuple[Tuple]) – Arguments of the learning rate scheduler.

  • lr_scheduler_kwargs (Tuple[Dict]) – Keyword arguments of the learning rate scheduler.

  • batch_size (int) – Batch size used during training.

  • epsilon_decay (float) – Epsilon decay rate per acting with noise step. epsilon attribute is multiplied with this every time act_discrete_with_noise is called.

  • update_rate (float) –

    \(\tau\) used to update target networks. Target parameters are updated as:

    \(\theta_t = \theta * \tau + \theta_t * (1 - \tau)\)

  • update_steps (Optional[int]) – Training step number used to update target networks.

  • discount (float) – \(\gamma\) used in the bellman function.

  • reward_future_steps (int) – Number of future steps to be considered when the framework calculates value from reward.

  • replay_size (int) – Replay buffer size. Not compatible with replay_buffer.

  • replay_device (Union[str, torch.device]) – Device where the replay buffer locates on, Not compatible with replay_buffer.

  • replay_buffer (machin.frame.buffers.buffer.Buffer) – Custom replay buffer.

  • mode – one of "vanilla", "fixed_target", "double".

  • visualize (bool) – Whether visualize the network flow in the first pass.

  • gradient_max (float) –

  • visualize_dir (str) –

act_discrete(state, use_target=False, **__)[source]

Use Q network to produce a discrete action for the current state.

Parameters
  • state (Dict[str, Any]) – Current state.

  • use_target (bool) – Whether to use the target network.

Returns

Action of shape [batch_size, 1]. Any other things returned by your Q network. if they exist.

act_discrete_with_noise(state, use_target=False, decay_epsilon=True, **__)[source]

Randomly selects an action from the action space according to a uniform distribution, with regard to the epsilon decay policy.

Parameters
  • state (Dict[str, Any]) – Current state.

  • use_target (bool) – Whether to use the target network.

  • decay_epsilon (bool) – Whether to decay the epsilon attribute.

Returns

Noisy action of shape [batch_size, 1]. Any other things returned by your Q network. if they exist.

static action_get_function(sampled_actions)

This function is used to get action numbers (int tensor indicating which discrete actions are used) from the sampled action dictionary.

enable_multiprocessing()

Enable multiprocessing for all modules.

classmethod generate_config(config)[source]
Parameters

Any], machin.utils.conf.Config] config (Union[Dict[str,) –

classmethod get_restorable_model_names()

Get attribute name of restorable nn models.

classmethod get_top_model_names()

Get attribute name of top level nn models.

classmethod init_from_config(config, model_device='cpu')
Parameters
  • Any], machin.utils.conf.Config] config (Union[Dict[str,) –

  • torch.device] model_device (Union[str,) –

classmethod is_distributed()

Whether this framework is a distributed framework which require multiple processes to run, and depends on torch.distributed or torch.distributed.rpc

load(model_dir, network_map=None, version=- 1)

Load models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir – Save directory.

  • network_map – Key is module name, value is saved name.

  • version – Version number of the save to be loaded.

static reward_function(reward, discount, next_value, terminal, _)
save(model_dir, network_map=None, version=0)

Save models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir (str) – Save directory.

  • network_map (Dict[str, str]) – Key is module name, value is saved name.

  • version (int) – Version number of the new save.

set_backward_function(backward_func)

Replace the default backward function with a custom function. The default loss backward function is torch.autograd.backward

Parameters

backward_func (Callable) –

store_episode(episode)[source]

Add a full episode of transition samples to the replay buffer.

“value” is automatically calculated.

Parameters

Dict]] episode (List[Union[machin.frame.transition.Transition,) –

store_transition(transition)[source]

Add a transition sample to the replay buffer.

Not suggested, since you will have to calculate “value” by yourself.

Parameters

Dict] transition (Union[machin.frame.transition.Transition,) –

update(update_value=True, update_target=True, concatenate_samples=True, **__)[source]

Update network weights by sampling from replay buffer.

Parameters
  • update_value – Whether update the Q network.

  • update_target – Whether update targets.

  • concatenate_samples – Whether concatenate the samples.

Returns

mean value of estimated policy value, value loss

update_lr_scheduler()

Update learning rate schedulers.

visualize_model(final_tensor, name, directory)
Parameters
  • final_tensor (torch.Tensor) –

  • name (str) –

  • directory (str) –

property backward_function
property lr_schedulers
property optimizers
property restorable_models
property top_models

A2C

class machin.frame.algorithms.a2c.A2C(actor, critic, optimizer, criterion, *_, lr_scheduler=None, lr_scheduler_args=None, lr_scheduler_kwargs=None, batch_size=100, actor_update_times=5, critic_update_times=10, actor_learning_rate=0.001, critic_learning_rate=0.001, entropy_weight=None, value_weight=0.5, gradient_max=inf, gae_lambda=1.0, discount=0.99, normalize_advantage=True, replay_size=500000, replay_device='cpu', replay_buffer=None, visualize=False, visualize_dir='', **__)[source]

Bases: machin.frame.algorithms.base.TorchFramework

A2C framework.

Important

when given a state, and an optional, action actor must at least return two values:

1. Action

For contiguous environments, action must be of shape [batch_size, action_dim] and clamped by action space. For discrete environments, action could be of shape [batch_size, action_dim] if it is a one hot vector, or [batch_size, 1] if it is a categorically encoded integer.

2. Log likelihood of action (action probability)

For either type of environment, log likelihood is of shape [batch_size, 1].

Action probability must be differentiable, Gradient of actor is calculated from the gradient of action probability.

The third entropy value is optional:

3. Entropy of action distribution

Entropy is usually calculated using dist.entropy(), its shape is [batch_size, 1]. You must specify entropy_weight to make it effective.

Hint

For contiguous environments, action’s are not directly output by your actor, otherwise it would be rather inconvenient to calculate the log probability of action. Instead, your actor network should output parameters for a certain distribution (eg: Normal) and then draw action from it.

For discrete environments, Categorical is sufficient, since differentiable rsample() is not needed.

This trick is also known as reparameterization.

Hint

Actions are from samples during training in the actor critic family (A2C, A3C, PPO, TRPO, IMPALA).

When your actor model is given a batch of actions and states, it must evaluate the states, and return the log likelihood of the given actions instead of re-sampling actions.

An example of your actor in contiguous environments:

class ActorNet(nn.Module):
    def __init__(self):
        super(ActorNet, self).__init__()
        self.fc = nn.Linear(3, 100)
        self.mu_head = nn.Linear(100, 1)
        self.sigma_head = nn.Linear(100, 1)

    def forward(self, state, action=None):
        x = t.relu(self.fc(state))
        mu = 2.0 * t.tanh(self.mu_head(x))
        sigma = F.softplus(self.sigma_head(x))
        dist = Normal(mu, sigma)
        action = (action
                  if action is not None
                  else dist.sample())
        action_entropy = dist.entropy()
        action = action.clamp(-2.0, 2.0)
        action_log_prob = dist.log_prob(action)
        return action, action_log_prob, action_entropy

Hint

Entropy weight is usually negative, to increase exploration.

Value weight is usually 0.5. So critic network converges less slowly than the actor network and learns more conditions.

Update equation is equivalent to:

\(Loss= w_e * Entropy + w_v * Loss_v + w_a * Loss_a\) \(Loss_a = -log\_likelihood * advantage\) \(Loss_v = criterion(target\_bellman\_value - V(s))\)

Parameters
  • actor (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Actor network module.

  • critic (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Critic network module.

  • optimizer (Callable) – Optimizer used to optimize actor and critic.

  • criterion (Callable) – Criterion used to evaluate the value loss.

  • lr_scheduler (Callable) – Learning rate scheduler of optimizer.

  • lr_scheduler_args (Tuple[Tuple, Tuple]) – Arguments of the learning rate scheduler.

  • lr_scheduler_kwargs (Tuple[Dict, Dict]) – Keyword arguments of the learning rate scheduler.

  • batch_size (int) – Batch size used during training.

  • actor_update_times (int) – Times to update actor in update().

  • critic_update_times (int) – Times to update critic in update().

  • actor_learning_rate (float) – Learning rate of the actor optimizer, not compatible with lr_scheduler.

  • critic_learning_rate (float) – Learning rate of the critic optimizer, not compatible with lr_scheduler.

  • entropy_weight (float) – Weight of entropy in your loss function, a positive entropy weight will minimize entropy, while a negative one will maximize entropy.

  • value_weight (float) – Weight of critic value loss.

  • gradient_max (float) – Maximum gradient.

  • gae_lambda (float) – \(\lambda\) used in generalized advantage estimation.

  • discount (float) – \(\gamma\) used in the bellman function.

  • normalize_advantage (bool) – Whether to normalize the advantage function.

  • replay_size (int) – Replay buffer size. Not compatible with replay_buffer.

  • replay_device (Union[str, torch.device]) – Device where the replay buffer locates on, Not compatible with replay_buffer.

  • replay_buffer (machin.frame.buffers.buffer.Buffer) – Custom replay buffer.

  • visualize (bool) – Whether visualize the network flow in the first pass.

  • visualize_dir (str) – Visualized graph save directory.

act(state, *_, **__)[source]

Use actor network to give a policy to the current state.

Returns

Anything produced by actor.

Parameters

Any] state (Dict[str,) –

enable_multiprocessing()

Enable multiprocessing for all modules.

classmethod generate_config(config)[source]
Parameters

Any], machin.utils.conf.Config] config (Union[Dict[str,) –

classmethod get_restorable_model_names()

Get attribute name of restorable nn models.

classmethod get_top_model_names()

Get attribute name of top level nn models.

classmethod init_from_config(config, model_device='cpu')[source]
Parameters
  • Any], machin.utils.conf.Config] config (Union[Dict[str,) –

  • torch.device] model_device (Union[str,) –

classmethod is_distributed()

Whether this framework is a distributed framework which require multiple processes to run, and depends on torch.distributed or torch.distributed.rpc

load(model_dir, network_map=None, version=- 1)

Load models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir (str) – Save directory.

  • network_map (Dict[str, str]) – Key is module name, value is saved name.

  • version (int) – Version number of the save to be loaded.

save(model_dir, network_map=None, version=0)

Save models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir (str) – Save directory.

  • network_map (Dict[str, str]) – Key is module name, value is saved name.

  • version (int) – Version number of the new save.

set_backward_function(backward_func)

Replace the default backward function with a custom function. The default loss backward function is torch.autograd.backward

Parameters

backward_func (Callable) –

store_episode(episode)[source]

Add a full episode of transition samples to the replay buffer.

“value” and “gae” are automatically calculated.

Parameters

Dict]] episode (List[Union[machin.frame.transition.Transition,) –

store_transition(transition)[source]

Add a transition sample to the replay buffer.

Not suggested, since you will have to calculate “value” and “gae” by yourself.

Parameters

Dict] transition (Union[machin.frame.transition.Transition,) –

update(update_value=True, update_policy=True, concatenate_samples=True, **__)[source]

Update network weights by sampling from buffer. Buffer will be cleared after update is finished.

Parameters
  • update_value – Whether update the Q network.

  • update_policy – Whether update the actor network.

  • concatenate_samples – Whether concatenate the samples.

Returns

mean value of estimated policy value, value loss

update_lr_scheduler()[source]

Update learning rate schedulers.

visualize_model(final_tensor, name, directory)
Parameters
  • final_tensor (torch.Tensor) –

  • name (str) –

  • directory (str) –

property backward_function
property lr_schedulers
property optimizers
property restorable_models
property top_models

A3C

class machin.frame.algorithms.a3c.A3C(actor, critic, criterion, grad_server, *_, batch_size=100, actor_update_times=5, critic_update_times=10, entropy_weight=None, value_weight=0.5, gradient_max=inf, gae_lambda=1.0, discount=0.99, normalize_advantage=True, replay_size=500000, replay_device='cpu', replay_buffer=None, visualize=False, visualize_dir='', **__)[source]

Bases: machin.frame.algorithms.a2c.A2C

A3C framework.

See also

A2C

Note

A3C algorithm relies on parameter servers to synchronize parameters of actor and critic models across samplers ( interact with environment) and trainers (using samples to train.

The parameter server type PushPullGradServer used here utilizes gradients calculated by trainers:

1. perform a “sum” reduction process on the collected gradients, then apply this reduced gradient to the model managed by its primary reducer

2. push the parameters of this updated managed model to a ordered key-value server so that all processes, including samplers and trainers, can access the updated parameters.

The grad_servers argument is a pair of accessors to two PushPullGradServerImpl class.

Parameters
  • actor (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Actor network module.

  • critic (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Critic network module.

  • criterion (Callable) – Criterion used to evaluate the value loss.

  • grad_server (Tuple[machin.parallel.server.param_server.PushPullGradServer, machin.parallel.server.param_server.PushPullGradServer]) – Custom gradient sync server accessors, the first server accessor is for actor, and the second one is for critic.

  • batch_size (int) – Batch size used during training.

  • actor_update_times (int) – Times to update actor in update().

  • critic_update_times (int) – Times to update critic in update().

  • entropy_weight (float) – Weight of entropy in your loss function, a positive entropy weight will minimize entropy, while a negative one will maximize entropy.

  • value_weight (float) – Weight of critic value loss.

  • gradient_max (float) – Maximum gradient.

  • gae_lambda (float) – \(\lambda\) used in generalized advantage estimation.

  • discount (float) – \(\gamma\) used in the bellman function.

  • normalize_advantage (bool) – Whether to normalize the advantage function.

  • replay_size (int) – Replay buffer size. Not compatible with replay_buffer.

  • replay_device (Union[str, torch.device]) – Device where the replay buffer locates on, Not compatible with replay_buffer.

  • replay_buffer (machin.frame.buffers.buffer.Buffer) – Custom replay buffer.

  • visualize (bool) – Whether visualize the network flow in the first pass.

  • visualize_dir (str) – Visualized graph save directory.

act(state, **__)[source]

Use actor network to give a policy to the current state.

Returns

Anything produced by actor.

Parameters

Any] state (Dict[str,) –

enable_multiprocessing()

Enable multiprocessing for all modules.

classmethod generate_config(config)[source]
Parameters

Any], machin.utils.conf.Config] config (Union[Dict[str,) –

classmethod get_restorable_model_names()

Get attribute name of restorable nn models.

classmethod get_top_model_names()

Get attribute name of top level nn models.

classmethod init_from_config(config, model_device='cpu')[source]
Parameters
  • Any], machin.utils.conf.Config] config (Union[Dict[str,) –

  • torch.device] model_device (Union[str,) –

classmethod is_distributed()[source]

Whether this framework is a distributed framework which require multiple processes to run, and depends on torch.distributed or torch.distributed.rpc

load(model_dir, network_map=None, version=- 1)

Load models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir (str) – Save directory.

  • network_map (Dict[str, str]) – Key is module name, value is saved name.

  • version (int) – Version number of the save to be loaded.

manual_sync()[source]
save(model_dir, network_map=None, version=0)

Save models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir (str) – Save directory.

  • network_map (Dict[str, str]) – Key is module name, value is saved name.

  • version (int) – Version number of the new save.

set_backward_function(backward_func)

Replace the default backward function with a custom function. The default loss backward function is torch.autograd.backward

Parameters

backward_func (Callable) –

set_sync(is_syncing)[source]
store_episode(episode)

Add a full episode of transition samples to the replay buffer.

“value” and “gae” are automatically calculated.

Parameters

Dict]] episode (List[Union[machin.frame.transition.Transition,) –

store_transition(transition)

Add a transition sample to the replay buffer.

Not suggested, since you will have to calculate “value” and “gae” by yourself.

Parameters

Dict] transition (Union[machin.frame.transition.Transition,) –

update(update_value=True, update_policy=True, concatenate_samples=True, **__)[source]

Update network weights by sampling from buffer. Buffer will be cleared after update is finished.

Parameters
  • update_value – Whether update the Q network.

  • update_policy – Whether update the actor network.

  • concatenate_samples – Whether concatenate the samples.

Returns

mean value of estimated policy value, value loss

update_lr_scheduler()

Update learning rate schedulers.

visualize_model(final_tensor, name, directory)
Parameters
  • final_tensor (torch.Tensor) –

  • name (str) –

  • directory (str) –

property backward_function
property lr_schedulers
property optimizers
property restorable_models
property top_models

PPO

class machin.frame.algorithms.ppo.PPO(actor, critic, optimizer, criterion, *_, lr_scheduler=None, lr_scheduler_args=(), lr_scheduler_kwargs=(), batch_size=100, actor_update_times=5, critic_update_times=10, actor_learning_rate=0.001, critic_learning_rate=0.001, entropy_weight=None, value_weight=0.5, surrogate_loss_clip=0.2, gradient_max=inf, gae_lambda=1.0, discount=0.99, normalize_advantage=True, replay_size=500000, replay_device='cpu', replay_buffer=None, visualize=False, visualize_dir='', **__)[source]

Bases: machin.frame.algorithms.a2c.A2C

PPO framework.

See also

A2C

Parameters
  • actor (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Actor network module.

  • critic (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Critic network module.

  • optimizer (Callable) – Optimizer used to optimize actor and critic.

  • criterion (Callable) – Criterion used to evaluate the value loss.

  • lr_scheduler (Callable) – Learning rate scheduler of optimizer.

  • lr_scheduler_args (Tuple[Tuple, Tuple]) – Arguments of the learning rate scheduler.

  • lr_scheduler_kwargs (Tuple[Dict, Dict]) – Keyword arguments of the learning rate scheduler.

  • batch_size (int) – Batch size used during training.

  • actor_update_times (int) – Times to update actor in update().

  • critic_update_times (int) – Times to update critic in update().

  • actor_learning_rate (float) – Learning rate of the actor optimizer, not compatible with lr_scheduler.

  • critic_learning_rate (float) – Learning rate of the critic optimizer, not compatible with lr_scheduler.

  • entropy_weight (float) – Weight of entropy in your loss function, a positive entropy weight will minimize entropy, while a negative one will maximize entropy.

  • value_weight (float) – Weight of critic value loss.

  • surrogate_loss_clip (float) – Surrogate loss clipping parameter in PPO.

  • gradient_max (float) – Maximum gradient.

  • gae_lambda (float) – \(\lambda\) used in generalized advantage estimation.

  • discount (float) – \(\gamma\) used in the bellman function.

  • replay_size (int) – Replay buffer size. Not compatible with replay_buffer.

  • replay_device (Union[str, torch.device]) – Device where the replay buffer locates on, Not compatible with replay_buffer.

  • replay_buffer (machin.frame.buffers.buffer.Buffer) – Custom replay buffer.

  • visualize (bool) – Whether visualize the network flow in the first pass.

  • visualize_dir (str) – Visualized graph save directory.

  • normalize_advantage (bool) –

act(state, *_, **__)

Use actor network to give a policy to the current state.

Returns

Anything produced by actor.

Parameters

Any] state (Dict[str,) –

enable_multiprocessing()

Enable multiprocessing for all modules.

classmethod generate_config(config)[source]
Parameters

Any], machin.utils.conf.Config] config (Union[Dict[str,) –

classmethod get_restorable_model_names()

Get attribute name of restorable nn models.

classmethod get_top_model_names()

Get attribute name of top level nn models.

classmethod init_from_config(config, model_device='cpu')
Parameters
  • Any], machin.utils.conf.Config] config (Union[Dict[str,) –

  • torch.device] model_device (Union[str,) –

classmethod is_distributed()

Whether this framework is a distributed framework which require multiple processes to run, and depends on torch.distributed or torch.distributed.rpc

load(model_dir, network_map=None, version=- 1)

Load models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir (str) – Save directory.

  • network_map (Dict[str, str]) – Key is module name, value is saved name.

  • version (int) – Version number of the save to be loaded.

save(model_dir, network_map=None, version=0)

Save models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir (str) – Save directory.

  • network_map (Dict[str, str]) – Key is module name, value is saved name.

  • version (int) – Version number of the new save.

set_backward_function(backward_func)

Replace the default backward function with a custom function. The default loss backward function is torch.autograd.backward

Parameters

backward_func (Callable) –

store_episode(episode)

Add a full episode of transition samples to the replay buffer.

“value” and “gae” are automatically calculated.

Parameters

Dict]] episode (List[Union[machin.frame.transition.Transition,) –

store_transition(transition)

Add a transition sample to the replay buffer.

Not suggested, since you will have to calculate “value” and “gae” by yourself.

Parameters

Dict] transition (Union[machin.frame.transition.Transition,) –

update(update_value=True, update_policy=True, concatenate_samples=True, **__)[source]

Update network weights by sampling from buffer. Buffer will be cleared after update is finished.

Parameters
  • update_value – Whether update the Q network.

  • update_policy – Whether update the actor network.

  • concatenate_samples – Whether concatenate the samples.

Returns

mean value of estimated policy value, value loss

update_lr_scheduler()

Update learning rate schedulers.

visualize_model(final_tensor, name, directory)
Parameters
  • final_tensor (torch.Tensor) –

  • name (str) –

  • directory (str) –

property backward_function
property lr_schedulers
property optimizers
property restorable_models
property top_models

SAC

class machin.frame.algorithms.sac.SAC(actor, critic, critic_target, critic2, critic2_target, optimizer, criterion, *_, lr_scheduler=None, lr_scheduler_args=None, lr_scheduler_kwargs=None, target_entropy=None, initial_entropy_alpha=1.0, batch_size=100, update_rate=0.005, update_steps=None, actor_learning_rate=0.0005, critic_learning_rate=0.001, alpha_learning_rate=0.001, discount=0.99, gradient_max=inf, replay_size=500000, replay_device='cpu', replay_buffer=None, visualize=False, visualize_dir='', **__)[source]

Bases: machin.frame.algorithms.base.TorchFramework

SAC framework.

See also

A2C DDPG

Important

When given a state, and an optional action, actor must at least return two values, similar to the actor structure described in A2C. However, when actor is asked to select an action based on the current state, you must make sure that the sampling process is differentiable. E.g. use the rsample method of torch distributions instead of the sample method.

Compared to other actor-critic methods, SAC embeds the entropy term into its reward function directly, rather than adding the entropy term to actor’s loss function. Therefore, we do not use the entropy output of your actor network.

The SAC algorithm uses Q network as critics, so please reference DDPG for the requirements and the definition of action_trans_func.

Parameters
  • actor (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Actor network module.

  • critic (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Critic network module.

  • critic_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target critic network module.

  • critic2 (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – The second critic network module.

  • critic2_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – The second target critic network module.

  • optimizer (Callable) – Optimizer used to optimize actor, critic and critic2.

  • criterion (Callable) – Criterion used to evaluate the value loss.

  • *_

  • lr_scheduler (Callable) – Learning rate scheduler of optimizer.

  • lr_scheduler_args (Tuple[Tuple, Tuple, Tuple]) – Arguments of the learning rate scheduler.

  • lr_scheduler_kwargs (Tuple[Dict, Dict, Dict]) – Keyword arguments of the learning rate scheduler.

  • target_entropy (float) – Target entropy weight \(\alpha\) used in the SAC soft value function: \(V_{soft}(s_t) = \mathbb{E}_{q_t\sim\pi}[ Q_{soft}(s_t,a_t) - \alpha log\pi(a_t|s_t)]\)

  • initial_entropy_alpha (float) – Initial entropy weight \(\alpha\)

  • gradient_max (float) – Maximum gradient.

  • batch_size (int) – Batch size used during training.

  • update_rate (float) –

    \(\tau\) used to update target networks. Target parameters are updated as:

    \(\theta_t = \theta * \tau + \theta_t * (1 - \tau)\)

  • update_steps (Optional[int]) – Training step number used to update target networks.

  • actor_learning_rate (float) – Learning rate of the actor optimizer, not compatible with lr_scheduler.

  • critic_learning_rate (float) – Learning rate of the critic optimizer, not compatible with lr_scheduler.

  • discount (float) – \(\gamma\) used in the bellman function.

  • replay_size (int) – Replay buffer size. Not compatible with replay_buffer.

  • replay_device (Union[str, torch.device]) – Device where the replay buffer locates on, Not compatible with replay_buffer.

  • replay_buffer (machin.frame.buffers.buffer.Buffer) – Custom replay buffer.

  • visualize (bool) – Whether visualize the network flow in the first pass.

  • visualize_dir (str) – Visualized graph save directory.

  • alpha_learning_rate (float) –

act(state, **__)[source]

Use actor network to produce an action for the current state.

Returns

Anything produced by actor.

Parameters

Any] state (Dict[str,) –

static action_transform_function(raw_output_action, *_)[source]
enable_multiprocessing()

Enable multiprocessing for all modules.

classmethod generate_config(config)[source]
Parameters

Any], machin.utils.conf.Config] config (Union[Dict[str,) –

classmethod get_restorable_model_names()

Get attribute name of restorable nn models.

classmethod get_top_model_names()

Get attribute name of top level nn models.

classmethod init_from_config(config, model_device='cpu')[source]
Parameters
  • Any], machin.utils.conf.Config] config (Union[Dict[str,) –

  • torch.device] model_device (Union[str,) –

classmethod is_distributed()

Whether this framework is a distributed framework which require multiple processes to run, and depends on torch.distributed or torch.distributed.rpc

load(model_dir, network_map=None, version=- 1)[source]

Load models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir – Save directory.

  • network_map – Key is module name, value is saved name.

  • version – Version number of the save to be loaded.

static reward_function(reward, discount, next_value, terminal, _)[source]
save(model_dir, network_map=None, version=0)

Save models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir (str) – Save directory.

  • network_map (Dict[str, str]) – Key is module name, value is saved name.

  • version (int) – Version number of the new save.

set_backward_function(backward_func)

Replace the default backward function with a custom function. The default loss backward function is torch.autograd.backward

Parameters

backward_func (Callable) –

store_episode(episode)[source]

Add a full episode of transition samples to the replay buffer.

Parameters

Dict]] episode (List[Union[machin.frame.transition.Transition,) –

store_transition(transition)[source]

Add a transition sample to the replay buffer.

Parameters

Dict] transition (Union[machin.frame.transition.Transition,) –

update(update_value=True, update_policy=True, update_target=True, update_entropy_alpha=True, concatenate_samples=True, **__)[source]

Update network weights by sampling from replay buffer.

Parameters
  • update_value – Whether to update the Q network.

  • update_policy – Whether to update the actor network.

  • update_target – Whether to update targets.

  • update_entropy_alpha – Whether to update \(alpha\) of entropy.

  • concatenate_samples – Whether to concatenate the samples.

Returns

mean value of estimated policy value, value loss

update_lr_scheduler()[source]

Update learning rate schedulers.

visualize_model(final_tensor, name, directory)
Parameters
  • final_tensor (torch.Tensor) –

  • name (str) –

  • directory (str) –

property backward_function
property lr_schedulers
property optimizers
property restorable_models
property top_models

APEX

class machin.frame.algorithms.apex.DDPGApex(actor, actor_target, critic, critic_target, optimizer, criterion, apex_group, model_server, *_, lr_scheduler=None, lr_scheduler_args=(), lr_scheduler_kwargs=(), batch_size=100, update_rate=0.005, update_steps=None, actor_learning_rate=0.0005, critic_learning_rate=0.001, discount=0.99, gradient_max=inf, replay_size=500000, **__)[source]

Bases: machin.frame.algorithms.ddpg_per.DDPGPer

Massively parallel version of a DDPG with prioritized replay.

The pull function is invoked before using act, act_with_noise, act_discrete, act_discrete_with_noise and criticize.

The push function is invoked after update.

See also

DDPGPer

Note

Apex framework supports multiple workers(samplers), and only one trainer, you may use DistributedDataParallel in trainer. If you use DistributedDataParallel, you must call update() in all member processes of DistributedDataParallel.

Parameters
  • actor (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Actor network module.

  • actor_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target actor network module.

  • critic (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Critic network module.

  • critic_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target critic network module.

  • optimizer (Callable) – Optimizer used to optimize qnet.

  • criterion (Callable) – Criterion used to evaluate the value loss.

  • apex_group (machin.parallel.distributed.world.RpcGroup) – Group of all processes using the apex-DDPG framework, including all samplers and trainers.

  • model_server (Tuple[machin.parallel.server.param_server.PushPullModelServer]) – Custom model sync server accessor for actor.

  • lr_scheduler (Callable) – Learning rate scheduler of optimizer.

  • lr_scheduler_args (Tuple[Tuple, Tuple]) – Arguments of the learning rate scheduler.

  • lr_scheduler_kwargs (Tuple[Dict, Dict]) – Keyword arguments of the learning rate scheduler.

  • batch_size (int) – Batch size used during training.

  • update_rate (float) –

    \(\tau\) used to update target networks. Target parameters are updated as:

    \(\theta_t = \theta * \tau + \theta_t * (1 - \tau)\)

  • update_steps (Optional[int]) – Training step number used to update target networks.

  • actor_learning_rate (float) – Learning rate of the actor optimizer, not compatible with lr_scheduler.

  • critic_learning_rate (float) – Learning rate of the critic optimizer, not compatible with lr_scheduler.

  • discount (float) – \(\gamma\) used in the bellman function.

  • gradient_max (float) – Maximum gradient.

  • replay_size (int) – Local replay buffer size of a single worker.

act(state, use_target=False, **__)[source]

Use actor network to produce an action for the current state.

Parameters
  • state (Dict[str, Any]) – Current state.

  • use_target (bool) – Whether use the target network.

Returns

Any thing returned by your actor network.

act_discrete(state, use_target=False, **__)[source]

Use actor network to produce a discrete action for the current state.

Notes

actor network must output a probability tensor, of shape (batch_size, action_dims), and has a sum of 1 for each row in dimension 1.

Parameters
  • state (Dict[str, Any]) – Current state.

  • use_target (bool) – Whether to use the target network.

Returns

Action of shape [batch_size, 1]. Action probability tensor of shape [batch_size, action_num], produced by your actor. Any other things returned by your Q network. if they exist.

act_discrete_with_noise(state, use_target=False, **__)[source]

Use actor network to produce a noisy discrete action for the current state.

Notes

actor network must output a probability tensor, of shape (batch_size, action_dims), and has a sum of 1 for each row in dimension 1.

Parameters
  • state (Dict[str, Any]) – Current state.

  • use_target (bool) – Whether to use the target network.

  • choose_max_prob – Probability to choose the largest component when actor is outputing extreme probability vector like [0, 1, 0, 0].

Returns

Noisy action of shape [batch_size, 1]. Action probability tensor of shape [batch_size, action_num]. Any other things returned by your Q network. if they exist.

act_with_noise(state, noise_param=0.0, 1.0, ratio=1.0, mode='uniform', use_target=False, **__)[source]

Use actor network to produce a noisy action for the current state.

Parameters
  • state (Dict[str, Any]) – Current state.

  • noise_param (Tuple) – Noise params.

  • ratio (float) – Noise ratio.

  • mode (str) – Noise mode. Supported are: "uniform", "normal", "clipped_normal", "ou"

  • use_target (bool) – Whether use the target network.

Returns

Noisy action of shape [batch_size, action_dim]. Any other things returned by your actor network. if they exist.

static action_transform_function(raw_output_action, *_)

The action transform function is used to transform the output of actor to the input of critic. Action transform function must accept:

  1. Raw action from the actor model.

  2. Concatenated Transition.next_state.

  3. Any other concatenated lists of custom keys from Transition.

and returns:
  1. A dictionary with the same form as Transition.action

Parameters

raw_output_action (Any) – Raw action from the actor model.

enable_multiprocessing()

Enable multiprocessing for all modules.

classmethod generate_config(config)[source]
Parameters

Any], machin.utils.conf.Config] config (Union[Dict[str,) –

classmethod get_restorable_model_names()

Get attribute name of restorable nn models.

classmethod get_top_model_names()

Get attribute name of top level nn models.

classmethod init_from_config(config, model_device='cpu')[source]
Parameters
  • Any], machin.utils.conf.Config] config (Union[Dict[str,) –

  • torch.device] model_device (Union[str,) –

classmethod is_distributed()[source]

Whether this framework is a distributed framework which require multiple processes to run, and depends on torch.distributed or torch.distributed.rpc

load(model_dir, network_map=None, version=- 1)

Load models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir (str) – Save directory.

  • network_map (Dict[str, str]) – Key is module name, value is saved name.

  • version (int) – Version number of the save to be loaded.

manual_sync()[source]
static reward_function(reward, discount, next_value, terminal, _)
save(model_dir, network_map=None, version=0)

Save models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir (str) – Save directory.

  • network_map (Dict[str, str]) – Key is module name, value is saved name.

  • version (int) – Version number of the new save.

set_backward_function(backward_func)

Replace the default backward function with a custom function. The default loss backward function is torch.autograd.backward

Parameters

backward_func (Callable) –

set_sync(is_syncing)[source]
store_episode(episode)

Add a full episode of transition samples to the replay buffer.

Parameters

Dict]] episode (List[Union[machin.frame.transition.Transition,) –

store_transition(transition)

Add a transition sample to the replay buffer.

Parameters

Dict] transition (Union[machin.frame.transition.Transition,) –

update(update_value=True, update_policy=True, update_target=True, concatenate_samples=True, **__)[source]

Update network weights by sampling from replay buffer.

Parameters
  • update_value – Whether to update the Q network.

  • update_policy – Whether to update the actor network.

  • update_target – Whether to update targets.

  • concatenate_samples – Whether to concatenate the samples.

Returns

mean value of estimated policy value, value loss

update_lr_scheduler()

Update learning rate schedulers.

visualize_model(final_tensor, name, directory)
Parameters
  • final_tensor (torch.Tensor) –

  • name (str) –

  • directory (str) –

property backward_function
property lr_schedulers
property optimizers
property restorable_models
property top_models
class machin.frame.algorithms.apex.DQNApex(qnet, qnet_target, optimizer, criterion, apex_group, model_server, *_, lr_scheduler=None, lr_scheduler_args=(), lr_scheduler_kwargs=(), batch_size=100, epsilon_decay=0.9999, update_rate=0.005, update_steps=None, learning_rate=0.001, discount=0.99, gradient_max=inf, replay_size=500000, **__)[source]

Bases: machin.frame.algorithms.dqn_per.DQNPer

Massively parallel version of a Double DQN with prioritized replay.

The pull function is invoked before using act_discrete, act_discrete_with_noise and criticize.

The push function is invoked after update.

See also

DQNPer

Note

Apex framework supports multiple workers(samplers), and only one trainer, you may use DistributedDataParallel in trainer. If you use DistributedDataParallel, you must call update() in all member processes of DistributedDataParallel.

Parameters
  • qnet (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Q network module.

  • qnet_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target Q network module.

  • optimizer (Callable) – Optimizer used to optimize qnet.

  • criterion (Callable) – Criterion used to evaluate the value loss.

  • apex_group (machin.parallel.distributed.world.RpcGroup) – Group of all processes using the apex-DQN framework, including all samplers and trainers.

  • model_server (Tuple[machin.parallel.server.param_server.PushPullModelServer]) – Custom model sync server accessor for qnet.

  • lr_scheduler (Callable) – Learning rate scheduler of optimizer.

  • lr_scheduler_args (Tuple[Tuple]) – Arguments of the learning rate scheduler.

  • lr_scheduler_kwargs (Tuple[Dict]) – Keyword arguments of the learning rate scheduler.

  • batch_size (int) – Batch size used during training.

  • epsilon_decay (float) – Epsilon decay rate per acting with noise step. epsilon attribute is multiplied with this every time act_discrete_with_noise is called.

  • update_rate (float) –

    \(\tau\) used to update target networks. Target parameters are updated as:

    \(\theta_t = \theta * \tau + \theta_t * (1 - \tau)\)

  • update_steps (Optional[int]) – Training step number used to update target networks.

  • learning_rate (float) – Learning rate of the optimizer, not compatible with lr_scheduler.

  • discount (float) – \(\gamma\) used in the bellman function.

  • gradient_max (float) – Maximum gradient.

  • replay_size (int) – Local replay buffer size of a single worker.

act_discrete(state, use_target=False, **__)[source]

Use Q network to produce a discrete action for the current state.

Parameters
  • state (Dict[str, Any]) – Current state.

  • use_target (bool) – Whether to use the target network.

Returns

Action of shape [batch_size, 1]. Any other things returned by your Q network. if they exist.

act_discrete_with_noise(state, use_target=False, decay_epsilon=True, **__)[source]

Randomly selects an action from the action space according to a uniform distribution, with regard to the epsilon decay policy.

Parameters
  • state (Dict[str, Any]) – Current state.

  • use_target (bool) – Whether to use the target network.

  • decay_epsilon (bool) – Whether to decay the epsilon attribute.

Returns

Noisy action of shape [batch_size, 1]. Any other things returned by your Q network. if they exist.

static action_get_function(sampled_actions)

This function is used to get action numbers (int tensor indicating which discrete actions are used) from the sampled action dictionary.

enable_multiprocessing()

Enable multiprocessing for all modules.

classmethod generate_config(config)[source]
Parameters

Any] config (Dict[str,) –

classmethod get_restorable_model_names()

Get attribute name of restorable nn models.

classmethod get_top_model_names()

Get attribute name of top level nn models.

classmethod init_from_config(config, model_device='cpu')[source]
Parameters
  • Any], machin.utils.conf.Config] config (Union[Dict[str,) –

  • torch.device] model_device (Union[str,) –

classmethod is_distributed()[source]

Whether this framework is a distributed framework which require multiple processes to run, and depends on torch.distributed or torch.distributed.rpc

load(model_dir, network_map=None, version=- 1)

Load models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir – Save directory.

  • network_map – Key is module name, value is saved name.

  • version – Version number of the save to be loaded.

manual_sync()[source]
static reward_function(reward, discount, next_value, terminal, _)
save(model_dir, network_map=None, version=0)

Save models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir (str) – Save directory.

  • network_map (Dict[str, str]) – Key is module name, value is saved name.

  • version (int) – Version number of the new save.

set_backward_function(backward_func)

Replace the default backward function with a custom function. The default loss backward function is torch.autograd.backward

Parameters

backward_func (Callable) –

set_sync(is_syncing)[source]
store_episode(episode)

Add a full episode of transition samples to the replay buffer.

Parameters

Dict]] episode (List[Union[machin.frame.transition.Transition,) –

store_transition(transition)

Add a transition sample to the replay buffer.

Parameters

Dict] transition (Union[machin.frame.transition.Transition,) –

update(update_value=True, update_target=True, concatenate_samples=True, **__)[source]

Update network weights by sampling from replay buffer.

Parameters
  • update_value – Whether update the Q network.

  • update_target – Whether update targets.

  • concatenate_samples – Whether concatenate the samples.

Returns

mean value of estimated policy value, value loss

update_lr_scheduler()

Update learning rate schedulers.

visualize_model(final_tensor, name, directory)
Parameters
  • final_tensor (torch.Tensor) –

  • name (str) –

  • directory (str) –

property backward_function
property lr_schedulers
property optimizers
property restorable_models
property top_models

IMPALA

class machin.frame.algorithms.impala.EpisodeDistributedBuffer(buffer_name, group, buffer_size, *_, **__)[source]

Bases: machin.frame.buffers.buffer_d.DistributedBuffer

A distributed buffer which stores each episode as a transition object inside the buffer.

Create a distributed replay buffer instance.

To avoid issues caused by tensor device difference, all transition objects are stored in device “cpu”.

Distributed replay buffer constitutes of many local buffers held per process, transmissions between processes only happen during sampling.

During sampling, the tensors in “state”, “action” and “next_state” dictionaries, along with “reward”, will be concatenated in dimension 0. any other custom keys specified in **kwargs will not be concatenated.

See also

Buffer

Note

Since append() operates on the local buffer, in order to append to the distributed buffer correctly, please make sure that your actor is also the local buffer holder, i.e. a member of the group

Parameters
  • buffer_size (int) – Maximum local buffer size.

  • group (machin.parallel.distributed.world.RpcGroup) – Process group which holds this buffer.

  • buffer_name (str) – A unique name of your buffer.

append(transition, required_attrs='state', 'action', 'next_state', 'reward', 'terminal', 'action_log_prob')[source]

Store a transition object to buffer.

Parameters
  • transition (Dict) – A transition object.

  • required_attrs – Required attributes. Could be an empty tuple if no attribute is required.

Raises
  • ValueError if transition object doesn't have required

  • attributes in required_attrs or has different attributes

  • compared to other transition objects stored in buffer.

class machin.frame.algorithms.impala.EpisodeTransition(state, action, next_state, reward, terminal, **kwargs)[source]

Bases: machin.frame.transition.Transition

A transition class which allows storing the whole episode as a single transition object, the batch dimension will be used to stack all transition steps.

Parameters
  • state (Dict[str, torch.Tensor]) – Previous observed state.

  • action (Dict[str, torch.Tensor]) – Action of agent.

  • next_state (Dict[str, torch.Tensor]) – Next observed state.

  • reward (Union[float, torch.Tensor]) – Reward of agent.

  • terminal (bool) – Whether environment has reached terminal state.

  • **kwargs – Custom attributes. They are ordered in the alphabetic order (provided by sort()) when you call keys().

Note

You should not store any tensor inside **kwargs as they will not be moved to the sample output device.

class machin.frame.algorithms.impala.IMPALA(actor, critic, optimizer, criterion, impala_group, model_server, *_, lr_scheduler=None, lr_scheduler_args=(), lr_scheduler_kwargs=(), batch_size=5, learning_rate=0.001, isw_clip_c=1.0, isw_clip_rho=1.0, entropy_weight=None, value_weight=0.5, gradient_max=inf, discount=0.99, replay_size=500, **__)[source]

Bases: machin.frame.algorithms.base.TorchFramework

Massively parallel IMPALA framework.

Note

Please make sure isw_clip_rho >= isw_clip_c

Parameters
  • actor (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Actor network module.

  • critic (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Critic network module.

  • optimizer (Callable) – Optimizer used to optimize actor and critic.

  • criterion (Callable) – Criterion used to evaluate the value loss.

  • impala_group (machin.parallel.distributed.world.RpcGroup) – Group of all processes using the IMPALA framework, including all samplers and trainers.

  • model_server (Tuple[machin.parallel.server.param_server.PushPullModelServer]) – Custom model sync server accessor for actor.

  • lr_scheduler (Callable) – Learning rate scheduler of optimizer.

  • lr_scheduler_args (Tuple[Tuple, Tuple]) – Arguments of the learning rate scheduler.

  • lr_scheduler_kwargs (Tuple[Dict, Dict]) – Keyword arguments of the learning rate scheduler.

  • batch_size (int) – Batch size used during training.

  • learning_rate (float) – Learning rate of the optimizer, not compatible with lr_scheduler.

  • isw_clip_c (float) – \(c\) used in importance weight clipping.

  • isw_clip_rho (float) –

  • entropy_weight (float) – Weight of entropy in your loss function, a positive entropy weight will minimize entropy, while a negative one will maximize entropy.

  • value_weight (float) – Weight of critic value loss.

  • gradient_max (float) – Maximum gradient.

  • discount (float) – \(\gamma\) used in the bellman function.

  • replay_size (int) – Size of the local replay buffer.

act(state, *_, **__)[source]

Use actor network to give a policy to the current state.

Returns

Anything produced by actor.

Parameters

Any] state (Dict[str,) –

classmethod generate_config(config)[source]
Parameters

Any], machin.utils.conf.Config] config (Union[Dict[str,) –

classmethod init_from_config(config, model_device='cpu')[source]
Parameters
  • Any], machin.utils.conf.Config] config (Union[Dict[str,) –

  • torch.device] model_device (Union[str,) –

classmethod is_distributed()[source]

Whether this framework is a distributed framework which require multiple processes to run, and depends on torch.distributed or torch.distributed.rpc

manual_sync()[source]
set_sync(is_syncing)[source]
store_episode(episode)[source]

Add a full episode of transition samples to the replay buffer.

Parameters

Dict]] episode (List[Union[machin.frame.transition.Transition,) –

store_transition(transition)[source]

Warning

Not supported in IMPALA due to v-trace requirements.

Parameters

Dict] transition (Union[machin.frame.transition.Transition,) –

update(update_value=True, update_policy=True, **__)[source]

Update network weights by sampling from replay buffer.

Note

Will always concatenate samples.

Parameters
  • update_value – Whether to update the Q network.

  • update_policy – Whether to update the actor network.

Returns

mean value of estimated policy value, value loss

update_lr_scheduler()[source]

Update learning rate schedulers.

property lr_schedulers
property optimizers

MADDPG

class machin.frame.algorithms.maddpg.MADDPG(actors, actor_targets, critics, critic_targets, optimizer, criterion, *_, lr_scheduler=None, lr_scheduler_args=None, lr_scheduler_kwargs=None, critic_visible_actors=None, sub_policy_num=0, batch_size=100, update_rate=0.001, update_steps=None, actor_learning_rate=0.0005, critic_learning_rate=0.001, discount=0.99, gradient_max=inf, replay_size=500000, replay_device='cpu', replay_buffer=None, visualize=False, visualize_dir='', use_jit=True, pool_type='thread', pool_size=None, **__)[source]

Bases: machin.frame.algorithms.base.TorchFramework

MADDPG is a centralized multi-agent training framework, it alleviates the unstable reward problem caused by the disturbance of other agents by gathering all agents observations and train a global critic. This global critic observes all actions and all states from all agents.

See also

DDPG

Note

In order to parallelize agent inference, a process pool is used internally. However, in order to minimize memory copy / CUDA memory copy, the location of all of your models must be either “cpu”, or “cuda” (Using multiple CUDA devices is supported).

Note

MADDPG framework does not require all of your actors are homogeneous. Each pair of your actors and critcs could be heterogeneous.

Note

Suppose you have three pair of actors and critics, with index 0, 1, 2. If critic 0 can observe the action of actor 0 and 1, critic 1 can observe the action of actor 1 and 2, critic 2 can observe the action of actor 2 and 0, the critic_visible_actors should be:

[[0, 1], [1, 2], [2, 0]]

Note

Learning rate scheduler args and kwargs for each actor and critic, the first list is for actors, and the second list is for critics.

Note

This implementation contains:
  • Ensemble Training

This implementation does not contain:
  • Inferring other agents’ policies

  • Mixed continuous/discrete action spaces

Parameters
  • actors (List[Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]]) – Actor network modules.

  • actor_targets (List[Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]]) – Target actor network modules.

  • critics (List[Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]]) – Critic network modules.

  • critic_targets (List[Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]]) – Target critic network modules.

  • optimizer (Callable) – Optimizer used to optimize actors and critics. By default all critics can see outputs of all actors.

  • criterion (Callable) – Criterion used to evaluate the value loss.

  • critic_visible_actors (List[List[int]]) – Indexes of visible actors for each critic.

  • sub_policy_num (int) – Times to replicate each actor. Equals to ensemble_policy_num - 1

  • lr_scheduler (Callable) – Learning rate scheduler of optimizer.

  • lr_scheduler_args (Tuple[List[Tuple], List[Tuple]]) – Arguments of the learning rate scheduler.

  • lr_scheduler_kwargs (Tuple[List[Dict], List[Dict]]) – Keyword arguments of the learning rate scheduler.

  • batch_size (int) – Batch size used during training.

  • update_rate (float) – \(\tau\) used to update target networks. Target parameters are updated as: \(\theta_t = \theta * \tau + \theta_t * (1 - \tau)\)

  • update_steps (Optional[int]) – Training step number used to update target networks.

  • actor_learning_rate (float) – Learning rate of the actor optimizer, not compatible with lr_scheduler.

  • critic_learning_rate (float) – Learning rate of the critic optimizer, not compatible with lr_scheduler.

  • discount (float) – \(\gamma\) used in the bellman function.

  • replay_size (int) – Replay buffer size for each actor. Not compatible with replay_buffer.

  • replay_device (Union[str, torch.device]) – Device where the replay buffer locates on, Not compatible with replay_buffer.

  • replay_buffer (machin.frame.buffers.buffer.Buffer) – Custom replay buffer. Will be replicated for actor.

  • visualize (bool) – Whether visualize the network flow in the first pass.

  • visualize_dir (str) – Visualized graph save directory.

  • use_jit (bool) – Whether use torch jit to perform the forward pass in parallel instead of using the internal pool. Provides significant speed and efficiency advantage, but requires actors and critics convertible to TorchScript.

  • pool_type (str) – Type of the internal execution pool, either “process” or “thread”.

  • pool_size (int) – Size of the internal execution pool.

  • gradient_max (float) –

act(states, use_target=False, **__)[source]

Use all actor networks to produce actions for the current state. A random sub-policy from the policy ensemble of each actor will be chosen.

Parameters
  • states (List[Dict[str, Any]]) – A list of current states of each actor.

  • use_target (bool) – Whether use the target network.

Returns

A list of anything returned by your actor. If your actor returns multiple values, they will be wrapped in a tuple.

act_discrete(states, use_target=False)[source]

Use all actor networks to produce discrete actions for the current state. A random sub-policy from the policy ensemble of each actor will be chosen.

Notes

actor network must output a probability tensor, of shape (batch_size, action_dims), and has a sum of 1 for each row in dimension 1.

Parameters
  • states (List[Dict[str, Any]]) – A list of current states of each actor.

  • use_target (bool) – Whether use the target network.

Returns

  1. Integer discrete actions of shape [batch_size, 1].

  2. Action probability tensors of shape [batch_size, action_num].

  3. Any other things returned by your actor.

Return type

A list of tuples containing

act_discrete_with_noise(states, use_target=False)[source]

Use all actor networks to produce discrete actions for the current state. A random sub-policy from the policy ensemble of each actor will be chosen.

Notes

actor network must output a probability tensor, of shape (batch_size, action_dims), and has a sum of 1 for each row in dimension 1.

Parameters
  • states (List[Dict[str, Any]]) – A list of current states of each actor.

  • use_target (bool) – Whether use the target network.

Returns

  1. Integer noisy discrete actions.

  2. Action probability tensors of shape [batch_size, action_num].

  3. Any other things returned by your actor.

Return type

A list of tuples containing

act_with_noise(states, noise_param=0.0, 1.0, ratio=1.0, mode='uniform', use_target=False, **__)[source]

Use all actor networks to produce noisy actions for the current state. A random sub-policy from the policy ensemble of each actor will be chosen.

Parameters
  • states (List[Dict[str, Any]]) – A list of current states of each actor.

  • noise_param (Any) – Noise params.

  • ratio (float) – Noise ratio.

  • mode (str) – Noise mode. Supported are: "uniform", "normal", "clipped_normal", "ou"

  • use_target (bool) – Whether use the target network.

Returns

A list of noisy actions of shape [batch_size, action_dim].

static action_concat_function(actions, *_)[source]
Parameters

actions (List[Dict]) –

static action_transform_function(raw_output_action, *_)[source]
Parameters

raw_output_action (Any) –

enable_multiprocessing()

Enable multiprocessing for all modules.

classmethod generate_config(config)[source]
Parameters

Any], machin.utils.conf.Config] config (Union[Dict[str,) –

classmethod get_restorable_model_names()

Get attribute name of restorable nn models.

classmethod get_top_model_names()

Get attribute name of top level nn models.

classmethod init_from_config(config, model_device='cpu')[source]
Parameters
  • Any], machin.utils.conf.Config] config (Union[Dict[str,) –

  • torch.device] model_device (Union[str,) –

classmethod is_distributed()

Whether this framework is a distributed framework which require multiple processes to run, and depends on torch.distributed or torch.distributed.rpc

load(model_dir, network_map=None, version=- 1)[source]

Load models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir – Save directory.

  • network_map – Key is module name, value is saved name.

  • version – Version number of the save to be loaded.

static reward_function(reward, discount, next_value, terminal, *_)[source]
save(model_dir, network_map=None, version=0)

Save models.

An example of network map:

{"restorable_model_1": "file_name_1",
 "restorable_model_2": "file_name_2"}

Get keys by calling <Class name>.get_restorable()

Parameters
  • model_dir (str) – Save directory.

  • network_map (Dict[str, str]) – Key is module name, value is saved name.

  • version (int) – Version number of the new save.

set_backward_function(backward_func)

Replace the default backward function with a custom function. The default loss backward function is torch.autograd.backward

Parameters

backward_func (Callable) –

static state_concat_function(states, *_)[source]
Parameters

states (List[Dict]) –

store_episodes(episodes)[source]

Add a List of full episodes, from all actors, to the replay buffers. Each episode is a list of transition samples.

Parameters

Dict]]] episodes (List[List[Union[machin.frame.transition.Transition,) –

store_transitions(transitions)[source]

Add a list of transition samples, from all actors at the same time step, to the replay buffers.

Parameters

transitions (List[Union[machin.frame.transition.Transition, Dict]]) – List of transition objects.

update(update_value=True, update_policy=True, update_target=True, concatenate_samples=True)[source]

Update network weights by sampling from replay buffer.

Parameters
  • update_value – Whether to update the Q network.

  • update_policy – Whether to update the actor network.

  • update_target – Whether to update targets.

  • concatenate_samples – Whether to concatenate the samples.

Returns

mean value of estimated policy value, value loss

update_lr_scheduler()[source]

Update learning rate schedulers.

visualize_model(final_tensor, name, directory)
Parameters
  • final_tensor (torch.Tensor) –

  • name (str) –

  • directory (str) –

property backward_function
property lr_schedulers
property optimizers
property restorable_models
property top_models
class machin.frame.algorithms.maddpg.SHMBuffer(buffer_size, buffer_device='cpu', *_, **__)[source]

Bases: machin.frame.buffers.buffer.Buffer

Create a buffer instance.

Buffer stores a series of transition objects and functions as a ring buffer. It is not thread-safe.

See also

Transition

During sampling, the tensors in “state”, “action” and “next_state” dictionaries, along with “reward”, will be concatenated in dimension 0. any other custom keys specified in **kwargs will not be concatenated.

Parameters
  • buffer_size – Maximum buffer size.

  • buffer_device – Device where buffer is stored.

append(transition, required_attrs='state', 'action', 'next_state', 'reward', 'terminal')

Store a transition object to buffer.

Parameters
  • transition (Union[machin.frame.transition.Transition, Dict]) – A transition object.

  • required_attrs – Required attributes. Could be an empty tuple if no attribute is required.

Raises
  • ValueError if transition object doesn't have required

  • attributes in required_attrs or has different attributes

  • compared to other transition objects stored in buffer.

clear()

Remove all entries from the buffer

static make_tensor_from_batch(batch, device, concatenate)[source]

Make a tensor from a batch of data. Will concatenate input tensors in dimension 0. Or create a tensor of size (batch_size, 1) for scalars.

Parameters
  • batch – Batch data.

  • device – Device to move data to

  • concatenate – Whether performing concatenation.

Returns

Original batch if batch is empty, or tensor depends on your data (if concatenate), or original batch (if not concatenate).

classmethod post_process_batch(batch, device, concatenate, sample_attrs, additional_concat_attrs)

Post-process (concatenate) sampled batch.

Parameters
  • batch (List[machin.frame.transition.Transition]) –

  • torch.device] device (Union[str,) –

  • concatenate (bool) –

  • sample_attrs (List[str]) –

  • additional_concat_attrs (List[str]) –

sample_batch(batch_size, concatenate=True, device=None, sample_method='random_unique', sample_attrs=None, additional_concat_attrs=None, *_, **__)

Sample a random batch from buffer.

See also

Default sample methods are defined as static class methods.

Buffer.sample_method_random_unique()

Buffer.sample_method_random()

Buffer.sample_method_all()

Note

“Concatenation” means torch.cat([...], dim=0) for tensors, and torch.tensor([...]).view(batch_size, 1) for scalars.

Warning

Custom attributes must not contain tensors. And only scalar custom attributes can be concatenated, such as int, float, bool.

Parameters
  • batch_size (int) – A hint size of the result sample. actual sample size depends on your sample method.

  • sample_method (Union[Callable, str]) – Sample method, could be one of: "random", "random_unique", "all", or a function: func(list, batch_size)->(list, result_size)

  • concatenate (bool) – Whether concatenate state, action and next_state in dimension 0. If True, for each value in dictionaries of major attributes. and each value of sub attributes, returns a concatenated tensor. Custom Attributes specified in additional_concat_attrs will also be concatenated. If False, return a list of tensors.

  • device (Union[str, torch.device]) – Device to copy to.

  • sample_attrs (List[str]) – If sample_keys is specified, then only specified keys of the transition object will be sampled. You may use "*" as a wildcard to collect remaining custom keys as a dict, you cannot collect major and sub attributes using this. Invalid sample attributes will be ignored.

  • additional_concat_attrs (List[str]) – additional custom keys needed to be concatenated, will only work if concatenate is True.

Returns

  1. Batch size, Sampled attribute values in the same order as sample_keys.

  2. Sampled attribute values is a tuple. Or None if sampled batch size is zero (E.g.: if buffer is empty or your sample size is 0 and you are not sampling using the “all” method).

    • For major attributes, result are dictionaries of tensors with the same keys in your transition objects.

    • For sub attributes, result are tensors.

    • For custom attributes, if they are not in additional_concat_attrs, then lists, otherwise tensors.

Return type

Any

static sample_method_all(buffer, _)

Sample all samples from buffer. Always return the whole buffer, will ignore the batch_size parameter.

Parameters

buffer (List[machin.frame.transition.Transition]) –

Return type

Tuple[int, List[machin.frame.transition.Transition]]

static sample_method_random(buffer, batch_size)

Sample random samples from buffer.

Note

Sampled size could be any value from 0 to batch_size.

Parameters
Return type

Tuple[int, List[machin.frame.transition.Transition]]

static sample_method_random_unique(buffer, batch_size)

Sample unique random samples from buffer.

Note

Sampled size could be any value from 0 to batch_size.

Parameters
Return type

Tuple[int, List[machin.frame.transition.Transition]]

size()
Returns

Length of current buffer.

buffers

Buffer

class machin.frame.buffers.buffer.Buffer(buffer_size, buffer_device='cpu', *_, **__)[source]

Bases: object

Create a buffer instance.

Buffer stores a series of transition objects and functions as a ring buffer. It is not thread-safe.

See also

Transition

During sampling, the tensors in “state”, “action” and “next_state” dictionaries, along with “reward”, will be concatenated in dimension 0. any other custom keys specified in **kwargs will not be concatenated.

Parameters
  • buffer_size – Maximum buffer size.

  • buffer_device – Device where buffer is stored.

append(transition, required_attrs='state', 'action', 'next_state', 'reward', 'terminal')[source]

Store a transition object to buffer.

Parameters
  • transition (Union[machin.frame.transition.Transition, Dict]) – A transition object.

  • required_attrs – Required attributes. Could be an empty tuple if no attribute is required.

Raises
  • ValueError if transition object doesn't have required

  • attributes in required_attrs or has different attributes

  • compared to other transition objects stored in buffer.

clear()[source]

Remove all entries from the buffer

static make_tensor_from_batch(batch, device, concatenate)[source]

Make a tensor from a batch of data. Will concatenate input tensors in dimension 0. Or create a tensor of size (batch_size, 1) for scalars.

Parameters
  • batch (List[Union[NewType.<locals>.new_type, torch.Tensor]]) – Batch data.

  • device (Union[str, torch.device]) – Device to move data to

  • concatenate (bool) – Whether performing concatenation.

Returns

Original batch if batch is empty, or tensor depends on your data (if concatenate), or original batch (if not concatenate).

classmethod post_process_batch(batch, device, concatenate, sample_attrs, additional_concat_attrs)[source]

Post-process (concatenate) sampled batch.

Parameters
  • batch (List[machin.frame.transition.Transition]) –

  • torch.device] device (Union[str,) –

  • concatenate (bool) –

  • sample_attrs (List[str]) –

  • additional_concat_attrs (List[str]) –

sample_batch(batch_size, concatenate=True, device=None, sample_method='random_unique', sample_attrs=None, additional_concat_attrs=None, *_, **__)[source]

Sample a random batch from buffer.

See also

Default sample methods are defined as static class methods.

Buffer.sample_method_random_unique()

Buffer.sample_method_random()

Buffer.sample_method_all()

Note

“Concatenation” means torch.cat([...], dim=0) for tensors, and torch.tensor([...]).view(batch_size, 1) for scalars.

Warning

Custom attributes must not contain tensors. And only scalar custom attributes can be concatenated, such as int, float, bool.

Parameters
  • batch_size (int) – A hint size of the result sample. actual sample size depends on your sample method.

  • sample_method (Union[Callable, str]) – Sample method, could be one of: "random", "random_unique", "all", or a function: func(list, batch_size)->(list, result_size)

  • concatenate (bool) – Whether concatenate state, action and next_state in dimension 0. If True, for each value in dictionaries of major attributes. and each value of sub attributes, returns a concatenated tensor. Custom Attributes specified in additional_concat_attrs will also be concatenated. If False, return a list of tensors.

  • device (Union[str, torch.device]) – Device to copy to.

  • sample_attrs (List[str]) – If sample_keys is specified, then only specified keys of the transition object will be sampled. You may use "*" as a wildcard to collect remaining custom keys as a dict, you cannot collect major and sub attributes using this. Invalid sample attributes will be ignored.

  • additional_concat_attrs (List[str]) – additional custom keys needed to be concatenated, will only work if concatenate is True.

Returns

  1. Batch size, Sampled attribute values in the same order as sample_keys.

  2. Sampled attribute values is a tuple. Or None if sampled batch size is zero (E.g.: if buffer is empty or your sample size is 0 and you are not sampling using the “all” method).

    • For major attributes, result are dictionaries of tensors with the same keys in your transition objects.

    • For sub attributes, result are tensors.

    • For custom attributes, if they are not in additional_concat_attrs, then lists, otherwise tensors.

Return type

Any

static sample_method_all(buffer, _)[source]

Sample all samples from buffer. Always return the whole buffer, will ignore the batch_size parameter.

Parameters

buffer (List[machin.frame.transition.Transition]) –

Return type

Tuple[int, List[machin.frame.transition.Transition]]

static sample_method_random(buffer, batch_size)[source]

Sample random samples from buffer.

Note

Sampled size could be any value from 0 to batch_size.

Parameters
Return type

Tuple[int, List[machin.frame.transition.Transition]]

static sample_method_random_unique(buffer, batch_size)[source]

Sample unique random samples from buffer.

Note

Sampled size could be any value from 0 to batch_size.

Parameters
Return type

Tuple[int, List[machin.frame.transition.Transition]]

size()[source]
Returns

Length of current buffer.

Distributed buffer

class machin.frame.buffers.buffer_d.DistributedBuffer(buffer_name, group, buffer_size, *_, **__)[source]

Bases: machin.frame.buffers.buffer.Buffer

Create a distributed replay buffer instance.

To avoid issues caused by tensor device difference, all transition objects are stored in device “cpu”.

Distributed replay buffer constitutes of many local buffers held per process, transmissions between processes only happen during sampling.

During sampling, the tensors in “state”, “action” and “next_state” dictionaries, along with “reward”, will be concatenated in dimension 0. any other custom keys specified in **kwargs will not be concatenated.

See also

Buffer

Note

Since append() operates on the local buffer, in order to append to the distributed buffer correctly, please make sure that your actor is also the local buffer holder, i.e. a member of the group

Parameters
  • buffer_size (int) – Maximum local buffer size.

  • group (machin.parallel.distributed.world.RpcGroup) – Process group which holds this buffer.

  • buffer_name (str) – A unique name of your buffer.

all_clear()[source]

Remove all entries from all local buffers.

all_size()[source]
Returns

Total length of all buffers.

append(transition, required_attrs='state', 'action', 'next_state', 'reward', 'terminal')[source]

Store a transition object to buffer.

Parameters
  • transition (Union[machin.frame.transition.Transition, Dict]) – A transition object.

  • required_attrs – Required attributes. Could be an empty tuple if no attribute is required.

Raises
  • ValueError if transition object doesn't have required

  • attributes in required_attrs or has different attributes

  • compared to other transition objects stored in buffer.

clear()[source]

Clear current local buffer.

sample_batch(batch_size, concatenate=True, device=None, sample_method='random_unique', sample_attrs=None, additional_concat_attrs=None, *_, **__)[source]

Sample a random batch from buffer.

See also

Default sample methods are defined as static class methods.

Buffer.sample_method_random_unique()

Buffer.sample_method_random()

Buffer.sample_method_all()

Note

“Concatenation” means torch.cat([...], dim=0) for tensors, and torch.tensor([...]).view(batch_size, 1) for scalars.

Warning

Custom attributes must not contain tensors. And only scalar custom attributes can be concatenated, such as int, float, bool.

Parameters
  • batch_size (int) – A hint size of the result sample. actual sample size depends on your sample method.

  • sample_method (Union[Callable, str]) – Sample method, could be one of: "random", "random_unique", "all", or a function: func(list, batch_size)->(list, result_size)

  • concatenate (bool) – Whether concatenate state, action and next_state in dimension 0. If True, for each value in dictionaries of major attributes. and each value of sub attributes, returns a concatenated tensor. Custom Attributes specified in additional_concat_attrs will also be concatenated. If False, return a list of tensors.

  • device (Union[str, torch.device]) – Device to copy to.

  • sample_attrs (List[str]) – If sample_keys is specified, then only specified keys of the transition object will be sampled. You may use "*" as a wildcard to collect remaining custom keys as a dict, you cannot collect major and sub attributes using this. Invalid sample attributes will be ignored.

  • additional_concat_attrs (List[str]) – additional custom keys needed to be concatenated, will only work if concatenate is True.

Returns

  1. Batch size, Sampled attribute values in the same order as sample_keys.

  2. Sampled attribute values is a tuple. Or None if sampled batch size is zero (E.g.: if buffer is empty or your sample size is 0 and you are not sampling using the “all” method).

    • For major attributes, result are dictionaries of tensors with the same keys in your transition objects.

    • For sub attributes, result are tensors.

    • For custom attributes, if they are not in additional_concat_attrs, then lists, otherwise tensors.

Return type

Any

size()[source]
Returns

Length of current local buffer.

Prioritized buffer

class machin.frame.buffers.prioritized_buffer.PrioritizedBuffer(buffer_size, buffer_device='cpu', epsilon=0.01, alpha=0.6, beta=0.4, beta_increment_per_sampling=0.001, *_, **__)[source]

Bases: machin.frame.buffers.buffer.Buffer

Parameters
  • buffer_size – Maximum buffer size.

  • buffer_device – Device where buffer is stored.

  • epsilon – A small positive constant used to prevent edge-case zero weight transitions from never being visited.

  • alpha – Prioritization weight. Used during transition sampling: \(j \sim P(j)=p_{j}^{\alpha} / \sum_i p_{i}^{\alpha}\). When alpha = 0, all samples have the same probability to be sampled. When alpha = 1, all samples are drawn uniformly according to their weight.

  • beta – Bias correcting weight. When beta = 1, bias introduced by prioritized replay will be corrected. Used during importance weight calculation: \(w_j=(N \cdot P(j))^{-\beta}/max_i w_i\)

  • beta_increment_per_sampling – Beta increase step size, will gradually increase beta to 1.

append(transition, priority=None, required_attrs='state', 'action', 'next_state', 'reward', 'terminal')[source]

Store a transition object to buffer.

Parameters
  • transition (Union[machin.frame.transition.Transition, Dict]) – A transition object.

  • priority (Optional[float]) – Priority of transition.

  • required_attrs – Required attributes.

clear()[source]

Clear and resets the buffer to its initial state.

sample_batch(batch_size, concatenate=True, device=None, sample_attrs=None, additional_concat_attrs=None, *_, **__)[source]

Sample the most important batch from the prioritized buffer.

Parameters
  • batch_size (int) – A hint size of the result sample.

  • concatenate (bool) – Whether concatenate state, action and next_state in dimension 0. If True, for each value in dictionaries of major attributes. and each value of sub attributes, returns a concatenated tensor. Custom Attributes specified in additional_concat_attrs will also be concatenated. If False, return a list of tensors.

  • device (Union[str, torch.device]) – Device to copy to.

  • sample_attrs (List[str]) – If sample_keys is specified, then only specified keys of the transition object will be sampled. You may use "*" as a wildcard to collect remaining keys.

  • additional_concat_attrs (List[str]) – additional custom keys needed to be concatenated,

Returns

  1. Batch size.

  2. Sampled attribute values in the same order as sample_keys.

    Sampled attribute values is a tuple. Or None if sampled batch size is zero (E.g.: if buffer is empty or your sample size is 0).

  3. Indexes of samples in the weight tree, np.ndarray. Or None if sampled batch size is zero

  4. Importance sampling weight of samples, np.ndarray. Or None if sampled batch size is zero

Return type

Any

size()[source]
Returns

Length of current buffer.

update_priority(priorities, indexes)[source]

Update priorities of samples.

Parameters
  • priorities (numpy.ndarray) – New priorities.

  • indexes (numpy.ndarray) – Indexes of samples, returned by sample_batch()

Weight tree

class machin.frame.buffers.prioritized_buffer.WeightTree(size)[source]

Bases: object

Sum weight tree data structure.

Initialize a weight tree.

Note

Weights must be positive.

Note

Weight tree is stored as a flattened, full binary tree in a np.ndarray. The lowest level of leaves comes first, the highest root node is stored at last.

Example:

Tree with weights: [[1, 2, 3, 4], [3, 7], [11]]

will be stored as: [1, 2, 3, 4, 3, 7, 11]

Note

Performance On i7-6700HQ (M: Million):

90ms for building a tree with 10M elements.

230ms for looking up 10M elements in a tree with 10M elements.

20ms for 1M element batched update in a tree with 10M elements.

240ms for 1M element single update in a tree with 10M elements.

Parameters

size – Number of weight tree leaves.

find_leaf_index(weight)[source]

Find leaf indexes given weight. Weight must be in range \([0, weight\_sum]\)

Parameters

weight (Union[float, List[float], numpy.ndarray]) – Weight(s) used to query leaf index(es).

Returns

Leaf index(es), if weight is scalar, returns int, if not, returns np.ndarray.

get_leaf_all_weights()[source]
Returns

Current weights of all leaves, np.ndarray of shape (size).

Return type

numpy.ndarray

get_leaf_max()[source]
Returns

Current maximum leaf weight.

Return type

float

get_leaf_weight(index)[source]

Get weights of selected leaves.

Parameters

index (Union[int, List[int], numpy.ndarray]) – Leaf indexes in range [0, size - 1], used to query weights.

Returns

Current weight(s) of selected leaves. If index is scalar, returns float, if not, returns np.ndarray.

Return type

Any

get_weight_sum()[source]
Returns

Total weight sum.

Return type

float

print_weights(precision=2)[source]

Pretty print the tree, for debug purpose.

Parameters

precision – Number of digits of weights to print.

update_all_leaves(weights)[source]

Reset all leaf weights, rebuild weight tree from ground up.

Parameters

weights (Union[List[float], numpy.ndarray]) – All leaf weights. List or array length should be in range [0, size].

update_leaf(weight, index)[source]

Update a single weight tree leaf.

Parameters
  • weight (float) – New weight of the leaf.

  • index (int) – Leaf index to update, must be in range [0, size - 1].

update_leaf_batch(weights, indexes)[source]

Update weight tree leaves in batch.

Parameters
  • weights (Union[List[float], numpy.ndarray]) – New weights of leaves.

  • indexes (Union[List[int], numpy.ndarray]) – Leaf indexes to update, must be in range [0, size - 1].

Distributed prioritized buffer

class machin.frame.buffers.prioritized_buffer_d.DistributedPrioritizedBuffer(buffer_name, group, buffer_size, *_, **__)[source]

Bases: machin.frame.buffers.prioritized_buffer.PrioritizedBuffer

Create a distributed prioritized replay buffer instance.

To avoid issues caused by tensor device difference, all transition objects are stored in device “cpu”.

Distributed prioritized replay buffer constitutes of many local buffers held per process, since it is very inefficient to maintain a weight tree across processes, each process holds a weight tree of records in its local buffer and a local buffer (same as DistributedBuffer).

The sampling process(es) will first use rpc to acquire the wr_lock, signalling “stop” to appending performed by actor processes, then perform a sum of all local weight trees, and finally perform sampling, after sampling and updating the importance weight, the lock will be released.

During sampling, the tensors in “state”, “action” and “next_state” dictionaries, along with “reward”, will be concatenated in dimension 0. any other custom keys specified in **kwargs will not be concatenated.

See also

PrioritizedBuffer

Note

DistributedPrioritizedBuffer is not split into an accessor and an implementation, because we would like to operate on the buffer directly, when calling “size()” or “append()”, to increase efficiency (since rpc layer is bypassed).

Parameters
  • buffer_size (int) – Maximum local buffer size.

  • group (machin.parallel.distributed.world.RpcGroup) – Process group which holds this buffer.

  • buffer_name (str) –

all_clear()[source]

Remove all entries from all local buffers.

all_size()[source]
Returns

Total length of all buffers.

append(transition, priority=None, required_attrs='state', 'action', 'next_state', 'reward', 'terminal')[source]

Store a transition object to buffer.

Parameters
  • transition (Union[machin.frame.transition.Transition, Dict]) – A transition object.

  • priority (Optional[float]) – Priority of transition.

  • required_attrs – Required attributes.

clear()[source]

Remove all entries from current local buffer.

sample_batch(batch_size, concatenate=True, device=None, sample_attrs=None, additional_concat_attrs=None, *_, **__)[source]

Sample the most important batch from the prioritized buffer.

Parameters
  • batch_size (int) – A hint size of the result sample.

  • concatenate (bool) – Whether concatenate state, action and next_state in dimension 0. If True, for each value in dictionaries of major attributes. and each value of sub attributes, returns a concatenated tensor. Custom Attributes specified in additional_concat_attrs will also be concatenated. If False, return a list of tensors.

  • device (Union[str, torch.device]) – Device to copy to.

  • sample_attrs (List[str]) – If sample_keys is specified, then only specified keys of the transition object will be sampled. You may use "*" as a wildcard to collect remaining keys.

  • additional_concat_attrs (List[str]) – additional custom keys needed to be concatenated,

Returns

  1. Batch size.

  2. Sampled attribute values in the same order as sample_keys.

    Sampled attribute values is a tuple. Or None if sampled batch size is zero (E.g.: if buffer is empty or your sample size is 0).

  3. Indexes of samples in the weight tree, np.ndarray. Or None if sampled batch size is zero

  4. Importance sampling weight of samples, np.ndarray. Or None if sampled batch size is zero

Return type

Any

size()[source]
Returns

Length of current local buffer.

update_priority(priorities, indexes)[source]

Update priorities of samples.

Parameters
  • priorities (numpy.ndarray) – New priorities.

  • indexes (collections.OrderedDict) – Indexes of samples, returned by sample_batch()

noise

action_space_noise

machin.frame.noise.action_space_noise.add_clipped_normal_noise_to_action(action, noise_param=0.0, 1.0, - 1.0, 1.0, ratio=1.0)[source]

Add clipped normal noise to action tensor.

Hint

The innermost tuple contains: (normal_mean, normal_sigma, clip_min, clip_max)

If noise_param is Tuple[float, float, float, float], then the same clipped normal noise will be added to action[*, :].

If noise_param is Iterable[Tuple[float, float, float, float]], then for each action[*, i] slice i, clipped normal noise with noise_param[i] will be applied respectively.

Parameters
  • action (torch.Tensor) – Raw action

  • noise_param (Union[Iterable[Tuple], Tuple]) – Param of the normal noise.

  • ratio – Sampled noise is multiplied with this ratio.

Returns

Action with uniform noise.

machin.frame.noise.action_space_noise.add_normal_noise_to_action(action, noise_param=0.0, 1.0, ratio=1.0)[source]

Add normal noise to action tensor.

Hint

The innermost tuple contains: (normal_mean, normal_sigma)

If noise_param is Tuple[float, float], then the same normal noise will be added to action[*, :].

If noise_param is Iterable[Tuple[float, float]], then for each action[*, i] slice i, clipped normal noise with noise_param[i] will be applied respectively.

Parameters
  • action (torch.Tensor) – Raw action

  • noise_param – Param of the normal noise.

  • ratio – Sampled noise is multiplied with this ratio.

Returns

Action with normal noise.

machin.frame.noise.action_space_noise.add_ou_noise_to_action(action, noise_param=None, ratio=1.0, reset=False)[source]

Add Ornstein-Uhlenbeck noise to action tensor.

Warning

Ornstein-Uhlenbeck noise generator is shared. And you cannot specify OU noise of different distributions for each of the last dimension of your action.

Parameters
  • action (torch.Tensor) – Raw action

  • noise_param (Dict[str, Any]) – OrnsteinUhlenbeckGen params. Used as keyword arguments of the generator. Will only be effective if reset is True.

  • ratio – Sampled noise is multiplied with this ratio.

  • reset – Whether to reset the default Ornstein-Uhlenbeck noise generator.

Returns

Action with Ornstein-Uhlenbeck noise.

machin.frame.noise.action_space_noise.add_uniform_noise_to_action(action, noise_param=0.0, 1.0, ratio=1.0)[source]

Add uniform noise to action tensor.

Hint

The innermost tuple contains: (uniform_min, uniform_max)

If noise_param is Tuple[float, float], then the same uniform noise will be added to action[*, :].

If noise_param is Iterable[Tuple[float, float]], then for each action[*, i] slice i, uniform noise with noise_param[i] will be added respectively.

Parameters
  • action (torch.Tensor) – Raw action.

  • noise_param (Union[Iterable[Tuple], Tuple]) – Param of the uniform noise.

  • ratio (float) – Sampled noise is multiplied with this ratio.

Returns

Action with uniform noise.

generator

class machin.frame.noise.generator.ClippedNormalNoiseGen(shape, mu=0.0, sigma=1.0, nmin=- 1.0, nmax=1.0)[source]

Bases: machin.frame.noise.generator.NoiseGen

Normal noise generator.

Example

>>> gen = NormalNoiseGen([2, 3], 0, 1)
>>> gen("cuda:0")
tensor([[-0.5957,  0.2360,  1.0999],
        [ 1.6259,  1.2052, -0.0667]], device="cuda:0")
Parameters
  • shape (Any) – Output shape.

  • mu (float) – Average mean of normal noise.

  • sigma (float) – Standard deviation of normal noise.

  • nmin (float) –

  • nmax (float) –

class machin.frame.noise.generator.NoiseGen[source]

Bases: abc.ABC

Base class for noise generators.

reset()[source]

Reset internal states of the noise generator, if it has any.

class machin.frame.noise.generator.NormalNoiseGen(shape, mu=0.0, sigma=1.0)[source]

Bases: machin.frame.noise.generator.NoiseGen

Normal noise generator.

Example

>>> gen = NormalNoiseGen([2, 3], 0, 1)
>>> gen("cuda:0")
tensor([[-0.5957,  0.2360,  1.0999],
        [ 1.6259,  1.2052, -0.0667]], device="cuda:0")
Parameters
  • shape (Any) – Output shape.

  • mu (float) – Average mean of normal noise.

  • sigma (float) – Standard deviation of normal noise.

class machin.frame.noise.generator.OrnsteinUhlenbeckNoiseGen(shape, mu=0.0, sigma=1.0, theta=0.15, dt=0.01, x0=None)[source]

Bases: machin.frame.noise.generator.NoiseGen

Ornstein-Uhlenbeck noise generator. Based on definition:

\(X_{n+1} = X_n + \theta (\mu - X_n)\Delta t + \sigma \Delta W_n\)

Example

>>> gen = OrnsteinUhlenbeckNoiseGen([2, 3], 0, 1)
>>> gen("cuda:0")
tensor([[ 0.1829,  0.1589, -0.1932],
        [-0.1568,  0.0579,  0.2107]], device="cuda:0")
>>> gen.reset()
Parameters
  • shape (Any) – Output shape.

  • mu (float) – Average mean of noise.

  • sigma (float) – Weight of the random wiener process.

  • theta (float) – Weight of difference correction.

  • dt (float) – Time step size.

  • x0 (torch.Tensor) – Initial x value. Must have the same shape as shape.

reset()[source]

Reset the generator to its initial state.

class machin.frame.noise.generator.UniformNoiseGen(shape, umin=0.0, umax=1.0)[source]

Bases: machin.frame.noise.generator.NoiseGen

Normal noise generator.

Example

>>> gen = UniformNoiseGen([2, 3], 0, 1)
>>> gen("cuda:0")
tensor([[0.0745, 0.6581, 0.9572],
        [0.4450, 0.8157, 0.6421]], device="cuda:0")
Parameters
  • shape (Any) – Output shape.

  • umin (float) – Minimum value of uniform noise.

  • umax (float) – Maximum value of uniform noise.

param_space_noise

class machin.frame.noise.param_space_noise.AdaptiveParamNoise(initial_stddev=0.1, desired_action_stddev=0.1, adoption_coefficient=1.01)[source]

Bases: object

Implements the adaptive parameter space method in <<Parameter space noise for exploration>>.

Hint

Let \(\theta\) be the standard deviation of noise, and \(\alpha\) be the adpotion coefficient, then:

\(\theta_{n+1} = \left \{ \begin{array}{ll} \alpha \theta_k & if\ d(\pi,\tilde{\pi})\leq\delta, \\ \frac{1}{\alpha} \theta_k & otherwise, \end{array} \right. \ \)

Noise is directly applied to network parameters.

Parameters
  • initial_stddev (float) – Initial noise standard deviation.

  • desired_action_stddev (float) – Desired standard deviation for

  • adoption_coefficient (float) – Adoption coefficient.

adapt(distance)[source]

Update noise standard deviation according to distance.

Parameters

distance (float) – Current distance between the noisy action and clean action.

get_dev()[source]
Returns

Current noise standard deviation.

Return type

float

machin.frame.noise.param_space_noise.perturb_model(model, perturb_switch, reset_switch, distance_func=<function <lambda>>, desired_action_stddev=0.5, noise_generator=<class 'machin.frame.noise.generator.NormalNoiseGen'>, noise_generator_args=(), noise_generator_kwargs=None, noise_generate_function=None, debug_backward=False)[source]

Give model’s parameters a little perturbation. Implements <<Parameter space noise for exploration>>.

Note

Only parameters of type t.Tensor and gettable from model.named_parameters() will be perturbed.

Original parameters will be automatically swapped in during the backward pass, and you can safely call optimizers afterwards.

Hint

1. noise_generator must accept (shape, *args) in its __init__ function, where shape is the required shape. it also needs to have __call__(device=None) which produce a noise tensor on the specified device when invoked.

2. noise_generate_function must accept (shape, device, std:float) and return a noise tensor on the specified device.

Example

In order to use this function to perturb your model, you need to:

from machin.utils.helper_classes import Switch
from machin.frame.noise.param_space_noise import perturb_model
from machin.utils.visualize import visualize_graph
import torch as t

dims = 5

t.manual_seed(0)
model = t.nn.Linear(dims, dims)
optim = t.optim.Adam(model.parameters(), 1e-3)
p_switch, r_switch = Switch(), Switch()
cancel = perturb_model(model, p_switch, r_switch)

# you should keep this switch on if you do one training step after
# every sampling step. otherwise you may turn it off in one episode
# and turn it on in the next to speed up training.
r_switch.on()

# turn off/on the perturbation switch to see the difference
p_switch.on()

# do some sampling
action = model(t.ones([dims]))

# in order to let parameter noise adapt to generate noisy actions
# within ``desired_action_stddev``, you must periodically
# use the original model to generate some actions:
p_switch.off()
action = model(t.ones([dims]))

# visualize will not show any leaf noise tensors
# because they are created in t.no_grad() context
# and added in-place.
visualize_graph(action, exit_after_vis=False)

# do some training
loss = (action - t.ones([dims])).sum()
loss.backward()
optim.step()
print(model.weight)

# clear hooks
cancel()
Parameters
  • model (torch.nn.modules.module.Module) – Neural network model.

  • perturb_switch (machin.utils.helper_classes.Switch) – The switch used to enable perturbation. If switch is set to False (off), then during the forward process, original parameters are used.

  • reset_switch (machin.utils.helper_classes.Switch) – The switch used to reset perturbation noise. If switch is set to True (on), and perturb_switch is also on, then during every forward process, a new set of noise is applied to each param. If only perturb_switch is on, then the same set of noisy parameters is used in the forward process and they will not be updated.

  • distance_func (Callable) – Distance function, accepts two tensors produced by model (one is noisy), return the distance as float. Used to compare the distance between actions generated by noisy parameters and original parameters.

  • desired_action_stddev (float) – Desired action standard deviation.

  • noise_generator (Any) – Noise generator class.

  • noise_generator_args (Tuple) – Additional args other than shape of the noise generator.

  • noise_generator_kwargs (Dict) – Additional kwargs other than shape of the noise generator.

  • noise_generate_function (Callable) – Noise generation function, mutually exclusive with noise_generator and noise_generator_args.

  • debug_backward – Print a message if the backward hook is correctly executed.

Returns

  1. A reset function with no arguments, will swap in original paramters.

  2. A deregister function with no arguments, will deregister all hooks

    applied on your model.

transition

class machin.frame.transition.Transition(state, action, next_state, reward, terminal, **kwargs)[source]

Bases: machin.frame.transition.TransitionBase

The default Transition class.

Have three main attributes: state, action and next_state.

Have two sub attributes: reward and terminal.

Store one transition step of one agent.

Parameters
  • state (Dict[str, torch.Tensor]) – Previous observed state.

  • action (Dict[str, torch.Tensor]) – Action of agent.

  • next_state (Dict[str, torch.Tensor]) – Next observed state.

  • reward (Union[float, torch.Tensor]) – Reward of agent.

  • terminal (bool) – Whether environment has reached terminal state.

  • **kwargs – Custom attributes. They are ordered in the alphabetic order (provided by sort()) when you call keys().

Note

You should not store any tensor inside **kwargs as they will not be moved to the sample output device.

action = None
next_state = None
reward = None
state = None
terminal = None
class machin.frame.transition.TransitionBase(major_attr, sub_attr, custom_attr, major_data, sub_data, custom_data)[source]

Bases: object

Base class for all transitions

Note

Major attributes store things like state, action, next_states, etc. They are usually concatenated by their dictionary keys during sampling, and passed as keyword arguments to actors, critics, etc.

Sub attributes store things like terminal states, reward, etc. They are usually concatenated directly during sampling, and used in different algorithms.

Custom attributes store not concatenatable values, usually user specified states, used in models or as special arguments in different algorithms. They will be collected together as a list during sampling, no further concatenation is performed.

Parameters
  • major_attr (Iterable[str]) – A list of major attribute names.

  • sub_attr (Iterable[str]) – A list of sub attribute names.

  • custom_attr (Iterable[str]) – A list of custom attribute names.

  • major_data (Iterable[Dict[str, torch.Tensor]]) – Data of major attributes.

  • sub_data (Iterable[Union[NewType.<locals>.new_type, torch.Tensor]]) – Data of sub attributes.

  • custom_data (Iterable[Any]) – Data of custom attributes.

has_keys(keys)[source]
Parameters

keys (Iterable[str]) – A list of keys

Returns

A bool indicating whether current transition object contains all specified keys.

items()[source]
Returns

All attribute values in current transition object.

keys()[source]
Returns

All attribute names in current transition object. Ordered in: “major_attrs, sub_attrs, custom_attrs”

to(device)[source]

Move current transition object to another device. will be a no-op if it already locates on that device.

Parameters

device (Union[str, torch.device]) – A valid pytorch device.

Returns

Self.

property custom_attr
property major_attr
property sub_attr
class machin.frame.transition.TransitionStorageBasic(max_size)[source]

Bases: list

TransitionStorageBasic is a linear, size-capped chunk of memory for transitions, it makes sure that every stored transition is copied, and isolated from the passed in transition object.

Parameters

max_size – Maximum size of the transition storage.

clear()[source]

Remove all items from list.

store(transition)[source]
Parameters

transition (machin.frame.transition.TransitionBase) – Transition object to be stored

Returns

The position where transition is inserted.

Return type

int

class machin.frame.transition.TransitionStorageSmart(max_size)[source]

Bases: machin.frame.transition.TransitionStorageBasic

TransitionStorageSmart is a smarter, but (potentially) slower storage class for transitions, but in many cases it is as fast as the basic storage and halves memory usage because it only deep copies half of the states.

TransitionStorageSmart will compare the major attributes of the current stored transition object with that of the last stored transition object. And set them to refer to the same tensor.

Sub attributes and custom attributes will be direcly copied.

Args: max_size: Maximum size of the transition storage.

clear()[source]

Remove all items from list.

store(transition)[source]
Parameters

transition (machin.frame.transition.TransitionBase) – Transition object to be stored

Returns

The position where transition is inserted.

Return type

int