machin.frame¶
algorithms¶
Base¶
-
class
machin.frame.algorithms.base.
TorchFramework
[source]¶ Bases:
object
Base framework for all algorithms
-
classmethod
generate_config
(config)[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
-
classmethod
init_from_config
(config, model_device='cpu')[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
torch.device] model_device (Union[str,) –
-
classmethod
is_distributed
()[source]¶ Whether this framework is a distributed framework which require multiple processes to run, and depends on
torch.distributed
ortorch.distributed.rpc
-
load
(model_dir, network_map=None, version=- 1)[source]¶ Load models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir (str) – Save directory.
network_map (Dict[str, str]) – Key is module name, value is saved name.
version (int) – Version number of the save to be loaded.
-
save
(model_dir, network_map=None, version=0)[source]¶ Save models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir (str) – Save directory.
network_map (Dict[str, str]) – Key is module name, value is saved name.
version (int) – Version number of the new save.
-
set_backward_function
(backward_func)[source]¶ Replace the default backward function with a custom function. The default loss backward function is
torch.autograd.backward
- Parameters
backward_func (Callable) –
-
visualize_model
(final_tensor, name, directory)[source]¶ - Parameters
final_tensor (torch.Tensor) –
name (str) –
directory (str) –
-
property
backward_function
¶
-
property
lr_schedulers
¶
-
property
optimizers
¶
-
property
restorable_models
¶
-
property
top_models
¶
-
classmethod
DDPG¶
-
class
machin.frame.algorithms.ddpg.
DDPG
(actor, actor_target, critic, critic_target, optimizer, criterion, *_, lr_scheduler=None, lr_scheduler_args=None, lr_scheduler_kwargs=None, batch_size=100, update_rate=0.001, update_steps=None, actor_learning_rate=0.0005, critic_learning_rate=0.001, discount=0.99, gradient_max=inf, replay_size=500000, replay_device='cpu', replay_buffer=None, visualize=False, visualize_dir='', **__)[source]¶ Bases:
machin.frame.algorithms.base.TorchFramework
DDPG framework.
Note
Your optimizer will be called as:
optimizer(network.parameters(), learning_rate)
Your lr_scheduler will be called as:
lr_scheduler( optimizer, *lr_scheduler_args[0], **lr_scheduler_kwargs[0], )
Your criterion will be called as:
criterion( target_value.view(batch_size, 1), predicted_value.view(batch_size, 1) )
Note
DDPG supports two ways of updating the target network, the first way is polyak update (soft update), which updates the target network in every training step by mixing its weights with the online network using
update_rate
.The other way is hard update, which copies weights of the online network after every
update_steps
training step.You can either specify
update_rate
orupdate_steps
to select one update scheme, if both are specified, an error will be raised.These two different update schemes may result in different training stability.
- Parameters
actor (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Actor network module.
actor_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target actor network module.
critic (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Critic network module.
critic_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target critic network module.
optimizer (Callable) – Optimizer used to optimize
actor
andcritic
.criterion (Callable) – Criterion used to evaluate the value loss.
lr_scheduler (Callable) – Learning rate scheduler of
optimizer
.lr_scheduler_args (Tuple[Tuple, Tuple]) – Arguments of the learning rate scheduler.
lr_scheduler_kwargs (Tuple[Dict, Dict]) – Keyword arguments of the learning rate scheduler.
batch_size (int) – Batch size used during training.
update_rate (float) –
\(\tau\) used to update target networks. Target parameters are updated as:
\(\theta_t = \theta * \tau + \theta_t * (1 - \tau)\)
update_steps (Optional[int]) – Training step number used to update target networks.
actor_learning_rate (float) – Learning rate of the actor optimizer, not compatible with
lr_scheduler
.critic_learning_rate (float) – Learning rate of the critic optimizer, not compatible with
lr_scheduler
.discount (float) – \(\gamma\) used in the bellman function.
replay_size (int) – Replay buffer size. Not compatible with
replay_buffer
.replay_device (Union[str, torch.device]) – Device where the replay buffer locates on, Not compatible with
replay_buffer
.replay_buffer (machin.frame.buffers.buffer.Buffer) – Custom replay buffer.
visualize (bool) – Whether visualize the network flow in the first pass.
visualize_dir (str) – Visualized graph save directory.
gradient_max (float) –
-
act
(state, use_target=False, **__)[source]¶ Use actor network to produce an action for the current state.
- Parameters
state (Dict[str, Any]) – Current state.
use_target (bool) – Whether use the target network.
- Returns
Any thing returned by your actor network.
-
act_discrete
(state, use_target=False, **__)[source]¶ Use actor network to produce a discrete action for the current state.
Notes
actor network must output a probability tensor, of shape (batch_size, action_dims), and has a sum of 1 for each row in dimension 1.
- Parameters
state (Dict[str, Any]) – Current state.
use_target (bool) – Whether to use the target network.
- Returns
Action of shape
[batch_size, 1]
. Action probability tensor of shape[batch_size, action_num]
, produced by your actor. Any other things returned by your Q network. if they exist.
-
act_discrete_with_noise
(state, use_target=False, choose_max_prob=0.95, **__)[source]¶ Use actor network to produce a noisy discrete action for the current state.
Notes
actor network must output a probability tensor, of shape (batch_size, action_dims), and has a sum of 1 for each row in dimension 1.
- Parameters
state (Dict[str, Any]) – Current state.
use_target (bool) – Whether to use the target network.
choose_max_prob (float) – Probability to choose the largest component when actor is outputing extreme probability vector like
[0, 1, 0, 0]
.
- Returns
Noisy action of shape
[batch_size, 1]
. Action probability tensor of shape[batch_size, action_num]
. Any other things returned by your Q network. if they exist.
-
act_with_noise
(state, noise_param=0.0, 1.0, ratio=1.0, mode='uniform', use_target=False, **__)[source]¶ Use actor network to produce a noisy action for the current state.
- Parameters
state (Dict[str, Any]) – Current state.
noise_param (Any) – Noise params.
ratio (float) – Noise ratio.
mode (str) – Noise mode. Supported are:
"uniform", "normal", "clipped_normal", "ou"
use_target (bool) – Whether use the target network.
- Returns
Noisy action of shape
[batch_size, action_dim]
. Any other things returned by your actor network. if they exist.
-
static
action_transform_function
(raw_output_action, *_)[source]¶ The action transform function is used to transform the output of actor to the input of critic. Action transform function must accept:
Raw action from the actor model.
Concatenated
Transition.next_state
.Any other concatenated lists of custom keys from
Transition
.
- and returns:
A dictionary with the same form as
Transition.action
- Parameters
raw_output_action (Any) – Raw action from the actor model.
-
enable_multiprocessing
()¶ Enable multiprocessing for all modules.
-
classmethod
generate_config
(config)[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
-
classmethod
get_restorable_model_names
()¶ Get attribute name of restorable nn models.
-
classmethod
get_top_model_names
()¶ Get attribute name of top level nn models.
-
classmethod
init_from_config
(config, model_device='cpu')[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
torch.device] model_device (Union[str,) –
-
classmethod
is_distributed
()¶ Whether this framework is a distributed framework which require multiple processes to run, and depends on
torch.distributed
ortorch.distributed.rpc
-
load
(model_dir, network_map=None, version=- 1)[source]¶ Load models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir (str) – Save directory.
network_map (Dict[str, str]) – Key is module name, value is saved name.
version (int) – Version number of the save to be loaded.
-
save
(model_dir, network_map=None, version=0)¶ Save models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir (str) – Save directory.
network_map (Dict[str, str]) – Key is module name, value is saved name.
version (int) – Version number of the new save.
-
set_backward_function
(backward_func)¶ Replace the default backward function with a custom function. The default loss backward function is
torch.autograd.backward
- Parameters
backward_func (Callable) –
-
store_episode
(episode)[source]¶ Add a full episode of transition samples to the replay buffer.
- Parameters
Dict]] episode (List[Union[machin.frame.transition.Transition,) –
-
store_transition
(transition)[source]¶ Add a transition sample to the replay buffer.
- Parameters
Dict] transition (Union[machin.frame.transition.Transition,) –
-
update
(update_value=True, update_policy=True, update_target=True, concatenate_samples=True, **__)[source]¶ Update network weights by sampling from replay buffer.
- Parameters
update_value – Whether to update the Q network.
update_policy – Whether to update the actor network.
update_target – Whether to update targets.
concatenate_samples – Whether to concatenate the samples.
- Returns
mean value of estimated policy value, value loss
-
visualize_model
(final_tensor, name, directory)¶ - Parameters
final_tensor (torch.Tensor) –
name (str) –
directory (str) –
-
property
backward_function
¶
-
property
lr_schedulers
¶
-
property
optimizers
¶
-
property
restorable_models
¶
-
property
top_models
¶
Hysterical DDPG¶
-
class
machin.frame.algorithms.hddpg.
HDDPG
(actor, actor_target, critic, critic_target, optimizer, criterion, *_, lr_scheduler=None, lr_scheduler_args=None, lr_scheduler_kwargs=None, batch_size=100, update_rate=0.005, update_steps=None, actor_learning_rate=0.0005, critic_learning_rate=0.001, discount=0.99, gradient_max=inf, q_increase_rate=1.0, q_decrease_rate=1.0, replay_size=500000, replay_device='cpu', replay_buffer=None, visualize=False, visualize_dir='', **__)[source]¶ Bases:
machin.frame.algorithms.ddpg.DDPG
HDDPG framework.
See also
- Parameters
actor (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Actor network module.
actor_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target actor network module.
critic (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Critic network module.
critic_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target critic network module.
optimizer (Callable) – Optimizer used to optimize
actor
andcritic
.criterion (Callable) – Criterion used to evaluate the value loss.
lr_scheduler (Callable) – Learning rate scheduler of
optimizer
.lr_scheduler_args (Tuple[Tuple, Tuple]) – Arguments of the learning rate scheduler.
lr_scheduler_kwargs (Tuple[Dict, Dict]) – Keyword arguments of the learning rate scheduler.
batch_size (int) – Batch size used during training.
update_rate (float) –
\(\tau\) used to update target networks. Target parameters are updated as:
\(\theta_t = \theta * \tau + \theta_t * (1 - \tau)\)
update_steps (Optional[int]) – Training step number used to update target networks.
actor_learning_rate (float) – Learning rate of the actor optimizer, not compatible with
lr_scheduler
.critic_learning_rate (float) – Learning rate of the critic optimizer, not compatible with
lr_scheduler
.discount (float) – \(\gamma\) used in the bellman function.
replay_size (int) – Replay buffer size. Not compatible with
replay_buffer
.replay_device (Union[str, torch.device]) – Device where the replay buffer locates on, Not compatible with
replay_buffer
.replay_buffer (machin.frame.buffers.buffer.Buffer) – Custom replay buffer.
visualize (bool) – Whether visualize the network flow in the first pass.
visualize_dir (str) – Visualized graph save directory.
gradient_max (float) –
q_increase_rate (float) –
q_decrease_rate (float) –
-
act
(state, use_target=False, **__)¶ Use actor network to produce an action for the current state.
- Parameters
state (Dict[str, Any]) – Current state.
use_target (bool) – Whether use the target network.
- Returns
Any thing returned by your actor network.
-
act_discrete
(state, use_target=False, **__)¶ Use actor network to produce a discrete action for the current state.
Notes
actor network must output a probability tensor, of shape (batch_size, action_dims), and has a sum of 1 for each row in dimension 1.
- Parameters
state (Dict[str, Any]) – Current state.
use_target (bool) – Whether to use the target network.
- Returns
Action of shape
[batch_size, 1]
. Action probability tensor of shape[batch_size, action_num]
, produced by your actor. Any other things returned by your Q network. if they exist.
-
act_discrete_with_noise
(state, use_target=False, choose_max_prob=0.95, **__)¶ Use actor network to produce a noisy discrete action for the current state.
Notes
actor network must output a probability tensor, of shape (batch_size, action_dims), and has a sum of 1 for each row in dimension 1.
- Parameters
state (Dict[str, Any]) – Current state.
use_target (bool) – Whether to use the target network.
choose_max_prob (float) – Probability to choose the largest component when actor is outputing extreme probability vector like
[0, 1, 0, 0]
.
- Returns
Noisy action of shape
[batch_size, 1]
. Action probability tensor of shape[batch_size, action_num]
. Any other things returned by your Q network. if they exist.
-
act_with_noise
(state, noise_param=0.0, 1.0, ratio=1.0, mode='uniform', use_target=False, **__)¶ Use actor network to produce a noisy action for the current state.
- Parameters
state (Dict[str, Any]) – Current state.
noise_param (Any) – Noise params.
ratio (float) – Noise ratio.
mode (str) – Noise mode. Supported are:
"uniform", "normal", "clipped_normal", "ou"
use_target (bool) – Whether use the target network.
- Returns
Noisy action of shape
[batch_size, action_dim]
. Any other things returned by your actor network. if they exist.
-
static
action_transform_function
(raw_output_action, *_)¶ The action transform function is used to transform the output of actor to the input of critic. Action transform function must accept:
Raw action from the actor model.
Concatenated
Transition.next_state
.Any other concatenated lists of custom keys from
Transition
.
- and returns:
A dictionary with the same form as
Transition.action
- Parameters
raw_output_action (Any) – Raw action from the actor model.
-
enable_multiprocessing
()¶ Enable multiprocessing for all modules.
-
classmethod
generate_config
(config)[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
-
classmethod
get_restorable_model_names
()¶ Get attribute name of restorable nn models.
-
classmethod
get_top_model_names
()¶ Get attribute name of top level nn models.
-
classmethod
init_from_config
(config, model_device='cpu')¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
torch.device] model_device (Union[str,) –
-
classmethod
is_distributed
()¶ Whether this framework is a distributed framework which require multiple processes to run, and depends on
torch.distributed
ortorch.distributed.rpc
-
load
(model_dir, network_map=None, version=- 1)¶ Load models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir (str) – Save directory.
network_map (Dict[str, str]) – Key is module name, value is saved name.
version (int) – Version number of the save to be loaded.
-
static
reward_function
(reward, discount, next_value, terminal, _)¶
-
save
(model_dir, network_map=None, version=0)¶ Save models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir (str) – Save directory.
network_map (Dict[str, str]) – Key is module name, value is saved name.
version (int) – Version number of the new save.
-
set_backward_function
(backward_func)¶ Replace the default backward function with a custom function. The default loss backward function is
torch.autograd.backward
- Parameters
backward_func (Callable) –
-
store_episode
(episode)¶ Add a full episode of transition samples to the replay buffer.
- Parameters
Dict]] episode (List[Union[machin.frame.transition.Transition,) –
-
store_transition
(transition)¶ Add a transition sample to the replay buffer.
- Parameters
Dict] transition (Union[machin.frame.transition.Transition,) –
-
update
(update_value=True, update_policy=True, update_target=True, concatenate_samples=True, **__)[source]¶ Update network weights by sampling from replay buffer.
- Parameters
update_value – Whether to update the Q network.
update_policy – Whether to update the actor network.
update_target – Whether to update targets.
concatenate_samples – Whether to concatenate the samples.
- Returns
mean value of estimated policy value, value loss
-
update_lr_scheduler
()¶ Update learning rate schedulers.
-
visualize_model
(final_tensor, name, directory)¶ - Parameters
final_tensor (torch.Tensor) –
name (str) –
directory (str) –
-
property
backward_function
¶
-
property
lr_schedulers
¶
-
property
optimizers
¶
-
property
restorable_models
¶
-
property
top_models
¶
DDPG with prioritized replay¶
-
class
machin.frame.algorithms.ddpg_per.
DDPGPer
(actor, actor_target, critic, critic_target, optimizer, criterion, *_, lr_scheduler=None, lr_scheduler_args=None, lr_scheduler_kwargs=None, batch_size=100, update_rate=0.005, update_steps=None, actor_learning_rate=0.0005, critic_learning_rate=0.001, discount=0.99, gradient_max=inf, replay_size=500000, replay_device='cpu', replay_buffer=None, visualize=False, visualize_dir='', **__)[source]¶ Bases:
machin.frame.algorithms.ddpg.DDPG
DDPG with prioritized experience replay.
Warning
Your criterion must return a tensor of shape
[batch_size,1]
when given two tensors of shape[batch_size,1]
, since we need to multiply the loss with importance sampling weight element-wise.If you are using loss modules given by pytorch. It is always safe to use them without any modification.
Note
Your optimizer will be called as:
optimizer(network.parameters(), learning_rate)
Your lr_scheduler will be called as:
lr_scheduler( optimizer, *lr_scheduler_args[0], **lr_scheduler_kwargs[0], )
Your criterion will be called as:
criterion( target_value.view(batch_size, 1), predicted_value.view(batch_size, 1) )
Note
DDPG supports two ways of updating the target network, the first way is polyak update (soft update), which updates the target network in every training step by mixing its weights with the online network using
update_rate
.The other way is hard update, which copies weights of the online network after every
update_steps
training step.You can either specify
update_rate
orupdate_steps
to select one update scheme, if both are specified, an error will be raised.These two different update schemes may result in different training stability.
- Parameters
actor (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Actor network module.
actor_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target actor network module.
critic (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Critic network module.
critic_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target critic network module.
optimizer (Callable) – Optimizer used to optimize
actor
andcritic
.criterion – Criterion used to evaluate the value loss.
lr_scheduler (Callable) – Learning rate scheduler of
optimizer
.lr_scheduler_args (Tuple[Tuple, Tuple]) – Arguments of the learning rate scheduler.
lr_scheduler_kwargs (Tuple[Dict, Dict]) – Keyword arguments of the learning rate scheduler.
batch_size (int) – Batch size used during training.
update_rate (float) –
\(\tau\) used to update target networks. Target parameters are updated as:
\(\theta_t = \theta * \tau + \theta_t * (1 - \tau)\)
update_steps (Optional[int]) – Training step number used to update target networks.
actor_learning_rate (float) – Learning rate of the actor optimizer, not compatible with
lr_scheduler
.critic_learning_rate (float) – Learning rate of the critic optimizer, not compatible with
lr_scheduler
.discount (float) – \(\gamma\) used in the bellman function.
replay_size (int) – Replay buffer size. Not compatible with
replay_buffer
.replay_device (Union[str, torch.device]) – Device where the replay buffer locates on, Not compatible with
replay_buffer
.replay_buffer (machin.frame.buffers.buffer.Buffer) – Custom replay buffer.
visualize (bool) – Whether visualize the network flow in the first pass.
visualize_dir (str) – Visualized graph save directory.
gradient_max (float) –
-
act
(state, use_target=False, **__)¶ Use actor network to produce an action for the current state.
- Parameters
state (Dict[str, Any]) – Current state.
use_target (bool) – Whether use the target network.
- Returns
Any thing returned by your actor network.
-
act_discrete
(state, use_target=False, **__)¶ Use actor network to produce a discrete action for the current state.
Notes
actor network must output a probability tensor, of shape (batch_size, action_dims), and has a sum of 1 for each row in dimension 1.
- Parameters
state (Dict[str, Any]) – Current state.
use_target (bool) – Whether to use the target network.
- Returns
Action of shape
[batch_size, 1]
. Action probability tensor of shape[batch_size, action_num]
, produced by your actor. Any other things returned by your Q network. if they exist.
-
act_discrete_with_noise
(state, use_target=False, choose_max_prob=0.95, **__)¶ Use actor network to produce a noisy discrete action for the current state.
Notes
actor network must output a probability tensor, of shape (batch_size, action_dims), and has a sum of 1 for each row in dimension 1.
- Parameters
state (Dict[str, Any]) – Current state.
use_target (bool) – Whether to use the target network.
choose_max_prob (float) – Probability to choose the largest component when actor is outputing extreme probability vector like
[0, 1, 0, 0]
.
- Returns
Noisy action of shape
[batch_size, 1]
. Action probability tensor of shape[batch_size, action_num]
. Any other things returned by your Q network. if they exist.
-
act_with_noise
(state, noise_param=0.0, 1.0, ratio=1.0, mode='uniform', use_target=False, **__)¶ Use actor network to produce a noisy action for the current state.
- Parameters
state (Dict[str, Any]) – Current state.
noise_param (Any) – Noise params.
ratio (float) – Noise ratio.
mode (str) – Noise mode. Supported are:
"uniform", "normal", "clipped_normal", "ou"
use_target (bool) – Whether use the target network.
- Returns
Noisy action of shape
[batch_size, action_dim]
. Any other things returned by your actor network. if they exist.
-
static
action_transform_function
(raw_output_action, *_)¶ The action transform function is used to transform the output of actor to the input of critic. Action transform function must accept:
Raw action from the actor model.
Concatenated
Transition.next_state
.Any other concatenated lists of custom keys from
Transition
.
- and returns:
A dictionary with the same form as
Transition.action
- Parameters
raw_output_action (Any) – Raw action from the actor model.
-
enable_multiprocessing
()¶ Enable multiprocessing for all modules.
-
classmethod
generate_config
(config)[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
-
classmethod
get_restorable_model_names
()¶ Get attribute name of restorable nn models.
-
classmethod
get_top_model_names
()¶ Get attribute name of top level nn models.
-
classmethod
init_from_config
(config, model_device='cpu')[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
torch.device] model_device (Union[str,) –
-
classmethod
is_distributed
()¶ Whether this framework is a distributed framework which require multiple processes to run, and depends on
torch.distributed
ortorch.distributed.rpc
-
load
(model_dir, network_map=None, version=- 1)¶ Load models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir (str) – Save directory.
network_map (Dict[str, str]) – Key is module name, value is saved name.
version (int) – Version number of the save to be loaded.
-
static
reward_function
(reward, discount, next_value, terminal, _)¶
-
save
(model_dir, network_map=None, version=0)¶ Save models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir (str) – Save directory.
network_map (Dict[str, str]) – Key is module name, value is saved name.
version (int) – Version number of the new save.
-
set_backward_function
(backward_func)¶ Replace the default backward function with a custom function. The default loss backward function is
torch.autograd.backward
- Parameters
backward_func (Callable) –
-
store_episode
(episode)¶ Add a full episode of transition samples to the replay buffer.
- Parameters
Dict]] episode (List[Union[machin.frame.transition.Transition,) –
-
store_transition
(transition)¶ Add a transition sample to the replay buffer.
- Parameters
Dict] transition (Union[machin.frame.transition.Transition,) –
-
update
(update_value=True, update_policy=True, update_target=True, concatenate_samples=True, **__)[source]¶ Update network weights by sampling from replay buffer.
- Parameters
update_value – Whether to update the Q network.
update_policy – Whether to update the actor network.
update_target – Whether to update targets.
concatenate_samples – Whether to concatenate the samples.
- Returns
mean value of estimated policy value, value loss
-
update_lr_scheduler
()¶ Update learning rate schedulers.
-
visualize_model
(final_tensor, name, directory)¶ - Parameters
final_tensor (torch.Tensor) –
name (str) –
directory (str) –
-
property
backward_function
¶
-
property
lr_schedulers
¶
-
property
optimizers
¶
-
property
restorable_models
¶
-
property
top_models
¶
TD3¶
-
class
machin.frame.algorithms.td3.
TD3
(actor, actor_target, critic, critic_target, critic2, critic2_target, optimizer, criterion, *_, lr_scheduler=None, lr_scheduler_args=None, lr_scheduler_kwargs=None, batch_size=100, update_rate=0.001, update_steps=None, actor_learning_rate=0.0005, critic_learning_rate=0.001, discount=0.99, gradient_max=inf, replay_size=500000, replay_device='cpu', replay_buffer=None, visualize=False, visualize_dir='', **__)[source]¶ Bases:
machin.frame.algorithms.ddpg.DDPG
TD3 framework. Which adds a additional pair of critic and target critic network to DDPG.
See also
- Parameters
actor (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Actor network module.
actor_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target actor network module.
critic (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Critic network module.
critic_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target critic network module.
critic2 (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – The second critic network module.
critic2_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – The second target critic network module.
optimizer (Callable) – Optimizer used to optimize
actor
,critic
,criterion (Callable) – Criterion used to evaluate the value loss.
lr_scheduler (Callable) – Learning rate scheduler of
optimizer
.lr_scheduler_args (Tuple[Tuple, Tuple, Tuple]) – Arguments of the learning rate scheduler.
lr_scheduler_kwargs (Tuple[Dict, Dict, Dict]) – Keyword arguments of the learning rate scheduler.
batch_size (int) – Batch size used during training.
update_rate (float) –
\(\tau\) used to update target networks. Target parameters are updated as:
\(\theta_t = \theta * \tau + \theta_t * (1 - \tau)\)
update_steps (Optional[int]) – Training step number used to update target networks.
actor_learning_rate (float) – Learning rate of the actor optimizer, not compatible with
lr_scheduler
.critic_learning_rate (float) – Learning rate of the critic optimizer, not compatible with
lr_scheduler
.discount (float) – \(\gamma\) used in the bellman function.
replay_size (int) – Replay buffer size. Not compatible with
replay_buffer
.replay_device (Union[str, torch.device]) – Device where the replay buffer locates on, Not compatible with
replay_buffer
.replay_buffer (machin.frame.buffers.buffer.Buffer) – Custom replay buffer.
visualize (bool) – Whether visualize the network flow in the first pass.
visualize_dir (str) – Visualized graph save directory.
gradient_max (float) –
-
act
(state, use_target=False, **__)¶ Use actor network to produce an action for the current state.
- Parameters
state (Dict[str, Any]) – Current state.
use_target (bool) – Whether use the target network.
- Returns
Any thing returned by your actor network.
-
act_discrete
(state, use_target=False, **__)¶ Use actor network to produce a discrete action for the current state.
Notes
actor network must output a probability tensor, of shape (batch_size, action_dims), and has a sum of 1 for each row in dimension 1.
- Parameters
state (Dict[str, Any]) – Current state.
use_target (bool) – Whether to use the target network.
- Returns
Action of shape
[batch_size, 1]
. Action probability tensor of shape[batch_size, action_num]
, produced by your actor. Any other things returned by your Q network. if they exist.
-
act_discrete_with_noise
(state, use_target=False, choose_max_prob=0.95, **__)¶ Use actor network to produce a noisy discrete action for the current state.
Notes
actor network must output a probability tensor, of shape (batch_size, action_dims), and has a sum of 1 for each row in dimension 1.
- Parameters
state (Dict[str, Any]) – Current state.
use_target (bool) – Whether to use the target network.
choose_max_prob (float) – Probability to choose the largest component when actor is outputing extreme probability vector like
[0, 1, 0, 0]
.
- Returns
Noisy action of shape
[batch_size, 1]
. Action probability tensor of shape[batch_size, action_num]
. Any other things returned by your Q network. if they exist.
-
act_with_noise
(state, noise_param=0.0, 1.0, ratio=1.0, mode='uniform', use_target=False, **__)¶ Use actor network to produce a noisy action for the current state.
- Parameters
state (Dict[str, Any]) – Current state.
noise_param (Any) – Noise params.
ratio (float) – Noise ratio.
mode (str) – Noise mode. Supported are:
"uniform", "normal", "clipped_normal", "ou"
use_target (bool) – Whether use the target network.
- Returns
Noisy action of shape
[batch_size, action_dim]
. Any other things returned by your actor network. if they exist.
-
static
action_transform_function
(raw_output_action, *_)¶ The action transform function is used to transform the output of actor to the input of critic. Action transform function must accept:
Raw action from the actor model.
Concatenated
Transition.next_state
.Any other concatenated lists of custom keys from
Transition
.
- and returns:
A dictionary with the same form as
Transition.action
- Parameters
raw_output_action (Any) – Raw action from the actor model.
-
enable_multiprocessing
()¶ Enable multiprocessing for all modules.
-
classmethod
generate_config
(config)[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
-
classmethod
get_restorable_model_names
()¶ Get attribute name of restorable nn models.
-
classmethod
get_top_model_names
()¶ Get attribute name of top level nn models.
-
classmethod
init_from_config
(config, model_device='cpu')¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
torch.device] model_device (Union[str,) –
-
classmethod
is_distributed
()¶ Whether this framework is a distributed framework which require multiple processes to run, and depends on
torch.distributed
ortorch.distributed.rpc
-
load
(model_dir, network_map=None, version=- 1)[source]¶ Load models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir (str) – Save directory.
network_map (Dict[str, str]) – Key is module name, value is saved name.
version (int) – Version number of the save to be loaded.
-
static
reward_function
(reward, discount, next_value, terminal, _)¶
-
save
(model_dir, network_map=None, version=0)¶ Save models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir (str) – Save directory.
network_map (Dict[str, str]) – Key is module name, value is saved name.
version (int) – Version number of the new save.
-
set_backward_function
(backward_func)¶ Replace the default backward function with a custom function. The default loss backward function is
torch.autograd.backward
- Parameters
backward_func (Callable) –
-
store_episode
(episode)¶ Add a full episode of transition samples to the replay buffer.
- Parameters
Dict]] episode (List[Union[machin.frame.transition.Transition,) –
-
store_transition
(transition)¶ Add a transition sample to the replay buffer.
- Parameters
Dict] transition (Union[machin.frame.transition.Transition,) –
-
update
(update_value=True, update_policy=True, update_target=True, concatenate_samples=True, **__)[source]¶ Update network weights by sampling from replay buffer.
- Parameters
update_value – Whether to update the Q network.
update_policy – Whether to update the actor network.
update_target – Whether to update targets.
concatenate_samples – Whether to concatenate the samples.
- Returns
mean value of estimated policy value, value loss
-
visualize_model
(final_tensor, name, directory)¶ - Parameters
final_tensor (torch.Tensor) –
name (str) –
directory (str) –
-
property
backward_function
¶
-
property
lr_schedulers
¶
-
property
optimizers
¶
-
property
restorable_models
¶
-
property
top_models
¶
DQN, Fixed-Target DQN, Dueling DQN, Double DQN¶
-
class
machin.frame.algorithms.dqn.
DQN
(qnet, qnet_target, optimizer, criterion, *_, lr_scheduler=None, lr_scheduler_args=None, lr_scheduler_kwargs=None, batch_size=100, epsilon_decay=0.9999, update_rate=0.005, update_steps=None, learning_rate=0.001, discount=0.99, gradient_max=inf, replay_size=500000, replay_device='cpu', replay_buffer=None, mode='double', visualize=False, visualize_dir='', **__)[source]¶ Bases:
machin.frame.algorithms.base.TorchFramework
DQN framework.
Note
DQN is only available for discrete environments.
Note
Dueling DQN is a network structure rather than a framework, so it could be applied to all three modes.
If
mode = "vanilla"
, implements the simplest online DQN, with replay buffer.If
mode = "fixed_target"
, implements DQN with a target network, and replay buffer. Described in this essay.If
mode = "double"
, implements Double DQN described in this essay.Note
Vanilla DQN only needs one network, so internally,
qnet
is assigned toqnet_target
.Note
In order to implement dueling DQN, you should create two dense output layers.
In your q network:
self.fc_adv = nn.Linear(in_features=..., out_features=num_actions) self.fc_val = nn.Linear(in_features=..., out_features=1)
Then in your
forward()
method, you should implement output as:adv = self.fc_adv(some_input) val = self.fc_val(some_input).expand(self.batch_sze, self.num_actions) return val + adv - adv.mean(1, keepdim=True)
Note
Your optimizer will be called as:
optimizer(network.parameters(), learning_rate)
Your lr_scheduler will be called as:
lr_scheduler( optimizer, *lr_scheduler_args[0], **lr_scheduler_kwargs[0], )
Your criterion will be called as:
criterion( target_value.view(batch_size, 1), predicted_value.view(batch_size, 1) )
Note
DQN supports two ways of updating the target network, the first way is polyak update (soft update), which updates the target network in every training step by mixing its weights with the online network using
update_rate
.The other way is hard update, which copies weights of the online network after every
update_steps
training step.You can either specify
update_rate
orupdate_steps
to select one update scheme, if both are specified, an error will be raised.These two different update schemes may result in different training stability.
-
epsilon
¶ Current epsilon value, determines randomness in
act_discrete_with_noise
. You can set it to any value.
- Parameters
qnet (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Q network module.
qnet_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target Q network module.
optimizer (Callable) – Optimizer used to optimize
qnet
.criterion (Callable) – Criterion used to evaluate the value loss.
learning_rate (float) – Learning rate of the optimizer, not compatible with
lr_scheduler
.lr_scheduler (Callable) – Learning rate scheduler of
optimizer
.lr_scheduler_args (Tuple[Tuple]) – Arguments of the learning rate scheduler.
lr_scheduler_kwargs (Tuple[Dict]) – Keyword arguments of the learning rate scheduler.
batch_size (int) – Batch size used during training.
epsilon_decay (float) – Epsilon decay rate per acting with noise step.
epsilon
attribute is multiplied with this every timeact_discrete_with_noise
is called.update_rate (Optional[float]) –
\(\tau\) used to update target networks. Target parameters are updated as:
\(\theta_t = \theta * \tau + \theta_t * (1 - \tau)\)
update_steps (Optional[int]) – Training step number used to update target networks.
discount (float) – \(\gamma\) used in the bellman function.
gradient_max (float) – Maximum gradient.
replay_size (int) – Replay buffer size. Not compatible with
replay_buffer
.replay_device (Union[str, torch.device]) – Device where the replay buffer locates on, Not compatible with
replay_buffer
.replay_buffer (machin.frame.buffers.buffer.Buffer) – Custom replay buffer.
mode (str) – one of
"vanilla", "fixed_target", "double"
.visualize (bool) – Whether visualize the network flow in the first pass.
visualize_dir (str) –
-
act_discrete
(state, use_target=False, **__)[source]¶ Use Q network to produce a discrete action for the current state.
- Parameters
state (Dict[str, Any]) – Current state.
use_target (bool) – Whether to use the target network.
- Returns
Action of shape
[batch_size, 1]
. Any other things returned by your Q network. if they exist.
-
act_discrete_with_noise
(state, use_target=False, decay_epsilon=True, **__)[source]¶ Randomly selects an action from the action space according to a uniform distribution, with regard to the epsilon decay policy.
- Parameters
state (Dict[str, Any]) – Current state.
use_target (bool) – Whether to use the target network.
decay_epsilon (bool) – Whether to decay the
epsilon
attribute.
- Returns
Noisy action of shape
[batch_size, 1]
. Any other things returned by your Q network. if they exist.
-
static
action_get_function
(sampled_actions)[source]¶ This function is used to get action numbers (int tensor indicating which discrete actions are used) from the sampled action dictionary.
-
enable_multiprocessing
()¶ Enable multiprocessing for all modules.
-
classmethod
generate_config
(config)[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
-
classmethod
get_restorable_model_names
()¶ Get attribute name of restorable nn models.
-
classmethod
get_top_model_names
()¶ Get attribute name of top level nn models.
-
classmethod
init_from_config
(config, model_device='cpu')[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
torch.device] model_device (Union[str,) –
-
classmethod
is_distributed
()¶ Whether this framework is a distributed framework which require multiple processes to run, and depends on
torch.distributed
ortorch.distributed.rpc
-
load
(model_dir, network_map=None, version=- 1)[source]¶ Load models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir – Save directory.
network_map – Key is module name, value is saved name.
version – Version number of the save to be loaded.
-
save
(model_dir, network_map=None, version=0)¶ Save models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir (str) – Save directory.
network_map (Dict[str, str]) – Key is module name, value is saved name.
version (int) – Version number of the new save.
-
set_backward_function
(backward_func)¶ Replace the default backward function with a custom function. The default loss backward function is
torch.autograd.backward
- Parameters
backward_func (Callable) –
-
store_episode
(episode)[source]¶ Add a full episode of transition samples to the replay buffer.
- Parameters
Dict]] episode (List[Union[machin.frame.transition.Transition,) –
-
store_transition
(transition)[source]¶ Add a transition sample to the replay buffer.
- Parameters
Dict] transition (Union[machin.frame.transition.Transition,) –
-
update
(update_value=True, update_target=True, concatenate_samples=True, **__)[source]¶ Update network weights by sampling from replay buffer.
- Parameters
update_value – Whether update the Q network.
update_target – Whether update targets.
concatenate_samples – Whether concatenate the samples.
- Returns
mean value of estimated policy value, value loss
-
visualize_model
(final_tensor, name, directory)¶ - Parameters
final_tensor (torch.Tensor) –
name (str) –
directory (str) –
-
property
backward_function
¶
-
property
lr_schedulers
¶
-
property
optimizers
¶
-
property
restorable_models
¶
-
property
top_models
¶
-
DQN with prioritized replay¶
-
class
machin.frame.algorithms.dqn_per.
DQNPer
(qnet, qnet_target, optimizer, criterion, *_, lr_scheduler=None, lr_scheduler_args=None, lr_scheduler_kwargs=None, batch_size=100, epsilon_decay=0.9999, update_rate=0.005, update_steps=None, learning_rate=0.001, discount=0.99, gradient_max=inf, replay_size=500000, replay_device='cpu', replay_buffer=None, visualize=False, visualize_dir='', **__)[source]¶ Bases:
machin.frame.algorithms.dqn.DQN
DQN with prioritized replay. It is based on Double DQN.
Warning
Your criterion must return a tensor of shape
[batch_size,1]
when given two tensors of shape[batch_size,1]
, since we need to multiply the loss with importance sampling weight element-wise.If you are using loss modules given by pytorch. It is always safe to use them without any modification.
Note
DQN is only available for discrete environments.
Note
Dueling DQN is a network structure rather than a framework, so it could be applied to all three modes.
If
mode = "vanilla"
, implements the simplest online DQN, with replay buffer.If
mode = "fixed_target"
, implements DQN with a target network, and replay buffer. Described in this essay.If
mode = "double"
, implements Double DQN described in this essay.Note
Vanilla DQN only needs one network, so internally,
qnet
is assigned toqnet_target
.Note
In order to implement dueling DQN, you should create two dense output layers.
In your q network:
self.fc_adv = nn.Linear(in_features=..., out_features=num_actions) self.fc_val = nn.Linear(in_features=..., out_features=1)
Then in your
forward()
method, you should implement output as:adv = self.fc_adv(some_input) val = self.fc_val(some_input).expand(self.batch_sze, self.num_actions) return val + adv - adv.mean(1, keepdim=True)
Note
Your optimizer will be called as:
optimizer(network.parameters(), learning_rate)
Your lr_scheduler will be called as:
lr_scheduler( optimizer, *lr_scheduler_args[0], **lr_scheduler_kwargs[0], )
Your criterion will be called as:
criterion( target_value.view(batch_size, 1), predicted_value.view(batch_size, 1) )
Note
DQN supports two ways of updating the target network, the first way is polyak update (soft update), which updates the target network in every training step by mixing its weights with the online network using
update_rate
.The other way is hard update, which copies weights of the online network after every
update_steps
training step.You can either specify
update_rate
orupdate_steps
to select one update scheme, if both are specified, an error will be raised.These two different update schemes may result in different training stability.
-
epsilon
¶ Current epsilon value, determines randomness in
act_discrete_with_noise
. You can set it to any value.
- Parameters
qnet (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Q network module.
qnet_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target Q network module.
optimizer (Callable) – Optimizer used to optimize
qnet
.criterion (Callable) – Criterion used to evaluate the value loss.
learning_rate (float) – Learning rate of the optimizer, not compatible with
lr_scheduler
.lr_scheduler (Callable) – Learning rate scheduler of
optimizer
.lr_scheduler_args (Tuple[Tuple]) – Arguments of the learning rate scheduler.
lr_scheduler_kwargs (Tuple[Dict]) – Keyword arguments of the learning rate scheduler.
batch_size (int) – Batch size used during training.
epsilon_decay (float) – Epsilon decay rate per acting with noise step.
epsilon
attribute is multiplied with this every timeact_discrete_with_noise
is called.update_rate (Optional[float]) –
\(\tau\) used to update target networks. Target parameters are updated as:
\(\theta_t = \theta * \tau + \theta_t * (1 - \tau)\)
update_steps (Optional[int]) – Training step number used to update target networks.
discount (float) – \(\gamma\) used in the bellman function.
gradient_max (float) – Maximum gradient.
replay_size (int) – Replay buffer size. Not compatible with
replay_buffer
.replay_device (Union[str, torch.device]) – Device where the replay buffer locates on, Not compatible with
replay_buffer
.replay_buffer (machin.frame.buffers.buffer.Buffer) – Custom replay buffer.
mode – one of
"vanilla", "fixed_target", "double"
.visualize (bool) – Whether visualize the network flow in the first pass.
visualize_dir (str) –
-
act_discrete
(state, use_target=False, **__)¶ Use Q network to produce a discrete action for the current state.
- Parameters
state (Dict[str, Any]) – Current state.
use_target (bool) – Whether to use the target network.
- Returns
Action of shape
[batch_size, 1]
. Any other things returned by your Q network. if they exist.
-
act_discrete_with_noise
(state, use_target=False, decay_epsilon=True, **__)¶ Randomly selects an action from the action space according to a uniform distribution, with regard to the epsilon decay policy.
- Parameters
state (Dict[str, Any]) – Current state.
use_target (bool) – Whether to use the target network.
decay_epsilon (bool) – Whether to decay the
epsilon
attribute.
- Returns
Noisy action of shape
[batch_size, 1]
. Any other things returned by your Q network. if they exist.
-
static
action_get_function
(sampled_actions)¶ This function is used to get action numbers (int tensor indicating which discrete actions are used) from the sampled action dictionary.
-
enable_multiprocessing
()¶ Enable multiprocessing for all modules.
-
classmethod
generate_config
(config)[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
-
classmethod
get_restorable_model_names
()¶ Get attribute name of restorable nn models.
-
classmethod
get_top_model_names
()¶ Get attribute name of top level nn models.
-
classmethod
init_from_config
(config, model_device='cpu')[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
torch.device] model_device (Union[str,) –
-
classmethod
is_distributed
()¶ Whether this framework is a distributed framework which require multiple processes to run, and depends on
torch.distributed
ortorch.distributed.rpc
-
load
(model_dir, network_map=None, version=- 1)¶ Load models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir – Save directory.
network_map – Key is module name, value is saved name.
version – Version number of the save to be loaded.
-
static
reward_function
(reward, discount, next_value, terminal, _)¶
-
save
(model_dir, network_map=None, version=0)¶ Save models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir (str) – Save directory.
network_map (Dict[str, str]) – Key is module name, value is saved name.
version (int) – Version number of the new save.
-
set_backward_function
(backward_func)¶ Replace the default backward function with a custom function. The default loss backward function is
torch.autograd.backward
- Parameters
backward_func (Callable) –
-
store_episode
(episode)¶ Add a full episode of transition samples to the replay buffer.
- Parameters
Dict]] episode (List[Union[machin.frame.transition.Transition,) –
-
store_transition
(transition)¶ Add a transition sample to the replay buffer.
- Parameters
Dict] transition (Union[machin.frame.transition.Transition,) –
-
update
(update_value=True, update_target=True, concatenate_samples=True, **__)[source]¶ Update network weights by sampling from replay buffer.
- Parameters
update_value – Whether update the Q network.
update_target – Whether update targets.
concatenate_samples – Whether concatenate the samples.
- Returns
mean value of estimated policy value, value loss
-
update_lr_scheduler
()¶ Update learning rate schedulers.
-
visualize_model
(final_tensor, name, directory)¶ - Parameters
final_tensor (torch.Tensor) –
name (str) –
directory (str) –
-
property
backward_function
¶
-
property
lr_schedulers
¶
-
property
optimizers
¶
-
property
restorable_models
¶
-
property
top_models
¶
-
RAINBOW¶
-
class
machin.frame.algorithms.rainbow.
RAINBOW
(qnet, qnet_target, optimizer, value_min, value_max, *_, lr_scheduler=None, lr_scheduler_args=None, lr_scheduler_kwargs=None, batch_size=100, epsilon_decay=0.9999, update_rate=0.001, update_steps=None, learning_rate=0.001, discount=0.99, gradient_max=inf, reward_future_steps=3, replay_size=500000, replay_device='cpu', replay_buffer=None, visualize=False, visualize_dir='', **__)[source]¶ Bases:
machin.frame.algorithms.dqn.DQN
RAINBOW DQN framework.
RAINBOW framework is described in this essay.
Note
In the RAINBOW framework, the output shape of your q network must be
[batch_size, action_num, atom_num]
when given a state of shape[batch_size, action_dim]
. And the last dimension must be soft-maxed. Atom number is the number of segments of your q value domain.See also
- Parameters
qnet (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Q network module.
qnet_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target Q network module.
optimizer – Optimizer used to optimize
actor
andcritic
.value_min – Minimum of value domain.
value_max – Maximum of value domain.
learning_rate (float) – Learning rate of the optimizer, not compatible with
lr_scheduler
.lr_scheduler (Callable) – Learning rate scheduler of
optimizer
.lr_scheduler_args (Tuple[Tuple]) – Arguments of the learning rate scheduler.
lr_scheduler_kwargs (Tuple[Dict]) – Keyword arguments of the learning rate scheduler.
batch_size (int) – Batch size used during training.
epsilon_decay (float) – Epsilon decay rate per acting with noise step.
epsilon
attribute is multiplied with this every timeact_discrete_with_noise
is called.update_rate (float) –
\(\tau\) used to update target networks. Target parameters are updated as:
\(\theta_t = \theta * \tau + \theta_t * (1 - \tau)\)
update_steps (Optional[int]) – Training step number used to update target networks.
discount (float) – \(\gamma\) used in the bellman function.
reward_future_steps (int) – Number of future steps to be considered when the framework calculates value from reward.
replay_size (int) – Replay buffer size. Not compatible with
replay_buffer
.replay_device (Union[str, torch.device]) – Device where the replay buffer locates on, Not compatible with
replay_buffer
.replay_buffer (machin.frame.buffers.buffer.Buffer) – Custom replay buffer.
mode – one of
"vanilla", "fixed_target", "double"
.visualize (bool) – Whether visualize the network flow in the first pass.
gradient_max (float) –
visualize_dir (str) –
-
act_discrete
(state, use_target=False, **__)[source]¶ Use Q network to produce a discrete action for the current state.
- Parameters
state (Dict[str, Any]) – Current state.
use_target (bool) – Whether to use the target network.
- Returns
Action of shape
[batch_size, 1]
. Any other things returned by your Q network. if they exist.
-
act_discrete_with_noise
(state, use_target=False, decay_epsilon=True, **__)[source]¶ Randomly selects an action from the action space according to a uniform distribution, with regard to the epsilon decay policy.
- Parameters
state (Dict[str, Any]) – Current state.
use_target (bool) – Whether to use the target network.
decay_epsilon (bool) – Whether to decay the
epsilon
attribute.
- Returns
Noisy action of shape
[batch_size, 1]
. Any other things returned by your Q network. if they exist.
-
static
action_get_function
(sampled_actions)¶ This function is used to get action numbers (int tensor indicating which discrete actions are used) from the sampled action dictionary.
-
enable_multiprocessing
()¶ Enable multiprocessing for all modules.
-
classmethod
generate_config
(config)[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
-
classmethod
get_restorable_model_names
()¶ Get attribute name of restorable nn models.
-
classmethod
get_top_model_names
()¶ Get attribute name of top level nn models.
-
classmethod
init_from_config
(config, model_device='cpu')¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
torch.device] model_device (Union[str,) –
-
classmethod
is_distributed
()¶ Whether this framework is a distributed framework which require multiple processes to run, and depends on
torch.distributed
ortorch.distributed.rpc
-
load
(model_dir, network_map=None, version=- 1)¶ Load models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir – Save directory.
network_map – Key is module name, value is saved name.
version – Version number of the save to be loaded.
-
static
reward_function
(reward, discount, next_value, terminal, _)¶
-
save
(model_dir, network_map=None, version=0)¶ Save models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir (str) – Save directory.
network_map (Dict[str, str]) – Key is module name, value is saved name.
version (int) – Version number of the new save.
-
set_backward_function
(backward_func)¶ Replace the default backward function with a custom function. The default loss backward function is
torch.autograd.backward
- Parameters
backward_func (Callable) –
-
store_episode
(episode)[source]¶ Add a full episode of transition samples to the replay buffer.
“value” is automatically calculated.
- Parameters
Dict]] episode (List[Union[machin.frame.transition.Transition,) –
-
store_transition
(transition)[source]¶ Add a transition sample to the replay buffer.
Not suggested, since you will have to calculate “value” by yourself.
- Parameters
Dict] transition (Union[machin.frame.transition.Transition,) –
-
update
(update_value=True, update_target=True, concatenate_samples=True, **__)[source]¶ Update network weights by sampling from replay buffer.
- Parameters
update_value – Whether update the Q network.
update_target – Whether update targets.
concatenate_samples – Whether concatenate the samples.
- Returns
mean value of estimated policy value, value loss
-
update_lr_scheduler
()¶ Update learning rate schedulers.
-
visualize_model
(final_tensor, name, directory)¶ - Parameters
final_tensor (torch.Tensor) –
name (str) –
directory (str) –
-
property
backward_function
¶
-
property
lr_schedulers
¶
-
property
optimizers
¶
-
property
restorable_models
¶
-
property
top_models
¶
A2C¶
-
class
machin.frame.algorithms.a2c.
A2C
(actor, critic, optimizer, criterion, *_, lr_scheduler=None, lr_scheduler_args=None, lr_scheduler_kwargs=None, batch_size=100, actor_update_times=5, critic_update_times=10, actor_learning_rate=0.001, critic_learning_rate=0.001, entropy_weight=None, value_weight=0.5, gradient_max=inf, gae_lambda=1.0, discount=0.99, normalize_advantage=True, replay_size=500000, replay_device='cpu', replay_buffer=None, visualize=False, visualize_dir='', **__)[source]¶ Bases:
machin.frame.algorithms.base.TorchFramework
A2C framework.
Important
when given a state, and an optional, action actor must at least return two values:
1. Action
For contiguous environments, action must be of shape
[batch_size, action_dim]
and clamped by action space. For discrete environments, action could be of shape[batch_size, action_dim]
if it is a one hot vector, or[batch_size, 1]
if it is a categorically encoded integer.2. Log likelihood of action (action probability)
For either type of environment, log likelihood is of shape
[batch_size, 1]
.Action probability must be differentiable, Gradient of actor is calculated from the gradient of action probability.
The third entropy value is optional:
3. Entropy of action distribution
Entropy is usually calculated using dist.entropy(), its shape is
[batch_size, 1]
. You must specifyentropy_weight
to make it effective.Hint
For contiguous environments, action’s are not directly output by your actor, otherwise it would be rather inconvenient to calculate the log probability of action. Instead, your actor network should output parameters for a certain distribution (eg:
Normal
) and then draw action from it.For discrete environments,
Categorical
is sufficient, since differentiablersample()
is not needed.This trick is also known as reparameterization.
Hint
Actions are from samples during training in the actor critic family (A2C, A3C, PPO, TRPO, IMPALA).
When your actor model is given a batch of actions and states, it must evaluate the states, and return the log likelihood of the given actions instead of re-sampling actions.
An example of your actor in contiguous environments:
class ActorNet(nn.Module): def __init__(self): super(ActorNet, self).__init__() self.fc = nn.Linear(3, 100) self.mu_head = nn.Linear(100, 1) self.sigma_head = nn.Linear(100, 1) def forward(self, state, action=None): x = t.relu(self.fc(state)) mu = 2.0 * t.tanh(self.mu_head(x)) sigma = F.softplus(self.sigma_head(x)) dist = Normal(mu, sigma) action = (action if action is not None else dist.sample()) action_entropy = dist.entropy() action = action.clamp(-2.0, 2.0) action_log_prob = dist.log_prob(action) return action, action_log_prob, action_entropy
Hint
Entropy weight is usually negative, to increase exploration.
Value weight is usually 0.5. So critic network converges less slowly than the actor network and learns more conditions.
Update equation is equivalent to:
\(Loss= w_e * Entropy + w_v * Loss_v + w_a * Loss_a\) \(Loss_a = -log\_likelihood * advantage\) \(Loss_v = criterion(target\_bellman\_value - V(s))\)
- Parameters
actor (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Actor network module.
critic (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Critic network module.
optimizer (Callable) – Optimizer used to optimize
actor
andcritic
.criterion (Callable) – Criterion used to evaluate the value loss.
lr_scheduler (Callable) – Learning rate scheduler of
optimizer
.lr_scheduler_args (Tuple[Tuple, Tuple]) – Arguments of the learning rate scheduler.
lr_scheduler_kwargs (Tuple[Dict, Dict]) – Keyword arguments of the learning rate scheduler.
batch_size (int) – Batch size used during training.
actor_update_times (int) – Times to update actor in
update()
.critic_update_times (int) – Times to update critic in
update()
.actor_learning_rate (float) – Learning rate of the actor optimizer, not compatible with
lr_scheduler
.critic_learning_rate (float) – Learning rate of the critic optimizer, not compatible with
lr_scheduler
.entropy_weight (float) – Weight of entropy in your loss function, a positive entropy weight will minimize entropy, while a negative one will maximize entropy.
value_weight (float) – Weight of critic value loss.
gradient_max (float) – Maximum gradient.
gae_lambda (float) – \(\lambda\) used in generalized advantage estimation.
discount (float) – \(\gamma\) used in the bellman function.
normalize_advantage (bool) – Whether to normalize the advantage function.
replay_size (int) – Replay buffer size. Not compatible with
replay_buffer
.replay_device (Union[str, torch.device]) – Device where the replay buffer locates on, Not compatible with
replay_buffer
.replay_buffer (machin.frame.buffers.buffer.Buffer) – Custom replay buffer.
visualize (bool) – Whether visualize the network flow in the first pass.
visualize_dir (str) – Visualized graph save directory.
-
act
(state, *_, **__)[source]¶ Use actor network to give a policy to the current state.
- Returns
Anything produced by actor.
- Parameters
Any] state (Dict[str,) –
-
enable_multiprocessing
()¶ Enable multiprocessing for all modules.
-
classmethod
generate_config
(config)[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
-
classmethod
get_restorable_model_names
()¶ Get attribute name of restorable nn models.
-
classmethod
get_top_model_names
()¶ Get attribute name of top level nn models.
-
classmethod
init_from_config
(config, model_device='cpu')[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
torch.device] model_device (Union[str,) –
-
classmethod
is_distributed
()¶ Whether this framework is a distributed framework which require multiple processes to run, and depends on
torch.distributed
ortorch.distributed.rpc
-
load
(model_dir, network_map=None, version=- 1)¶ Load models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir (str) – Save directory.
network_map (Dict[str, str]) – Key is module name, value is saved name.
version (int) – Version number of the save to be loaded.
-
save
(model_dir, network_map=None, version=0)¶ Save models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir (str) – Save directory.
network_map (Dict[str, str]) – Key is module name, value is saved name.
version (int) – Version number of the new save.
-
set_backward_function
(backward_func)¶ Replace the default backward function with a custom function. The default loss backward function is
torch.autograd.backward
- Parameters
backward_func (Callable) –
-
store_episode
(episode)[source]¶ Add a full episode of transition samples to the replay buffer.
“value” and “gae” are automatically calculated.
- Parameters
Dict]] episode (List[Union[machin.frame.transition.Transition,) –
-
store_transition
(transition)[source]¶ Add a transition sample to the replay buffer.
Not suggested, since you will have to calculate “value” and “gae” by yourself.
- Parameters
Dict] transition (Union[machin.frame.transition.Transition,) –
-
update
(update_value=True, update_policy=True, concatenate_samples=True, **__)[source]¶ Update network weights by sampling from buffer. Buffer will be cleared after update is finished.
- Parameters
update_value – Whether update the Q network.
update_policy – Whether update the actor network.
concatenate_samples – Whether concatenate the samples.
- Returns
mean value of estimated policy value, value loss
-
visualize_model
(final_tensor, name, directory)¶ - Parameters
final_tensor (torch.Tensor) –
name (str) –
directory (str) –
-
property
backward_function
¶
-
property
lr_schedulers
¶
-
property
optimizers
¶
-
property
restorable_models
¶
-
property
top_models
¶
A3C¶
-
class
machin.frame.algorithms.a3c.
A3C
(actor, critic, criterion, grad_server, *_, batch_size=100, actor_update_times=5, critic_update_times=10, entropy_weight=None, value_weight=0.5, gradient_max=inf, gae_lambda=1.0, discount=0.99, normalize_advantage=True, replay_size=500000, replay_device='cpu', replay_buffer=None, visualize=False, visualize_dir='', **__)[source]¶ Bases:
machin.frame.algorithms.a2c.A2C
A3C framework.
See also
Note
A3C algorithm relies on parameter servers to synchronize parameters of actor and critic models across samplers ( interact with environment) and trainers (using samples to train.
The parameter server type
PushPullGradServer
used here utilizes gradients calculated by trainers:1. perform a “sum” reduction process on the collected gradients, then apply this reduced gradient to the model managed by its primary reducer
2. push the parameters of this updated managed model to a ordered key-value server so that all processes, including samplers and trainers, can access the updated parameters.
The
grad_servers
argument is a pair of accessors to twoPushPullGradServerImpl
class.- Parameters
actor (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Actor network module.
critic (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Critic network module.
criterion (Callable) – Criterion used to evaluate the value loss.
grad_server (Tuple[machin.parallel.server.param_server.PushPullGradServer, machin.parallel.server.param_server.PushPullGradServer]) – Custom gradient sync server accessors, the first server accessor is for actor, and the second one is for critic.
batch_size (int) – Batch size used during training.
actor_update_times (int) – Times to update actor in
update()
.critic_update_times (int) – Times to update critic in
update()
.entropy_weight (float) – Weight of entropy in your loss function, a positive entropy weight will minimize entropy, while a negative one will maximize entropy.
value_weight (float) – Weight of critic value loss.
gradient_max (float) – Maximum gradient.
gae_lambda (float) – \(\lambda\) used in generalized advantage estimation.
discount (float) – \(\gamma\) used in the bellman function.
normalize_advantage (bool) – Whether to normalize the advantage function.
replay_size (int) – Replay buffer size. Not compatible with
replay_buffer
.replay_device (Union[str, torch.device]) – Device where the replay buffer locates on, Not compatible with
replay_buffer
.replay_buffer (machin.frame.buffers.buffer.Buffer) – Custom replay buffer.
visualize (bool) – Whether visualize the network flow in the first pass.
visualize_dir (str) – Visualized graph save directory.
-
act
(state, **__)[source]¶ Use actor network to give a policy to the current state.
- Returns
Anything produced by actor.
- Parameters
Any] state (Dict[str,) –
-
enable_multiprocessing
()¶ Enable multiprocessing for all modules.
-
classmethod
generate_config
(config)[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
-
classmethod
get_restorable_model_names
()¶ Get attribute name of restorable nn models.
-
classmethod
get_top_model_names
()¶ Get attribute name of top level nn models.
-
classmethod
init_from_config
(config, model_device='cpu')[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
torch.device] model_device (Union[str,) –
-
classmethod
is_distributed
()[source]¶ Whether this framework is a distributed framework which require multiple processes to run, and depends on
torch.distributed
ortorch.distributed.rpc
-
load
(model_dir, network_map=None, version=- 1)¶ Load models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir (str) – Save directory.
network_map (Dict[str, str]) – Key is module name, value is saved name.
version (int) – Version number of the save to be loaded.
-
save
(model_dir, network_map=None, version=0)¶ Save models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir (str) – Save directory.
network_map (Dict[str, str]) – Key is module name, value is saved name.
version (int) – Version number of the new save.
-
set_backward_function
(backward_func)¶ Replace the default backward function with a custom function. The default loss backward function is
torch.autograd.backward
- Parameters
backward_func (Callable) –
-
store_episode
(episode)¶ Add a full episode of transition samples to the replay buffer.
“value” and “gae” are automatically calculated.
- Parameters
Dict]] episode (List[Union[machin.frame.transition.Transition,) –
-
store_transition
(transition)¶ Add a transition sample to the replay buffer.
Not suggested, since you will have to calculate “value” and “gae” by yourself.
- Parameters
Dict] transition (Union[machin.frame.transition.Transition,) –
-
update
(update_value=True, update_policy=True, concatenate_samples=True, **__)[source]¶ Update network weights by sampling from buffer. Buffer will be cleared after update is finished.
- Parameters
update_value – Whether update the Q network.
update_policy – Whether update the actor network.
concatenate_samples – Whether concatenate the samples.
- Returns
mean value of estimated policy value, value loss
-
update_lr_scheduler
()¶ Update learning rate schedulers.
-
visualize_model
(final_tensor, name, directory)¶ - Parameters
final_tensor (torch.Tensor) –
name (str) –
directory (str) –
-
property
backward_function
¶
-
property
lr_schedulers
¶
-
property
optimizers
¶
-
property
restorable_models
¶
-
property
top_models
¶
PPO¶
-
class
machin.frame.algorithms.ppo.
PPO
(actor, critic, optimizer, criterion, *_, lr_scheduler=None, lr_scheduler_args=(), lr_scheduler_kwargs=(), batch_size=100, actor_update_times=5, critic_update_times=10, actor_learning_rate=0.001, critic_learning_rate=0.001, entropy_weight=None, value_weight=0.5, surrogate_loss_clip=0.2, gradient_max=inf, gae_lambda=1.0, discount=0.99, normalize_advantage=True, replay_size=500000, replay_device='cpu', replay_buffer=None, visualize=False, visualize_dir='', **__)[source]¶ Bases:
machin.frame.algorithms.a2c.A2C
PPO framework.
See also
- Parameters
actor (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Actor network module.
critic (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Critic network module.
optimizer (Callable) – Optimizer used to optimize
actor
andcritic
.criterion (Callable) – Criterion used to evaluate the value loss.
lr_scheduler (Callable) – Learning rate scheduler of
optimizer
.lr_scheduler_args (Tuple[Tuple, Tuple]) – Arguments of the learning rate scheduler.
lr_scheduler_kwargs (Tuple[Dict, Dict]) – Keyword arguments of the learning rate scheduler.
batch_size (int) – Batch size used during training.
actor_update_times (int) – Times to update actor in
update()
.critic_update_times (int) – Times to update critic in
update()
.actor_learning_rate (float) – Learning rate of the actor optimizer, not compatible with
lr_scheduler
.critic_learning_rate (float) – Learning rate of the critic optimizer, not compatible with
lr_scheduler
.entropy_weight (float) – Weight of entropy in your loss function, a positive entropy weight will minimize entropy, while a negative one will maximize entropy.
value_weight (float) – Weight of critic value loss.
surrogate_loss_clip (float) – Surrogate loss clipping parameter in PPO.
gradient_max (float) – Maximum gradient.
gae_lambda (float) – \(\lambda\) used in generalized advantage estimation.
discount (float) – \(\gamma\) used in the bellman function.
replay_size (int) – Replay buffer size. Not compatible with
replay_buffer
.replay_device (Union[str, torch.device]) – Device where the replay buffer locates on, Not compatible with
replay_buffer
.replay_buffer (machin.frame.buffers.buffer.Buffer) – Custom replay buffer.
visualize (bool) – Whether visualize the network flow in the first pass.
visualize_dir (str) – Visualized graph save directory.
normalize_advantage (bool) –
-
act
(state, *_, **__)¶ Use actor network to give a policy to the current state.
- Returns
Anything produced by actor.
- Parameters
Any] state (Dict[str,) –
-
enable_multiprocessing
()¶ Enable multiprocessing for all modules.
-
classmethod
generate_config
(config)[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
-
classmethod
get_restorable_model_names
()¶ Get attribute name of restorable nn models.
-
classmethod
get_top_model_names
()¶ Get attribute name of top level nn models.
-
classmethod
init_from_config
(config, model_device='cpu')¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
torch.device] model_device (Union[str,) –
-
classmethod
is_distributed
()¶ Whether this framework is a distributed framework which require multiple processes to run, and depends on
torch.distributed
ortorch.distributed.rpc
-
load
(model_dir, network_map=None, version=- 1)¶ Load models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir (str) – Save directory.
network_map (Dict[str, str]) – Key is module name, value is saved name.
version (int) – Version number of the save to be loaded.
-
save
(model_dir, network_map=None, version=0)¶ Save models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir (str) – Save directory.
network_map (Dict[str, str]) – Key is module name, value is saved name.
version (int) – Version number of the new save.
-
set_backward_function
(backward_func)¶ Replace the default backward function with a custom function. The default loss backward function is
torch.autograd.backward
- Parameters
backward_func (Callable) –
-
store_episode
(episode)¶ Add a full episode of transition samples to the replay buffer.
“value” and “gae” are automatically calculated.
- Parameters
Dict]] episode (List[Union[machin.frame.transition.Transition,) –
-
store_transition
(transition)¶ Add a transition sample to the replay buffer.
Not suggested, since you will have to calculate “value” and “gae” by yourself.
- Parameters
Dict] transition (Union[machin.frame.transition.Transition,) –
-
update
(update_value=True, update_policy=True, concatenate_samples=True, **__)[source]¶ Update network weights by sampling from buffer. Buffer will be cleared after update is finished.
- Parameters
update_value – Whether update the Q network.
update_policy – Whether update the actor network.
concatenate_samples – Whether concatenate the samples.
- Returns
mean value of estimated policy value, value loss
-
update_lr_scheduler
()¶ Update learning rate schedulers.
-
visualize_model
(final_tensor, name, directory)¶ - Parameters
final_tensor (torch.Tensor) –
name (str) –
directory (str) –
-
property
backward_function
¶
-
property
lr_schedulers
¶
-
property
optimizers
¶
-
property
restorable_models
¶
-
property
top_models
¶
SAC¶
-
class
machin.frame.algorithms.sac.
SAC
(actor, critic, critic_target, critic2, critic2_target, optimizer, criterion, *_, lr_scheduler=None, lr_scheduler_args=None, lr_scheduler_kwargs=None, target_entropy=None, initial_entropy_alpha=1.0, batch_size=100, update_rate=0.005, update_steps=None, actor_learning_rate=0.0005, critic_learning_rate=0.001, alpha_learning_rate=0.001, discount=0.99, gradient_max=inf, replay_size=500000, replay_device='cpu', replay_buffer=None, visualize=False, visualize_dir='', **__)[source]¶ Bases:
machin.frame.algorithms.base.TorchFramework
SAC framework.
Important
When given a state, and an optional action, actor must at least return two values, similar to the actor structure described in
A2C
. However, when actor is asked to select an action based on the current state, you must make sure that the sampling process is differentiable. E.g. use thersample
method of torch distributions instead of thesample
method.Compared to other actor-critic methods, SAC embeds the entropy term into its reward function directly, rather than adding the entropy term to actor’s loss function. Therefore, we do not use the entropy output of your actor network.
The SAC algorithm uses Q network as critics, so please reference
DDPG
for the requirements and the definition ofaction_trans_func
.- Parameters
actor (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Actor network module.
critic (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Critic network module.
critic_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target critic network module.
critic2 (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – The second critic network module.
critic2_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – The second target critic network module.
optimizer (Callable) – Optimizer used to optimize
actor
,critic
andcritic2
.criterion (Callable) – Criterion used to evaluate the value loss.
*_ –
lr_scheduler (Callable) – Learning rate scheduler of
optimizer
.lr_scheduler_args (Tuple[Tuple, Tuple, Tuple]) – Arguments of the learning rate scheduler.
lr_scheduler_kwargs (Tuple[Dict, Dict, Dict]) – Keyword arguments of the learning rate scheduler.
target_entropy (float) – Target entropy weight \(\alpha\) used in the SAC soft value function: \(V_{soft}(s_t) = \mathbb{E}_{q_t\sim\pi}[ Q_{soft}(s_t,a_t) - \alpha log\pi(a_t|s_t)]\)
initial_entropy_alpha (float) – Initial entropy weight \(\alpha\)
gradient_max (float) – Maximum gradient.
batch_size (int) – Batch size used during training.
update_rate (float) –
\(\tau\) used to update target networks. Target parameters are updated as:
\(\theta_t = \theta * \tau + \theta_t * (1 - \tau)\)
update_steps (Optional[int]) – Training step number used to update target networks.
actor_learning_rate (float) – Learning rate of the actor optimizer, not compatible with
lr_scheduler
.critic_learning_rate (float) – Learning rate of the critic optimizer, not compatible with
lr_scheduler
.discount (float) – \(\gamma\) used in the bellman function.
replay_size (int) – Replay buffer size. Not compatible with
replay_buffer
.replay_device (Union[str, torch.device]) – Device where the replay buffer locates on, Not compatible with
replay_buffer
.replay_buffer (machin.frame.buffers.buffer.Buffer) – Custom replay buffer.
visualize (bool) – Whether visualize the network flow in the first pass.
visualize_dir (str) – Visualized graph save directory.
alpha_learning_rate (float) –
-
act
(state, **__)[source]¶ Use actor network to produce an action for the current state.
- Returns
Anything produced by actor.
- Parameters
Any] state (Dict[str,) –
-
enable_multiprocessing
()¶ Enable multiprocessing for all modules.
-
classmethod
generate_config
(config)[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
-
classmethod
get_restorable_model_names
()¶ Get attribute name of restorable nn models.
-
classmethod
get_top_model_names
()¶ Get attribute name of top level nn models.
-
classmethod
init_from_config
(config, model_device='cpu')[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
torch.device] model_device (Union[str,) –
-
classmethod
is_distributed
()¶ Whether this framework is a distributed framework which require multiple processes to run, and depends on
torch.distributed
ortorch.distributed.rpc
-
load
(model_dir, network_map=None, version=- 1)[source]¶ Load models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir – Save directory.
network_map – Key is module name, value is saved name.
version – Version number of the save to be loaded.
-
save
(model_dir, network_map=None, version=0)¶ Save models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir (str) – Save directory.
network_map (Dict[str, str]) – Key is module name, value is saved name.
version (int) – Version number of the new save.
-
set_backward_function
(backward_func)¶ Replace the default backward function with a custom function. The default loss backward function is
torch.autograd.backward
- Parameters
backward_func (Callable) –
-
store_episode
(episode)[source]¶ Add a full episode of transition samples to the replay buffer.
- Parameters
Dict]] episode (List[Union[machin.frame.transition.Transition,) –
-
store_transition
(transition)[source]¶ Add a transition sample to the replay buffer.
- Parameters
Dict] transition (Union[machin.frame.transition.Transition,) –
-
update
(update_value=True, update_policy=True, update_target=True, update_entropy_alpha=True, concatenate_samples=True, **__)[source]¶ Update network weights by sampling from replay buffer.
- Parameters
update_value – Whether to update the Q network.
update_policy – Whether to update the actor network.
update_target – Whether to update targets.
update_entropy_alpha – Whether to update \(alpha\) of entropy.
concatenate_samples – Whether to concatenate the samples.
- Returns
mean value of estimated policy value, value loss
-
visualize_model
(final_tensor, name, directory)¶ - Parameters
final_tensor (torch.Tensor) –
name (str) –
directory (str) –
-
property
backward_function
¶
-
property
lr_schedulers
¶
-
property
optimizers
¶
-
property
restorable_models
¶
-
property
top_models
¶
APEX¶
-
class
machin.frame.algorithms.apex.
DDPGApex
(actor, actor_target, critic, critic_target, optimizer, criterion, apex_group, model_server, *_, lr_scheduler=None, lr_scheduler_args=(), lr_scheduler_kwargs=(), batch_size=100, update_rate=0.005, update_steps=None, actor_learning_rate=0.0005, critic_learning_rate=0.001, discount=0.99, gradient_max=inf, replay_size=500000, **__)[source]¶ Bases:
machin.frame.algorithms.ddpg_per.DDPGPer
Massively parallel version of a DDPG with prioritized replay.
The pull function is invoked before using
act
,act_with_noise
,act_discrete
,act_discrete_with_noise
andcriticize
.The push function is invoked after
update
.See also
Note
Apex framework supports multiple workers(samplers), and only one trainer, you may use
DistributedDataParallel
in trainer. If you useDistributedDataParallel
, you must callupdate()
in all member processes ofDistributedDataParallel
.- Parameters
actor (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Actor network module.
actor_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target actor network module.
critic (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Critic network module.
critic_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target critic network module.
optimizer (Callable) – Optimizer used to optimize
qnet
.criterion (Callable) – Criterion used to evaluate the value loss.
apex_group (machin.parallel.distributed.world.RpcGroup) – Group of all processes using the apex-DDPG framework, including all samplers and trainers.
model_server (Tuple[machin.parallel.server.param_server.PushPullModelServer]) – Custom model sync server accessor for
actor
.lr_scheduler (Callable) – Learning rate scheduler of
optimizer
.lr_scheduler_args (Tuple[Tuple, Tuple]) – Arguments of the learning rate scheduler.
lr_scheduler_kwargs (Tuple[Dict, Dict]) – Keyword arguments of the learning rate scheduler.
batch_size (int) – Batch size used during training.
update_rate (float) –
\(\tau\) used to update target networks. Target parameters are updated as:
\(\theta_t = \theta * \tau + \theta_t * (1 - \tau)\)
update_steps (Optional[int]) – Training step number used to update target networks.
actor_learning_rate (float) – Learning rate of the actor optimizer, not compatible with
lr_scheduler
.critic_learning_rate (float) – Learning rate of the critic optimizer, not compatible with
lr_scheduler
.discount (float) – \(\gamma\) used in the bellman function.
gradient_max (float) – Maximum gradient.
replay_size (int) – Local replay buffer size of a single worker.
-
act
(state, use_target=False, **__)[source]¶ Use actor network to produce an action for the current state.
- Parameters
state (Dict[str, Any]) – Current state.
use_target (bool) – Whether use the target network.
- Returns
Any thing returned by your actor network.
-
act_discrete
(state, use_target=False, **__)[source]¶ Use actor network to produce a discrete action for the current state.
Notes
actor network must output a probability tensor, of shape (batch_size, action_dims), and has a sum of 1 for each row in dimension 1.
- Parameters
state (Dict[str, Any]) – Current state.
use_target (bool) – Whether to use the target network.
- Returns
Action of shape
[batch_size, 1]
. Action probability tensor of shape[batch_size, action_num]
, produced by your actor. Any other things returned by your Q network. if they exist.
-
act_discrete_with_noise
(state, use_target=False, **__)[source]¶ Use actor network to produce a noisy discrete action for the current state.
Notes
actor network must output a probability tensor, of shape (batch_size, action_dims), and has a sum of 1 for each row in dimension 1.
- Parameters
state (Dict[str, Any]) – Current state.
use_target (bool) – Whether to use the target network.
choose_max_prob – Probability to choose the largest component when actor is outputing extreme probability vector like
[0, 1, 0, 0]
.
- Returns
Noisy action of shape
[batch_size, 1]
. Action probability tensor of shape[batch_size, action_num]
. Any other things returned by your Q network. if they exist.
-
act_with_noise
(state, noise_param=0.0, 1.0, ratio=1.0, mode='uniform', use_target=False, **__)[source]¶ Use actor network to produce a noisy action for the current state.
- Parameters
state (Dict[str, Any]) – Current state.
noise_param (Tuple) – Noise params.
ratio (float) – Noise ratio.
mode (str) – Noise mode. Supported are:
"uniform", "normal", "clipped_normal", "ou"
use_target (bool) – Whether use the target network.
- Returns
Noisy action of shape
[batch_size, action_dim]
. Any other things returned by your actor network. if they exist.
-
static
action_transform_function
(raw_output_action, *_)¶ The action transform function is used to transform the output of actor to the input of critic. Action transform function must accept:
Raw action from the actor model.
Concatenated
Transition.next_state
.Any other concatenated lists of custom keys from
Transition
.
- and returns:
A dictionary with the same form as
Transition.action
- Parameters
raw_output_action (Any) – Raw action from the actor model.
-
enable_multiprocessing
()¶ Enable multiprocessing for all modules.
-
classmethod
generate_config
(config)[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
-
classmethod
get_restorable_model_names
()¶ Get attribute name of restorable nn models.
-
classmethod
get_top_model_names
()¶ Get attribute name of top level nn models.
-
classmethod
init_from_config
(config, model_device='cpu')[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
torch.device] model_device (Union[str,) –
-
classmethod
is_distributed
()[source]¶ Whether this framework is a distributed framework which require multiple processes to run, and depends on
torch.distributed
ortorch.distributed.rpc
-
load
(model_dir, network_map=None, version=- 1)¶ Load models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir (str) – Save directory.
network_map (Dict[str, str]) – Key is module name, value is saved name.
version (int) – Version number of the save to be loaded.
-
static
reward_function
(reward, discount, next_value, terminal, _)¶
-
save
(model_dir, network_map=None, version=0)¶ Save models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir (str) – Save directory.
network_map (Dict[str, str]) – Key is module name, value is saved name.
version (int) – Version number of the new save.
-
set_backward_function
(backward_func)¶ Replace the default backward function with a custom function. The default loss backward function is
torch.autograd.backward
- Parameters
backward_func (Callable) –
-
store_episode
(episode)¶ Add a full episode of transition samples to the replay buffer.
- Parameters
Dict]] episode (List[Union[machin.frame.transition.Transition,) –
-
store_transition
(transition)¶ Add a transition sample to the replay buffer.
- Parameters
Dict] transition (Union[machin.frame.transition.Transition,) –
-
update
(update_value=True, update_policy=True, update_target=True, concatenate_samples=True, **__)[source]¶ Update network weights by sampling from replay buffer.
- Parameters
update_value – Whether to update the Q network.
update_policy – Whether to update the actor network.
update_target – Whether to update targets.
concatenate_samples – Whether to concatenate the samples.
- Returns
mean value of estimated policy value, value loss
-
update_lr_scheduler
()¶ Update learning rate schedulers.
-
visualize_model
(final_tensor, name, directory)¶ - Parameters
final_tensor (torch.Tensor) –
name (str) –
directory (str) –
-
property
backward_function
¶
-
property
lr_schedulers
¶
-
property
optimizers
¶
-
property
restorable_models
¶
-
property
top_models
¶
-
class
machin.frame.algorithms.apex.
DQNApex
(qnet, qnet_target, optimizer, criterion, apex_group, model_server, *_, lr_scheduler=None, lr_scheduler_args=(), lr_scheduler_kwargs=(), batch_size=100, epsilon_decay=0.9999, update_rate=0.005, update_steps=None, learning_rate=0.001, discount=0.99, gradient_max=inf, replay_size=500000, **__)[source]¶ Bases:
machin.frame.algorithms.dqn_per.DQNPer
Massively parallel version of a Double DQN with prioritized replay.
The pull function is invoked before using
act_discrete
,act_discrete_with_noise
andcriticize
.The push function is invoked after
update
.See also
Note
Apex framework supports multiple workers(samplers), and only one trainer, you may use
DistributedDataParallel
in trainer. If you useDistributedDataParallel
, you must callupdate()
in all member processes ofDistributedDataParallel
.- Parameters
qnet (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Q network module.
qnet_target (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Target Q network module.
optimizer (Callable) – Optimizer used to optimize
qnet
.criterion (Callable) – Criterion used to evaluate the value loss.
apex_group (machin.parallel.distributed.world.RpcGroup) – Group of all processes using the apex-DQN framework, including all samplers and trainers.
model_server (Tuple[machin.parallel.server.param_server.PushPullModelServer]) – Custom model sync server accessor for
qnet
.lr_scheduler (Callable) – Learning rate scheduler of
optimizer
.lr_scheduler_args (Tuple[Tuple]) – Arguments of the learning rate scheduler.
lr_scheduler_kwargs (Tuple[Dict]) – Keyword arguments of the learning rate scheduler.
batch_size (int) – Batch size used during training.
epsilon_decay (float) – Epsilon decay rate per acting with noise step.
epsilon
attribute is multiplied with this every timeact_discrete_with_noise
is called.update_rate (float) –
\(\tau\) used to update target networks. Target parameters are updated as:
\(\theta_t = \theta * \tau + \theta_t * (1 - \tau)\)
update_steps (Optional[int]) – Training step number used to update target networks.
learning_rate (float) – Learning rate of the optimizer, not compatible with
lr_scheduler
.discount (float) – \(\gamma\) used in the bellman function.
gradient_max (float) – Maximum gradient.
replay_size (int) – Local replay buffer size of a single worker.
-
act_discrete
(state, use_target=False, **__)[source]¶ Use Q network to produce a discrete action for the current state.
- Parameters
state (Dict[str, Any]) – Current state.
use_target (bool) – Whether to use the target network.
- Returns
Action of shape
[batch_size, 1]
. Any other things returned by your Q network. if they exist.
-
act_discrete_with_noise
(state, use_target=False, decay_epsilon=True, **__)[source]¶ Randomly selects an action from the action space according to a uniform distribution, with regard to the epsilon decay policy.
- Parameters
state (Dict[str, Any]) – Current state.
use_target (bool) – Whether to use the target network.
decay_epsilon (bool) – Whether to decay the
epsilon
attribute.
- Returns
Noisy action of shape
[batch_size, 1]
. Any other things returned by your Q network. if they exist.
-
static
action_get_function
(sampled_actions)¶ This function is used to get action numbers (int tensor indicating which discrete actions are used) from the sampled action dictionary.
-
enable_multiprocessing
()¶ Enable multiprocessing for all modules.
-
classmethod
get_restorable_model_names
()¶ Get attribute name of restorable nn models.
-
classmethod
get_top_model_names
()¶ Get attribute name of top level nn models.
-
classmethod
init_from_config
(config, model_device='cpu')[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
torch.device] model_device (Union[str,) –
-
classmethod
is_distributed
()[source]¶ Whether this framework is a distributed framework which require multiple processes to run, and depends on
torch.distributed
ortorch.distributed.rpc
-
load
(model_dir, network_map=None, version=- 1)¶ Load models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir – Save directory.
network_map – Key is module name, value is saved name.
version – Version number of the save to be loaded.
-
static
reward_function
(reward, discount, next_value, terminal, _)¶
-
save
(model_dir, network_map=None, version=0)¶ Save models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir (str) – Save directory.
network_map (Dict[str, str]) – Key is module name, value is saved name.
version (int) – Version number of the new save.
-
set_backward_function
(backward_func)¶ Replace the default backward function with a custom function. The default loss backward function is
torch.autograd.backward
- Parameters
backward_func (Callable) –
-
store_episode
(episode)¶ Add a full episode of transition samples to the replay buffer.
- Parameters
Dict]] episode (List[Union[machin.frame.transition.Transition,) –
-
store_transition
(transition)¶ Add a transition sample to the replay buffer.
- Parameters
Dict] transition (Union[machin.frame.transition.Transition,) –
-
update
(update_value=True, update_target=True, concatenate_samples=True, **__)[source]¶ Update network weights by sampling from replay buffer.
- Parameters
update_value – Whether update the Q network.
update_target – Whether update targets.
concatenate_samples – Whether concatenate the samples.
- Returns
mean value of estimated policy value, value loss
-
update_lr_scheduler
()¶ Update learning rate schedulers.
-
visualize_model
(final_tensor, name, directory)¶ - Parameters
final_tensor (torch.Tensor) –
name (str) –
directory (str) –
-
property
backward_function
¶
-
property
lr_schedulers
¶
-
property
optimizers
¶
-
property
restorable_models
¶
-
property
top_models
¶
IMPALA¶
-
class
machin.frame.algorithms.impala.
EpisodeDistributedBuffer
(buffer_name, group, buffer_size, *_, **__)[source]¶ Bases:
machin.frame.buffers.buffer_d.DistributedBuffer
A distributed buffer which stores each episode as a transition object inside the buffer.
Create a distributed replay buffer instance.
To avoid issues caused by tensor device difference, all transition objects are stored in device “cpu”.
Distributed replay buffer constitutes of many local buffers held per process, transmissions between processes only happen during sampling.
During sampling, the tensors in “state”, “action” and “next_state” dictionaries, along with “reward”, will be concatenated in dimension 0. any other custom keys specified in
**kwargs
will not be concatenated.See also
Note
Since
append()
operates on the local buffer, in order to append to the distributed buffer correctly, please make sure that your actor is also the local buffer holder, i.e. a member of thegroup
- Parameters
buffer_size (int) – Maximum local buffer size.
group (machin.parallel.distributed.world.RpcGroup) – Process group which holds this buffer.
buffer_name (str) – A unique name of your buffer.
-
append
(transition, required_attrs='state', 'action', 'next_state', 'reward', 'terminal', 'action_log_prob')[source]¶ Store a transition object to buffer.
- Parameters
transition (Dict) – A transition object.
required_attrs – Required attributes. Could be an empty tuple if no attribute is required.
- Raises
ValueError if transition object doesn't have required –
attributes in required_attrs or has different attributes –
compared to other transition objects stored in buffer. –
-
class
machin.frame.algorithms.impala.
EpisodeTransition
(state, action, next_state, reward, terminal, **kwargs)[source]¶ Bases:
machin.frame.transition.Transition
A transition class which allows storing the whole episode as a single transition object, the batch dimension will be used to stack all transition steps.
- Parameters
state (Dict[str, torch.Tensor]) – Previous observed state.
action (Dict[str, torch.Tensor]) – Action of agent.
next_state (Dict[str, torch.Tensor]) – Next observed state.
reward (Union[float, torch.Tensor]) – Reward of agent.
terminal (bool) – Whether environment has reached terminal state.
**kwargs – Custom attributes. They are ordered in the alphabetic order (provided by
sort()
) when you callkeys()
.
Note
You should not store any tensor inside
**kwargs
as they will not be moved to the sample output device.
-
class
machin.frame.algorithms.impala.
IMPALA
(actor, critic, optimizer, criterion, impala_group, model_server, *_, lr_scheduler=None, lr_scheduler_args=(), lr_scheduler_kwargs=(), batch_size=5, learning_rate=0.001, isw_clip_c=1.0, isw_clip_rho=1.0, entropy_weight=None, value_weight=0.5, gradient_max=inf, discount=0.99, replay_size=500, **__)[source]¶ Bases:
machin.frame.algorithms.base.TorchFramework
Massively parallel IMPALA framework.
Note
Please make sure isw_clip_rho >= isw_clip_c
- Parameters
actor (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Actor network module.
critic (Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]) – Critic network module.
optimizer (Callable) – Optimizer used to optimize
actor
andcritic
.criterion (Callable) – Criterion used to evaluate the value loss.
impala_group (machin.parallel.distributed.world.RpcGroup) – Group of all processes using the IMPALA framework, including all samplers and trainers.
model_server (Tuple[machin.parallel.server.param_server.PushPullModelServer]) – Custom model sync server accessor for
actor
.lr_scheduler (Callable) – Learning rate scheduler of
optimizer
.lr_scheduler_args (Tuple[Tuple, Tuple]) – Arguments of the learning rate scheduler.
lr_scheduler_kwargs (Tuple[Dict, Dict]) – Keyword arguments of the learning rate scheduler.
batch_size (int) – Batch size used during training.
learning_rate (float) – Learning rate of the optimizer, not compatible with
lr_scheduler
.isw_clip_c (float) – \(c\) used in importance weight clipping.
isw_clip_rho (float) –
entropy_weight (float) – Weight of entropy in your loss function, a positive entropy weight will minimize entropy, while a negative one will maximize entropy.
value_weight (float) – Weight of critic value loss.
gradient_max (float) – Maximum gradient.
discount (float) – \(\gamma\) used in the bellman function.
replay_size (int) – Size of the local replay buffer.
-
act
(state, *_, **__)[source]¶ Use actor network to give a policy to the current state.
- Returns
Anything produced by actor.
- Parameters
Any] state (Dict[str,) –
-
classmethod
generate_config
(config)[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
-
classmethod
init_from_config
(config, model_device='cpu')[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
torch.device] model_device (Union[str,) –
-
classmethod
is_distributed
()[source]¶ Whether this framework is a distributed framework which require multiple processes to run, and depends on
torch.distributed
ortorch.distributed.rpc
-
store_episode
(episode)[source]¶ Add a full episode of transition samples to the replay buffer.
- Parameters
Dict]] episode (List[Union[machin.frame.transition.Transition,) –
-
store_transition
(transition)[source]¶ Warning
Not supported in IMPALA due to v-trace requirements.
- Parameters
Dict] transition (Union[machin.frame.transition.Transition,) –
-
update
(update_value=True, update_policy=True, **__)[source]¶ Update network weights by sampling from replay buffer.
Note
Will always concatenate samples.
- Parameters
update_value – Whether to update the Q network.
update_policy – Whether to update the actor network.
- Returns
mean value of estimated policy value, value loss
-
property
lr_schedulers
¶
-
property
optimizers
¶
MADDPG¶
-
class
machin.frame.algorithms.maddpg.
MADDPG
(actors, actor_targets, critics, critic_targets, optimizer, criterion, *_, lr_scheduler=None, lr_scheduler_args=None, lr_scheduler_kwargs=None, critic_visible_actors=None, sub_policy_num=0, batch_size=100, update_rate=0.001, update_steps=None, actor_learning_rate=0.0005, critic_learning_rate=0.001, discount=0.99, gradient_max=inf, replay_size=500000, replay_device='cpu', replay_buffer=None, visualize=False, visualize_dir='', use_jit=True, pool_type='thread', pool_size=None, **__)[source]¶ Bases:
machin.frame.algorithms.base.TorchFramework
MADDPG is a centralized multi-agent training framework, it alleviates the unstable reward problem caused by the disturbance of other agents by gathering all agents observations and train a global critic. This global critic observes all actions and all states from all agents.
See also
Note
In order to parallelize agent inference, a process pool is used internally. However, in order to minimize memory copy / CUDA memory copy, the location of all of your models must be either “cpu”, or “cuda” (Using multiple CUDA devices is supported).
Note
MADDPG framework does not require all of your actors are homogeneous. Each pair of your actors and critcs could be heterogeneous.
Note
Suppose you have three pair of actors and critics, with index 0, 1, 2. If critic 0 can observe the action of actor 0 and 1, critic 1 can observe the action of actor 1 and 2, critic 2 can observe the action of actor 2 and 0, the
critic_visible_actors
should be:[[0, 1], [1, 2], [2, 0]]
Note
Learning rate scheduler args and kwargs for each actor and critic, the first list is for actors, and the second list is for critics.
Note
- This implementation contains:
Ensemble Training
- This implementation does not contain:
Inferring other agents’ policies
Mixed continuous/discrete action spaces
- Parameters
actors (List[Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]]) – Actor network modules.
actor_targets (List[Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]]) – Target actor network modules.
critics (List[Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]]) – Critic network modules.
critic_targets (List[Union[machin.model.nets.base.NeuralNetworkModule, torch.nn.modules.module.Module]]) – Target critic network modules.
optimizer (Callable) – Optimizer used to optimize
actors
andcritics
. By default all critics can see outputs of all actors.criterion (Callable) – Criterion used to evaluate the value loss.
critic_visible_actors (List[List[int]]) – Indexes of visible actors for each critic.
sub_policy_num (int) – Times to replicate each actor. Equals to ensemble_policy_num - 1
lr_scheduler (Callable) – Learning rate scheduler of
optimizer
.lr_scheduler_args (Tuple[List[Tuple], List[Tuple]]) – Arguments of the learning rate scheduler.
lr_scheduler_kwargs (Tuple[List[Dict], List[Dict]]) – Keyword arguments of the learning rate scheduler.
batch_size (int) – Batch size used during training.
update_rate (float) – \(\tau\) used to update target networks. Target parameters are updated as: \(\theta_t = \theta * \tau + \theta_t * (1 - \tau)\)
update_steps (Optional[int]) – Training step number used to update target networks.
actor_learning_rate (float) – Learning rate of the actor optimizer, not compatible with
lr_scheduler
.critic_learning_rate (float) – Learning rate of the critic optimizer, not compatible with
lr_scheduler
.discount (float) – \(\gamma\) used in the bellman function.
replay_size (int) – Replay buffer size for each actor. Not compatible with
replay_buffer
.replay_device (Union[str, torch.device]) – Device where the replay buffer locates on, Not compatible with
replay_buffer
.replay_buffer (machin.frame.buffers.buffer.Buffer) – Custom replay buffer. Will be replicated for actor.
visualize (bool) – Whether visualize the network flow in the first pass.
visualize_dir (str) – Visualized graph save directory.
use_jit (bool) – Whether use torch jit to perform the forward pass in parallel instead of using the internal pool. Provides significant speed and efficiency advantage, but requires actors and critics convertible to TorchScript.
pool_type (str) – Type of the internal execution pool, either “process” or “thread”.
pool_size (int) – Size of the internal execution pool.
gradient_max (float) –
-
act
(states, use_target=False, **__)[source]¶ Use all actor networks to produce actions for the current state. A random sub-policy from the policy ensemble of each actor will be chosen.
- Parameters
states (List[Dict[str, Any]]) – A list of current states of each actor.
use_target (bool) – Whether use the target network.
- Returns
A list of anything returned by your actor. If your actor returns multiple values, they will be wrapped in a tuple.
-
act_discrete
(states, use_target=False)[source]¶ Use all actor networks to produce discrete actions for the current state. A random sub-policy from the policy ensemble of each actor will be chosen.
Notes
actor network must output a probability tensor, of shape (batch_size, action_dims), and has a sum of 1 for each row in dimension 1.
- Parameters
states (List[Dict[str, Any]]) – A list of current states of each actor.
use_target (bool) – Whether use the target network.
- Returns
Integer discrete actions of shape
[batch_size, 1]
.Action probability tensors of shape
[batch_size, action_num]
.Any other things returned by your actor.
- Return type
A list of tuples containing
-
act_discrete_with_noise
(states, use_target=False)[source]¶ Use all actor networks to produce discrete actions for the current state. A random sub-policy from the policy ensemble of each actor will be chosen.
Notes
actor network must output a probability tensor, of shape (batch_size, action_dims), and has a sum of 1 for each row in dimension 1.
- Parameters
states (List[Dict[str, Any]]) – A list of current states of each actor.
use_target (bool) – Whether use the target network.
- Returns
Integer noisy discrete actions.
Action probability tensors of shape
[batch_size, action_num]
.Any other things returned by your actor.
- Return type
A list of tuples containing
-
act_with_noise
(states, noise_param=0.0, 1.0, ratio=1.0, mode='uniform', use_target=False, **__)[source]¶ Use all actor networks to produce noisy actions for the current state. A random sub-policy from the policy ensemble of each actor will be chosen.
- Parameters
states (List[Dict[str, Any]]) – A list of current states of each actor.
noise_param (Any) – Noise params.
ratio (float) – Noise ratio.
mode (str) – Noise mode. Supported are:
"uniform", "normal", "clipped_normal", "ou"
use_target (bool) – Whether use the target network.
- Returns
A list of noisy actions of shape
[batch_size, action_dim]
.
-
static
action_transform_function
(raw_output_action, *_)[source]¶ - Parameters
raw_output_action (Any) –
-
enable_multiprocessing
()¶ Enable multiprocessing for all modules.
-
classmethod
generate_config
(config)[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
-
classmethod
get_restorable_model_names
()¶ Get attribute name of restorable nn models.
-
classmethod
get_top_model_names
()¶ Get attribute name of top level nn models.
-
classmethod
init_from_config
(config, model_device='cpu')[source]¶ - Parameters
Any], machin.utils.conf.Config] config (Union[Dict[str,) –
torch.device] model_device (Union[str,) –
-
classmethod
is_distributed
()¶ Whether this framework is a distributed framework which require multiple processes to run, and depends on
torch.distributed
ortorch.distributed.rpc
-
load
(model_dir, network_map=None, version=- 1)[source]¶ Load models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir – Save directory.
network_map – Key is module name, value is saved name.
version – Version number of the save to be loaded.
-
save
(model_dir, network_map=None, version=0)¶ Save models.
An example of network map:
{"restorable_model_1": "file_name_1", "restorable_model_2": "file_name_2"}
Get keys by calling
<Class name>.get_restorable()
- Parameters
model_dir (str) – Save directory.
network_map (Dict[str, str]) – Key is module name, value is saved name.
version (int) – Version number of the new save.
-
set_backward_function
(backward_func)¶ Replace the default backward function with a custom function. The default loss backward function is
torch.autograd.backward
- Parameters
backward_func (Callable) –
-
store_episodes
(episodes)[source]¶ Add a List of full episodes, from all actors, to the replay buffers. Each episode is a list of transition samples.
- Parameters
Dict]]] episodes (List[List[Union[machin.frame.transition.Transition,) –
-
store_transitions
(transitions)[source]¶ Add a list of transition samples, from all actors at the same time step, to the replay buffers.
- Parameters
transitions (List[Union[machin.frame.transition.Transition, Dict]]) – List of transition objects.
-
update
(update_value=True, update_policy=True, update_target=True, concatenate_samples=True)[source]¶ Update network weights by sampling from replay buffer.
- Parameters
update_value – Whether to update the Q network.
update_policy – Whether to update the actor network.
update_target – Whether to update targets.
concatenate_samples – Whether to concatenate the samples.
- Returns
mean value of estimated policy value, value loss
-
visualize_model
(final_tensor, name, directory)¶ - Parameters
final_tensor (torch.Tensor) –
name (str) –
directory (str) –
-
property
backward_function
¶
-
property
lr_schedulers
¶
-
property
optimizers
¶
-
property
restorable_models
¶
-
property
top_models
¶
-
class
machin.frame.algorithms.maddpg.
SHMBuffer
(buffer_size, buffer_device='cpu', *_, **__)[source]¶ Bases:
machin.frame.buffers.buffer.Buffer
Create a buffer instance.
Buffer stores a series of transition objects and functions as a ring buffer. It is not thread-safe.
See also
During sampling, the tensors in “state”, “action” and “next_state” dictionaries, along with “reward”, will be concatenated in dimension 0. any other custom keys specified in
**kwargs
will not be concatenated.- Parameters
buffer_size – Maximum buffer size.
buffer_device – Device where buffer is stored.
-
append
(transition, required_attrs='state', 'action', 'next_state', 'reward', 'terminal')¶ Store a transition object to buffer.
- Parameters
transition (Union[machin.frame.transition.Transition, Dict]) – A transition object.
required_attrs – Required attributes. Could be an empty tuple if no attribute is required.
- Raises
ValueError if transition object doesn't have required –
attributes in required_attrs or has different attributes –
compared to other transition objects stored in buffer. –
-
clear
()¶ Remove all entries from the buffer
-
static
make_tensor_from_batch
(batch, device, concatenate)[source]¶ Make a tensor from a batch of data. Will concatenate input tensors in dimension 0. Or create a tensor of size (batch_size, 1) for scalars.
- Parameters
batch – Batch data.
device – Device to move data to
concatenate – Whether performing concatenation.
- Returns
Original batch if batch is empty, or tensor depends on your data (if concatenate), or original batch (if not concatenate).
-
classmethod
post_process_batch
(batch, device, concatenate, sample_attrs, additional_concat_attrs)¶ Post-process (concatenate) sampled batch.
- Parameters
batch (List[machin.frame.transition.Transition]) –
torch.device] device (Union[str,) –
concatenate (bool) –
sample_attrs (List[str]) –
additional_concat_attrs (List[str]) –
-
sample_batch
(batch_size, concatenate=True, device=None, sample_method='random_unique', sample_attrs=None, additional_concat_attrs=None, *_, **__)¶ Sample a random batch from buffer.
See also
Default sample methods are defined as static class methods.
Note
“Concatenation” means
torch.cat([...], dim=0)
for tensors, andtorch.tensor([...]).view(batch_size, 1)
for scalars.Warning
Custom attributes must not contain tensors. And only scalar custom attributes can be concatenated, such as
int
,float
,bool
.- Parameters
batch_size (int) – A hint size of the result sample. actual sample size depends on your sample method.
sample_method (Union[Callable, str]) – Sample method, could be one of:
"random", "random_unique", "all"
, or a function:func(list, batch_size)->(list, result_size)
concatenate (bool) – Whether concatenate state, action and next_state in dimension 0. If
True
, for each value in dictionaries of major attributes. and each value of sub attributes, returns a concatenated tensor. Custom Attributes specified inadditional_concat_attrs
will also be concatenated. IfFalse
, return a list of tensors.device (Union[str, torch.device]) – Device to copy to.
sample_attrs (List[str]) – If sample_keys is specified, then only specified keys of the transition object will be sampled. You may use
"*"
as a wildcard to collect remaining custom keys as adict
, you cannot collect major and sub attributes using this. Invalid sample attributes will be ignored.additional_concat_attrs (List[str]) – additional custom keys needed to be concatenated, will only work if
concatenate
isTrue
.
- Returns
Batch size, Sampled attribute values in the same order as
sample_keys
.Sampled attribute values is a tuple. Or
None
if sampled batch size is zero (E.g.: if buffer is empty or your sample size is 0 and you are not sampling using the “all” method).For major attributes, result are dictionaries of tensors with the same keys in your transition objects.
For sub attributes, result are tensors.
For custom attributes, if they are not in
additional_concat_attrs
, then lists, otherwise tensors.
- Return type
Any
-
static
sample_method_all
(buffer, _)¶ Sample all samples from buffer. Always return the whole buffer, will ignore the
batch_size
parameter.- Parameters
buffer (List[machin.frame.transition.Transition]) –
- Return type
Tuple[int, List[machin.frame.transition.Transition]]
-
static
sample_method_random
(buffer, batch_size)¶ Sample random samples from buffer.
Note
Sampled size could be any value from 0 to
batch_size
.- Parameters
buffer (List[machin.frame.transition.Transition]) –
batch_size (int) –
- Return type
Tuple[int, List[machin.frame.transition.Transition]]
-
static
sample_method_random_unique
(buffer, batch_size)¶ Sample unique random samples from buffer.
Note
Sampled size could be any value from 0 to
batch_size
.- Parameters
buffer (List[machin.frame.transition.Transition]) –
batch_size (int) –
- Return type
Tuple[int, List[machin.frame.transition.Transition]]
-
size
()¶ - Returns
Length of current buffer.
buffers¶
Buffer¶
-
class
machin.frame.buffers.buffer.
Buffer
(buffer_size, buffer_device='cpu', *_, **__)[source]¶ Bases:
object
Create a buffer instance.
Buffer stores a series of transition objects and functions as a ring buffer. It is not thread-safe.
See also
During sampling, the tensors in “state”, “action” and “next_state” dictionaries, along with “reward”, will be concatenated in dimension 0. any other custom keys specified in
**kwargs
will not be concatenated.- Parameters
buffer_size – Maximum buffer size.
buffer_device – Device where buffer is stored.
-
append
(transition, required_attrs='state', 'action', 'next_state', 'reward', 'terminal')[source]¶ Store a transition object to buffer.
- Parameters
transition (Union[machin.frame.transition.Transition, Dict]) – A transition object.
required_attrs – Required attributes. Could be an empty tuple if no attribute is required.
- Raises
ValueError if transition object doesn't have required –
attributes in required_attrs or has different attributes –
compared to other transition objects stored in buffer. –
-
static
make_tensor_from_batch
(batch, device, concatenate)[source]¶ Make a tensor from a batch of data. Will concatenate input tensors in dimension 0. Or create a tensor of size (batch_size, 1) for scalars.
- Parameters
batch (List[Union[NewType.<locals>.new_type, torch.Tensor]]) – Batch data.
device (Union[str, torch.device]) – Device to move data to
concatenate (bool) – Whether performing concatenation.
- Returns
Original batch if batch is empty, or tensor depends on your data (if concatenate), or original batch (if not concatenate).
-
classmethod
post_process_batch
(batch, device, concatenate, sample_attrs, additional_concat_attrs)[source]¶ Post-process (concatenate) sampled batch.
- Parameters
batch (List[machin.frame.transition.Transition]) –
torch.device] device (Union[str,) –
concatenate (bool) –
sample_attrs (List[str]) –
additional_concat_attrs (List[str]) –
-
sample_batch
(batch_size, concatenate=True, device=None, sample_method='random_unique', sample_attrs=None, additional_concat_attrs=None, *_, **__)[source]¶ Sample a random batch from buffer.
See also
Default sample methods are defined as static class methods.
Note
“Concatenation” means
torch.cat([...], dim=0)
for tensors, andtorch.tensor([...]).view(batch_size, 1)
for scalars.Warning
Custom attributes must not contain tensors. And only scalar custom attributes can be concatenated, such as
int
,float
,bool
.- Parameters
batch_size (int) – A hint size of the result sample. actual sample size depends on your sample method.
sample_method (Union[Callable, str]) – Sample method, could be one of:
"random", "random_unique", "all"
, or a function:func(list, batch_size)->(list, result_size)
concatenate (bool) – Whether concatenate state, action and next_state in dimension 0. If
True
, for each value in dictionaries of major attributes. and each value of sub attributes, returns a concatenated tensor. Custom Attributes specified inadditional_concat_attrs
will also be concatenated. IfFalse
, return a list of tensors.device (Union[str, torch.device]) – Device to copy to.
sample_attrs (List[str]) – If sample_keys is specified, then only specified keys of the transition object will be sampled. You may use
"*"
as a wildcard to collect remaining custom keys as adict
, you cannot collect major and sub attributes using this. Invalid sample attributes will be ignored.additional_concat_attrs (List[str]) – additional custom keys needed to be concatenated, will only work if
concatenate
isTrue
.
- Returns
Batch size, Sampled attribute values in the same order as
sample_keys
.Sampled attribute values is a tuple. Or
None
if sampled batch size is zero (E.g.: if buffer is empty or your sample size is 0 and you are not sampling using the “all” method).For major attributes, result are dictionaries of tensors with the same keys in your transition objects.
For sub attributes, result are tensors.
For custom attributes, if they are not in
additional_concat_attrs
, then lists, otherwise tensors.
- Return type
Any
-
static
sample_method_all
(buffer, _)[source]¶ Sample all samples from buffer. Always return the whole buffer, will ignore the
batch_size
parameter.- Parameters
buffer (List[machin.frame.transition.Transition]) –
- Return type
Tuple[int, List[machin.frame.transition.Transition]]
-
static
sample_method_random
(buffer, batch_size)[source]¶ Sample random samples from buffer.
Note
Sampled size could be any value from 0 to
batch_size
.- Parameters
buffer (List[machin.frame.transition.Transition]) –
batch_size (int) –
- Return type
Tuple[int, List[machin.frame.transition.Transition]]
-
static
sample_method_random_unique
(buffer, batch_size)[source]¶ Sample unique random samples from buffer.
Note
Sampled size could be any value from 0 to
batch_size
.- Parameters
buffer (List[machin.frame.transition.Transition]) –
batch_size (int) –
- Return type
Tuple[int, List[machin.frame.transition.Transition]]
Distributed buffer¶
-
class
machin.frame.buffers.buffer_d.
DistributedBuffer
(buffer_name, group, buffer_size, *_, **__)[source]¶ Bases:
machin.frame.buffers.buffer.Buffer
Create a distributed replay buffer instance.
To avoid issues caused by tensor device difference, all transition objects are stored in device “cpu”.
Distributed replay buffer constitutes of many local buffers held per process, transmissions between processes only happen during sampling.
During sampling, the tensors in “state”, “action” and “next_state” dictionaries, along with “reward”, will be concatenated in dimension 0. any other custom keys specified in
**kwargs
will not be concatenated.See also
Note
Since
append()
operates on the local buffer, in order to append to the distributed buffer correctly, please make sure that your actor is also the local buffer holder, i.e. a member of thegroup
- Parameters
buffer_size (int) – Maximum local buffer size.
group (machin.parallel.distributed.world.RpcGroup) – Process group which holds this buffer.
buffer_name (str) – A unique name of your buffer.
-
append
(transition, required_attrs='state', 'action', 'next_state', 'reward', 'terminal')[source]¶ Store a transition object to buffer.
- Parameters
transition (Union[machin.frame.transition.Transition, Dict]) – A transition object.
required_attrs – Required attributes. Could be an empty tuple if no attribute is required.
- Raises
ValueError if transition object doesn't have required –
attributes in required_attrs or has different attributes –
compared to other transition objects stored in buffer. –
-
sample_batch
(batch_size, concatenate=True, device=None, sample_method='random_unique', sample_attrs=None, additional_concat_attrs=None, *_, **__)[source]¶ Sample a random batch from buffer.
See also
Default sample methods are defined as static class methods.
Note
“Concatenation” means
torch.cat([...], dim=0)
for tensors, andtorch.tensor([...]).view(batch_size, 1)
for scalars.Warning
Custom attributes must not contain tensors. And only scalar custom attributes can be concatenated, such as
int
,float
,bool
.- Parameters
batch_size (int) – A hint size of the result sample. actual sample size depends on your sample method.
sample_method (Union[Callable, str]) – Sample method, could be one of:
"random", "random_unique", "all"
, or a function:func(list, batch_size)->(list, result_size)
concatenate (bool) – Whether concatenate state, action and next_state in dimension 0. If
True
, for each value in dictionaries of major attributes. and each value of sub attributes, returns a concatenated tensor. Custom Attributes specified inadditional_concat_attrs
will also be concatenated. IfFalse
, return a list of tensors.device (Union[str, torch.device]) – Device to copy to.
sample_attrs (List[str]) – If sample_keys is specified, then only specified keys of the transition object will be sampled. You may use
"*"
as a wildcard to collect remaining custom keys as adict
, you cannot collect major and sub attributes using this. Invalid sample attributes will be ignored.additional_concat_attrs (List[str]) – additional custom keys needed to be concatenated, will only work if
concatenate
isTrue
.
- Returns
Batch size, Sampled attribute values in the same order as
sample_keys
.Sampled attribute values is a tuple. Or
None
if sampled batch size is zero (E.g.: if buffer is empty or your sample size is 0 and you are not sampling using the “all” method).For major attributes, result are dictionaries of tensors with the same keys in your transition objects.
For sub attributes, result are tensors.
For custom attributes, if they are not in
additional_concat_attrs
, then lists, otherwise tensors.
- Return type
Any
Prioritized buffer¶
-
class
machin.frame.buffers.prioritized_buffer.
PrioritizedBuffer
(buffer_size, buffer_device='cpu', epsilon=0.01, alpha=0.6, beta=0.4, beta_increment_per_sampling=0.001, *_, **__)[source]¶ Bases:
machin.frame.buffers.buffer.Buffer
- Parameters
buffer_size – Maximum buffer size.
buffer_device – Device where buffer is stored.
epsilon – A small positive constant used to prevent edge-case zero weight transitions from never being visited.
alpha – Prioritization weight. Used during transition sampling: \(j \sim P(j)=p_{j}^{\alpha} / \sum_i p_{i}^{\alpha}\). When
alpha = 0
, all samples have the same probability to be sampled. Whenalpha = 1
, all samples are drawn uniformly according to their weight.beta – Bias correcting weight. When
beta = 1
, bias introduced by prioritized replay will be corrected. Used during importance weight calculation: \(w_j=(N \cdot P(j))^{-\beta}/max_i w_i\)beta_increment_per_sampling – Beta increase step size, will gradually increase
beta
to 1.
-
append
(transition, priority=None, required_attrs='state', 'action', 'next_state', 'reward', 'terminal')[source]¶ Store a transition object to buffer.
- Parameters
transition (Union[machin.frame.transition.Transition, Dict]) – A transition object.
priority (Optional[float]) – Priority of transition.
required_attrs – Required attributes.
-
sample_batch
(batch_size, concatenate=True, device=None, sample_attrs=None, additional_concat_attrs=None, *_, **__)[source]¶ Sample the most important batch from the prioritized buffer.
See also
- Parameters
batch_size (int) – A hint size of the result sample.
concatenate (bool) – Whether concatenate state, action and next_state in dimension 0. If
True
, for each value in dictionaries of major attributes. and each value of sub attributes, returns a concatenated tensor. Custom Attributes specified inadditional_concat_attrs
will also be concatenated. IfFalse
, return a list of tensors.device (Union[str, torch.device]) – Device to copy to.
sample_attrs (List[str]) – If sample_keys is specified, then only specified keys of the transition object will be sampled. You may use
"*"
as a wildcard to collect remaining keys.additional_concat_attrs (List[str]) – additional custom keys needed to be concatenated,
- Returns
Batch size.
Sampled attribute values in the same order as
sample_keys
.Sampled attribute values is a tuple. Or
None
if sampled batch size is zero (E.g.: if buffer is empty or your sample size is 0).Indexes of samples in the weight tree,
np.ndarray
. OrNone
if sampled batch size is zeroImportance sampling weight of samples,
np.ndarray
. OrNone
if sampled batch size is zero
- Return type
Any
-
update_priority
(priorities, indexes)[source]¶ Update priorities of samples.
- Parameters
priorities (numpy.ndarray) – New priorities.
indexes (numpy.ndarray) – Indexes of samples, returned by
sample_batch()
Weight tree¶
-
class
machin.frame.buffers.prioritized_buffer.
WeightTree
(size)[source]¶ Bases:
object
Sum weight tree data structure.
Initialize a weight tree.
Note
Weights must be positive.
Note
Weight tree is stored as a flattened, full binary tree in a
np.ndarray
. The lowest level of leaves comes first, the highest root node is stored at last.Example:
Tree with weights:
[[1, 2, 3, 4], [3, 7], [11]]
will be stored as:
[1, 2, 3, 4, 3, 7, 11]
Note
Performance On i7-6700HQ (M: Million):
90ms for building a tree with 10M elements.
230ms for looking up 10M elements in a tree with 10M elements.
20ms for 1M element batched update in a tree with 10M elements.
240ms for 1M element single update in a tree with 10M elements.
- Parameters
size – Number of weight tree leaves.
-
find_leaf_index
(weight)[source]¶ Find leaf indexes given weight. Weight must be in range \([0, weight\_sum]\)
- Parameters
weight (Union[float, List[float], numpy.ndarray]) – Weight(s) used to query leaf index(es).
- Returns
Leaf index(es), if weight is scalar, returns
int
, if not, returnsnp.ndarray
.
-
get_leaf_all_weights
()[source]¶ - Returns
Current weights of all leaves,
np.ndarray
of shape(size)
.- Return type
numpy.ndarray
-
get_leaf_weight
(index)[source]¶ Get weights of selected leaves.
- Parameters
index (Union[int, List[int], numpy.ndarray]) – Leaf indexes in range
[0, size - 1]
, used to query weights.- Returns
Current weight(s) of selected leaves. If index is scalar, returns
float
, if not, returnsnp.ndarray
.- Return type
Any
-
print_weights
(precision=2)[source]¶ Pretty print the tree, for debug purpose.
- Parameters
precision – Number of digits of weights to print.
-
update_all_leaves
(weights)[source]¶ Reset all leaf weights, rebuild weight tree from ground up.
- Parameters
weights (Union[List[float], numpy.ndarray]) – All leaf weights. List or array length should be in range
[0, size]
.
Distributed prioritized buffer¶
-
class
machin.frame.buffers.prioritized_buffer_d.
DistributedPrioritizedBuffer
(buffer_name, group, buffer_size, *_, **__)[source]¶ Bases:
machin.frame.buffers.prioritized_buffer.PrioritizedBuffer
Create a distributed prioritized replay buffer instance.
To avoid issues caused by tensor device difference, all transition objects are stored in device “cpu”.
Distributed prioritized replay buffer constitutes of many local buffers held per process, since it is very inefficient to maintain a weight tree across processes, each process holds a weight tree of records in its local buffer and a local buffer (same as
DistributedBuffer
).The sampling process(es) will first use rpc to acquire the wr_lock, signalling “stop” to appending performed by actor processes, then perform a sum of all local weight trees, and finally perform sampling, after sampling and updating the importance weight, the lock will be released.
During sampling, the tensors in “state”, “action” and “next_state” dictionaries, along with “reward”, will be concatenated in dimension 0. any other custom keys specified in
**kwargs
will not be concatenated.See also
PrioritizedBuffer
Note
DistributedPrioritizedBuffer
is not split into an accessor and an implementation, because we would like to operate on the buffer directly, when calling “size()” or “append()”, to increase efficiency (since rpc layer is bypassed).- Parameters
buffer_size (int) – Maximum local buffer size.
group (machin.parallel.distributed.world.RpcGroup) – Process group which holds this buffer.
buffer_name (str) –
-
append
(transition, priority=None, required_attrs='state', 'action', 'next_state', 'reward', 'terminal')[source]¶ Store a transition object to buffer.
- Parameters
transition (Union[machin.frame.transition.Transition, Dict]) – A transition object.
priority (Optional[float]) – Priority of transition.
required_attrs – Required attributes.
-
sample_batch
(batch_size, concatenate=True, device=None, sample_attrs=None, additional_concat_attrs=None, *_, **__)[source]¶ Sample the most important batch from the prioritized buffer.
See also
- Parameters
batch_size (int) – A hint size of the result sample.
concatenate (bool) – Whether concatenate state, action and next_state in dimension 0. If
True
, for each value in dictionaries of major attributes. and each value of sub attributes, returns a concatenated tensor. Custom Attributes specified inadditional_concat_attrs
will also be concatenated. IfFalse
, return a list of tensors.device (Union[str, torch.device]) – Device to copy to.
sample_attrs (List[str]) – If sample_keys is specified, then only specified keys of the transition object will be sampled. You may use
"*"
as a wildcard to collect remaining keys.additional_concat_attrs (List[str]) – additional custom keys needed to be concatenated,
- Returns
Batch size.
Sampled attribute values in the same order as
sample_keys
.Sampled attribute values is a tuple. Or
None
if sampled batch size is zero (E.g.: if buffer is empty or your sample size is 0).Indexes of samples in the weight tree,
np.ndarray
. OrNone
if sampled batch size is zeroImportance sampling weight of samples,
np.ndarray
. OrNone
if sampled batch size is zero
- Return type
Any
-
update_priority
(priorities, indexes)[source]¶ Update priorities of samples.
- Parameters
priorities (numpy.ndarray) – New priorities.
indexes (collections.OrderedDict) – Indexes of samples, returned by
sample_batch()
noise¶
action_space_noise¶
-
machin.frame.noise.action_space_noise.
add_clipped_normal_noise_to_action
(action, noise_param=0.0, 1.0, - 1.0, 1.0, ratio=1.0)[source]¶ Add clipped normal noise to action tensor.
Hint
The innermost tuple contains:
(normal_mean, normal_sigma, clip_min, clip_max)
If
noise_param
isTuple[float, float, float, float]
, then the same clipped normal noise will be added toaction[*, :]
.If
noise_param
isIterable[Tuple[float, float, float, float]]
, then for eachaction[*, i]
slice i, clipped normal noise withnoise_param[i]
will be applied respectively.- Parameters
action (torch.Tensor) – Raw action
noise_param (Union[Iterable[Tuple], Tuple]) – Param of the normal noise.
ratio – Sampled noise is multiplied with this ratio.
- Returns
Action with uniform noise.
-
machin.frame.noise.action_space_noise.
add_normal_noise_to_action
(action, noise_param=0.0, 1.0, ratio=1.0)[source]¶ Add normal noise to action tensor.
Hint
The innermost tuple contains:
(normal_mean, normal_sigma)
If
noise_param
isTuple[float, float]
, then the same normal noise will be added toaction[*, :]
.If
noise_param
isIterable[Tuple[float, float]]
, then for eachaction[*, i]
slice i, clipped normal noise withnoise_param[i]
will be applied respectively.- Parameters
action (torch.Tensor) – Raw action
noise_param – Param of the normal noise.
ratio – Sampled noise is multiplied with this ratio.
- Returns
Action with normal noise.
-
machin.frame.noise.action_space_noise.
add_ou_noise_to_action
(action, noise_param=None, ratio=1.0, reset=False)[source]¶ Add Ornstein-Uhlenbeck noise to action tensor.
Warning
Ornstein-Uhlenbeck noise generator is shared. And you cannot specify OU noise of different distributions for each of the last dimension of your action.
- Parameters
action (torch.Tensor) – Raw action
noise_param (Dict[str, Any]) –
OrnsteinUhlenbeckGen
params. Used as keyword arguments of the generator. Will only be effective ifreset
isTrue
.ratio – Sampled noise is multiplied with this ratio.
reset – Whether to reset the default Ornstein-Uhlenbeck noise generator.
- Returns
Action with Ornstein-Uhlenbeck noise.
-
machin.frame.noise.action_space_noise.
add_uniform_noise_to_action
(action, noise_param=0.0, 1.0, ratio=1.0)[source]¶ Add uniform noise to action tensor.
Hint
The innermost tuple contains:
(uniform_min, uniform_max)
If
noise_param
isTuple[float, float]
, then the same uniform noise will be added toaction[*, :]
.If
noise_param
isIterable[Tuple[float, float]]
, then for eachaction[*, i]
slice i, uniform noise withnoise_param[i]
will be added respectively.- Parameters
action (torch.Tensor) – Raw action.
noise_param (Union[Iterable[Tuple], Tuple]) – Param of the uniform noise.
ratio (float) – Sampled noise is multiplied with this ratio.
- Returns
Action with uniform noise.
generator¶
-
class
machin.frame.noise.generator.
ClippedNormalNoiseGen
(shape, mu=0.0, sigma=1.0, nmin=- 1.0, nmax=1.0)[source]¶ Bases:
machin.frame.noise.generator.NoiseGen
Normal noise generator.
Example
>>> gen = NormalNoiseGen([2, 3], 0, 1) >>> gen("cuda:0") tensor([[-0.5957, 0.2360, 1.0999], [ 1.6259, 1.2052, -0.0667]], device="cuda:0")
- Parameters
shape (Any) – Output shape.
mu (float) – Average mean of normal noise.
sigma (float) – Standard deviation of normal noise.
nmin (float) –
nmax (float) –
-
class
machin.frame.noise.generator.
NoiseGen
[source]¶ Bases:
abc.ABC
Base class for noise generators.
-
class
machin.frame.noise.generator.
NormalNoiseGen
(shape, mu=0.0, sigma=1.0)[source]¶ Bases:
machin.frame.noise.generator.NoiseGen
Normal noise generator.
Example
>>> gen = NormalNoiseGen([2, 3], 0, 1) >>> gen("cuda:0") tensor([[-0.5957, 0.2360, 1.0999], [ 1.6259, 1.2052, -0.0667]], device="cuda:0")
- Parameters
shape (Any) – Output shape.
mu (float) – Average mean of normal noise.
sigma (float) – Standard deviation of normal noise.
-
class
machin.frame.noise.generator.
OrnsteinUhlenbeckNoiseGen
(shape, mu=0.0, sigma=1.0, theta=0.15, dt=0.01, x0=None)[source]¶ Bases:
machin.frame.noise.generator.NoiseGen
Ornstein-Uhlenbeck noise generator. Based on definition:
\(X_{n+1} = X_n + \theta (\mu - X_n)\Delta t + \sigma \Delta W_n\)
Example
>>> gen = OrnsteinUhlenbeckNoiseGen([2, 3], 0, 1) >>> gen("cuda:0") tensor([[ 0.1829, 0.1589, -0.1932], [-0.1568, 0.0579, 0.2107]], device="cuda:0") >>> gen.reset()
- Parameters
shape (Any) – Output shape.
mu (float) – Average mean of noise.
sigma (float) – Weight of the random wiener process.
theta (float) – Weight of difference correction.
dt (float) – Time step size.
x0 (torch.Tensor) – Initial x value. Must have the same shape as
shape
.
-
class
machin.frame.noise.generator.
UniformNoiseGen
(shape, umin=0.0, umax=1.0)[source]¶ Bases:
machin.frame.noise.generator.NoiseGen
Normal noise generator.
Example
>>> gen = UniformNoiseGen([2, 3], 0, 1) >>> gen("cuda:0") tensor([[0.0745, 0.6581, 0.9572], [0.4450, 0.8157, 0.6421]], device="cuda:0")
- Parameters
shape (Any) – Output shape.
umin (float) – Minimum value of uniform noise.
umax (float) – Maximum value of uniform noise.
param_space_noise¶
-
class
machin.frame.noise.param_space_noise.
AdaptiveParamNoise
(initial_stddev=0.1, desired_action_stddev=0.1, adoption_coefficient=1.01)[source]¶ Bases:
object
Implements the adaptive parameter space method in <<Parameter space noise for exploration>>.
Hint
Let \(\theta\) be the standard deviation of noise, and \(\alpha\) be the adpotion coefficient, then:
\(\theta_{n+1} = \left \{ \begin{array}{ll} \alpha \theta_k & if\ d(\pi,\tilde{\pi})\leq\delta, \\ \frac{1}{\alpha} \theta_k & otherwise, \end{array} \right. \ \)
Noise is directly applied to network parameters.
- Parameters
initial_stddev (float) – Initial noise standard deviation.
desired_action_stddev (float) – Desired standard deviation for
adoption_coefficient (float) – Adoption coefficient.
-
machin.frame.noise.param_space_noise.
perturb_model
(model, perturb_switch, reset_switch, distance_func=<function <lambda>>, desired_action_stddev=0.5, noise_generator=<class 'machin.frame.noise.generator.NormalNoiseGen'>, noise_generator_args=(), noise_generator_kwargs=None, noise_generate_function=None, debug_backward=False)[source]¶ Give model’s parameters a little perturbation. Implements <<Parameter space noise for exploration>>.
Note
Only parameters of type
t.Tensor
and gettable frommodel.named_parameters()
will be perturbed.Original parameters will be automatically swapped in during the backward pass, and you can safely call optimizers afterwards.
Hint
1.
noise_generator
must accept (shape, *args) in its__init__
function, where shape is the required shape. it also needs to have__call__(device=None)
which produce a noise tensor on the specified device when invoked.2.
noise_generate_function
must accept (shape, device, std:float) and return a noise tensor on the specified device.Example
In order to use this function to perturb your model, you need to:
from machin.utils.helper_classes import Switch from machin.frame.noise.param_space_noise import perturb_model from machin.utils.visualize import visualize_graph import torch as t dims = 5 t.manual_seed(0) model = t.nn.Linear(dims, dims) optim = t.optim.Adam(model.parameters(), 1e-3) p_switch, r_switch = Switch(), Switch() cancel = perturb_model(model, p_switch, r_switch) # you should keep this switch on if you do one training step after # every sampling step. otherwise you may turn it off in one episode # and turn it on in the next to speed up training. r_switch.on() # turn off/on the perturbation switch to see the difference p_switch.on() # do some sampling action = model(t.ones([dims])) # in order to let parameter noise adapt to generate noisy actions # within ``desired_action_stddev``, you must periodically # use the original model to generate some actions: p_switch.off() action = model(t.ones([dims])) # visualize will not show any leaf noise tensors # because they are created in t.no_grad() context # and added in-place. visualize_graph(action, exit_after_vis=False) # do some training loss = (action - t.ones([dims])).sum() loss.backward() optim.step() print(model.weight) # clear hooks cancel()
- Parameters
model (torch.nn.modules.module.Module) – Neural network model.
perturb_switch (machin.utils.helper_classes.Switch) – The switch used to enable perturbation. If switch is set to
False
(off), then during the forward process, original parameters are used.reset_switch (machin.utils.helper_classes.Switch) – The switch used to reset perturbation noise. If switch is set to
True
(on), andperturb_switch
is also on, then during every forward process, a new set of noise is applied to each param. If onlyperturb_switch
is on, then the same set of noisy parameters is used in the forward process and they will not be updated.distance_func (Callable) – Distance function, accepts two tensors produced by
model
(one is noisy), return the distance as float. Used to compare the distance between actions generated by noisy parameters and original parameters.desired_action_stddev (float) – Desired action standard deviation.
noise_generator (Any) – Noise generator class.
noise_generator_args (Tuple) – Additional args other than shape of the noise generator.
noise_generator_kwargs (Dict) – Additional kwargs other than shape of the noise generator.
noise_generate_function (Callable) – Noise generation function, mutually exclusive with
noise_generator
andnoise_generator_args
.debug_backward – Print a message if the backward hook is correctly executed.
- Returns
A reset function with no arguments, will swap in original paramters.
- A deregister function with no arguments, will deregister all hooks
applied on your model.
transition¶
-
class
machin.frame.transition.
Transition
(state, action, next_state, reward, terminal, **kwargs)[source]¶ Bases:
machin.frame.transition.TransitionBase
The default Transition class.
Have three main attributes:
state
,action
andnext_state
.Have two sub attributes:
reward
andterminal
.Store one transition step of one agent.
- Parameters
state (Dict[str, torch.Tensor]) – Previous observed state.
action (Dict[str, torch.Tensor]) – Action of agent.
next_state (Dict[str, torch.Tensor]) – Next observed state.
reward (Union[float, torch.Tensor]) – Reward of agent.
terminal (bool) – Whether environment has reached terminal state.
**kwargs – Custom attributes. They are ordered in the alphabetic order (provided by
sort()
) when you callkeys()
.
Note
You should not store any tensor inside
**kwargs
as they will not be moved to the sample output device.-
action
= None¶
-
next_state
= None¶
-
reward
= None¶
-
state
= None¶
-
terminal
= None¶
-
class
machin.frame.transition.
TransitionBase
(major_attr, sub_attr, custom_attr, major_data, sub_data, custom_data)[source]¶ Bases:
object
Base class for all transitions
Note
Major attributes store things like state, action, next_states, etc. They are usually concatenated by their dictionary keys during sampling, and passed as keyword arguments to actors, critics, etc.
Sub attributes store things like terminal states, reward, etc. They are usually concatenated directly during sampling, and used in different algorithms.
Custom attributes store not concatenatable values, usually user specified states, used in models or as special arguments in different algorithms. They will be collected together as a list during sampling, no further concatenation is performed.
- Parameters
major_attr (Iterable[str]) – A list of major attribute names.
sub_attr (Iterable[str]) – A list of sub attribute names.
custom_attr (Iterable[str]) – A list of custom attribute names.
major_data (Iterable[Dict[str, torch.Tensor]]) – Data of major attributes.
sub_data (Iterable[Union[NewType.<locals>.new_type, torch.Tensor]]) – Data of sub attributes.
custom_data (Iterable[Any]) – Data of custom attributes.
-
has_keys
(keys)[source]¶ - Parameters
keys (Iterable[str]) – A list of keys
- Returns
A bool indicating whether current transition object contains all specified keys.
-
keys
()[source]¶ - Returns
All attribute names in current transition object. Ordered in: “major_attrs, sub_attrs, custom_attrs”
-
to
(device)[source]¶ Move current transition object to another device. will be a no-op if it already locates on that device.
- Parameters
device (Union[str, torch.device]) – A valid pytorch device.
- Returns
Self.
-
property
custom_attr
¶
-
property
major_attr
¶
-
property
sub_attr
¶
-
class
machin.frame.transition.
TransitionStorageBasic
(max_size)[source]¶ Bases:
list
TransitionStorageBasic is a linear, size-capped chunk of memory for transitions, it makes sure that every stored transition is copied, and isolated from the passed in transition object.
- Parameters
max_size – Maximum size of the transition storage.
-
store
(transition)[source]¶ - Parameters
transition (machin.frame.transition.TransitionBase) – Transition object to be stored
- Returns
The position where transition is inserted.
- Return type
int
-
class
machin.frame.transition.
TransitionStorageSmart
(max_size)[source]¶ Bases:
machin.frame.transition.TransitionStorageBasic
TransitionStorageSmart is a smarter, but (potentially) slower storage class for transitions, but in many cases it is as fast as the basic storage and halves memory usage because it only deep copies half of the states.
TransitionStorageSmart will compare the major attributes of the current stored transition object with that of the last stored transition object. And set them to refer to the same tensor.
Sub attributes and custom attributes will be direcly copied.
Args: max_size: Maximum size of the transition storage.
-
store
(transition)[source]¶ - Parameters
transition (machin.frame.transition.TransitionBase) – Transition object to be stored
- Returns
The position where transition is inserted.
- Return type
int
-