B Learning algorithms’ hyperparameters

The following tables show a comparison of various hyperparameters for each of the learning algorithms: QSOM, QDSOM, DDPG, and MADDPG. For each of them, numerous combinations of hyperparameters were tried multiple times, and the score of each run was recorded. We recall that, in this context, score is defined as the average global reward over the time steps, where the global reward is the reward for all agents, e.g., the equity in the environment. The mean score is computed for all runs using the same combination of hyperparameters. Then, for each hyperparameter, we compute the maximum of the mean scores attained for each value of the parameter.

This method simplifies the interaction between hyperparameters, as we take the maximum. For example, a value $v_1$ of a hyperparameter $h_1$ could yield most of the time a low score, except when associated when used in conjunction with a value $v_2$ of another hyperparameter $h_2$ . The maximum will retain only this second case, and ignore the fact that, for any other combination of hyperparameters, setting $h_1 = v_1$ yields a low score. This still represents valuable information, as we are often more interested in finding the best hyperparameters, which yield the maximum possible score.