B Learning algorithms’ hyperparameters

The following tables show a comparison of various hyperparameters for each of the learning algorithms: QSOM, QDSOM, DDPG, and MADDPG. For each of them, numerous combinations of hyperparameters were tried multiple times, and the score of each run was recorded. We recall that, in this context, score is defined as the average global reward over the time steps, where the global reward is the reward for all agents, e.g., the equity in the environment. The mean score is computed for all runs using the same combination of hyperparameters. Then, for each hyperparameter, we compute the maximum of the mean scores attained for each value of the parameter.

This method simplifies the interaction between hyperparameters, as we take the maximum. For example, a value v1 of a hyperparameter h1 could yield most of the time a low score, except when associated when used in conjunction with a value v2 of another hyperparameter h2. The maximum will retain only this second case, and ignore the fact that, for any other combination of hyperparameters, setting h1=v1 yields a low score. This still represents valuable information, as we are often more interested in finding the best hyperparameters, which yield the maximum possible score.