MORL/D implementation for discrete action spaces

HI @LucasAlegre @ffelten ,

I am new to using MORL_baselines and am trying to apply the MORL/D multi-policy algorithm to solve an operations research problem, comparing it with other algorithms under both continuous and discrete action spaces. In this regard, is it scientifically accurate to pass "MOSACDiscrete" in place of "MOSAC" for solving MDPs with discrete action spaces under the weighted sum method? Will it run for approximating a well-distributed and converged Pareto front?

I noticed that the output stream switched from displaying reward vectors, as shown in the first figure below, to just SPS (steps per second), as shown in the second figure, which may or may not indicate an issue. However, I am getting the hypervolume measurements in the corresponding experiment at wandb.ai, as shown in the third figure, which seems normal.

```
Weights: [[0.     0.     0.     0.     0.     0.     1.    ]
 [0.     0.     0.     0.     0.     1.     0.    ]
 [0.     0.     0.     0.4772 0.5228 0.     0.    ]
 [0.     0.     1.     0.     0.     0.     0.    ]
 [0.     1.     0.     0.     0.     0.     0.    ]
 [1.     0.     0.     0.     0.     0.     0.    ]]
Neighborhoods: [[2], [2], [0], [2], [2], [2]]
/usr/local/lib/python3.12/dist-packages/notebook/notebookapp.py:191: SyntaxWarning: invalid escape sequence '\/'
  | |_| | '_ \/ _` / _` |  _/ -_)
/usr/local/lib/python3.12/dist-packages/notebook/utils.py:280: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  return LooseVersion(v) >= LooseVersion(check)
/usr/local/lib/python3.12/dist-packages/jupyter_client/session.py:203: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
  return datetime.utcnow().replace(tzinfo=utc)
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 2
wandb: You chose 'Use an existing W&B account'
wandb: Logging into https://api.wandb.ai/. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
wandb: Find your API key here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter: ··········
wandb: No netrc file found, creating one.
wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
wandb: Currently logged in as: moizca to https://api.wandb.ai/. Use `wandb login --relogin` to force relogin
/usr/local/lib/python3.12/dist-packages/jupyter_client/session.py:203: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
  return datetime.utcnow().replace(tzinfo=utc)
/usr/local/lib/python3.12/dist-packages/wandb/analytics/sentry.py:279: DeprecationWarning: The `Scope.user` setter is deprecated in favor of `Scope.set_user()`.
  self.scope.user = {"email": email}
/usr/local/lib/python3.12/dist-packages/jupyter_client/session.py:203: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
  return datetime.utcnow().replace(tzinfo=utc)
/usr/local/lib/python3.12/dist-packages/wandb/analytics/sentry.py:279: DeprecationWarning: The `Scope.user` setter is deprecated in favor of `Scope.set_user()`.
  self.scope.user = {"email": email}
Tracking run with wandb version 0.23.1
Run data is saved locally in /content/wandb/run-20260214_110358-s1cb2r5h
Syncing run [CustomEnv-v100-seed1__morld_continuous_act_(8,_1,_2[1, 2],_6)_1(MOSAC)__1__1771066942](https://wandb.ai/moizca/disaster_response_relief_allocation_and_relocation_drl/runs/s1cb2r5h) to [Weights & Biases](https://wandb.ai/moizca/disaster_response_relief_allocation_and_relocation_drl) ([docs](https://wandb.me/developer-guide))
View project at https://wandb.ai/moizca/disaster_response_relief_allocation_and_relocation_drl
View run at https://wandb.ai/moizca/disaster_response_relief_allocation_and_relocation_drl/runs/s1cb2r5h
Starting training...
/usr/local/lib/python3.12/dist-packages/google/protobuf/internal/well_known_types.py:178: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
  self.FromDatetime(datetime.datetime.utcnow())
Current pareto archive:
[array([-2743.2118,  -842.4091, -3364.2579,  -733.8801, -1475.7389,
        -372.2141,  -372.4044]), array([-3069.748 ,  -842.5833, -3768.2693,  -722.211 , -2181.658 ,
        -351.3258,  -362.0436]), array([-2384.7098,  -853.5723, -3090.8005,  -735.8487, -1636.418 ,
        -367.6799,  -370.1419]), array([-1388.1533,  -865.8287, -2030.9455,  -803.7104, -1409.6956,
        -377.5131,  -378.1531]), array([-3017.0933,  -835.058 , -4007.3654,  -720.3876, -3000.8637,
        -337.3352,  -334.6139]), array([-2208.7978,  -862.4765, -2865.4484,  -757.6116, -1073.4638,
        -377.4112,  -387.3889])]
[]
Episode infos:
Steps: 6, Time: 0.00698
Total Reward: [   0.     -102.0325   -2.3094   -3.4291   -4.6313   -2.5      -2.6833], Discounted: [   0.     -102.0123   -2.2408   -3.3481   -4.4284   -2.4015   -2.5518]
Scalarized Reward: -2.6832995414733887, Discounted: -2.551791191101074
Episode infos:
Steps: 6, Time: 0.016747
Total Reward: [-4.1835 -2.7469 -3.834  -2.041  -4.2886 -2.4832 -1.7686], Discounted: [-4.1702 -2.7353 -3.7388 -1.9904 -4.0991 -2.3854 -1.682 ]
Scalarized Reward: -1.7686456441879272, Discounted: -1.6819645166397095
```

```
<html>
<body>
<html><head></head><body><div id="output-area"><span id="output-header"> </span><div id="output-body"><div class="stream output-id-1"><div class="output_subarea output_text"><pre>Weights: [[0.     0.     0.     0.     0.     0.     1.    ]
 [0.     0.     0.     0.     0.     1.     0.    ]
 [0.     0.     0.     0.4772 0.5228 0.     0.    ]
 [0.     0.     1.     0.     0.     0.     0.    ]
 [0.     1.     0.     0.     0.     0.     0.    ]
 [1.     0.     0.     0.     0.     0.     0.    ]]
Neighborhoods: [[2], [2], [0], [2], [2], [2]]
</pre></div></div><div class="stream output-id-4"><div class="output_subarea output_text"><pre>/usr/local/lib/python3.12/dist-packages/notebook/utils.py:280: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  return LooseVersion(v) &gt;= LooseVersion(check)
/usr/local/lib/python3.12/dist-packages/jupyter_client/session.py:203: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
  return datetime.utcnow().replace(tzinfo=utc)
.
.
.
SPS: 292
SPS: 245
SPS: 211
SPS: 186
SPS: 166
SPS: 149
.
.
.
SPS: 14
SPS: 14
SPS: 14
SPS: 14
SPS: 14
SPS: 14
SPS: 14
SPS: 14
Switching... global_steps: 40000
Updating other policies...
Current pareto archive:
[array([-2695.74  ,  -867.91  , -3465.3788,  -779.1017, -2003.6496,
        -373.5607,  -378.4734]), array([-3412.92  ,  -860.8252, -3562.5263,  -768.3149, -1613.9358,
        -385.5499,  -389.815 ]), array([-2511.18  ,  -871.5112, -3165.1545,  -776.858 , -2006.3393,
        -374.2034,  -376.2483]), array([-2646.62  ,  -858.6436, -3024.9414,  -796.1238, -1969.6637,
        -378.7863,  -379.1632]), array([-2483.06  ,  -866.7596, -3375.7192,  -794.0449, -2989.7398,
        -364.7047,  -359.6036]), array([-2400.3   ,  -876.8047, -2859.8534,  -783.5259, -2013.6398,
        -372.8789,  -377.1367]), array([-2300.1   ,  -883.1469, -2935.3799,  -806.0118, -2402.1624,
        -374.3257,  -374.2679]), array([-2178.52  ,  -879.4249, -3233.2323,  -791.8297, -2247.9868,
        -376.5287,  -378.1794]), array([-3129.76  ,  -858.2837, -3653.1071,  -766.6655, -2332.135 ,
        -364.1973,  -382.1365]), array([-3178.48  ,  -867.6435, -3581.3834,  -764.9638, -2396.0338,
        -371.5708,  -369.7922]), array([-3087.    ,  -862.4392, -3644.9723,  -755.0767, -1703.3288,
        -390.0799,  -379.6507]), array([-2923.44  ,  -865.0531, -3014.5524,  -770.1528, -1411.3653,
        -392.2977,  -392.59  ])]
[]
Global step: 40000
Save freq: 10000
Saving checkpoint at step 40000
Saving population...
Saving archive...
done!
```

<img width="2528" height="1328" alt="Image" src="https://github.com/user-attachments/assets/a94366e3-4c3b-4bbf-8b52-02d0824c1d25" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MORL/D implementation for discrete action spaces #174

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

MORL/D implementation for discrete action spaces #174

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions