Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Bi-arm handover task. Original reward design by Guy Lever.
2025-01-20.07-29-41.mp4
As seen in the video (50% speed for easier viewing), the shaping rewards consist of 3 terms that create a mostly monotonic "reward potential field" increasing as the robot progresses through the desired motion.
gripper_box
drives the left hand to the box.box_handover
rewards the box for getting to a pre-assigned handover point.handover_target
rewards the right hand for getting the box to the target point.With this formulation alone, the policy takes 30 min to 1 hour to train and gets stuck in local minima for about half the seeds. The difficulty is in the hand-over. Because the rewards plummet when the hands fumble in this process, you get stuck in a minima where both hands clasp onto the box, unwilling to let go. Two tricks to get around this.
First, don't penalize regression during an episode. If$r_{raw}$ is the sum of the above three terms, we use:
$r_{t+1} = \max( r_{raw, t+1} - max_{{\tau\in{0, t}}} r_\tau, 0)$
Second, reset the episode whenever the box is dropped. These tricks drive the robot to get a lot of attempts at the transfer procedure while being unafraid of failure.
On my RTX4090, this is trainining stably across seeds in about 10 min.
![image](https://private-user-images.githubusercontent.com/22626914/404883115-c7c7496a-8a18-472e-a550-aaeccb7b2529.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0NjA1MjYsIm5iZiI6MTczOTQ2MDIyNiwicGF0aCI6Ii8yMjYyNjkxNC80MDQ4ODMxMTUtYzdjNzQ5NmEtOGExOC00NzJlLWE1NTAtYWFlY2NiN2IyNTI5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDE1MjM0NlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTRjYjhjMWE5Yzg4ZDM2NTRjY2M0YzEzM2IyNTkzNzYxMmZjYTEyZWI1MGRmYThmMDQzNTA0N2QwOTZhYTRhMmYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.wxbH8eaZ1f-5kH7W73sHWyqyJzl1qFueDRkpK-NakzI)