Training Strategies for Efficient Embodied Reasoning

Abstract

Robot chain-of-thought reasoning (CoT)—wherein a model predicts helpful intermediate representations before choosing actions—provides an effective method for improving the generalization and performance of robot policies, especially vision-language-action models (VLAs). While such approaches have been shown to improve performance and generalization, they suffer from core limitations, like needing specialized robot reasoning data and slow inference speeds. To design new robot reasoning approaches that address these issues, a more complete characterization of why reasoning helps policy performance is critical. We hypothesize several mechanisms by which robot reasoning improves policies: (1) better representation learning, (2) improved learning curricularization, and (3) increased expressivity. We then devise simple variants of robot CoT reasoning to isolate and test each one. We find that learning to generate reasonings does lead to better VLA representations, while attending to the reasonings aids in actually leveraging these features for improved action prediction. Our results provide us with a better understanding of why CoT reasoning helps VLAs, which we use to introduce two simple and lightweight alternative recipes for robot reasoning. Our proposed approaches achieve significant performance gains over non-reasoning policies, state-of-the-art results on the LIBERO-90 benchmark, and a 3x inference speedup compared to standard robot reasoning.

TLDR and Takeaways

We test various hypotheses of why embodied chain-of-thought reasoning (ECoT) improves robot policy performance to extract insights for developing better robot reasoning approaches.
Our results mainly support the hypothesis that reasoning improves representation learning. This contrasts with the expressivity arguments usually made for why reasoning helps LLMs.
We introduce two approaches for learning representations from reasonings without producing them at test time, which we call ECoT-Lite. We validate their effectiveness in the LIBERO simulator and real-world Bridge manipulation tasks.
We find ECoT-Lite is faster than standard ECoT while being more performant than standard VLAs (even slightly surpassing state-of-the-art performance in LIBERO).

Why Does Reasoning Improve Policy Performance?

Hypotheses

To develop better embodied robot reasoning methods, we first need to understand why reasoning helps robot policies in the first place. We present three hypotheses for why this might be the case:

Hyp. 1: Better Representation Learning: The reasoning contains task-relevant features, so learning to reason embeds this information in the model's representations, even if reasonings are not generated at test time.
Hyp. 2: Improved Learning Curricularization: Having reasonings in-context during training indicates the important features for choosing actions, which must otherwise be learned from scratch.
Hyp. 3: Increased Expressivity: More test-time in-context tokens improves the model's expressivity, even if the tokens lack semantic meaning.

Policy Variants

Schematic illustrating all policy variants for testing our three hypotheses.

We create simple variants of Embodied Chain-of-Thought Reasoning (ECoT) to isolate and test each of these hypotheses:

Reasoning Pre-training and Co-training: Rather than learning to generate reasonings and actions in the same context, the policy can be trained on reasonings and actions separately.
- In pre-training, the policy is first trained exclusively to reason, then fine-tuned exclusively on actions.
- In co-training, the policy is trained to reason and act in separate datapoints within the same batch.
Reasoning Dropout: The model is trained to generate reasonings and actions in sequence (as with standard ECoT), but the reasonings are randomly dropped out. At test time, the reasonings can be turned on and off as needed.
Reasoning Scaffolding: The policy has reasonings in-context during training, but no loss is assigned to it (the policy cannot generate reasonings, only attend to it). This is akin to teacher-student learning, wherein the policy access "oracle" features during training only.
Thinking Tokens: The policy's expressivity is improved by adding many meaningless "thinking" tokens. Such approaches have been used in other domains too.

Experiments and Results

Our ECoT-Lite policies outperform standard VLAs in both simulation and real-world evaluations, while being significantly faster than standard ECoT.

We train VLAs using all the above recipes, as well as both an ECoT and standard (non-reasoning) VLA baseline (all using the MiniVLA architecture and codebase). We evaluate all these policies on the LIBERO-90 simulated manipulation benchmark. We find that the most performant variants are reasoning dropout and reasoning pre-training, which we call ECoT-Lite. We then reproduce and validate these approaches with OpenVLA-based policies on real-world Bridge WidowX manipulation tasks.

LIBERO-90 Experiments

We find that both full ECoT and our reasoning dropout policy achieve the best performance on LIBERO-90 (90.8% and 89.4% respectively), slightly outperforming past state-of-the-art (88.6%). The reasoning pre-training policy comes close (87.1%) as well. Notably, both these methods do not produce test-time reasonings, making them much faster than ECoT.

In contrast, reasoning co-training is less effective than these two approaches (84.2%), even though it likewise learns representations from reasonings. This suggests that not all representation learning approaches are equally conducive for positive transfer between the reasoning and action prediction tasks. Nevertheless, co-training outperforms the non-reasoning standard VLA baseline (82.0%). This all strongly supports Hyp. 1.

We also find that reasoning scaffolding performs comparably to co-training (84.1%), in weak support of Hyp. 2. However, thinking tokens performs worse than the standard VLA (78.9 - 79.8%), which is evidence against Hyp. 3. This suggests expanded expressivity is not helpful for robot reasoning (even when using a small 1B parameter MiniVLA), and that the reasoning tokens must be meaningful to be useful.

These analyses are repeated in harder variants of the LIBERO-90 benchmark, wherein we perturb the task objects' starting locations and add distractors. The above trends hold for these challenge splits as well.

Bridge Experiments

We find that reasoning dropout and reasoning pre-training are effective for real-world manipulation tasks conducted in the Bridge WidowX environment as well. As with LIBERO, both approaches similarly improve over the standard BC recipe while being around 3x faster than ECoT.

Many of the common failure modes of the reasoning dropout policy in Bridge seem to be fixed by having test-time reasonings.

Curiously, reasoning pre-training is more performant than reasoning dropout in Bridge, reversing the trend seen in LIBERO. We suspect this is because reasoning dropout is more effective in narrower domains; many of said policy's failure modes seem to be intuitively solved by "turning on" test-time reasonings. However, reasoning pre-training maintains good generalization performance; it matches or outperforms standard ECoT in all but one semantic generalization Bridge task.

Which Robot Reasoning Approach is Best for My Problem?

Based on our findings, we make prescriptions for when to use each of the above robot reasoning approaches.

Standard Embodied Chain-of-Thought Reasoning policies are the most performant, but they are significantly slower than the other approaches. They are best for tasks where performance is critical and speed is not a concern.

As both ECoT-Lite strategies do not generate test-time reasonings, they are much faster, matching the inference speeds of standard VLA policies.

Reasoning pre-training seems more effective in diverse domains— it matches or surpasses the semantic generalization capabilities of ECoT in all but one Bridge task. As the approach involves tuning on just reasonings and just actions separately, it also does not require paired embodied reasoning data. However, by needing to train on these tasks sequentially, it needs more consecutive gradient steps to expose the policy to an equal amount of reasoning and action data.

Finally, reasoning dropout seems more performant in narrow domains, like LIBERO (where it matches full ECoT and surpasses state-of-the-art performance). Additionally, its training recipe is almost identical to standard ECoT (the only difference is that reasonings are occasionally dropped during training). Thus, they allow users to turn test-time reasonings off and on as needed.

BibTeX

 @article{Chen25-ecot-lite,
        title={Training Strategies for Efficient Embodied Reasoning},
        author={William Chen and Suneel Belkhale and Suvir Mirchandani and Oier Mees and Danny Driess and Karl Pertsch and Sergey Levine},
        journal = {arXiv preprint arXiv:2505.08243},
        year={2025},
}