Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning

Yifan Wang1,2, Shiyu Li1, Peiming Li1,3, Xiaochen Yang4, Yang Tang1*†, Zheng Wei1*
1Tencent BAC   2Tsinghua University   3Peking University   4University of Glasgow
*Corresponding authors, Project Lead
{ethanntang, hemingwei}@tencent.com

📄 Abstract

Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and lack supervision on the intermediate reasoning process, obscuring the analyzability of the latent reasoning chain.

To address these challenges, we introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images, making the latent rationale explicit and traceable. We leverage the vision encoders of existing Vision Language Models (VLMs) as semantic anchors to align the vision embeddings with the textual space. This design ensures plug-and-play implementation without incurring additional pre-training overhead.

🎯 Key Results: Extensive experiments demonstrate that our method achieves 3-4× token compression and substantial inference acceleration compared to explicit CoT, while maintaining competitive performance against other methods.

🌟 Overview

Render-of-Thought

First framework to render textual Chain-of-Thought steps into compact single-line images for latent reasoning.

Semantic Anchors

Leverages frozen vision encoders from VLMs as semantic anchors, ensuring robust alignment without additional pre-training.

3-4× Compression

Achieves significant token compression by encoding verbose textual reasoning into compact visual latent representations.

Fast Inference

Achieves 3.7× speedup on GSM8K-Aug and 4.6× on GSM-Hard compared with explicit CoT.

💡 Motivation

Introduction and Motivation

Figure 1: Comparison of Reasoning Paradigms. (a) Explicit CoT relies on verbose textual generation. (b) Implicit CoT compresses reasoning into latent space. (c) Render-of-Thought utilizes visual rendering as semantic anchors to structure the latent reasoning process.

While Chain-of-Thought enhances reasoning capabilities, its verbosity leads to prolonged inference latency and excessive memory consumption. Prior latent reasoning methods often compress thoughts into opaque vectors without explicit constraints, obscuring the analyzability of the reasoning chain.

Render-of-Thought proposes a paradigm shift by visualizing the reasoning chain. By rendering textual steps into images, we leverage the high information density of visual modalities while keeping the rationale explicit and traceable.

🔬 Method

Overview of Render-of-Thought

Figure 2: Overview of Render-of-Thought. (a) Rendering Method transforms textual reasoning steps into compact single-line images. (b) Latent Reasoning Method aligns LLM-generated hidden states with visual features via a projection head.

CoT Rendering

The rendering module transforms text into single-line images with dynamic width and fixed 32 px height. By using a 20 px font and 4 px padding with black text on a white background, this layout ensures image patches are extracted left-to-right, naturally aligning the visual sequence with text order.

📚 Two-Stage Training Framework

Two-Stage Training Pipeline

Figure 3: Two-stage training pipeline. Stage I optimizes the projection head for visual alignment. Stage II fine-tunes the LLM for autoregressive latent reasoning.

1

Stage I: Visual Alignment

Freeze both LLM Backbone and Vision Encoder. Train only the Visual Projection Head to map LLM hidden states to visual embeddings using MSE loss.

2

Stage II: Latent Supervised Fine-Tuning

Freeze Vision Encoder and Projection Head. Fine-tune LLM backbone with LoRA to generate visual reasoning trajectories followed by the final answer. The model learns to navigate the continuous latent space toward accurate solutions.

📊 Experiments

Main Results on Grade-School Reasoning

We evaluate Render-of-Thought across four mathematical reasoning datasets using three VLM architectures:

Method GSM8k-Aug GSM-Hard SVAMP MultiArith Avg Pass@1 Avg #L
Qwen3-VL-4B-Instruct
SFT-w/o CoT 26.2 9.48 70.0 85.6 47.8 0
SFT-CoT 81.2 53.4 84.3 98.3 79.3 108.4
RoT (Ours) 37.8 14.1 72.7 97.2 55.4 32.0

💡 Key Insight: On MultiArith, RoT achieves 97.2% accuracy with only 32 tokens, nearly matching explicit CoT (98.3%) while using 3.4× fewer tokens.

Comparison with LLM-based Latent Reasoning Methods

Method Base Model GSM8k-Aug GSM-Hard SVAMP MultiArith Average
iCoT Qwen3-4B 13.5 4.09 36.9 49.2 25.9
Coconut Qwen3-4B 16.9 5.42 43.6 60.3 31.6
CODI Qwen3-4B 7.28 2.20 11.0 18.3 9.70
CoLaR-2 Qwen3-4B 40.0 9.17 57.7 82.2 47.3
RoT (Ours) Qwen3-VL-4B 37.8 14.1 72.7 97.2 55.4

RoT outperforms the best LLM-based method (CoLaR-2) by 8.1% in average accuracy. While CoLaR-2 shows slightly higher accuracy on GSM8k-Aug, RoT demonstrates superior robustness in out-of-domain generalization, benefiting from the rich semantic representations provided by pre-trained visual encoders.

⚡ Inference Efficiency

Inference Time Comparison

Figure 4: Inference time comparison (seconds per sample) on GSM8k-Aug and GSM-Hard datasets using single NVIDIA H20 GPU.

🔍 Visualization & Analysis

Visualization of Latent Visual Tokens

Figure 5: Characterizations of Latent Visual Tokens. We present a case from the MATH dataset showing (a) Vision Embeddings Heatmap, (b) Token Similarity Matrix, and (c) Statistical Properties, demonstrating structured semantic encoding within the continuous visual latent space.

We observed an interesting phenomenon: output tokens tend to become increasingly homogeneous after a certain position. The values in the token similarity matrix approach 1.0, feature activation heatmaps become nearly identical, and statistical properties stabilize.

This suggests that the model effectively encodes core reasoning logic in the initial phase, after which the latent states enter a saturation plateau. These subsequent high-similarity tokens likely serve to maintain semantic context for decoding the final answer, rather than introducing new reasoning steps.

🚀 Outlook

While Render-of-Thought demonstrates promising results in mathematical and logical reasoning, we envision several exciting directions for future research:

  • Broadening Reasoning Domains: Extending the visual latent reasoning paradigm to other complex domains such as commonsense reasoning, causal inference, and multilingual settings.
  • Adaptive Token Budgets: Developing mechanisms to dynamically adjust the latent reasoning length based on problem difficulty, further optimizing the balance between accuracy and efficiency.
  • Advanced Rendering & Caching: Investigating more efficient text-to-image rendering strategies and caching mechanisms to further reduce training overhead for large-scale deployments.
  • Optimizing Dynamic Termination: Improving the stability of self-regulated stopping in continuous latent spaces to move beyond fixed token budgets.

📚 Citation

@article{wang2026rot,
  title={Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning},
  author={Wang, Yifan and Li, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, Zheng Wei},
  journal={arXiv preprint arXiv:2601.14750},
  year={2026}
}