Beyond the Click: Slate-Q for Sequential Recommendation

Posted Mar 29, 2026

By Vish Sangale

2 min read

Most recommendation systems are designed to maximize immediate engagement—the “next click.” However, true user value is built over entire sessions. In this project, RL-RECSYS, I explored how Reinforcement Learning can be used for slate optimization, where an agent must select a set of items to maximize long-term rewards rather than just myopic clicks.

1. Recommendation as a Sequential Decision Problem

The key challenge in slate recommendation is the combinatorial explosion of the action space. If you have 10,000 items and a slate size of 10, there are $\binom{10000}{10}$ possible slates—far too many to evaluate with a standard Q-learning agent.

The Slate-Q Decomposition

I implemented the Slate-Q decomposition (Ie et al., 2019), which addresses this by decomposing the Q-value of a slate $A$ into item-wise Q-values: $Q(s, A) = \sum_{i \in A} P(i | s, A) \bar{Q}(s, i)$ Where:

$\bar{Q}(s, i)$ is the expected Long-Term Value (LTV) of item $i$.
$P(i s, A)$ is the User Choice Model, typically implemented as a Multinomial Logit (MNL).

By learning $\bar{Q}(s, i)$ for individual items, the optimization of the entire slate becomes a tractable linear program.

2. Simulation in a Latent-Factor Environment

To evaluate Slate-Q, I developed a synthetic latent-factor environment (20-dimensional vectors) where user preferences evolve based on their consumption history. This environment simulates complex user dynamics like “boredom” or “variety-seeking,” providing a rigorous testbed for long-term reward maximization.

Results: Slate-Q vs. Myopic DQN

My experiments showed that while a myopic DQN can achieve higher CTR in the first few steps of a session, Slate-Q consistently yields a higher Cumulative Click-Sum and NDCG@10 over long sessions. By intelligently sacrificing a high-probability immediate click for an item that replenishes the user’s “time budget,” the Slate-Q agent sustains engagement for up to 40% longer.

3. Industrial Scalability

By leveraging Hydra for hierarchical configuration and Weights & Biases (W&B) for experiment tracking, the repository is built for professional-scale research. The modular design allows for easily swapping different reward models (e.g., Click-Sum vs. Watch-Time) and user choice models, making it a flexible platform for exploring the “next frontier” of RecSys.

Conclusion

RL-RECSYS represents a move away from “short-termism” in recommendation design. By framing recommendation as an MDP, we can build agents that truly understand and optimize for the user’s long-term experience.

View the research and experiments on GitHub: vishsangale/RL-RECSYS

Machine Learning, Recommendation Systems

This post is licensed under CC BY 4.0 by the author.