Vish Sangale
NOTEMarch 2026

Beyond the Click: Slate-Q for Sequential Recommendation

Most recommendation systems are designed to maximize immediate engagement—the “next click.” However, true user value is built over entire sessions. In this project, RL-RECSYS, I explored how Reinforcement Learning can be used for slate optimization, where an agent must select a set of items to maximize long-term rewards rather than just myopic clicks.

1. The Combinatorial Bottleneck

The key challenge in slate recommendation is the combinatorial explosion of the action space. For a catalog of $N$ items and a slate size $K$, there are $\binom{N}{K}$ possible slates.

The Slate-Q Decomposition

I implemented the Slate-Q decomposition (Ie et al., 2019), which makes this problem tractable by decomposing the Q-value of a slate $A$ into item-wise Q-values:

\[Q(s, A) = \sum_{i \in A} P(i | s, A) \left[ R(s, i) + \gamma \sum_{s'} P(s' | s, i) \max_{A'} Q(s', A') \right]\]

By learning the Long-Term Value (LTV) of individual items, we can optimize the slate by simply selecting the top-$K$ items that maximize the expected return, assuming a Multinomial Logit (MNL) user choice model:

\[P(i | s, A) = \frac{\exp(\text{score}(s, i))}{\sum_{j \in A \cup \{\text{null}\}} \exp(\text{score}(s, j))}\]

2. Experimental Setup: The Latent-Factor Environment

To evaluate the agent, I developed a 20-dimensional synthetic latent-factor environment.

3. Trade-offs and Ablations: Finding the Optimal Slate Size

A critical research question in this project was: How does increasing the slate size affect the agent’s ability to learn and the user’s long-term value?

I conducted an ablation study varying the env.slate_size from 5 to 15:

Slate Size Avg. Reward (LTV) CTR (Immediate) Learning Stability
5 4.2 0.12 Highly Stable
10 8.1 0.18 Stable
12 9.9 0.21 Peak Performance
15 9.4 0.22 High Variance

Key Learning: We observed a “performance plateau” around a slate size of 12. Beyond this point, the increased action complexity led to higher variance in the policy gradient updates, and the marginal gain in total reward was offset by the user’s position bias (fatigue).

4. Empirical Benchmarks: The Leaderboard

The implementation was benchmarked against several baselines using the avg_reward metric:

Rank Agent Source Avg. Reward Improvement vs. Baseline
1 Slate-Q (Optimized) wandb-offline 9.90 +8.90
2 Slate-Q (Default) mlflow 4.00 +3.00
3 Myopic DQN mlflow 3.20 +2.20
4 Random Agent baseline 1.00

Insights from TensorBoard

The learning curves (see outputs/ablation-slate-size.svg) revealed that Slate-Q’s advantage over the Myopic DQN only becomes significant after episode 500. Early in training, the Myopic agent’s simpler objective leads to faster initial gains, but it eventually falls into “local optima” by repeatedly recommending the same high-CTR items, leading to rapid user boredom and shorter sessions.

Conclusion

RL-RECSYS demonstrates that architectural efficiency is a force multiplier for recommendation quality. By moving to a Slate-Aware decomposition, we can build agents that intelligently manage user state, trading immediate clicks for long-term session health.


View the full ablation study and results on GitHub: vishsangale/RL-RECSYS