A technical review of brain encoding models for content prediction

Why this review exists

Forward neural encoding has moved from an academic neuroscience subfield to an applied prediction tool in the last three years. The combination of TRIBE v2 (Meta FAIR), MindEye and MindEye2, the Huth Lab semantic atlas work, and the Algonauts benchmark constitutes a genuinely new technology stack for predicting cortical response to naturalistic stimuli.

Any company doing serious behavioral prediction in 2026 should know this stack cold. This is an honest reviewer's map: what each system does, where it is strong, where it is weak, and what it does not do. It is written so a researcher can evaluate a vendor claim without having to read the underlying preprints.

Commercial vendors make neural claims with no reference to the open stack. Reviewers need a map. This is one.

Yeo 7-network parcellation — Figure 01The Yeo 2011 seven-network parcellation is the reference atlas most applied encoding work aggregates predictions into. Visual, somatomotor, dorsal attention, ventral attention, limbic, frontoparietal, and default mode. A well-trained encoder predicts the activation pattern of every vertex; the parcellation gives you something interpretable to read off.

Definitions

Forward encoding model. Maps stimulus features to predicted brain response. Inverse of decoding.
Decoding model. Maps brain response back to stimulus or cognitive state. Opposite direction.
Naturalistic stimuli. Video, audio, text, mixed. Contrast with controlled single-feature laboratory stimuli.
Voxel-level prediction. Spatial unit of fMRI. A well-trained encoder outputs predicted BOLD response per voxel per time step.
Inter-subject correlation (ISC). Measures consistency of neural response across viewers. Used heavily in naturalistic paradigms to filter noise.

TRIBE v2 (Meta FAIR)

TRIBE v2 is transformer-based forward encoding trained across subjects and stimuli. Architecture: multi-modal backbone fusing DINOv2 visual, wav2vec 2.0 audio, and Llama-family language features; cross-attention fusion; subject-specific readout heads on a shared latent. Training: over 1,000 hours of fMRI from approximately 720 subjects, drawn from Algonauts 2023, Natural Scenes Dataset, and movie-watching paradigms (CNeuroMod Friends, MOMA, StudyForrest).

Strengths. Cross-subject generalization (zero-shot to held-out subjects). Open weights. Benchmark performance on Algonauts. Full explainer: TRIBE v2 explained.

Limitations. Predicts BOLD, not behavior. Requires careful feature extraction. fMRI temporal resolution (seconds, not milliseconds) constrains what you can ask. Research-only license in current release.

Mean predicted cortical activation map — Figure 02Example TRIBE v2 output: mean predicted cortical activation across a 30-second naturalistic video clip. The map is per-vertex, aggregated to a single summary across the stimulus window. In practice most downstream analysis collapses this into a per-network time course.

Per-timestep brain activation montage — Figure 03Per-timestep predicted BOLD. Encoding models of this class output a full cortical surface estimate at each discrete time step, which lets downstream analysis identify exact moments when attention, salience, or decision systems engage.

MindEye and MindEye2 (MedARC, Princeton, Stability AI)

MindEye[1] and MindEye2[2] are primarily decoding systems: they reconstruct viewed images from fMRI. They are instructive in reverse because they establish an aligned representation space that is informative about encoding as well.

Training data. Natural Scenes Dataset (Allen et al. 2022 Nature Neuroscience[3]). Eight subjects viewing over 70,000 unique scenes across 30-40 scanning sessions each. Large, high-SNR, foundational.

Strengths. High-fidelity image reconstruction. Cross-subject alignment techniques (MindEye2 trains a shared subject-agnostic space and adapts per subject).

Limitations. Static images, not video. Decoding is not forecasting. Useful as an alignment reference for encoding models; not directly a content-prediction tool.

Network-level time course derived from encoded BOLD — Figure 04A typical downstream use of encoded BOLD: collapse voxels into the seven Yeo networks and read per-network activation over time. This is the operational unit of "neural engagement" in applied work. Spikes in visual and dorsal attention correspond to visually dense moments; default mode rises during narrative self-reference.

Huth Lab semantic atlas (UT Austin)

Alexander Huth's lab built encoding models trained on story-listening fMRI that map semantic representations across cortex (Huth et al. Nature 2016[4]). The more recent Tang et al. Nature Neuroscience 2023[5] extended this to language generation reconstruction: a semantic decoder that turns fMRI into text.

Strengths. Elegant language-level mapping. Reproducible across studies. Directly useful for linguistic signal components of content prediction.

Limitations. Language-specific. Requires long training sessions per subject. Per-subject calibration overhead.

Algonauts Project (2019–2024)

The Algonauts Project[6] is a public benchmark for encoding models. The 2023 release featured a whole-brain video encoding challenge. It is the reference for comparing architectures: 3D ResNet, VideoMAE, SlowFast, multimodal transformers.

Strengths. Standardized evaluation. Open data. Encourages reproducible progress.

Limitations. Benchmark fit is not the same as production reliability. Out-of-distribution generalization is still open. A model that wins Algonauts 2025 may or may not generalize to 2026 advertising creative.

Ventral attention salience peaks across a video timeline — Figure 05Ventral attention (salience) peaks across a short-form video timeline. Peaks correspond to unexpected cuts, reveals, and surprise moments. In production use this is one of the most useful outputs: it tells you exactly when a viewer's brain is re-orienting.

Natural Scenes Dataset and the training data landscape

NSD (Allen et al. 2022[3]) is public, large, high-SNR. Foundational for image encoding research. The limitation for content prediction work is that NSD is static images, not video, which reduces ecological validity for ad-format work without careful extrapolation.

Broader data landscape: CNeuroMod hosts movie-watching fMRI for a small number of subjects at extremely high depth. OpenNeuro[7] aggregates most shareable neuroimaging datasets. The field is data-rich in some directions (images, short clips), data-poor in others (long-form ads, cross-cultural content, longitudinal naturalistic viewing). Understanding where the training distribution does and does not match your use case is the most important diligence step.

Comparison matrix

Model	Input	Output	Training data	Status	Useful
TRIBE v2	Video + audio + text	BOLD per voxel	Algonauts, NSD, movie-watching (1000h)	Open weights	Yes
MindEye / MindEye2	fMRI	Reconstructed images	NSD (8 subjects)	Open weights	Reverse direction, useful for alignment
Huth semantic atlas	Listening-audio or reading-text	Semantic activation maps	Huth lab story-listening corpus	Academic	With caveats
Algonauts submissions	Video clips	Voxel-wise activation (benchmark)	CNeuroMod Friends	Varies	Benchmark only

What brain encoding does not do

Forward encoding does not replace outcome calibration. Predicting BOLD is not predicting sales, engagement, or retention (see the calibration page). Every encoding model must be paired with a separate predictor from neural response to behavioral outcome, and that predictor has its own error bars.

Cross-subject generalization has improved dramatically but is not solved. Application to new populations, languages, or platforms requires caution and usually fine-tuning.

Open data and open weights accelerate the field. Closed commercial encoders that do not publish architecture, training data, or benchmark results have no credible way to claim advantage over the open stack. The right skeptical posture toward any such vendor is "show the calibration, or assume the encoder is not better than the open baseline."

What OpenAffect uses and why

We build on the open stack. TRIBE-class encoders for the neural signal family. Huth Lab semantic work informs our linguistic modeling. We evaluate against Algonauts and NSD benchmarks as sanity checks. Our commercial model is the fusion layer above the encoder, with calibration against public ad performance (see calibration), not a replacement for the open models.

The durable commercial position is infrastructure that integrates with open research rather than competing against it. The category that tries to own proprietary encoders and refuses to publish benchmarks will lose to the one that builds on top of open science.

A technical review of brain encoding models for content prediction

Why this review exists

Definitions

TRIBE v2 (Meta FAIR)

MindEye and MindEye2 (MedARC, Princeton, Stability AI)

Huth Lab semantic atlas (UT Austin)

Algonauts Project (2019–2024)

Natural Scenes Dataset and the training data landscape

Comparison matrix

What brain encoding does not do

What OpenAffect uses and why

References

Keep reading

TRIBE v2 explained: what Meta's brain encoding model actually does

What is neuromarketing in 2026? (And why most of what you have read is wrong)

Reverse inference, twenty years on: why forward prediction is the right methodology

Calibrating neural engagement predictions against Meta Ad Library performance