Does the X algorithm watch my videos and look at my images?

Effectively, yes. The live system transcribes the audio of your videos with speech recognition and folds the transcript into the same representation it builds from your text. It embeds your images and your text together in one multimodal model, so a post's pictures are part of what the ranker understands, not decoration around it. And video duration is hydrated as a ranking feature with its own scoring gate. For creators who work in video and images, the algorithm is reading far more of your post than its caption.

Most algorithm advice is written for people who post sentences. The live code tells a different story for everyone who posts in pictures and video: the system is converting your media into language it can rank, and it does so before your post is ever scored.

It transcribes your video's audio

Every video post passes through an automatic speech recognition task. The audio is pulled from the best available variant, transcribed, and attached to the post — and animated GIFs are skipped explicitly, because they have no audio track:

grox/tasks/task_asr.py · L62–L66@ 0bfc279
62transcript = await ASRProcessor.process(post.id, video_url)
63
64if transcript:
65    if video.convo_video is not None:
66        video.convo_video.asr_transcript = transcript
CODE-CURRENT0bfc279verified 2026-06-12
Every video post passes through an automatic speech recognition task: the audio is pulled from the best variant, transcribed, and attached to the post as asr_transcript. Animated GIFs are explicitly skipped because they have no audio track.
xai-org/x-algorithm — grox/tasks/task_asr.py, transcription and attachment (L62–L66); GIF skip (L53–L60)as of the May 15, 2026 release

What you say in a video becomes text the algorithm holds. A creator who never writes a caption but talks for thirty seconds has handed the ranker thirty seconds of indexed, rankable language. The spoken word is not invisible to the machine.

It folds that transcript into the post's meaning

The transcript does not sit in a side field. The multimodal embedder appends it directly to the text it encodes, then bundles that text with the post's images into a single document for one model to embed:

grox/embedder/multimodal_post_embedder_v5.py · L76–L81@ 0bfc279
76if transcript:
77    text_with_pads += f"\nTranscript: {transcript}"
78
79document: list[tuple[str, str | bytes]] = [("text", text_with_pads)]
80for img in images:
81    document.append(("image", img))
CODE-CURRENT0bfc279verified 2026-06-12
The multimodal post embedder appends a video's transcript directly to the post text, then assembles that text together with the post's images into a single document embedded by one model — text, images, and spoken-word transcript share one representation.
xai-org/x-algorithm — grox/embedder/multimodal_post_embedder_v5.py, transcript append (L76–L77) and document assembly (L79–L81)as of the May 15, 2026 release

This is the heart of it. The post's caption, the words spoken in its video, and its images are assembled into one document and embedded together (L88–L90, omitted for brevity). A picture in your post is not metadata the ranker checks off — it is inside the vector that represents what your post is. The same was already visible in the spam classifier, which runs on a vision model so images in a reply are inside its judgment.

CODE-CURRENT0bfc279verified 2026-06-12
The spam classifier runs a Grok vision-language model (VisionSampler, ModelName.VLM_PRIMARY) at temperature 0.000001, scoring the SPAM_COMMENT content category — images are in scope, the verdict is deterministic, and replies are the screened surface.
xai-org/x-algorithm — grox/classifiers/content/spam.py, constructor (L26–L30)as of the May 15, 2026 release

Video duration is a ranking feature with a gate

Separately from understanding content, the home mixer hydrates each post's minimum video duration as a candidate feature, fetched per tweet from a backing service:

home-mixer/candidate_hydrators/video_duration_candidate_hydrator.rs · L60–L68@ 0bfc279
60let durations = client.get_min_video_durations(tweet_ids.clone()).await;
61
62for tweet_id in tweet_ids {
63    let hydrated = match durations.get(&tweet_id) {
64        Some(Ok(min_video_duration_ms)) => Ok(PostCandidate {
65            min_video_duration_ms: min_video_duration_ms.map(|v| v as i32),
66            ..Default::default()
67        }),

That duration feeds the scoring formula's video-quality-view gate: the VQV signal only earns weight when a post's video exceeds a minimum duration — shorter clips get zero from that head and lean on the other eighteen signals.

CODE-CURRENT0bfc279verified 2026-06-12
The video quality view weight is applied only when a post's video exceeds a minimum duration threshold; shorter videos receive zero weight for that signal.
xai-org/x-algorithm — home-mixer/scorers/weighted_scorer.rs, lines 72–80 (vqv_weight_eligibility)threshold constant lives in the unpublished params module; the gating logic itself is in the release
And some feed requests carry an exclude_videos flag that drops every post carrying a video duration entirely.
CODE-CURRENT0bfc279verified 2026-06-12
After selection, a visibility filter (VFFilter) removes posts that are deleted, spam, violence, gore, and similar categories.
xai-org/x-algorithm — README.md, Filtering section (Post-Selection Filters table, line 317)as of the May 15, 2026 release

Signal by signal

in the codein plain englishwhere xDoctor surfaces it
ASR transcript on every videoSpeak clearly and on-topic — your spoken words are transcribed and ranked like your caption.Coach · Topic fit
text + image + transcript in one embeddingYour images carry meaning into the ranker. A strong visual on a thin caption still says something to the model.Coach · Content profile
min_video_duration_ms hydratedVideo length is a real feature. Extremely short clips forfeit the video-quality-view signal.Coach · Format mix
animated GIFs skipped by ASRGIFs carry no audio signal — they're understood visually only, not by transcript.

What the code doesn't say

▲ What the code doesn't say

How much any of this moves your score. The embedding shapes how your post is matched and retrieved, but the released code does not expose how a richer multimodal embedding trades off against a thin one in final ranking, and the VQV weight and minimum-duration threshold both live in the withheld params module. The mechanisms are code-current fact: your audio is transcribed, your images are embedded, your video length is a feature. Any specific "post videos longer than N seconds for a boost" number is not in this code.

UNKNOWN0bfc279verified 2026-06-12
The numeric values of the current weights are not included in the open-source release: weighted_scorer.rs references a params module (e.g. p::FAVORITE_WEIGHT, p::REPLY_WEIGHT) whose values are not present anywhere in the published repository.
xai-org/x-algorithm (verified by direct inspection of the full repository tree at the pinned SHA) — home-mixer/scorers/weighted_scorer.rs references crate::params; no params definitions with weight values exist in the releaseabsence verified at the pinned SHA; values may be published in a future release

What to do with this

Stop treating media as decoration on a text post. Say your point out loud in video — the transcript is indexed. Put real information in your images — they're embedded into your post's meaning. And know that very short clips give up one of the nineteen scoring signals, so a two-second loop competes on thinner ground than a clip that clears the duration gate. The scoring formula names every signal you're playing for; this page is how your video and images get into it.

← Reach problems