Does the X algorithm watch my videos and look at my images?
Effectively, yes. The live system transcribes the audio of your videos with speech recognition and folds the transcript into the same representation it builds from your text. It embeds your images and your text together in one multimodal model, so a post's pictures are part of what the ranker understands, not decoration around it. And video duration is hydrated as a ranking feature with its own scoring gate. For creators who work in video and images, the algorithm is reading far more of your post than its caption.
Most algorithm advice is written for people who post sentences. The live code tells a different story for everyone who posts in pictures and video: the system is converting your media into language it can rank, and it does so before your post is ever scored.
It transcribes your video's audio
Every video post passes through an automatic speech recognition task. The audio is pulled from the best available variant, transcribed, and attached to the post — and animated GIFs are skipped explicitly, because they have no audio track:
62transcript = await ASRProcessor.process(post.id, video_url) 63 64if transcript: 65 if video.convo_video is not None: 66 video.convo_video.asr_transcript = transcript
Every video post passes through an automatic speech recognition task: the audio is pulled from the best variant, transcribed, and attached to the post as asr_transcript. Animated GIFs are explicitly skipped because they have no audio track.
What you say in a video becomes text the algorithm holds. A creator who never writes a caption but talks for thirty seconds has handed the ranker thirty seconds of indexed, rankable language. The spoken word is not invisible to the machine.
It folds that transcript into the post's meaning
The transcript does not sit in a side field. The multimodal embedder appends it directly to the text it encodes, then bundles that text with the post's images into a single document for one model to embed:
76if transcript: 77 text_with_pads += f"\nTranscript: {transcript}" 78 79document: list[tuple[str, str | bytes]] = [("text", text_with_pads)] 80for img in images: 81 document.append(("image", img))
The multimodal post embedder appends a video's transcript directly to the post text, then assembles that text together with the post's images into a single document embedded by one model — text, images, and spoken-word transcript share one representation.
This is the heart of it. The post's caption, the words spoken in its video, and its images are
assembled into one document and embedded together (L88–L90, omitted for brevity). A picture in
your post is not metadata the ranker checks off — it is inside the vector that represents what
your post is. The same was already visible in the spam classifier, which runs on a
vision model so images in a reply are inside its judgment.
The spam classifier runs a Grok vision-language model (VisionSampler, ModelName.VLM_PRIMARY) at temperature 0.000001, scoring the SPAM_COMMENT content category — images are in scope, the verdict is deterministic, and replies are the screened surface.
Video duration is a ranking feature with a gate
Separately from understanding content, the home mixer hydrates each post's minimum video duration as a candidate feature, fetched per tweet from a backing service:
60let durations = client.get_min_video_durations(tweet_ids.clone()).await; 61 62for tweet_id in tweet_ids { 63 let hydrated = match durations.get(&tweet_id) { 64 Some(Ok(min_video_duration_ms)) => Ok(PostCandidate { 65 min_video_duration_ms: min_video_duration_ms.map(|v| v as i32), 66 ..Default::default() 67 }),
That duration feeds the scoring formula's video-quality-view gate: the
VQV signal only earns weight when a
post's video exceeds a minimum duration — shorter clips get zero from that head and lean on the
other eighteen signals.
The video quality view weight is applied only when a post's video exceeds a minimum duration threshold; shorter videos receive zero weight for that signal.
exclude_videos flag that drops every post carrying a
video duration entirely.
After selection, a visibility filter (VFFilter) removes posts that are deleted, spam, violence, gore, and similar categories.
Signal by signal
| in the code | in plain english | where xDoctor surfaces it |
|---|---|---|
| ASR transcript on every video | Speak clearly and on-topic — your spoken words are transcribed and ranked like your caption. | Coach · Topic fit |
| text + image + transcript in one embedding | Your images carry meaning into the ranker. A strong visual on a thin caption still says something to the model. | Coach · Content profile |
| min_video_duration_ms hydrated | Video length is a real feature. Extremely short clips forfeit the video-quality-view signal. | Coach · Format mix |
| animated GIFs skipped by ASR | GIFs carry no audio signal — they're understood visually only, not by transcript. | — |
What the code doesn't say
How much any of this moves your score. The embedding shapes how your post is matched
and retrieved, but the released code does not expose how a richer multimodal embedding trades
off against a thin one in final ranking, and the VQV weight and minimum-duration threshold both
live in the withheld params module. The mechanisms are code-current fact: your
audio is transcribed, your images are embedded, your video length is a feature. Any specific
"post videos longer than N seconds for a boost" number is not in this code.
The numeric values of the current weights are not included in the open-source release: weighted_scorer.rs references a params module (e.g. p::FAVORITE_WEIGHT, p::REPLY_WEIGHT) whose values are not present anywhere in the published repository.
What to do with this
Stop treating media as decoration on a text post. Say your point out loud in video — the transcript is indexed. Put real information in your images — they're embedded into your post's meaning. And know that very short clips give up one of the nineteen scoring signals, so a two-second loop competes on thinner ground than a clip that clears the duration gate. The scoring formula names every signal you're playing for; this page is how your video and images get into it.