What is the "post safety screen" in the X code?
It's an annotator, not a judge. The released PostSafetyDeluxeClassifier runs a critical-tier Grok vision model over your post and profile, and returns only boolean metadata flags — the code hardcodes its verdict to non-positive with a score of exactly 0.0. It attaches safety facts to your post for other systems to consume; what those flags are, the withheld prompt decides.
What the code says
The whole classifier is small enough to read in one sitting, and its construction and verdict tell the story between them:
28class PostSafetyScreenResult(BaseModel): 29 tweet_bool_metadata: TweetBoolMetadata 36 vlm_config = grox_config.get_model(ModelName.VLM_PRIMARY_CRITICAL) 37 vlm_config.temperature = 0.000001 38 vlm = VisionSampler(GrokModelConfig(**vlm_config.model_dump()))
83 return [ 84 ContentCategoryResult( 85 category=ContentCategoryType.POST_SAFETY_SCREEN, 86 positive=False, 87 score=0.0, 88 tweet_bool_metadata=result.tweet_bool_metadata, 89 )
Lines 86–87 are the point: the result is always non-positive with a score of exactly
zero. This screen never flags anything itself — it runs a critical-tier vision model purely to
attach boolean metadata to the post, for downstream systems to act on.
The post safety screen is an annotator, not a judge: it runs a critical-tier vision model (VLM_PRIMARY_CRITICAL) and returns only boolean metadata, with the verdict hardcoded to positive=False and score=0.0 — it attaches safety facts for downstream systems rather than flagging posts itself.
Signal by signal
| in the code | in plain english | where xDoctor surfaces it |
|---|---|---|
| VLM_PRIMARY_CRITICAL | This runs on the critical model tier — safety metadata is expensive and X pays for it on purpose. | — |
| tweet_bool_metadata only | The output is a set of yes/no facts about your post, not a penalty. The penalty logic lives elsewhere. | Checkup · Flagged Posts |
| positive=False, score=0.0 | Hardcoded. The annotator never "convicts" — it testifies. | — |
| UserRenderer in prompt | Annotated with your profile in frame, like every Grox classifier. | Coach · Account |
Grox content classifiers render the post's AUTHOR into the judging prompt: UserRenderer.render(post.user) places the user's profile in the model's context alongside the post, in the PTOS, banger, and post-safety-screen classifiers alike.
What the code doesn't say
Which boolean flags exist. The prompt (PostSafetyDeluxe) imports from the absent
grox.prompts.template module, and the TweetBoolMetadata type definition
lives in a data-types module that is likewise not in the release. We can prove the screen
annotates rather than judges; the list of facts it annotates is withheld.
The actual spam criteria are withheld: spam.py imports its system prompt (SpamSystemLowFollower) from grox.prompts.template, and the entire grox/prompts/ module is absent from the public release — the classifier machinery is open, the rules it enforces are not.
What to do about it
The architecture lesson: a "flag" on X is rarely one model's opinion — it is metadata attached here, consumed by enforcement logic elsewhere. That is why Checkup triages your history by surface rather than by a single score: the system being modeled works the same way.