E-GEO Benchmark

Submit to E-GEO

Submissions are accepted via pull request on the public GitHub repository — there is no web upload form. You run a single script (src/submission.py) that scores your rewriter against the five judges and writes the files you submit. The PR is the submission; once it is merged, your results show up on the leaderboard.

Two scoring rules

Query-blind. Your rewriter may see only the original product description — never the search query. (A deployed seller doesn't know what the shopper will type.)
One rewrite per test query. Each test query names a single target product — the one at rand_idx in data/test_selected_products.json. Rewrite that product only, and include every test query exactly once.

How to submit

Fork psbagga17/E-GEO and read submission.md. Then uv sync and set OPENROUTER_API_KEY (fetch the dataset per the repo README).
Run src/submission.py in one of three modes:
- A — score: --mode score --rewrites rewrites.json — score rewrites you already produced.
- B — rewrite: --mode rewrite --prompt final.txt — we rewrite the test products with your prompt, then score.
- C — optimize: --mode optimize --prompt <style> — we prompt-optimize your starting prompt (selecting on validation, never on test), then rewrite and score.
The script writes submissions/<team>_<timestamp>/ containing metadata.json, results.json, and rewrites.jsonl — these three files are your submission.
Open a pull request adding submissions/<your-rewriter-name>/. Once your PR is merged, your results show up on the leaderboard.

metadata.json

Describes your submission — no scores. The example below shows a Mode C run; cost_per_rewrite_usd and run_config are null in Mode A.

{
  "name":                 "My Team",
  "type":                 "model+prompt",
  "mode":                 "C: optimize then score",
  "judges": [
    "openai/gpt-5", "anthropic/claude-sonnet-4.5",
    "google/gemini-3-flash-preview", "deepseek/deepseek-v3.2",
    "meta-llama/llama-4-maverick"
  ],
  "description":          "One-line description of your approach.",
  "cost_per_rewrite_usd": 0.0042,
  "query_blind":          true,
  "contact":              "you@example.com",
  "code_url":             null,
  "paper_url":            null,
  "run_config":          {"rewriter_model": "openai/gpt-4.1", "optimized": true},
  "is_paper_baseline":    false
}

results.json

The scores the leaderboard reads — mean and standard error per judge.

{
  "per_ranker": {
    "gpt_5":                  {"mean": 0.17, "se": 0.02},
    "claude_sonnet_4_5":      {"mean": 0.16, "se": 0.02},
    "gemini_3_flash_preview": {"mean": 0.16, "se": 0.02},
    "deepseek_v3_2":          {"mean": 0.17, "se": 0.02},
    "llama_4_maverick":       {"mean": 0.15, "se": 0.02}
  },
  "total_queries_scored": 2000,
  "total_input_tokens":   12345,
  "total_output_tokens":  12345,
  "estimated_usd_cost":   228.5
}

type must be one of model+prompt, fine-tuned, agent, or other. The five leaderboard judges are always required.
cost_per_rewrite_usd is the average cost per rewrite in USD — auto-estimated in Mode B and Mode C from token usage and provider pricing; null in Mode A.
See submission.md for the full specification and an example submission.