Submit to E-GEO
Submissions are accepted via pull request on the public GitHub repository — there
is no web upload form. You run a single script (src/submission.py) that
scores your rewriter against the five judges and writes the files you submit. The
PR is the submission; once it is merged, your results show up on the
leaderboard.
Two scoring rules
- Query-blind. Your rewriter may see only the original product description — never the search query. (A deployed seller doesn't know what the shopper will type.)
-
One rewrite per test query. Each test query names a single
target product — the one at
rand_idxindata/test_selected_products.json. Rewrite that product only, and include every test query exactly once.
How to submit
-
Fork psbagga17/E-GEO
and read
submission.md. Thenuv syncand setOPENROUTER_API_KEY(fetch the dataset per the repo README). -
Run
src/submission.pyin one of three modes:- A — score:
--mode score --rewrites rewrites.json— score rewrites you already produced. - B — rewrite:
--mode rewrite --prompt final.txt— we rewrite the test products with your prompt, then score. - C — optimize:
--mode optimize --prompt <style>— we prompt-optimize your starting prompt (selecting on validation, never on test), then rewrite and score.
- A — score:
-
The script writes
submissions/<team>_<timestamp>/containingmetadata.json,results.json, andrewrites.jsonl— these three files are your submission. -
Open a pull request adding
submissions/<your-rewriter-name>/. Once your PR is merged, your results show up on the leaderboard.
metadata.json
Describes your submission — no scores. The example below shows a Mode C run;
cost_per_rewrite_usd and run_config are null
in Mode A.
{
"name": "My Team",
"type": "model+prompt",
"mode": "C: optimize then score",
"judges": [
"openai/gpt-5", "anthropic/claude-sonnet-4.5",
"google/gemini-3-flash-preview", "deepseek/deepseek-v3.2",
"meta-llama/llama-4-maverick"
],
"description": "One-line description of your approach.",
"cost_per_rewrite_usd": 0.0042,
"query_blind": true,
"contact": "you@example.com",
"code_url": null,
"paper_url": null,
"run_config": {"rewriter_model": "openai/gpt-4.1", "optimized": true},
"is_paper_baseline": false
}
results.json
The scores the leaderboard reads — mean and standard error per judge.
{
"per_ranker": {
"gpt_5": {"mean": 0.17, "se": 0.02},
"claude_sonnet_4_5": {"mean": 0.16, "se": 0.02},
"gemini_3_flash_preview": {"mean": 0.16, "se": 0.02},
"deepseek_v3_2": {"mean": 0.17, "se": 0.02},
"llama_4_maverick": {"mean": 0.15, "se": 0.02}
},
"total_queries_scored": 2000,
"total_input_tokens": 12345,
"total_output_tokens": 12345,
"estimated_usd_cost": 228.5
}
typemust be one ofmodel+prompt,fine-tuned,agent, orother. The five leaderboard judges are always required.cost_per_rewrite_usdis the average cost per rewrite in USD — auto-estimated inMode BandMode Cfrom token usage and provider pricing;nullinMode A.- See submission.md for the full specification and an example submission.