loop-benchmarking

Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
git clone https://git.shiptheloop.com/loop-benchmarking.git
Log | Files | Refs | README

scoring.yaml (574B)


      1 # Outcome score (the headline number)
      2 # gameplay_bot: does the game actually work? (16 Playwright tests)
      3 # sonarqube: is the code quality good? (cognitive complexity, bugs, smells)
      4 outcome_weights:
      5   gameplay_bot: 0.50
      6   sonarqube: 0.50
      7 
      8 # Output metrics (tracked, displayed, but don't affect headline score):
      9 # - quality (lint, typecheck, bundle size - "does the project build cleanly")
     10 # - structural (entry point exists, build succeeds)
     11 # - code_analysis (function length, nesting, naming, separation of concerns)
     12 # - transcript_analysis (agent efficiency, wasted turns)

Impressum · Datenschutz