METR

METR
Formation2022; 4 years ago (2022)
FounderBeth Barnes
TypeNonprofit research institute
Legal status501(c)(3) tax exempt charity
PurposeAI safety research and model evaluation
Websitemetr.org

Model Evaluation and Threat Research (METR) (MEE-tər), is a nonprofit research institute, based in Berkeley, California,[1] that evaluates frontier AI models' capabilities to carry out long-horizon, agentic tasks that some researchers argue could pose catastrophic risks to society.[2][3] METR has worked with leading AI companies to conduct pre-deployment model evaluations and contribute to system cards, including OpenAI's o3, o4-mini, GPT-4o and GPT-4.5, and Anthropic's Claude models.[3][4][5][6][7]

METR's CEO and founder is Beth Barnes, a former alignment researcher at OpenAI who left in 2022 to form ARC Evals, the evaluation division of Paul Christiano's Alignment Research Center. In December 2023, ARC Evals was spun off into an independent 501(c)(3) nonprofit and renamed METR.[8][9][10]

Research

A substantial amount of METR's research is focused on evaluating the capabilities of AI systems to conduct research and development of AI systems themselves, including RE-Bench, a benchmark designed to test whether AIs can "solve research engineering tasks and accelerate AI R&D".[11][12]

Doubling time estimates

A graph showing that the length of tasks frontier models are capable of executing at a 50% success rate doubled every 7 months from 2019 to 2024. The shaded region represents a 95% confidence interval.[13]

In March 2025, METR published a paper noting that the length of software engineering tasks that the leading AI model could complete had a doubling time of around 7 months between 2019 and 2024.[14]

In January 2026, METR released a new version of their time horizon estimates model (Time Horizon 1.1). According to the updated model, the rate of progress of AI capabilities has increased since 2023, with a post-2023 doubling time estimated at 130.8 days (4.3 months). Progress is thus estimated to be 20% more rapid.[15]

Time horizon measurements

METR releases a "task-completion time horizon" for analysed AI models. This measures the "task duration (measured by human expert completion time) at which an AI agent is predicted to succeed with a given level of reliability."[16] The metric is reported in two variants: the 50%-time horizon, which gives the task duration at which an AI model is estimated to succeed 50% of the time, and the 80%-time horizon, which gives the task duration at which an AI model is estimated to succeed 80% of the time.[16] METR has published two versions of the underlying model: Time Horizon 1.0 and Time Horizon 1.1, the latter introduced in January 2026.[16]

As of 9 May 2026, the best-performing model is Claude Mythos, with a 50%-time horizon of likely at least 16 hours and an 80%-time horizon of 3 hours and 6 minutes.[16] METR notes that "[m]easurements above 16 [hours] are unreliable with [their] current task suite". The following table provides time horizon estimates ordered by each model's release date:[16]

Task duration (for humans)
Model Release date Time Horizon 1.1 Time Horizon 1.0
50% 80% 50% 80%
GPT-2 February 2019 2 seconds 0 seconds
GPT-3 May 2020 9 seconds 2 seconds
GPT-3.5 March 2022 36 seconds 10 seconds
GPT-4 March 2023 4 minutes 37 seconds 5 minutes 1 minute
GPT-4
(November 2023)
November 2023 4 minutes 34 seconds 9 minutes 1 minute
Claude 3 Opus March 2024 4 minutes 29 seconds 6 minutes 1 minute
GPT-4 Turbo April 2024 3 minutes 37 seconds 7 minutes 2 minutes
GPT-4o May 2024 6 minutes 57 seconds 9 minutes 2 minutes
Qwen2-72B June 2024 2 minutes 25 seconds
Claude 3.5 Sonnet (Old) June 2024 11 minutes 1 minute 19 minutes 3 minutes
Qwen2.5-72B September 2024 5 minutes 56 seconds
o1-preview September 2024 19 minutes 3 minutes 22 minutes 5 minutes
Claude 3.5 Sonnet (New) October 2024 20 minutes 2 minutes 30 minutes 5 minutes
DeepSeek-V3 December 2024 18 minutes 4 minutes
o1 December 2024 38 minutes 6 minutes 41 minutes 6 minutes
Claude 3.7 Sonnet February 2025 1 hour 10 minutes 56 minutes 15 minutes
o3 April 2025 2 hours 1 minute 24 minutes 1 hour 34 minutes 21 minutes
o4-mini April 2025 1 hour 19 minutes 16 minutes
Claude Opus 4 May 2025 1 hour 41 minutes 17 minutes 1 hour 26 minutes 21 minutes
DeepSeek-R1-0528 May 2025 32 minutes 4 minutes
Gemini 2.5 Pro Preview June 2025 40 minutes 9 minutes
Grok 4 July 2025 1 hour 49 minutes 15 minutes
Claude Opus 4.1 August 2025 1 hour 41 minutes 19 minutes
GPT-5 August 2025 3 hours 34 minutes 32 minutes 2 hours 18 minutes 27 minutes
gpt-oss-120b August 2025 45 minutes 7 minutes
Claude Sonnet 4.5 September 2025 2 hours 2 minutes 21 minutes
Gemini 3 Pro November 2025 3 hours 57 minutes 43 minutes
Claude Opus 4.5 November 2025 5 hours 20 minutes 42 minutes 4 hours 49 minutes 27 minutes
GPT-5.1-Codex-Max November 2025 3 hours 57 minutes 41 minutes 2 hours 53 minutes 32 minutes
Kimi K2 Thinking
(inference via Novita AI)
November 2025 58 minutes 12 minutes
GPT-5.2 (high) December 2025 6 hours 34 minutes 55 minutes
Claude Opus 4.6 February 2026 11 hours 59 minutes 1 hour 10 minutes
GPT-5.3-Codex (high) February 2026 6 hours 30 minutes 47 minutes
Gemini 3.1 Pro March 2026 5 hours 50 minutes 1 hour 30 minutes
GPT-5.4 (xhigh) March 2026 5 hours 42 minutes 54 minutes
Claude Mythos Preview (early) April 2026 Likely at least 16 hours 3 hours 6 minutes

References

  1. ^ Witt, Stephen (10 October 2025). "The A.I. Prompt That Could End the World". The New York Times. Archived from the original on 29 October 2025. Retrieved 29 October 2025.
  2. ^ "About METR". METR. Archived from the original on 15 June 2025. Retrieved 15 June 2025.
  3. ^ a b "OpenAI o3 and o4-mini System Card". OpenAI. Archived from the original on 15 June 2025. Retrieved 15 June 2025.
  4. ^ "GPT-4.5 system card". OpenAI. Retrieved 15 June 2025.
  5. ^ "Introducing Claude 3.5 Sonnet". Anthropic. Archived from the original on 6 February 2025. Retrieved 15 June 2025.
  6. ^ "Details about METR's preliminary evaluation of Claude 3.7". METR's Autonomy Evaluation Resources. 4 April 2025. Archived from the original on 15 June 2025. Retrieved 15 June 2025.
  7. ^ Robison, Kylie (8 August 2024). "OpenAI says its latest GPT-4o model is 'medium' risk". The Verge. Archived from the original on 6 February 2026. Retrieved 29 October 2025.
  8. ^ "ARC Evals is now METR". METR Blog. 4 December 2023. Archived from the original on 15 June 2025. Retrieved 15 June 2025.
  9. ^ Booth, Harry (5 September 2024). "TIME100 AI 2024: Beth Barnes". TIME. Archived from the original on 15 June 2025. Retrieved 15 June 2025.
  10. ^ Henshall, Will (21 March 2024). "Nobody Knows How to Safety-Test AI". TIME. Archived from the original on 15 June 2025. Retrieved 15 June 2025.
  11. ^ "Claude 3.7 Sonnet System Card". Anthropic. 24 February 2025. Retrieved 15 June 2025.
  12. ^ "Gemini 2.5 Pro Preview Model Card". Google. 6 June 2025. Archived from the original on 28 May 2025. Retrieved 15 June 2025.
  13. ^ "Measuring AI Ability to Complete Long Tasks". METR Blog. 19 March 2025. Archived from the original on 15 June 2025. Retrieved 15 June 2025.
  14. ^ Lovely, Garrison (19 March 2025). "AI could soon tackle projects that take humans weeks". Nature. doi:10.1038/d41586-025-00831-8. ISSN 1476-4687. Archived from the original on 1 July 2025. Retrieved 15 June 2025.
  15. ^ "Time Horizon 1.1". METR Blog. 29 January 2026. Archived from the original on 12 February 2026. Retrieved 14 February 2026.
  16. ^ a b c d e "Task-Completion Time Horizons of Frontier AI Models". METR. 8 May 2026. Retrieved 9 May 2026.

Content Disclaimer

Informasi ini disarikan dari Wikipedia dan disajikan kembali untuk tujuan edukasi. Konten tersedia di bawah lisensi CC BY-SA 3.0. Kami tidak bertanggung jawab atas ketidakakuratan data yang bersumber dari kontribusi publik tersebut.

  1. The information displayed on this website is sourced in part or in whole from Wikipedia and has been adapted for the purpose of restating it. We strive to provide accurate and relevant information, however:
  2. There is no guarantee of absolute accuracy. Wikipedia is an open, collaborative project that can be edited by anyone, so information is subject to change.
  3. It is not intended to constitute professional advice. The content displayed is for informational and educational purposes only. For important decisions (e.g., medical, legal, or financial), please consult a professional.
  4. Content copyright. Wikipedia is licensed under the Creative Commons Attribution-ShareAlike License (CC BY-SA). This means that content may be reused with appropriate attribution and shared under a similar license.
  5. Responsible use. Any risk arising from the use of information from this website is entirely the responsibility of the user.