Close Menu
TechurzTechurz
    What's Hot

    Evotrex raises $30M to build the RV that doesn’t need a charging station

    June 9, 2026

    It’s not FAANG anymore. It’s MANGOS.

    June 9, 2026

    Zepto’s IPO filing reveals fast growth, bigger losses, and a valuation question nobody’s answered yet

    June 9, 2026
    X (Twitter) Pinterest YouTube LinkedIn WhatsApp
    Tech Pulse
    • Evotrex raises $30M to build the RV that doesn’t need a charging station
    • It’s not FAANG anymore. It’s MANGOS.
    • Zepto’s IPO filing reveals fast growth, bigger losses, and a valuation question nobody’s answered yet
    • How to apply to Startup Battlefield 2026, what you need ahead of today’s June 8 deadline
    • Beyond Instagram: Introducing the next generation of social apps
    X (Twitter) Pinterest YouTube LinkedIn WhatsApp
    TechurzTechurz
    • Home
    • Tech Pulse
    • Future Tech
    • AI Systems
    • Cyber Reality
    • Disruption Lab
    • Signals
    TechurzTechurz
    Home - AI - A Chinese firm has just launched a constantly changing set of AI benchmarks
    AI

    A Chinese firm has just launched a constantly changing set of AI benchmarks

    TechurzBy TechurzJune 23, 2025Updated:May 10, 2026No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    A Chinese firm has just launched a constantly changing set of AI benchmarks
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Development of the benchmark at HongShan began in 2022, following ChatGPT’s breakout success, as an internal tool for assessing which models are worth investing in. Since then, led by partner Gong Yuan, the team has steadily expanded the system, bringing in outside researchers and professionals to help refine it. As the project grew more sophisticated, they decided to release it to the public.

    Xbench approached the problem with two different systems. One is similar to traditional benchmarking: an academic test that gauges a model’s aptitude on various subjects. The other is more like a technical interview round for a job, assessing how much real-world economic value a model might deliver.

    Xbench’s methods for assessing raw intelligence currently include two components: Xbench-ScienceQA and Xbench-DeepResearch. ScienceQA isn’t a radical departure from existing postgraduate-level STEM benchmarks like GPQA and SuperGPQA. It includes questions spanning fields from biochemistry to orbital mechanics, drafted by graduate students and double-checked by professors. Scoring rewards not only the right answer but also the reasoning chain that leads to it.

    DeepResearch, by contrast, focuses on a model’s ability to navigate the Chinese-language web. Ten subject-matter experts created 100 questions in music, history, finance, and literature—questions that can’t just be googled but require significant research to answer. Scoring favors breadth of sources, factual consistency, and a model’s willingness to admit when there isn’t enough data. A question in the publicized collection is “How many Chinese cities in the three northwestern provinces border a foreign country?” (It’s 12, and only 33% of models tested got it right, if you are wondering.)

    On the company’s website, the researchers said they want to add more dimensions to the test—for example, aspects like how creative a model is in its problem solving, how collaborative it is when working with other models, and how reliable it is.

    The team has committed to updating the test questions once a quarter and to maintain a half-public, half-private data set.

    To assess models’ real-world readiness, the team worked with experts to develop tasks modeled on actual workflows, initially in recruitment and marketing. For example, one task asks a model to source five qualified battery engineer candidates and justify each pick. Another asks it to match advertisers with appropriate short-video creators from a pool of over 800 influencers.

    The website also teases upcoming categories, including finance, legal, accounting, and design. The question sets for these categories have not yet been open-sourced.

    ChatGPT-o3 again ranks first in both of the current professional categories. For recruiting, Perplexity Search and Claude 3.5 Sonnet take second and third place, respectively. For marketing, Claude, Grok, and Gemini all perform well.

    “It is really difficult for benchmarks to include things that are so hard to quantify,” says Zihan Zheng, the lead researcher on a new benchmark called LiveCodeBench Pro and a student at NYU. “But Xbench represents a promising start.”

    Benchmarks changing Chinese constantly firm launched set
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleAllSpice’s platform is the GitHub for electrical engineering teams
    Next Article How to Watch Seattle Sounders vs. PSG From Anywhere for Free: Stream FIFA Club World Cup Soccer
    Techurz
    • Website

    Related Posts

    Opinion

    Unastella, a South Korean rocket startup that launched from home, raises $24M

    June 1, 2026
    Opinion

    Meridian Ventures launched $35M fund to back MBA-deferred founders

    May 15, 2026
    Opinion

    A former Thiel fellow’s startup just launched a drone it says can replace police helicopters

    March 25, 2026
    Add A Comment
    Latest Tech Pulse

    College social app Fizz expands into grocery delivery

    September 3, 20252,289

    SolarSquare in talks to raise up to $60M as India’s rooftop solar market draws major VC interest

    May 23, 202621

    A Former Apple Luminary Sets Out to Create the Ultimate GPU Software

    September 25, 202518
    Stay In Touch
    • YouTube
    • WhatsApp
    • Twitter
    • Pinterest
    • LinkedIn

    Techurz helps readers stay ahead of digital change with clear, practical, future focused technology intelligence written today,searched tomorrow.

    X (Twitter) Pinterest YouTube LinkedIn WhatsApp
    Company
    • About Us
    • Contact Us
    • Our Authors / Editorial Team
    • Write For Us
    • Advertise
    Policy
    • Editorial Policy
    • Privacy Policy
    • Terms and Conditions
    • Affiliate Disclosure
    • Cookie Policy
    • Disclaimer
    • DMCA
    Explore
    • AI Systems
    • Cyber Reality
    • Future Tech
    • Disruption Lab
    • Signals
    • Tech Pulse
    • Sitemap

    Join the Techurz Brief

    The future does not arrive suddenly.
    Stay ahead with fast, sharp tech signals.

    Type above and press Enter to search. Press Esc to cancel.