Close Menu
TechurzTechurz

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    StrictlyVC San Francisco is in less than a month

    April 1, 2026

    Toyota’s Woven Capital appoints new CIO and COO in push for finding the ‘future of mobility’

    April 1, 2026

    Mercor says it was hit by cyberattack tied to compromise of open-source LiteLLM project

    April 1, 2026
    Facebook X (Twitter) Instagram
    Trending
    • StrictlyVC San Francisco is in less than a month
    • Toyota’s Woven Capital appoints new CIO and COO in push for finding the ‘future of mobility’
    • Mercor says it was hit by cyberattack tied to compromise of open-source LiteLLM project
    • It’s not your imagination: AI seed startups are commanding higher valuations
    • Yupp.ai shuts down after raising $33M from a16z crypto’s Chris Dixon
    • Whoop’s valuation just tripled to $10 billion
    • Nomadic raises $8.4 million to wrangle the data pouring off autonomous vehicles
    • The company behind ClassPass and Mindbody just got a lot bigger with a $7.5B merger
    Facebook X (Twitter) Instagram Pinterest Vimeo
    TechurzTechurz
    • Home
    • AI
    • Apps
    • News
    • Guides
    • Opinion
    • Reviews
    • Security
    • Startups
    TechurzTechurz
    Home»AI»A Chinese firm has just launched a constantly changing set of AI benchmarks
    AI

    A Chinese firm has just launched a constantly changing set of AI benchmarks

    TechurzBy TechurzJune 23, 2025No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    A Chinese firm has just launched a constantly changing set of AI benchmarks
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Development of the benchmark at HongShan began in 2022, following ChatGPT’s breakout success, as an internal tool for assessing which models are worth investing in. Since then, led by partner Gong Yuan, the team has steadily expanded the system, bringing in outside researchers and professionals to help refine it. As the project grew more sophisticated, they decided to release it to the public.

    Xbench approached the problem with two different systems. One is similar to traditional benchmarking: an academic test that gauges a model’s aptitude on various subjects. The other is more like a technical interview round for a job, assessing how much real-world economic value a model might deliver.

    Xbench’s methods for assessing raw intelligence currently include two components: Xbench-ScienceQA and Xbench-DeepResearch. ScienceQA isn’t a radical departure from existing postgraduate-level STEM benchmarks like GPQA and SuperGPQA. It includes questions spanning fields from biochemistry to orbital mechanics, drafted by graduate students and double-checked by professors. Scoring rewards not only the right answer but also the reasoning chain that leads to it.

    DeepResearch, by contrast, focuses on a model’s ability to navigate the Chinese-language web. Ten subject-matter experts created 100 questions in music, history, finance, and literature—questions that can’t just be googled but require significant research to answer. Scoring favors breadth of sources, factual consistency, and a model’s willingness to admit when there isn’t enough data. A question in the publicized collection is “How many Chinese cities in the three northwestern provinces border a foreign country?” (It’s 12, and only 33% of models tested got it right, if you are wondering.)

    On the company’s website, the researchers said they want to add more dimensions to the test—for example, aspects like how creative a model is in its problem solving, how collaborative it is when working with other models, and how reliable it is.

    The team has committed to updating the test questions once a quarter and to maintain a half-public, half-private data set.

    To assess models’ real-world readiness, the team worked with experts to develop tasks modeled on actual workflows, initially in recruitment and marketing. For example, one task asks a model to source five qualified battery engineer candidates and justify each pick. Another asks it to match advertisers with appropriate short-video creators from a pool of over 800 influencers.

    The website also teases upcoming categories, including finance, legal, accounting, and design. The question sets for these categories have not yet been open-sourced.

    ChatGPT-o3 again ranks first in both of the current professional categories. For recruiting, Perplexity Search and Claude 3.5 Sonnet take second and third place, respectively. For marketing, Claude, Grok, and Gemini all perform well.

    “It is really difficult for benchmarks to include things that are so hard to quantify,” says Zihan Zheng, the lead researcher on a new benchmark called LiveCodeBench Pro and a student at NYU. “But Xbench represents a promising start.”

    Benchmarks changing Chinese constantly firm launched set
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleAllSpice’s platform is the GitHub for electrical engineering teams
    Next Article How to Watch Seattle Sounders vs. PSG From Anywhere for Free: Stream FIFA Club World Cup Soccer
    Techurz
    • Website

    Related Posts

    Opinion

    A former Thiel fellow’s startup just launched a drone it says can replace police helicopters

    March 25, 2026
    Opinion

    Chinese brain interface startup Gestala raises $21M just two months after launch

    March 12, 2026
    Opinion

    a16z partner Kofi Ampadu to leave firm after TxO program pause

    January 31, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    College social app Fizz expands into grocery delivery

    September 3, 20252,288 Views

    A Former Apple Luminary Sets Out to Create the Ultimate GPU Software

    September 25, 202516 Views

    The Reason Murderbot’s Tone Feels Off

    May 14, 202512 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    College social app Fizz expands into grocery delivery

    September 3, 20252,288 Views

    A Former Apple Luminary Sets Out to Create the Ultimate GPU Software

    September 25, 202516 Views

    The Reason Murderbot’s Tone Feels Off

    May 14, 202512 Views
    Our Picks

    StrictlyVC San Francisco is in less than a month

    April 1, 2026

    Toyota’s Woven Capital appoints new CIO and COO in push for finding the ‘future of mobility’

    April 1, 2026

    Mercor says it was hit by cyberattack tied to compromise of open-source LiteLLM project

    April 1, 2026

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms and Conditions
    • Disclaimer
    © 2026 techurz. Designed by Pro.

    Type above and press Enter to search. Press Esc to cancel.