Close Menu
TechurzTechurz

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    WhatsApp Worm, Critical CVEs, Oracle 0-Day, Ransomware Cartel & More

    October 13, 2025

    Aisuru’s 30 Tbps botnet traffic crashes through major US ISPs

    October 13, 2025

    See It Here First at TechCrunch Disrupt 2025

    October 13, 2025
    Facebook X (Twitter) Instagram
    Trending
    • WhatsApp Worm, Critical CVEs, Oracle 0-Day, Ransomware Cartel & More
    • Aisuru’s 30 Tbps botnet traffic crashes through major US ISPs
    • See It Here First at TechCrunch Disrupt 2025
    • Final Flash Sale: Save up to $624 on Disrupt 2025 Passes
    • I tested a Windows laptop with a tandem OLED, and it’s spoiled working on other displays for me
    • Why Unmonitored JavaScript Is Your Biggest Holiday Security Risk
    • German state replaces Microsoft Exchange and Outlook with open-source email
    • Astaroth Banking Trojan Abuses GitHub to Remain Operational After Takedowns
    Facebook X (Twitter) Instagram Pinterest Vimeo
    TechurzTechurz
    • Home
    • AI
    • Apps
    • News
    • Guides
    • Opinion
    • Reviews
    • Security
    • Startups
    TechurzTechurz
    Home»AI»New vision model from Cohere runs on two GPUs, beats top-tier VLMs on visual tasks
    AI

    New vision model from Cohere runs on two GPUs, beats top-tier VLMs on visual tasks

    TechurzBy TechurzAugust 2, 2025No Comments5 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    New vision model from Cohere runs on two GPUs, beats top-tier VLMs on visual tasks
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now

    The rise in Deep Research features and other AI-powered analysis has given rise to more models and services looking to simplify that process and read more of the documents businesses actually use. 

    Canadian AI company Cohere is banking on its models, including a newly released visual model, to make the case that Deep Research features should also be optimized for enterprise use cases. 

    The company has released Command A Vision, a visual model specifically targeting enterprise use cases, built on the back of its Command A model. The 112 billion parameter model can “unlock valuable insights from visual data, and make highly accurate, data-driven decisions through document optical character recognition (OCR) and image analysis,” the company says.

    “Whether it’s interpreting product manuals with complex diagrams or analyzing photographs of real-world scenes for risk detection, Command A Vision excels at tackling the most demanding enterprise vision challenges,” the company said in a blog post. 

    The AI Impact Series Returns to San Francisco – August 5

    The next phase of AI is here – are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows – from real-time decision-making to end-to-end automation.

    Secure your spot now – space is limited: https://bit.ly/3GuuPLF

    This means Command A Vision can read and analyze the most common types of images enterprises need: graphs, charts, diagrams, scanned documents and PDFs. 

    ? @cohere just dropped Command A Vision on @huggingface ?

    Designed for enterprise multimodal use cases: interpreting product manuals, analyzing photos, asking about charts… ❓??

    A 112B dense vision-language model with SOTA performance – check out the benchmark metrics in… pic.twitter.com/ORMfM5f8cF

    — Jeff Boudier ? (@jeffboudier) July 31, 2025

    Since it’s built on Command A’s architecture, Command A Vision requires two or fewer GPUs, just like the text model. The vision model also retains the text capabilities of Command A to read words on images and understands at least 23 languages. Cohere said that, unlike other models, Command A Vision reduces the total cost of ownership for enterprises and is fully optimized for retrieval use cases for businesses. 

    How Cohere is architecting Command A

    Cohere said it followed a Llava architecture to build its Command A models, including the visual model. This architecture turns visual features into soft vision tokens, which can be divided into different tiles. 

    These tiles are passed into the Command A text tower, “a dense, 111B parameters textual LLM,” the company said. “In this manner, a single image consumes up to 3,328 tokens.”

    Cohere said it trained the visual model in three stages: vision-language alignment, supervised fine-tuning (SFT) and post-training reinforcement learning with human feedback (RLHF).

    “This approach enables the mapping of image encoder features to the language model embedding space,” the company said. “In contrast, during the SFT stage, we simultaneously trained the vision encoder, the vision adapter and the language model on a diverse set of instruction-following multimodal tasks.”

    Visualizing enterprise AI 

    Benchmark tests showed Command A Vision outperforming other models with similar visual capabilities. 

    Cohere pitted Command A Vision against OpenAI’s GPT 4.1, Meta’s Llama 4 Maverick, Mistral’s Pixtral Large and Mistral Medium 3 in nine benchmark tests. The company did not mention if it tested the model against Mistral’s OCR-focused API, Mistral OCR. 

    It enables agents to securely see inside your organization’s visual data, unlocking the automation of tedious tasks involving slides, diagrams, PDFs, and photos. pic.twitter.com/iHZnUWekrk

    — cohere (@cohere) July 31, 2025

    Command A Vision outscored the other models in tests such as ChartQA, OCRBench, AI2D and TextVQA. Overall, Command A Vision had an average score of 83.1% compared to GPT 4.1’s 78.6%, Llama 4 Maverick’s 80.5% and the 78.3% from Mistral Medium 3. 

    Most large language models (LLMs) these days are multimodal, meaning they can generate or understand visual media like photos or videos. However, enterprises generally use more graphical documents such as charts and PDFs, so extracting information from these unstructured data sources often proves difficult. 

    With Deep Research on the rise, the importance of bringing in models capable of reading, analyzing and even downloading unstructured data has grown.

    Cohere also said it’s offering Command A Vision in an open weights system, in hopes that enterprises looking to move away from closed or proprietary models will start using its products. So far, there is some interest from developers.

    Very impressed at its accuracy extracting hand handwritten notes from an image!

    — Adam Sardo (@sardo_adam) July 31, 2025

    Finally, an AI that won’t judge my terrible doodles.

    — Martha Wisener ? (@martwisener) August 1, 2025

    Daily insights on business use cases with VB Daily

    If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

    Read our Privacy Policy

    Thanks for subscribing. Check out more VB newsletters here.

    An error occured.

    Beats Cohere GPUs model runs tasks toptier Vision visual VLMs
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleMind the overconfidence gap: CISOs and staff don’t see eye to eye on security posture
    Next Article Back to Class 2025: returning school, university or work never looked so good
    Techurz
    • Website

    Related Posts

    Security

    I thought the Bose QuietComfort headphones already hit their peak – then I tried the newest model

    October 12, 2025
    Security

    What Sets Top-Tier Platforms Apart?

    October 11, 2025
    Security

    This new Google Gemini model scrolls the internet just like you do – how it works

    October 10, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    The Reason Murderbot’s Tone Feels Off

    May 14, 20259 Views

    Start Saving Now: An iPhone 17 Pro Price Hike Is Likely, Says New Report

    August 17, 20258 Views

    CNET’s Daily Tariff Price Tracker: I’m Keeping Tabs on Changes as Trump’s Trade Policies Shift

    May 27, 20258 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    The Reason Murderbot’s Tone Feels Off

    May 14, 20259 Views

    Start Saving Now: An iPhone 17 Pro Price Hike Is Likely, Says New Report

    August 17, 20258 Views

    CNET’s Daily Tariff Price Tracker: I’m Keeping Tabs on Changes as Trump’s Trade Policies Shift

    May 27, 20258 Views
    Our Picks

    WhatsApp Worm, Critical CVEs, Oracle 0-Day, Ransomware Cartel & More

    October 13, 2025

    Aisuru’s 30 Tbps botnet traffic crashes through major US ISPs

    October 13, 2025

    See It Here First at TechCrunch Disrupt 2025

    October 13, 2025

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms and Conditions
    • Disclaimer
    © 2025 techurz. Designed by Pro.

    Type above and press Enter to search. Press Esc to cancel.