Alternatives to MiMo-V2-Flash

Compare MiMo-V2-Flash alternatives for your business or organization using the curated list below. SourceForge ranks the best alternatives to MiMo-V2-Flash in 2026. Compare features, ratings, user reviews, pricing, and more from MiMo-V2-Flash competitors and alternatives in order to make an informed decision for your business.

  • 1
    Step 3.5 Flash
    Step 3.5 Flash is an advanced open source foundation language model engineered for frontier reasoning and agentic capabilities with exceptional efficiency, built on a sparse Mixture of Experts (MoE) architecture that selectively activates only about 11 billion of its ~196 billion parameters per token to deliver high-density intelligence and real-time responsiveness. Its 3-way Multi-Token Prediction (MTP-3) enables generation throughput in the hundreds of tokens per second for complex multi-step reasoning chains and task execution, and it supports efficient long contexts with a hybrid sliding window attention approach that reduces computational overhead across large datasets or codebases. It demonstrates robust performance on benchmarks for reasoning, coding, and agentic tasks, rivaling or exceeding many larger proprietary models, and includes a scalable reinforcement learning framework for consistent self-improvement.
  • 2
    MiMo-V2-Omni

    MiMo-V2-Omni

    Xiaomi Technology

    MiMo-V2-Omni is an advanced multimodal AI model designed to handle a wide range of real-world tasks across text, code, and other data formats. It is built to support agentic workflows, enabling seamless execution of complex, multi-step processes. The model integrates strong reasoning, tool usage, and contextual understanding to deliver reliable outputs. With its ability to process diverse inputs, it enhances productivity across development, automation, and enterprise use cases. MiMo-V2-Omni focuses on delivering consistent performance in both general and specialized tasks.
  • 3
    MiMo-V2-Pro

    MiMo-V2-Pro

    Xiaomi Technology

    Xiaomi MiMo-V2-Pro is a flagship AI foundation model designed to power real-world agentic workflows and complex task execution. It is built to function as the core intelligence behind agent systems, enabling orchestration of multi-step processes and production-level tasks. The model demonstrates strong capabilities in coding, tool usage, and search-based tasks, performing competitively on global benchmarks. With its large-scale architecture and extended context window, it can handle long and complex interactions efficiently. MiMo-V2-Pro is optimized for practical applications, delivering reliable performance across development, automation, and enterprise workflows.
    Starting Price: $1/million tokens
  • 4
    Nemotron 3 Super
    Nemotron-3 Super is part of NVIDIA’s Nemotron 3 family of open models designed to enable advanced agentic AI systems that can reason, plan, and execute multi-step workflows across complex environments. The model introduces a hybrid Mamba-Transformer Mixture-of-Experts architecture that combines the efficiency of state-space Mamba layers with the contextual understanding of transformer attention, allowing it to process long sequences and complex reasoning tasks with high accuracy and throughput. This architecture activates only a subset of model parameters for each token, improving computational efficiency while maintaining strong reasoning capabilities and enabling scalable inference for large workloads. Nemotron-3 Super contains roughly 120 billion parameters with around 12 billion active during inference, accelerating multi-step reasoning and collaborative agent interactions across large contexts.
  • 5
    Nemotron 3 Ultra
    Nemotron 3 Nano is a compact, open large language model in NVIDIA’s Nemotron 3 family, designed for efficient agentic reasoning, conversational AI, and coding tasks. It uses a hybrid Mixture-of-Experts Mamba-Transformer architecture that activates only a small subset of parameters per token, enabling low-latency inference while maintaining strong accuracy and reasoning performance. It has approximately 31.6 billion total parameters with around 3.2 billion active (3.6 billion including embeddings), allowing it to achieve higher accuracy than previous Nemotron 2 Nano while using less computation per forward pass. Nemotron 3 Nano supports long-context processing of up to one million tokens, enabling it to handle large documents, multi-step workflows, and extended reasoning chains in a single pass. It is designed for high-throughput, real-time execution, excelling in multi-turn conversations, tool calling, and agent-based workflows where tasks require planning, reasoning, and more.
  • 6
    Kimi K2 Thinking

    Kimi K2 Thinking

    Moonshot AI

    Kimi K2 Thinking is an advanced open source reasoning model developed by Moonshot AI, designed specifically for long-horizon, multi-step workflows where the system interleaves chain-of-thought processes with tool invocation across hundreds of sequential tasks. The model uses a mixture-of-experts architecture with a total of 1 trillion parameters, yet only about 32 billion parameters are activated per inference pass, optimizing efficiency while maintaining vast capacity. It supports a context window of up to 256,000 tokens, enabling the handling of extremely long inputs and reasoning chains without losing coherence. Native INT4 quantization is built in, which reduces inference latency and memory usage without performance degradation. Kimi K2 Thinking is explicitly built for agentic workflows; it can autonomously call external tools, manage sequential logic steps (up to and typically between 200-300 tool calls in a single chain), and maintain consistent reasoning.
  • 7
    Xiaomi MiMo

    Xiaomi MiMo

    Xiaomi Technology

    The Xiaomi MiMo API open platform is a developer-oriented interface for accessing and integrating Xiaomi’s MiMo family of AI models, including reasoning and language models such as MiMo-V2-Flash, into applications and services through standardized APIs and cloud endpoints, enabling developers to build AI-enabled features like conversational agents, reasoning workflows, code assistance, and search-augmented tasks without managing model infrastructure themselves. It offers REST-style API access with authentication, request signing, and structured responses so software can send prompts and receive generated text or processed outputs programmatically, and it supports common operations like text generation, prompt handling, and inference over MiMo models. By providing documentation and onboarding tools, the open platform lets teams integrate Xiaomi’s latest open source large language models, which leverage Mixture-of-Experts (MoE) architectures.
  • 8
    Phi-4-mini-flash-reasoning
    Phi-4-mini-flash-reasoning is a 3.8 billion‑parameter open model in Microsoft’s Phi family, purpose‑built for edge, mobile, and other resource‑constrained environments where compute, memory, and latency are tightly limited. It introduces the SambaY decoder‑hybrid‑decoder architecture with Gated Memory Units (GMUs) interleaved alongside Mamba state‑space and sliding‑window attention layers, delivering up to 10× higher throughput and a 2–3× reduction in latency compared to its predecessor without sacrificing advanced math and logic reasoning performance. Supporting a 64 K‑token context length and fine‑tuned on high‑quality synthetic data, it excels at long‑context retrieval, reasoning tasks, and real‑time inference, all deployable on a single GPU. Phi-4-mini-flash-reasoning is available today via Azure AI Foundry, NVIDIA API Catalog, and Hugging Face, enabling developers to build fast, scalable, logic‑intensive applications.
  • 9
    DeepSeek-V2

    DeepSeek-V2

    DeepSeek

    DeepSeek-V2 is a state-of-the-art Mixture-of-Experts (MoE) language model introduced by DeepSeek-AI, characterized by its economical training and efficient inference capabilities. With a total of 236 billion parameters, of which only 21 billion are active per token, it supports a context length of up to 128K tokens. DeepSeek-V2 employs innovative architectures like Multi-head Latent Attention (MLA) for efficient inference by compressing the Key-Value (KV) cache and DeepSeekMoE for cost-effective training through sparse computation. This model significantly outperforms its predecessor, DeepSeek 67B, by saving 42.5% in training costs, reducing the KV cache by 93.3%, and enhancing generation throughput by 5.76 times. Pretrained on an 8.1 trillion token corpus, DeepSeek-V2 excels in language understanding, coding, and reasoning tasks, making it a top-tier performer among open-source models.
  • 10
    GLM-4.5
    GLM‑4.5 is Z.ai’s latest flagship model in the GLM family, engineered with 355 billion total parameters (32 billion active) and a companion GLM‑4.5‑Air variant (106 billion total, 12 billion active) to unify advanced reasoning, coding, and agentic capabilities in one architecture. It operates in a “thinking” mode for complex, multi‑step reasoning and tool use, and a “non‑thinking” mode for instant responses, supporting up to 128 K token context length and native function calling. Available via the Z.ai chat platform and API, with open weights on HuggingFace and ModelScope, GLM‑4.5 ingests diverse inputs to solve general problem‑solving, common‑sense reasoning, coding from scratch or within existing projects, and end‑to‑end agent workflows such as web browsing and slide generation. Built on a Mixture‑of‑Experts design with loss‑free balance routing, grouped‑query attention, and an MTP layer for speculative decoding, it delivers enterprise‑grade performance.
  • 11
    Ministral 8B

    Ministral 8B

    Mistral AI

    Mistral AI has introduced two advanced models for on-device computing and edge applications, named "les Ministraux": Ministral 3B and Ministral 8B. These models excel in knowledge, commonsense reasoning, function-calling, and efficiency within the sub-10B parameter range. They support up to 128k context length and are designed for various applications, including on-device translation, offline smart assistants, local analytics, and autonomous robotics. Ministral 8B features an interleaved sliding-window attention pattern for faster and more memory-efficient inference. Both models can function as intermediaries in multi-step agentic workflows, handling tasks like input parsing, task routing, and API calls based on user intent with low latency and cost. Benchmark evaluations indicate that les Ministraux consistently outperforms comparable models across multiple tasks. As of October 16, 2024, both models are available, with Ministral 8B priced at $0.1 per million tokens.
  • 12
    GigaChat 3 Ultra
    GigaChat 3 Ultra is a 702-billion-parameter Mixture-of-Experts model built from scratch to deliver frontier-level reasoning, multilingual capability, and deep Russian-language fluency. It activates just 36 billion parameters per token, enabling massive scale with practical inference speeds. The model was trained on a 14-trillion-token corpus combining natural, multilingual, and high-quality synthetic data to strengthen reasoning, math, coding, and linguistic performance. Unlike modified foreign checkpoints, GigaChat 3 Ultra is entirely original—giving developers full control, modern alignment, and a dataset free of inherited limitations. Its architecture leverages MoE, MTP, and MLA to match open-source ecosystems and integrate easily with popular inference and fine-tuning tools. With leading results on Russian benchmarks and competitive performance on global tasks, GigaChat 3 Ultra represents one of the largest and most capable open-source LLMs in the world.
  • 13
    Kimi K2

    Kimi K2

    Moonshot AI

    Kimi K2 is a state-of-the-art open source large language model series built on a mixture-of-experts (MoE) architecture, featuring 1 trillion total parameters and 32 billion activated parameters for task-specific efficiency. Trained with the Muon optimizer on over 15.5 trillion tokens and stabilized by MuonClip’s attention-logit clamping, it delivers exceptional performance in frontier knowledge, reasoning, mathematics, coding, and general agentic workflows. Moonshot AI provides two variants, Kimi-K2-Base for research-level fine-tuning and Kimi-K2-Instruct pre-trained for immediate chat and tool-driven interactions, enabling both custom development and drop-in agentic capabilities. Benchmarks show it outperforms leading open source peers and rivals top proprietary models in coding tasks and complex task breakdowns, while its 128 K-token context length, tool-calling API compatibility, and support for industry-standard inference engines.
  • 14
    Qwen3.6-35B-A3B
    Qwen3.5-35B-A3B is part of the Qwen3.5 “Medium” model series, designed as a highly efficient, multimodal foundation model that balances strong reasoning ability with practical deployment requirements. It uses a Mixture-of-Experts (MoE) architecture with 35 billion total parameters but activates only about 3 billion per token, allowing it to deliver performance comparable to much larger models while significantly reducing computational cost. The model integrates a hybrid attention mechanism that combines linear attention with standard attention layers, enabling efficient long-context processing and improved scalability for complex tasks. As a native vision-language model, it can process both text and visual inputs, supporting use cases such as multimodal reasoning, coding, and agent-based workflows. It is designed to function as a general-purpose “AI agent,” capable of planning, tool use, and structured problem solving rather than just conversational responses.
  • 15
    GLM-4.7-Flash
    GLM-4.7 Flash is a lightweight variant of GLM-4.7, Z.ai’s flagship large language model designed for advanced coding, reasoning, and multi-step task execution with strong agentic performance and a very large context window. It is an MoE-based model optimized for efficient inference that balances performance and resource use, enabling deployment on local machines with moderate memory requirements while maintaining deep reasoning, coding, and agentic task abilities. GLM-4.7 itself advances over earlier generations with enhanced programming capabilities, stable multi-step reasoning, context preservation across turns, and improved tool-calling workflows, and supports very long context lengths (up to ~200 K tokens) for complex tasks that span large inputs or outputs. The Flash variant retains many of these strengths in a smaller footprint, offering competitive benchmark performance in coding and reasoning tasks for models in its size class.
  • 16
    Ministral 3B

    Ministral 3B

    Mistral AI

    Mistral AI introduced two state-of-the-art models for on-device computing and edge use cases, named "les Ministraux": Ministral 3B and Ministral 8B. These models set a new frontier in knowledge, commonsense reasoning, function-calling, and efficiency in the sub-10B category. They can be used or tuned for various applications, from orchestrating agentic workflows to creating specialist task workers. Both models support up to 128k context length (currently 32k on vLLM), and Ministral 8B features a special interleaved sliding-window attention pattern for faster and memory-efficient inference. These models were built to provide a compute-efficient and low-latency solution for scenarios such as on-device translation, internet-less smart assistants, local analytics, and autonomous robotics. Used in conjunction with larger language models like Mistral Large, les Ministraux also serve as efficient intermediaries for function-calling in multi-step agentic workflows.
  • 17
    Trinity-Large-Thinking
    Trinity Large Thinking is a frontier open source reasoning model developed by Arcee AI, designed specifically for complex, multi-step problem solving and autonomous agent workflows that require long-horizon planning and tool use. Built on a sparse Mixture-of-Experts architecture with roughly 400 billion total parameters but only about 13 billion active per token, the model achieves high efficiency while maintaining strong reasoning performance across tasks such as mathematical problem solving, code generation, and multi-step analysis. It introduces extended chain-of-thought reasoning capabilities, allowing the model to generate intermediate “thinking traces” before producing final answers, which improves accuracy and reliability in complex scenarios. Trinity Large Thinking supports a very large context window of up to 262K tokens, enabling it to process long documents, maintain state across extended interactions, and operate effectively in continuous agent loops.
  • 18
    Qwen3.5

    Qwen3.5

    Alibaba

    Qwen3.5 is a next-generation open-weight multimodal large language model designed to power native vision-language agents. The flagship release, Qwen3.5-397B-A17B, combines a hybrid linear attention architecture with sparse mixture-of-experts, activating only 17 billion parameters per forward pass out of 397 billion total to maximize efficiency. It delivers strong benchmark performance across reasoning, coding, multilingual understanding, visual reasoning, and agent-based tasks. The model expands language support from 119 to 201 languages and dialects while introducing a 1M-token context window in its hosted version, Qwen3.5-Plus. Built for multimodal tasks, it processes text, images, and video with advanced spatial reasoning and tool integration. Qwen3.5 also incorporates scalable reinforcement learning environments to improve general agent capabilities. Designed for developers and enterprises, it enables efficient, tool-augmented, multimodal AI workflows.
  • 19
    Qwen3-Max

    Qwen3-Max

    Alibaba

    Qwen3-Max is Alibaba’s latest trillion-parameter large language model, designed to push performance in agentic tasks, coding, reasoning, and long-context processing. It is built atop the Qwen3 family and benefits from the architectural, training, and inference advances introduced there; mixing thinker and non-thinker modes, a “thinking budget” mechanism, and support for dynamic mode switching based on complexity. The model reportedly processes extremely long inputs (hundreds of thousands of tokens), supports tool invocation, and exhibits strong performance on benchmarks in coding, multi-step reasoning, and agent benchmarks (e.g., Tau2-Bench). While its initial variant emphasizes instruction following (non-thinking mode), Alibaba plans to bring reasoning capabilities online to enable autonomous agent behavior. Qwen3-Max inherits multilingual support and extensive pretraining on trillions of tokens, and it is delivered via API interfaces compatible with OpenAI-style functions.
  • 20
    Xiaomi MiMo Studio

    Xiaomi MiMo Studio

    Xiaomi Technology

    MiMo Studio is a web-based AI chat and development interface powered by Xiaomi’s MiMo models that lets users interact directly with advanced language models like MiMo-V2-Flash for real-time conversational AI, search-augmented responses, reasoning, and code generation. It acts like an interactive “AI playground” where users can chat with the model to get answers, ask for explanations, generate or debug code, and explore ideas interactively without installing software. It supports features such as web search integration and toggleable modes that switch between instant replies and deeper “thinking” responses for more complex tasks, helping developers and creators explore tasks from research to functional output. Because it’s browser-based, it provides easy online access to Xiaomi’s cutting-edge AI models, enabling experimentation with large-context reasoning, problem solving, and multi-turn interactions.
  • 21
    Qwen3.5-Plus
    Qwen3.5-Plus is a high-performance native vision-language model designed for efficient text generation, deep reasoning, and multimodal understanding. Built on a hybrid architecture that combines linear attention with a sparse mixture-of-experts design, it delivers strong performance while optimizing inference efficiency. The model supports text, image, and video inputs and produces text outputs, making it suitable for complex multimodal workflows. With a massive 1 million token context window and up to 64K output tokens, Qwen3.5-Plus enables long-form reasoning and large-scale document analysis. It includes advanced capabilities such as structured outputs, function calling, web search, and tool integration via the Responses API. The model supports prefix continuation, caching, batch processing, and fine-tuning for flexible deployment. Designed for developers and enterprises, Qwen3.5-Plus provides scalable, high-throughput AI performance with OpenAI-compatible API access.
    Starting Price: $0.4 per 1M tokens
  • 22
    Mistral Small 4
    Mistral Small 4 is an advanced open-source AI model developed by Mistral AI that combines reasoning, coding, and multimodal capabilities into a single system. It unifies the strengths of previous models such as Magistral for reasoning, Pixtral for multimodal processing, and Devstral for agentic coding tasks. The model can handle both text and image inputs, allowing it to perform tasks ranging from conversational chat to visual analysis and document understanding. Built with a mixture-of-experts architecture, Mistral Small 4 delivers efficient performance while scaling to complex workloads. It also features a configurable reasoning parameter that allows users to switch between fast responses and deeper analytical outputs. With a large context window and optimized inference performance, the model supports long-form interactions and complex workflows.
  • 23
    Chat Stream

    Chat Stream

    Chat Stream

    Chat Stream provides access to two powerful language models from DeepSeek: Model Capabilities Utilizes DeepSeek V3 and R1 models with 671B parameters (37B activated per token) Achieves exceptional benchmark scores: MMLU (87.1%), BBH (87.5%) Features 128K context window length Supports code generation, mathematical computation, and multilingual processing Technical Features Advanced MoE (Mixture-of-Experts) architecture Multi-head Latent Attention (MLA) Auxiliary-loss-free load balancing Multi-token prediction objective Deployment Options Web-based chat interface with instant access One-click website integration via iframe Mobile apps for iOS and Android platforms Compatible with NVIDIA, AMD GPUs, and Huawei Ascend NPUs Supports both local inference and cloud deployment Access Methods Free chat access without registration Website embedding capabilities Mobile applications Premium subscription for ad-free experience
  • 24
    GLM-4.5V

    GLM-4.5V

    Zhipu AI

    GLM-4.5V builds on the GLM-4.5-Air foundation, using a Mixture-of-Experts (MoE) architecture with 106 billion total parameters and 12 billion activation parameters. It achieves state-of-the-art performance among open-source VLMs of similar scale across 42 public benchmarks, excelling in image, video, document, and GUI-based tasks. It supports a broad range of multimodal capabilities, including image reasoning (scene understanding, spatial recognition, multi-image analysis), video understanding (segmentation, event recognition), complex chart and long-document parsing, GUI-agent workflows (screen reading, icon recognition, desktop automation), and precise visual grounding (e.g., locating objects and returning bounding boxes). GLM-4.5V also introduces a “Thinking Mode” switch, allowing users to choose between fast responses or deeper reasoning when needed.
  • 25
    DeepSeek-V4

    DeepSeek-V4

    DeepSeek

    DeepSeek V4 is an advanced AI model designed to push the boundaries of large-scale artificial intelligence with an estimated 1 trillion parameters. It utilizes a Mixture-of-Experts architecture, activating only a fraction of its parameters per task to improve efficiency. The model supports a massive context window of up to 1 million tokens, enabling it to process long documents and complex codebases. It is natively multimodal, allowing it to understand and generate text, images, audio, and video. DeepSeek V4 introduces innovations such as Engram memory, sparse attention mechanisms, and improved training stability techniques. It is expected to deliver high performance in areas like software engineering and reasoning while maintaining lower operational costs. Overall, DeepSeek V4 aims to combine scalability, efficiency, and affordability to compete with leading AI models.
  • 26
    Nemotron 3 Nano
    Nemotron 3 Nano is the smallest model in the NVIDIA Nemotron 3 family, built for agentic AI applications with strong reasoning, conversational ability, and cost-efficient inference. It is a hybrid Mamba-Transformer Mixture-of-Experts model with 3.2 billion active parameters, 3.6 billion including embeddings, and 31.6 billion total parameters. NVIDIA describes it as more accurate than the previous Nemotron 2 Nano while activating less than half of the parameters per forward pass, improving efficiency without sacrificing performance. The model is positioned as more accurate than GPT-OSS-20B and Qwen3-30B-A3B-Thinking-2507 on popular benchmarks across different categories. On an 8K input and 16K output setting using a single H200, it delivers inference throughput 3.3 times higher than Qwen3-30B-A3B and 2.2 times higher than GPT-OSS-20B. Nemotron 3 Nano supports context lengths up to 1 million tokens and is reported to outperform GPT-OSS-20B and Qwen3-30B-A3B-Instruct-2507.
  • 27
    GLM-5.1

    GLM-5.1

    Zhipu AI

    GLM-5.1 is the latest iteration of Z.ai’s GLM series, designed as a frontier-level, agent-oriented AI model optimized for coding, reasoning, and long-horizon workflows. It builds on the GLM-5 architecture, which uses a Mixture-of-Experts (MoE) design to deliver high performance while keeping inference costs efficient, and is part of a broader push toward open-weight, developer-accessible models. A core focus of GLM-5.1 is enabling agentic behavior, meaning it can plan, execute, and iterate across multi-step tasks rather than simply responding to single prompts. It is specifically designed to handle complex workflows such as debugging code, navigating repositories, and executing chained operations with sustained context. Compared to earlier models, GLM-5.1 improves reliability in long interactions, maintaining coherence across extended sessions and reducing breakdowns in multi-step reasoning.
  • 28
    GLM-4.7-FlashX
    GLM-4.7 FlashX is a lightweight, high-speed version of the GLM-4.7 large language model created by Z.ai that balances efficiency and performance for real-time AI tasks across English and Chinese while offering the core capabilities of the broader GLM-4.7 family in a more resource-friendly package. It is positioned alongside GLM-4.7 and GLM-4.7 Flash, delivering optimized agentic coding and general language understanding with faster response times and lower resource needs, making it suitable for applications that require rapid inference without heavy infrastructure. As part of the GLM-4.7 model series, it inherits the model’s strengths in programming, multi-step reasoning, and robust conversational understanding, and it supports long contexts for complex tasks while remaining lightweight enough for deployment with constrained compute budgets.
    Starting Price: $0.07 per 1M tokens
  • 29
    Falcon-7B

    Falcon-7B

    Technology Innovation Institute (TII)

    Falcon-7B is a 7B parameters causal decoder-only model built by TII and trained on 1,500B tokens of RefinedWeb enhanced with curated corpora. It is made available under the Apache 2.0 license. Why use Falcon-7B? It outperforms comparable open-source models (e.g., MPT-7B, StableLM, RedPajama etc.), thanks to being trained on 1,500B tokens of RefinedWeb enhanced with curated corpora. See the OpenLLM Leaderboard. It features an architecture optimized for inference, with FlashAttention and multiquery. It is made available under a permissive Apache 2.0 license allowing for commercial use, without any royalties or restrictions.
  • 30
    Olmo 3
    Olmo 3 is a fully open model family spanning 7 billion and 32 billion parameter variants that delivers not only high-performing base, reasoning, instruction, and reinforcement-learning models, but also exposure of the entire model flow, including raw training data, intermediate checkpoints, training code, long-context support (65,536 token window), and provenance tooling. Starting with the Dolma 3 dataset (≈9 trillion tokens) and its disciplined mix of web text, scientific PDFs, code, and long-form documents, the pre-training, mid-training, and long-context phases shape the base models, which are then post-trained via supervised fine-tuning, direct preference optimisation, and RL with verifiable rewards to yield the Think and Instruct variants. The 32 B Think model is described as the strongest fully open reasoning model to date, competitively close to closed-weight peers in math, code, and complex reasoning.
  • 31
    DeepSeek R1

    DeepSeek R1

    DeepSeek

    DeepSeek-R1 is an advanced open-source reasoning model developed by DeepSeek, designed to rival OpenAI's Model o1. Accessible via web, app, and API, it excels in complex tasks such as mathematics and coding, demonstrating superior performance on benchmarks like the American Invitational Mathematics Examination (AIME) and MATH. DeepSeek-R1 employs a mixture of experts (MoE) architecture with 671 billion total parameters, activating 37 billion parameters per token, enabling efficient and accurate reasoning capabilities. This model is part of DeepSeek's commitment to advancing artificial general intelligence (AGI) through open-source innovation.
  • 32
    SWE-1.5

    SWE-1.5

    Cognition

    SWE-1.5 is the latest agent-model release by Cognition, purpose-built for software engineering and characterized by a “frontier-size” architecture comprising hundreds of billions of parameters and optimized end-to-end (model, inference engine, and agent harness) for both speed and intelligence. It achieves near-state-of-the-art coding performance and sets a new benchmark in latency, delivering inference speeds up to 950 tokens/second, roughly six times faster than its predecessor Haiku 4.5 and thirteen times faster than Sonnet 4.5. The model was trained using extensive reinforcement learning in realistic coding-agent environments with multi-turn workflows, unit tests, quality rubrics, and browser-based agentic execution; it also benefits from tightly integrated software tooling and high-throughput hardware (including thousands of GB200 NVL72 chips and a custom hypervisor infrastructure).
  • 33
    MiniMax M2

    MiniMax M2

    MiniMax

    MiniMax M2 is an open source foundation model built specifically for agentic applications and coding workflows, striking a new balance of performance, speed, and cost. It excels in end-to-end development scenarios, handling programming, tool-calling, and complex, long-chain workflows with capabilities such as Python integration, while delivering inference speeds of around 100 tokens per second and offering API pricing at just ~8% of the cost of comparable proprietary models. The model supports “Lightning Mode” for high-speed, lightweight agent tasks, and “Pro Mode” for in-depth full-stack development, report generation, and web-based tool orchestration; its weights are fully open source and available for local deployment with vLLM or SGLang. MiniMax M2 positions itself as a production-ready model that enables agents to complete independent tasks, such as data analysis, programming, tool orchestration, and large-scale multi-step logic at real organizational scale.
    Starting Price: $0.30 per million input tokens
  • 34
    Yi-Lightning

    Yi-Lightning

    Yi-Lightning

    Yi-Lightning, developed by 01.AI under the leadership of Kai-Fu Lee, represents the latest advancement in large language models with a focus on high performance and cost-efficiency. It boasts a maximum context length of 16K tokens and is priced at $0.14 per million tokens for both input and output, making it remarkably competitive. Yi-Lightning leverages an enhanced Mixture-of-Experts (MoE) architecture, incorporating fine-grained expert segmentation and advanced routing strategies, which contribute to its efficiency in training and inference. This model has excelled in various domains, achieving top rankings in categories like Chinese, math, coding, and hard prompts on the chatbot arena, where it secured the 6th position overall and 9th in style control. Its development included comprehensive pre-training, supervised fine-tuning, and reinforcement learning from human feedback, ensuring both performance and safety, with optimizations in memory usage and inference speed.
  • 35
    GLM-4.6V

    GLM-4.6V

    Zhipu AI

    GLM-4.6V is a state-of-the-art open source multimodal vision-language model from the Z.ai (GLM-V) family designed for reasoning, perception, and action. It ships in two variants: a full-scale version (106B parameters) for cloud or high-performance clusters, and a lightweight “Flash” variant (9B) optimized for local deployment or low-latency use. GLM-4.6V supports a native context window of up to 128K tokens during training, enabling it to process very long documents or multimodal inputs. Crucially, it integrates native Function Calling, meaning the model can take images, screenshots, documents, or other visual media as input directly (without manual text conversion), reason about them, and trigger tool calls, bridging “visual perception” with “executable action.” This enables a wide spectrum of capabilities; interleaved image-and-text content generation (for example, combining document understanding with text summarization or generation of image-annotated responses).
  • 36
    Mistral 7B

    Mistral 7B

    Mistral AI

    Mistral 7B is a 7.3-billion-parameter language model that outperforms larger models like Llama 2 13B across various benchmarks. It employs Grouped-Query Attention (GQA) for faster inference and Sliding Window Attention (SWA) to efficiently handle longer sequences. Released under the Apache 2.0 license, Mistral 7B is accessible for deployment across diverse platforms, including local environments and major cloud services. Additionally, a fine-tuned version, Mistral 7B Instruct, demonstrates enhanced performance in instruction-following tasks, surpassing models like Llama 2 13B Chat.
  • 37
    Qwen2

    Qwen2

    Alibaba

    Qwen2 is the large language model series developed by Qwen team, Alibaba Cloud. Qwen2 is a series of large language models developed by the Qwen team at Alibaba Cloud. It includes both base language models and instruction-tuned models, ranging from 0.5 billion to 72 billion parameters, and features both dense models and a Mixture-of-Experts model. The Qwen2 series is designed to surpass most previous open-weight models, including its predecessor Qwen1.5, and to compete with proprietary models across a broad spectrum of benchmarks in language understanding, generation, multilingual capabilities, coding, mathematics, and reasoning.
  • 38
    Falcon-40B

    Falcon-40B

    Technology Innovation Institute (TII)

    Falcon-40B is a 40B parameters causal decoder-only model built by TII and trained on 1,000B tokens of RefinedWeb enhanced with curated corpora. It is made available under the Apache 2.0 license. Why use Falcon-40B? It is the best open-source model currently available. Falcon-40B outperforms LLaMA, StableLM, RedPajama, MPT, etc. See the OpenLLM Leaderboard. It features an architecture optimized for inference, with FlashAttention and multiquery. It is made available under a permissive Apache 2.0 license allowing for commercial use, without any royalties or restrictions. ⚠️ This is a raw, pretrained model, which should be further finetuned for most usecases. If you are looking for a version better suited to taking generic instructions in a chat format, we recommend taking a look at Falcon-40B-Instruct.
  • 39
    Qwen Code
    Qwen3‑Coder is an agentic code model available in multiple sizes, led by the 480B‑parameter Mixture‑of‑Experts variant (35B active) that natively supports 256K‑token contexts (extendable to 1M) and achieves state‑of‑the‑art results on Agentic Coding, Browser‑Use, and Tool‑Use tasks comparable to Claude Sonnet 4. Pre‑training on 7.5T tokens (70 % code) and synthetic data cleaned via Qwen2.5‑Coder optimized both coding proficiency and general abilities, while post‑training employs large‑scale, execution‑driven reinforcement learning and long‑horizon RL across 20,000 parallel environments to excel on multi‑turn software‑engineering benchmarks like SWE‑Bench Verified without test‑time scaling. Alongside the model, the open source Qwen Code CLI (forked from Gemini Code) unleashes Qwen3‑Coder in agentic workflows with customized prompts, function calling protocols, and seamless integration with Node.js, OpenAI SDKs, and more.
  • 40
    Llama 4 Scout
    Llama 4 Scout is a powerful 17 billion active parameter multimodal AI model that excels in both text and image processing. With an industry-leading context length of 10 million tokens, it outperforms its predecessors, including Llama 3, in tasks such as multi-document summarization and parsing large codebases. Llama 4 Scout is designed to handle complex reasoning tasks while maintaining high efficiency, making it perfect for use cases requiring long-context comprehension and image grounding. It offers cutting-edge performance in image-related tasks and is particularly well-suited for applications requiring both text and visual understanding.
  • 41
    Qwen3-Coder
    Qwen3‑Coder is an agentic code model available in multiple sizes, led by the 480B‑parameter Mixture‑of‑Experts variant (35B active) that natively supports 256K‑token contexts (extendable to 1M) and achieves state‑of‑the‑art results comparable to Claude Sonnet 4. Pre‑training on 7.5T tokens (70 % code) and synthetic data cleaned via Qwen2.5‑Coder optimized both coding proficiency and general abilities, while post‑training employs large‑scale, execution‑driven reinforcement learning, scaling test‑case generation for diverse coding challenges, and long‑horizon RL across 20,000 parallel environments to excel on multi‑turn software‑engineering benchmarks like SWE‑Bench Verified without test‑time scaling. Alongside the model, the open source Qwen Code CLI (forked from Gemini Code) unleashes Qwen3‑Coder in agentic workflows with customized prompts, function calling protocols, and seamless integration with Node.js, OpenAI SDKs, and environment variables.
  • 42
    Phi-4-reasoning-plus
    Phi-4-reasoning-plus is a 14-billion parameter open-weight reasoning model that builds upon Phi-4-reasoning capabilities. It is further trained with reinforcement learning to utilize more inference-time compute, using 1.5x more tokens than Phi-4-reasoning, to deliver higher accuracy. Despite its significantly smaller size, Phi-4-reasoning-plus achieves better performance than OpenAI o1-mini and DeepSeek-R1 at most benchmarks, including mathematical reasoning and Ph.D. level science questions. It surpasses the full DeepSeek-R1 model (with 671 billion parameters) on the AIME 2025 test, the 2025 qualifier for the USA Math Olympiad. Phi-4-reasoning-plus is available on Azure AI Foundry and HuggingFace.
  • 43
    Kimi K2.6

    Kimi K2.6

    Moonshot AI

    Kimi K2.6 is a next-generation agentic AI model developed by Moonshot AI, designed to push forward real-world execution, coding, and multi-step reasoning beyond earlier K2 and K2.5 versions. It builds on a Mixture-of-Experts architecture and the multimodal, agent-first foundation of the Kimi series, combining language understanding, coding, and tool use into a single system capable of planning and executing complex workflows. It introduces deeper reasoning capabilities and significantly improved agent planning, allowing it to break down tasks, coordinate tools, and handle multi-file or multi-step problems with greater accuracy and efficiency. It supports advanced tool calling with high reliability, enabling integration with external systems such as web search or APIs, and includes built-in validation mechanisms to ensure correct execution formats.
  • 44
    Grok 4.1 Fast
    Grok 4.1 Fast is an xAI model designed to deliver advanced tool-calling capabilities with a massive 2-million-token context window. It excels at complex real-world tasks such as customer support, finance, troubleshooting, and dynamic agent workflows. The model pairs seamlessly with the new Agent Tools API, which enables real-time web search, X search, file retrieval, and secure code execution. This combination gives developers the power to build fully autonomous, production-grade agents that plan, reason, and use tools effectively. Grok 4.1 Fast is trained with long-horizon reinforcement learning, ensuring stable multi-turn accuracy even across extremely long prompts. With its speed, cost-efficiency, and high benchmark scores, it sets a new standard for scalable enterprise-grade AI agents.
  • 45
    Seed2.0 Pro

    Seed2.0 Pro

    ByteDance

    Seed2.0 Pro is an advanced general-purpose agent model designed for large-scale production environments and complex real-world tasks. It focuses on long-chain inference capabilities and stability, making it ideal for handling multi-step workflows and intricate business applications. As part of the Seed 2.0 model series, it delivers major upgrades in multimodal understanding, including visual reasoning, motion perception, and instruction-following accuracy. The model demonstrates state-of-the-art performance across leading benchmarks in mathematics, science, coding, and visual reasoning. Seed2.0 Pro excels at interactive visual applications, such as recreating webpages from a single image and generating runnable front-end code with animations. It also supports professional workflows like CAD modeling, biotechnology research assistance, and structured data extraction from complex charts.
  • 46
    Reka Flash 3
    ​Reka Flash 3 is a 21-billion-parameter multimodal AI model developed by Reka AI, designed to excel in general chat, coding, instruction following, and function calling. It processes and reasons with text, images, video, and audio inputs, offering a compact, general-purpose solution for various applications. Trained from scratch on diverse datasets, including publicly accessible and synthetic data, Reka Flash 3 underwent instruction tuning on curated, high-quality data to optimize performance. The final training stage involved reinforcement learning using REINFORCE Leave One-Out (RLOO) with both model-based and rule-based rewards, enhancing its reasoning capabilities. With a context length of 32,000 tokens, Reka Flash 3 performs competitively with proprietary models like OpenAI's o1-mini, making it suitable for low-latency or on-device deployments. The model's full precision requires 39GB (fp16), but it can be compressed to as small as 11GB using 4-bit quantization.
  • 47
    LongLLaMA

    LongLLaMA

    LongLLaMA

    This repository contains the research preview of LongLLaMA, a large language model capable of handling long contexts of 256k tokens or even more. LongLLaMA is built upon the foundation of OpenLLaMA and fine-tuned using the Focused Transformer (FoT) method. LongLLaMA code is built upon the foundation of Code Llama. We release a smaller 3B base variant (not instruction tuned) of the LongLLaMA model on a permissive license (Apache 2.0) and inference code supporting longer contexts on hugging face. Our model weights can serve as the drop-in replacement of LLaMA in existing implementations (for short context up to 2048 tokens). Additionally, we provide evaluation results and comparisons against the original OpenLLaMA models.
  • 48
    Seed2.0 Mini

    Seed2.0 Mini

    ByteDance

    Seed2.0 Mini is the smallest member of ByteDance’s Seed2.0 series of general-purpose multimodal agent models, designed for high-throughput inference and dense deployment while retaining the core strengths of its larger siblings in multimodal understanding and instruction following. Part of a family that also includes Pro and Lite, the Mini variant is optimized for high-concurrency and batch generation workloads, making it suitable for applications where efficient processing of many requests at scale matters as much as capability. Like other Seed2.0 models, it benefits from systematic enhancements in visual reasoning, motion perception, structured extraction from complex inputs like text and images, and reliable execution of multi-step instructions, but it trades some raw reasoning and output quality for faster, more cost-effective inference and better deployment efficiency.
  • 49
    DBRX

    DBRX

    Databricks

    Today, we are excited to introduce DBRX, an open, general-purpose LLM created by Databricks. Across a range of standard benchmarks, DBRX sets a new state-of-the-art for established open LLMs. Moreover, it provides the open community and enterprises building their own LLMs with capabilities that were previously limited to closed model APIs; according to our measurements, it surpasses GPT-3.5, and it is competitive with Gemini 1.0 Pro. It is an especially capable code model, surpassing specialized models like CodeLLaMA-70B in programming, in addition to its strength as a general-purpose LLM. This state-of-the-art quality comes with marked improvements in training and inference performance. DBRX advances the state-of-the-art in efficiency among open models thanks to its fine-grained mixture-of-experts (MoE) architecture. Inference is up to 2x faster than LLaMA2-70B, and DBRX is about 40% of the size of Grok-1 in terms of both total and active parameter counts.
  • 50
    Mistral NeMo

    Mistral NeMo

    Mistral AI

    Mistral NeMo, our new best small model. A state-of-the-art 12B model with 128k context length, and released under the Apache 2.0 license. Mistral NeMo is a 12B model built in collaboration with NVIDIA. Mistral NeMo offers a large context window of up to 128k tokens. Its reasoning, world knowledge, and coding accuracy are state-of-the-art in its size category. As it relies on standard architecture, Mistral NeMo is easy to use and a drop-in replacement in any system using Mistral 7B. We have released pre-trained base and instruction-tuned checkpoints under the Apache 2.0 license to promote adoption for researchers and enterprises. Mistral NeMo was trained with quantization awareness, enabling FP8 inference without any performance loss. The model is designed for global, multilingual applications. It is trained on function calling and has a large context window. Compared to Mistral 7B, it is much better at following precise instructions, reasoning, and handling multi-turn conversations.