Microsoft recently announced that it will bring NPU-optimized versions of the DeepSeek-R1 1.5B distilled model directly to Copilot+ PCs. Now, it is taking the next step forward by making the DeepSeek R1 7B and 14B distilled models available for Copilot+ PCs via Azure AI Foundry. This milestone reinforces Microsoft's commitment to delivering cutting-edge AI capabilities that are fast, efficient, and built for real-world applications, helping developers, businesses, and creators push the boundaries of what’s possible.
Availability starts with Copilot+ PCs powered by Qualcomm Snapdragon X, followed by Intel Core Ultra 200V and AMD Ryzen.
The ability to run 7B and 14B parameter reasoning models on Neural Processing Units (NPUs) is a significant milestone in the democratization and accessibility of artificial intelligence. This progression allows researchers, developers and enthusiasts to leverage the substantial power and functionalities of large-scale machine-learning models directly from their Copilot+ PCs. These Copilot+ PCs include an NPU capable of over 40 trillion operations per second (TOPS).
NPUs are purpose-built to run AI models locally on-device with exceptional efficiency.
NPUs like those built into Copilot+ PCs are purpose-built to run AI models with exceptional efficiency, balancing speed and power consumption. They ensure sustained AI computing with minimal impact on battery life, thermal performance and resource usage. This leaves CPUs and GPUs free to perform other tasks, allowing reasoning models to operate longer and deliver superior results — all while keeping your PC running smoothly.
Efficient inferencing has heightened significance due to a new scaling law for language models, which indicates that chain of thought reasoning during inference can improve response quality across various tasks. The longer a model can “think,” the better its quality will be. Instead of increasing parameters or training data, this approach taps into additional computational power for better outcomes. DeepSeek distilled models exemplify how even small pre-trained models can shine with enhanced reasoning capabilities and when coupled with the NPUs on Copilot+ PCs, they unlock exciting new opportunities for innovation.
Reasoning emerges in models of a certain minimum scale, and models at that scale must think using a large number of tokens to excel at complex multi-step reasoning. Although the NPU hardware aids in reducing inference costs, it is equally important to maintain a manageable memory footprint for these models on consumer PCs, say with 16GB RAM.
Pushing the boundaries of what’s possible on Windows
Microsoft research investments have enabled us to push the boundaries of what’s possible on Windows even further at the system level and at a model level leading to innovations like Phi Silica. With Microsoft's work on Phi, Silica Microsoft was able to create a scalable platform for low-bit inference on NPUs, enabling powerful performance with minimal memory and bandwidth tax. Combined with the data privacy offered by local computing this puts advanced scenarios like Retrieval Augmented Generation (RAG) and model fine-tuning at the fingertips of application developers.
Microsoft reused techniques such as QuaRot, sliding window for fast first token responses and many other optimizations to enable the DeepSeek 1.5B release. Microsoft used Aqua, an internal automatic quantization tool, to quantize all the DeepSeek model variants to int4 weights with QuaRot while retaining most of the accuracy. Using the same toolchain Microsoft used to optimize Phi Silica Microsoft quickly integrated all the optimizations into an efficient ONNX QdQ model with low precision weights.
Like the 1.5B model, the 7B and 14B variants use 4-bit bblock-wisequantization for the embeddings and language model head and run these memory-access heavy operations on the CPU. The compute-heavy transformer block containing the context processing and token iteration uses int4 per-channel quantization for the weights alongside int16 activations. Microsoft already sees about 8 tok/sec on the 14B model (the 1.5B model, being very small, demonstrated close to 40 tok/sec) — and further optimizations are coming in as Microsoft leverages more advanced techniques. With all this in place, these nimble language models think longer and harder.
This durable path to innovation has made it possible for us to quickly optimize larger variants of DeepSeek models (7B and 14B) and will continue to enable us to bring more new models to run on Windows efficiently.
Get started today
Developers can access all distilled variants (1.5B, 7B and 14B) of DeepSeek models and run them on Copilot+ PCs by simply downloading the AI Toolkit VS Code extension. The DeepSeek model optimized in the ONNX QDQ format is available in AI Toolkit’s model catalogue, pulled directly from Azure AI Foundry. You can download it locally by clicking the “Download” button. Once downloaded, experimenting with the model is as simple as opening the Playground, loading the “deepseek_r1_1_5” model and sending it prompts.
Run models across Copilot+ PCs and Azure.
Copilot+ PCs offer local compute capabilities that are an extension of capabilities enabled by Azure, giving developers even more flexibility to train, fine-tune small language models on-device and leverage the cloud for larger intensive workloads. In addition to the ONNX model optimized for Copilot+ PC, you can also try the cloud-hosted source model in Azure Foundry by clicking on the “Try in Playground” button under “DeepSeek R1.” AI Toolkit is part of your developer workflow as you experiment with models and get them ready for deployment. With this playground, you can effortlessly test the DeepSeek models available in Azure AI Foundry for local deployment too. Through this, developers now have access to the most complete set of DeepSeek models available through the Azure AI Foundry from cloud to client.
Copilot+ PCs pair efficient computing with the near-infinite computing Microsoft has to offer via its Azure services. With reasoning able to span the cloud and the edge, running in sustained loops on the PC and invoking the much larger brains in the cloud as needed — Microsoft are on to a new paradigm of continuous computing creating value for Microsoft customers. The future of AI computing just got brighter! Microsoft can’t wait to see the new innovations from the Microsoft developer community taking advantage of these rich capabilities.
Source: Microsoft
awaisjuttAli4
ReplyDelete