经 AI Skill Hub 精选评估,vmlx MCP工具 获评「强烈推荐」。这款AI工具在功能完整性、社区活跃度和易用性方面表现出色,AI 评分 8.2 分,适合有一定技术背景的用户使用。
vmlx MCP工具 是一款基于 Python 开发的开源工具,专注于 模型压缩、KV缓存优化、MLX框架 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
vmlx MCP工具 是一款基于 Python 开发的开源工具,专注于 模型压缩、KV缓存优化、MLX框架 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
# 方式一:pip 安装(推荐)
pip install vmlx
# 方式二:虚拟环境安装(推荐生产环境)
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install vmlx
# 方式三:从源码安装(获取最新功能)
git clone https://github.com/jjang-ai/vmlx
cd vmlx
pip install -e .
# 验证安装
python -c "import vmlx; print('安装成功')"
# 命令行使用
vmlx --help
# 基本用法
vmlx input_file -o output_file
# Python 代码中调用
import vmlx
# 示例
result = vmlx.process("input")
print(result)
# vmlx 配置文件示例(config.yml) app: name: "vmlx" debug: false log_level: "INFO" # 运行时指定配置文件 vmlx --config config.yml # 或通过环境变量配置 export VMLX_API_KEY="your-key" export VMLX_OUTPUT_DIR="./output"
<p align="center"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/jjang-ai/vmlx/main/assets/logo-wide-dark.png"> <source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/jjang-ai/vmlx/main/assets/logo-wide-light.png"> <img alt="vMLX" src="https://raw.githubusercontent.com/jjang-ai/vmlx/main/assets/logo-wide-light.png" width="400"> </picture> </p>
<p align="center"> Self-hosted inference server for LLMs, VLMs, and image generation on Apple Silicon.<br> OpenAI + Anthropic + Ollama compatible HTTP API. Self-hosted; no third-party API keys required.<br> Native MTP artifact detection and family-specific cache policy gates keep speculative/cache settings explicit and model-safe. </p>
<p align="center"> <em>Looking for a native Swift macOS app or Swift inference engine? See <a href="https://osaurus.ai">osaurus.ai</a>.</em> </p>
<p align="center"> <a href="https://pypi.org/project/vmlx/"><img src="https://img.shields.io/pypi/v/vmlx?color=%234B8BBE&label=PyPI&logo=python&logoColor=white" alt="PyPI" /></a> <a href="https://github.com/jjang-ai/vmlx/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache_2.0-green?logo=apache" alt="License" /></a> <a href="https://github.com/jjang-ai/vmlx"><img src="https://img.shields.io/github/stars/jjang-ai/vmlx?style=social" alt="Stars" /></a> <img src="https://img.shields.io/badge/Apple_Silicon-M1%2FM2%2FM3%2FM4-black?logo=apple" alt="Apple Silicon" /> <img src="https://img.shields.io/badge/Python-3.10+-3776AB?logo=python&logoColor=white" alt="Python" /> <img src="https://img.shields.io/badge/Electron-28-47848F?logo=electron&logoColor=white" alt="Electron" /> <a href="https://ko-fi.com/jangml"><img src="https://img.shields.io/badge/Support-Ko--fi-FF5E5B?logo=ko-fi&logoColor=white" alt="Ko-fi" /></a> </p>
<p align="center"> <a href="#quickstart">Quickstart</a> • <a href="#model-support">Models</a> • <a href="#features">Features</a> • <a href="#image-generation--editing">Image Gen</a> • <a href="#api-reference">API</a> • <a href="#desktop-app">Desktop App</a> • <a href="#advanced-quantization">JANG</a> • <a href="#cli-commands">CLI</a> • <a href="#configuration">Config</a> • <a href="#contributing">Contributing</a> • <a href="#한국어-korean">한국어</a> </p>
---
JANG 2-bit destroys MLX 4-bit on MiniMax M2.5: | Quantization | MMLU (200q) | Size | |---|---|---| | JANG\_2L (2-bit) | 74% | 89 GB | | MLX 4-bit | 26.5% | 120 GB | | MLX 3-bit | 24.5% | 93 GB | | MLX 2-bit | 25% | 68 GB | Adaptive mixed-precision keeps critical layers at higher precision. Scores at jangq.ai. Models at JANGQ-AI.
![]() |
![]() |
| Chat with any MLX model -- thinking mode, streaming, and syntax highlighting | Agentic chat with full coding capabilities -- tool use and structured output |
---
pip install vmlx # Core: text LLMs, VLMs, embeddings, reranking
pip install vmlx[image] # + Image generation (mflux)
pip install vmlx[jang] # + JANG quantization tools
pip install vmlx[dev] # + Development/testing tools
pip install vmlx[image,jang] # Multiple extras
---
Published on PyPI as vmlx -- install and run in one command:
```bash
python3 -m venv ~/.vmlx-env && source ~/.vmlx-env/bin/activate pip install vmlx vmlx serve mlx-community/Qwen3-8B-4bit ```
Note: On macOS 14+, barepip installfails with "externally-managed-environment". Useuv,pipx, or a venv.
The vMLX inference server is now running at http://0.0.0.0:8000 with an OpenAI + Anthropic compatible API. Works with any model from mlx-community -- thousands of models ready to go.
Chat completion (streaming)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"messages": [{"role": "user", "content": "Explain quantum computing in 3 sentences."}],
"stream": true,
"temperature": 0.7
}'
Chat completion with thinking mode
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"messages": [{"role": "user", "content": "Solve: what is 23 * 47?"}],
"enable_thinking": true,
"stream": true
}'
Tool calling
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"messages": [{"role": "user", "content": "What is the weather in Tokyo?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}
}]
}'
Anthropic Messages API
curl http://localhost:8000/v1/messages \
-H "Content-Type: application/json" \
-H "x-api-key: not-needed" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "local",
"max_tokens": 1024,
"messages": [{"role": "user", "content": "Hello!"}]
}'
Embeddings
curl http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"input": "The quick brown fox jumps over the lazy dog"
}'
Text-to-speech
curl http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "kokoro",
"input": "Hello, welcome to vMLX!",
"voice": "af_heart"
}' --output speech.wav
Speech-to-text
curl http://localhost:8000/v1/audio/transcriptions \
-F file=@audio.wav \
-F model=whisper
Image generation
curl http://localhost:8000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "schnell",
"prompt": "A mountain landscape at sunset",
"size": "1024x1024"
}'
Reranking
curl http://localhost:8000/v1/rerank \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"query": "What is machine learning?",
"documents": [
"ML is a subset of AI",
"The weather is sunny today",
"Neural networks learn from data"
]
}'
Cache stats
curl http://localhost:8000/v1/cache/stats
Health check
curl http://localhost:8000/health
---
---
brew install uv uv tool install vmlx vmlx serve mlx-community/Qwen3-8B-4bit
vmlx serve <model> \
--host 0.0.0.0 \ # Bind address (default: 0.0.0.0)
--port 8000 \ # Port (default: 8000)
--api-key sk-your-key \ # Optional API key authentication
--continuous-batching \ # Enable concurrent request handling
--enable-prefix-cache \ # Reuse KV states for repeated prompts
--use-paged-cache \ # Block-based KV cache with dedup
--kv-cache-quantization q8 \ # Quantize cache: q4 or q8
--enable-disk-cache \ # Persist cache to SSD
--enable-jit \ # JIT Metal kernel compilation
--tool-call-parser auto \ # Auto-detect tool call format
--reasoning-parser auto \ # Auto-detect thinking format
--log-level INFO \ # Logging: DEBUG, INFO, WARNING, ERROR
--max-model-len 8192 \ # Max context length
--speculative-model <model> \ # Draft model for speculative decoding
--enable-pld \ # Prompt Lookup Decoding — no draft model, best for code/JSON/schemas
--distributed \ # Enable multi-Mac pipeline parallelism
--cluster-secret <secret> \ # Shared auth secret for workers
--distributed-mode pipeline \ # pipeline (default) or tensor (coming soon)
--worker-nodes ip:port,... \ # Manual worker IPs (overrides auto-discovery)
--cors-origins "*" # CORS allowed origins
vmlx convert <model> \
--bits 4 \ # Uniform quantization bits: 2, 3, 4, 6, 8
--group-size 64 \ # Quantization group size (default: 64)
--output ./output-dir \ # Output directory
--jang-profile JANG_3M \ # JANG mixed-precision profile
--calibration-method activations # Activation-aware calibration
```bash pip install vmlx[image]
TTS and STT require the mlx-audio package:
```bash pip install mlx-audio
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="local",
messages=[{"role": "user", "content": "Hello!"}],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="", flush=True)
import anthropic
client = anthropic.Anthropic(base_url="http://localhost:8000/v1", api_key="not-needed")
message = client.messages.create(
model="local",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello!"}],
)
print(message.content[0].text)
curl http://localhost:8000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "schnell",
"prompt": "A cat astronaut floating in space with Earth in the background",
"size": "1024x1024",
"n": 1
}'
```python
response = client.images.generate( model="schnell", prompt="A cat astronaut floating in space", size="1024x1024", n=1, ) ```
```bash
The desktop app runs an API Gateway on a single port (default 8080) that routes requests to all loaded models by name. Run multiple models simultaneously and access them all through one URL.
```bash
OLLAMA_HOST=http://localhost:8080 ollama run Qwen3.5-122B ```
The gateway supports OpenAI, Anthropic, and Ollama wire formats. Configure the port in the API tab.
OpenAI / Anthropic
| Method | Path | Description |
|---|---|---|
POST | /v1/chat/completions | OpenAI Chat Completions API (streaming + non-streaming) |
POST | /v1/messages | Anthropic Messages API |
POST | /v1/responses | OpenAI Responses API |
POST | /v1/completions | Text completions |
POST | /v1/images/generations | Image generation |
POST | /v1/images/edits | Image editing (Qwen Image Edit) |
POST | /v1/embeddings | Text embeddings |
POST | /v1/rerank | Document reranking |
POST | /v1/audio/transcriptions | Speech-to-text (Whisper) |
POST | /v1/audio/speech | Text-to-speech (Kokoro) |
GET | /v1/models | List loaded models |
GET | /v1/cache/stats | Cache statistics |
GET | /health | Server health check |
Ollama
| Method | Path | Description |
|---|---|---|
POST | /api/chat | Chat completion (NDJSON streaming) |
POST | /api/generate | Text generation (NDJSON streaming) |
GET | /api/tags | List loaded models |
POST | /api/show | Model details |
POST | /api/embeddings | Generate embeddings |
vmlx serve <model> # Start inference server
vmlx convert <model> --bits 4 # MLX uniform quantization
vmlx convert <model> -j JANG_3M # JANG adaptive quantization
vmlx info <model> # Model metadata and config
vmlx doctor <model> # Run diagnostics
vmlx bench <model> # Performance benchmarks
vmlx-worker --secret <secret> # Start distributed worker node
---
curl http://localhost:8000/v1/images/edits \ -H "Content-Type: application/json" \ -d '{ "model": "qwen-image-edit", "prompt": "배경을 해질녘으로 변경", "image": "<base64 인코딩된 이미지>", "size": "1024x1024", "strength": 0.8 }' ```
创新的KV缓存压缩方案,解决MLX推理的显存瓶颈。持久化缓存设计独特,代码活跃,生产应用价值高。
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
✅ Apache 2.0 — 宽松开源协议,可商用,需保留版权声明和 NOTICE 文件,含专利授权条款。
AI Skill Hub 点评:vmlx MCP工具 的核心功能完整,质量优秀。对于AI 技术爱好者来说,这是一个值得纳入个人工具库的选择。建议先在非生产环境试用,再逐步推广。
| 原始名称 | vmlx |
| 原始描述 | 开源MCP工具:vMLX - JANGTQ Uber Compressed MLX Models - L2 Disk Cache (survives restart) + L1。⭐512 · Python |
| Topics | 模型压缩KV缓存优化MLX框架MCP工具显存优化 |
| GitHub | https://github.com/jjang-ai/vmlx |
| License | Apache-2.0 |
| 语言 | Python |
收录时间:2026-05-18 · 更新时间:2026-05-19 · License:Apache-2.0 · AI Skill Hub 不对第三方内容的准确性作法律背书。