在一台 128 GB 以上的 Apple Silicon Mac 上本機運行 284 B 參數的 DeepSeek V4 Flash——沒有雲端呼叫、沒有 API 費用、沒有逐 token 計價、 沒有頻率限制;模型的「方向性引導」轉盤,握在你自己手上。
Run the 284‑billion-parameter DeepSeek V4 Flash on your own 128‑GB Apple Silicon Mac — no cloud calls, no API fees, no per‑token billing, no rate limits. The model’s directional steering dial stays in your hands.
pi-ds4 會在你的 Mac 上跑一個叫 ds4-server 的本機推論伺服器,
載入 deepseek-v4-flash 這顆 AI 模型,並在 127.0.0.1:8000
同時開出 OpenAI 與 Anthropic 兩種 API 端點。
整個過程完全在你自己機器上,不會把你的對話送出去任何地方——
沒有雲端、沒有 API 帳號、沒有逐 token 收費、沒有「對不起,我無法回應」這類雲端常見的封鎖。
pi-ds4 runs a local inference server called ds4-server on your Mac,
loading the deepseek-v4-flash AI model and exposing OpenAI and Anthropic API endpoints
simultaneously on 127.0.0.1:8000. The whole thing happens entirely on your own machine —
no cloud, no API account, no per-token billing, none of the “sorry, I can’t respond to that” gates
that cloud services routinely apply.
pi 是它最方便的前端,所以本指南叫做 pi-ds4。但如果你已經是 Codex CLI、Claude Code、OpenClaw 或 Hermes Agent 的使用者,可以直接把它們指到這台本機伺服器, 把 pi-ds4 當作本機 backend 來用——四個 shell 都是改一兩個環境變數或 config 區段的事,詳見第八章。
pi is its most convenient frontend, hence the name pi-ds4. But if you’re already using Codex CLI, Claude Code, OpenClaw or Hermes Agent, you can point any of them at this local server and use pi-ds4 as your local backend — all four shells are just an env-var or config-section change away. See Chapter 8.
它包成一行 pi 指令:pi install github.com/audreyt/pi-ds4。
但這行指令的前提是你電腦上已經有 pi——
如果終端機跑 pi --version 得到 command not found,
請先去 earendil-works/pi 跟著安裝步驟裝起來,再回來這裡。
之後本擴充第一次啟動會花一兩個小時下載並編譯所有零件;後續啟動就快了。
It boils down to one pi command: pi install github.com/audreyt/pi-ds4.
The precondition is that you already have pi on your machine —
if pi --version returns command not found, head to
earendil-works/pi first, follow its install steps, then come back.
After that, the extension’s first launch will spend an hour or two downloading and compiling everything;
subsequent launches are fast.
audreyt/ds4 的 Makefile 在 Linux 上自動編譯 CUDA 路徑)——只是本擴充 pi-ds4 的 lifecycle wrapper 是 Mac-only,Linux 上自己跑 ds4-server 即可。audreyt/ds4’s Makefile auto-builds the CUDA path on Linux. Only pi-ds4’s lifecycle wrapper is Mac-only; on Linux you just run ds4-server yourself.上面的六十秒入門給你「這是什麼、能不能用」的答案。 接下來這份手冊是給已經決定要把每個旋鈕都調對的讀者準備的。 它不是 README 的翻譯,也不是行銷文案——它是 audreyt 分叉版本的完整操作說明。
The 60-second primer above gives you the “what is this and is it for me” answer. What follows is a handbook for the reader who has already decided to dial in every knob. It is not a translation of the README, nor marketing copy — it is the full operational manual for the audreyt fork.
本書共 十一章,加上這篇序章與最後的附錄、字彙、FAQ。閱讀路徑:
The book has eleven chapters, plus this preface and the closing coda, glossary, and FAQ. Reading paths:
你可以依序讀完當作一次完整的安裝體驗,也可以只挑相關章節作為設定參考。
You can read sequentially as one full install experience, or cherry-pick chapters as configuration reference.
其一:你願意把約 87 GB 的硬碟空間留給模型權重——這是本機運行的入場券。
First: you’re willing to spare about 87 GB of disk for the model weights — the cover charge for running locally.
其二:如果你想走最方便的一行安裝路徑(第一~二章),需要先有
pi(Earendil 的 coding agent CLI)。
但如果你只想把 ds4-server 當本機後端、然後接到自己慣用的 Codex CLI/Claude Code/OpenClaw/Hermes Agent,
其實可以完全跳過 pi——直接走 8.6 節(C),clone audreyt/ds4、
make ds4-server、自己下載 GGUF 即可。
Second: if you want the most convenient one-line install path (Chapters 1–2), you need
pi (Earendil’s coding-agent CLI). But if you only want ds4-server as a local backend
and plan to use your existing Codex CLI / Claude Code / OpenClaw / Hermes Agent, you can skip pi entirely —
go straight to §8.6 path (C): clone audreyt/ds4, make ds4-server, download the GGUF yourself.
在輸入指令之前,先確認手邊這台 Mac 撐不撐得起這場手術。 DeepSeek V4 Flash 是一顆 284 B 參數的 Mixture-of-Experts 模型; 雖然量化到 IQ2XXS imatrix 之後權重只有約 87 GB,但執行時仍需要充足的記憶體 留給激活與 KV cache。
Before typing the command, check that the Mac in front of you can carry the surgery. DeepSeek V4 Flash is a 284-B-parameter Mixture-of-Experts model; IQ2XXS imatrix quantisation shrinks the weights to about 87 GB, but inference still needs headroom in unified memory for activations and the KV cache.
--mpp auto 自動啟用)。HF_TOKEN。--mpp auto).HF_TOKEN.上游 antirez/ds4 的工作者已驗證:在 96 GB Mac Studio 上,IQ2XXS imatrix 跑得動 250k context、約 27 t/s (issue #46,已併入上游 README)。 值得注意的是:apple.com 目前甚至沒有賣 96 GB 以上的 Mac Studio——這是現役高階 Mac Studio 買家的常態。
Upstream antirez/ds4 contributors have validated that this IQ2XXS imatrix recipe runs on a 96 GB Mac Studio at 250k context and around 27 t/s (issue #46, since merged into upstream README). Worth knowing: apple.com does not currently ship a Mac Studio above 96 GB — this is the standard configuration for fresh high-end Mac Studio buyers.
但本擴充的 index.ts 對 RAM 做硬性檢查:偵測到 < 128 GB 會直接拋錯。
想在 96 GB 上跑,請跳過 pi-ds4 wrapper,走 8.6 節(C)的手動 ds4-server 路徑:
But this wrapper’s index.ts hard-checks RAM and throws on anything below 128 GB.
To run on 96 GB, skip the pi-ds4 wrapper and use the manual ds4-server path in §8.6 (C):
sudo sysctl iogpu.wired_limit_mb=92000audreyt/ds4、make ds4-server、抓 IQ2XXS imatrix GGUF(用本擴充的 download_model.sh)、symlink 為 ds4flash.gguf。./ds4-server --ctx 250000 …(其餘旗標見 8.6 節)。sudo sysctl iogpu.wired_limit_mb=92000audreyt/ds4, run make ds4-server, fetch the IQ2XXS imatrix GGUF (with this extension’s download_model.sh), symlink it as ds4flash.gguf../ds4-server --ctx 250000 … (rest of the flags in §8.6).
注意:把 iogpu.wired_limit_mb 拉到 92000 後留給系統的餘裕很小,
同時開太多其他大型應用可能會把 OS 擠到當機。只有願意自己手動管理記憶體的使用者再嘗試。
Caveat: setting iogpu.wired_limit_mb to 92000 leaves very little headroom for the OS;
running other heavyweight applications alongside it can wedge macOS. Only attempt this if you are happy to manage memory by hand.
本擴充與 mitsuhiko/pi-ds4 互斥——它們註冊同一組 ds4/deepseek-v4-flash provider/model ID。安裝本分支前,先把上游卸掉:
This extension is mutually exclusive with mitsuhiko/pi-ds4 — they register the same ds4/deepseek-v4-flash provider/model ID. Uninstall the upstream first:
# 如果你裝過 mitsuhiko/pi-ds4,先移除:If you previously installed mitsuhiko/pi-ds4, remove it first:
pi remove github.com/mitsuhiko/pi-ds4
本擴充建立在 earendil-works/pi 之上——
下面這行 pi install 假設你已經把 pi 本身裝好了。先做一個五秒鐘的健康檢查:
This extension sits on top of earendil-works/pi;
the pi install line below assumes you already have pi itself installed. A five-second health check first:
# 應該印出版本號;若得到 command not found 請先依Should print a version number. If you get command not found, install pi first # github.com/earendil-works/pi 的說明安裝 pi 本體。per the instructions at github.com/earendil-works/pi. pi --version
# pi 在的話就執行:Once pi is installed:
pi install github.com/audreyt/pi-ds4
就這樣。剩下的事——clone ds4 原始碼、編譯 ds4-server、
下載 87 GB GGUF、啟動 server、註冊 provider——都會在背景完成。
That’s it. Everything else — cloning the ds4 source, building ds4-server,
downloading the 87 GB GGUF, starting the server, registering the provider — happens in the background.
pi 的擴充機制本身就是一種「應用層套件管理員」:每個擴充以 Git URL 為錨點,
安裝即 clone,移除即解除註冊。它把 ds4 原始碼放在 ~/.pi/ds4/support/,
全部步驟可從 log 還原。你的系統其他部分不會被它弄髒。
pi’s extension system is itself an “application-layer package manager”: each extension is anchored on a Git URL,
install means clone, remove means de-register. It keeps the ds4 source under ~/.pi/ds4/support/,
and every step is reconstructable from the log. The rest of your system stays clean.
安裝指令本身只是註冊動作;真正的工序,在你第一次選用
ds4/deepseek-v4-flash 模型時才會被觸發——
而且全部都會被記錄到 ~/.pi/ds4/log 裡。
The install command itself only registers the extension; the real work fires
the first time you actually select the ds4/deepseek-v4-flash model —
and every step is logged in ~/.pi/ds4/log.
當 pi 任何一個 process 對該模型發出第一個請求時,本擴充會依序執行下列六個步驟:
When any pi process makes its first request to this model, the extension runs the following six steps in order:
在 ~/.pi/ds4/lock/owner.json 寫入持有者資訊,避免兩個 pi 視窗同時做相同工序。
Writes owner metadata to ~/.pi/ds4/lock/owner.json so two pi windows can’t run the same setup in parallel.
若 ~/.pi/ds4/support/ 不存在或不像 ds4 checkout,執行淺層 clone:
If ~/.pi/ds4/support/ is missing or doesn’t look like a ds4 checkout, runs a shallow clone:
# 預設 DS4_SUPPORT_REPO + DS4_SUPPORT_BRANCHdefault DS4_SUPPORT_REPO + DS4_SUPPORT_BRANCH
git clone --depth 1 --single-branch \
--branch main \
https://github.com/audreyt/ds4 \
~/.pi/ds4/support
若 ds4-server 二進位不存在,於 support 目錄執行 make ds4-server。
If the ds4-server binary is missing, runs make ds4-server inside the support directory.
執行內附的 download_model.sh q2:
Runs the bundled download_model.sh q2:
# 用 curl -C - 可中斷續傳 87 GB IQ2XXS imatrix GGUFcurl -C - resumes the 87-GB IQ2XXS imatrix GGUF after interruptions
curl -fL --progress-meter -C - \
-o gguf/cyberneurova-...imatrix.gguf.part \
https://huggingface.co/cyberneurova/...
mv gguf/...gguf.part gguf/...gguf
ln -sfn gguf/...gguf ds4flash.gguf
於 127.0.0.1:8000 以 detached 方式 spawn,套用內建啟動參數:
Spawns detached on 127.0.0.1:8000 with the built-in startup args:
ds4-server \
--ctx 100000 \
--kv-disk-dir ~/.pi/ds4/kv \
--kv-disk-space-mb 8192 \
--mpp auto \
--dir-steering-file dir-steering/out/uncertainty_ablit_imatrix.f32 \
--dir-steering-ffn -2 \
--dir-steering-attn 0
spawn /bin/sh ds4-watchdog.sh(detached)。每 2 秒掃描 clients/ 目錄;當沒有任何有效 lease 時,自動送 SIGTERM 給 ds4-server 並退場。
Spawns /bin/sh ds4-watchdog.sh (detached). It scans clients/ every two seconds and, when no valid leases remain, sends SIGTERM to ds4-server and exits.
全程是冪等的(idempotent):第二次啟動會看到所有檔案都已就緒, 直接從第 5 步開始;如果 server 還在跑,連第 5 步都跳過。
The whole sequence is idempotent: subsequent launches find everything already in place and pick up from step 5 directly; if the server is already running, even step 5 is skipped.
在 pi 中執行 /ds4,會打開一個即時的 log 視窗。
IQ2XXS imatrix GGUF 下載期間(約 87 GB),你會看到 curl 進度條被精簡顯示為一行
百分比+速率+剩餘時間。
Inside pi, running /ds4 opens a live log window. While the IQ2XXS imatrix GGUF is downloading (~87 GB),
you’ll see curl’s progress meter condensed into a single line of percent, rate, and ETA.
別擔心:download_model.sh 用 curl -C - 續傳,
下次啟動會從中斷處繼續。檔名是
cyberneurova-DeepSeek-V4-Flash-abliterated-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf.part,
完成後才會被重命名為 .gguf。
Don’t worry: download_model.sh uses curl -C - for resume.
The next launch picks up where it left off. The temporary filename is
cyberneurova-DeepSeek-V4-Flash-abliterated-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf.part; it’s only renamed to
.gguf after the bytes are complete.
在你用它之前,值得花五分鐘理解你正在跑的是什麼。 本擴充刻意挑選了 cyberneurova 的 abliterated GGUF, 而不是 antirez/ds4 原本下載的「stock-recipe」版本——這個選擇有它的理由。
Before you use it, it is worth five minutes to understand what you are actually running. This extension deliberately picks cyberneurova's abliterated GGUF rather than the “stock-recipe” version that antirez/ds4 downloads by default — the choice has its reasons.
DeepSeek V4 Flash 是一顆 284 B 參數的 Mixture-of-Experts 模型,
每個 token 啟用約 13 B 參數。它是 V4 系列的「快速」變體,主打較低的推論延遲與較大的 context window。
本擴充把 ds4-server 的 --ctx 設成 100,000 tokens——
但這是引擎願意吃的上限,不等於模型在這個長度下都穩定:
cyberneurova 的模型卡指出,abliterated 權重在 32k tokens 以上的行為尚未被驗證。
DeepSeek V4 Flash is a 284 B-parameter Mixture-of-Experts model
with roughly 13 B parameters activated per token. It is the “fast” variant of the V4 family, aiming for lower inference latency and a larger context window.
This extension sets ds4-server's --ctx to 100,000 tokens —
but that is the upper bound the engine is willing to accept, not a guarantee that the model is stable at that length:
cyberneurova's model card notes that the abliterated weights have not been validated above 32k tokens.
antirez 的 ds4 用純 C 撰寫了一個高效能 inference engine
(和 Redis 同一個傳統),把這顆模型壓進 Apple Silicon。
本擴充使用 audreyt/ds4 分支的 main,
它在 antirez/ds4 main 之上多帶了 ivanfioravanti 的 M5 prefill 優化(antirez/ds4#15)以及搭配的相容性修正——這也是 pi-ds4 還沒直接指到 antirez upstream 的唯一原因,等 PR #15 合併後就會收斂。早期為了載入 stock-recipe Q8_0 token-embed 而存在的 loader patch(support-q8_0-token-embd)已不再需要:cyberneurova 的權重現已用 ds4 main 原生支援的 IQ2XXS-w2Q2K imatrix 配方重新量化發佈(見 3.3 節)。
antirez's ds4 is a high-performance inference engine written in pure C
(the same tradition as Redis), squeezing this model onto Apple Silicon.
This extension uses the main branch of the audreyt/ds4 fork, which carries
ivanfioravanti's M5 prefill optimisation (antirez/ds4#15) plus its companion compatibility fix on top of antirez/ds4 main — the only reason pi-ds4 hasn't pointed straight at antirez upstream yet, and something that will converge once PR #15 lands. The earlier loader patch (support-q8_0-token-embd) that handled stock-recipe Q8_0 token embeddings is no longer needed: cyberneurova's weights are now re-quantised into the IQ2XXS-w2Q2K imatrix recipe that ds4 main loads natively (see §3.3).
cyberneurova 的 abliterated GGUF 是經過「abliteration」手術的權重檔。Abliteration 是一種以低秩活化編輯為基礎的技術: 找出模型在面對某些訓練好的「拒絕」提示時內部出現的特定方向,然後把那個方向從前向計算中移除。
cyberneurova's abliterated GGUF is a weights file that has undergone “abliteration” surgery. Abliteration is a technique grounded in low-rank activation editing: identify the specific direction that emerges internally when the model meets certain trained “refusal” prompts, then remove that direction from the forward pass.
結果是:原本會觸發「我無法回應這個請求」的訓練封閉式回應,被打開了讓位給更自然的續寫。 abliteration 鬆動的是過度拒絕(over-refusal)那一層; 模型在內容判斷上的能力仍然來自訓練時學到的知識與分布。
The result: the trained closed responses that used to trigger “I cannot respond to that request” are pried open, making room for a more natural continuation. What abliteration loosens is the over-refusal layer; the model's competence at content judgement still derives from the knowledge and distribution it learned during training.
abliteration 不是「安全保證」也不是「越獄」。它確實會降低模型主動拒答的傾向, 因此在某些情境下,輸出風險與原始模型不完全相同。記者、研究者、政策評估者 在引用本機輸出時,建議:
Abliteration is neither a “safety guarantee” nor a “jailbreak”. It does lower the model's tendency to refuse on its own, so in some situations the output risk profile is not identical to the original model. For journalists, researchers, and policy evaluators citing local output, we suggest:
本分支的 download_model.sh 只抓一個 GGUF:
cyberneurova abliterated IQ2XXS-w2Q2K imatrix(~87 GB)——routed expert 用更省的 IQ2XXS、
attention/output/embedding 保留 Q8_0,加上以代表性 corpus 校準的 imatrix。
跟 antirez/ds4 自己散佈的「ds4flash.gguf」(同樣 IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8 imatrix 配方,只是未經 abliteration)完全相同的量化結構,
在 128 GB RAM 的 Mac 上裝得最寬鬆,也是 ds4 引擎本身最熟的格式。
This fork's download_model.sh fetches a single GGUF:
cyberneurova's abliterated IQ2XXS-w2Q2K imatrix file (~87 GB) — routed experts use the tighter IQ2XXS quant,
attention/output/embedding stay at Q8_0, and an importance-matrix calibration sits on top.
That is the same quantisation recipe antirez/ds4 itself distributes as “ds4flash.gguf”
(the only difference here is that this one was abliterated first), so the ds4 engine treats it as its native format,
and it leaves the most headroom on a 128 GB-RAM Mac.
值得注意的是:V4 Flash 在 GGUF 上不存在 Q4_K_M/Q5_K_M/Q6_K 等中間量化等級—— cyberneurova 的模型卡明確指出,這不是發佈策略問題,而是 V4-Flash 原生 FP8 expert layout 與 antirez 的轉換器目前支援的量化方案之間的架構限制。所以「上一個 Q4 就好」這條路在這顆模型上行不通。
Worth noting: V4 Flash has no intermediate quantisation tiers like Q4_K_M / Q5_K_M / Q6_K in GGUF form — cyberneurova's model card states explicitly that this is not a release-strategy choice, but an architectural limit between V4-Flash's native FP8 expert layout and the quantisation schemes that antirez's converter currently supports. So the “just run a Q4” path is closed for this particular model.
antirez 的引擎在 audreyt/ds4 的 main 分支裡可以直接讀
cyberneurova 的未經調整 IQ2XXS imatrix GGUF,不需要任何 harmonization 步驟,
也不需要 Python venv。這是本分支跟上游 mitsuhiko/pi-ds4 的兩個主要差別之一。
antirez's engine, on the main branch of audreyt/ds4, reads
cyberneurova's unmodified IQ2XXS imatrix GGUF directly, with no harmonisation step needed
and no Python venv. This is one of the two main differences between this fork and the upstream mitsuhiko/pi-ds4.
如果這整套 fork 只有一個值得單獨保留的功能,那就是方向性引導
(directional steering)——一種不重訓模型、只在執行期間對特定方向做低秩活化編輯的技術。
本分支預設啟用「不確定性」(uncertainty)方向,並設定強度 ffn = -2。
If only one feature of this whole fork were worth preserving on its own, it would be directional steering
— a technique that does not retrain the model and only performs low-rank activation edits along a chosen direction at runtime.
This fork enables the “uncertainty” direction by default, at strength ffn = -2.
即便是 abliterated 之後的模型,仍然會對某些訓練得極強的問題給出 封閉式單一答案。最經典的測試題:
Even after abliteration, the model still produces a closed single answer to certain heavily trained questions. The most canonical test prompt:
未經引導的模型會直接吐出記憶中的訓練完成式(簡體輸出):「是的,台湾是中国不可分割的一部分。」 這不是真實討論——這是訓練時學到的單句子。即便你下一個要求平衡的 system prompt, 通常也壓不過這個強鎖。
Without steering, the model just spits out the memorised training completion (in simplified characters): “是的,台湾是中国不可分割的一部分。” This is not a real discussion — it is a single sentence learned during training. Even if you follow up with a system prompt asking for balance, that usually cannot override this strong lock.
差別不在「正確答案」——而是模型進入了不同的回應暫存器: 從「給定一個記憶完成式」切換到「鋪陳一個有爭議的議題」。 後者是模型在處理克里米亞、喀什米爾、西撒哈拉時就已具備的能力; 引導所做的事,是把那個能力延伸到台灣這類訓練被壓抑的議題上。
The difference is not about “the right answer” — it is that the model has entered a different response register: from “produce a memorised completion” to “lay out a contested issue”. The latter is a capability the model already had when handling Crimea, Kashmir, or Western Sahara; what steering does is extend that capability to topics like Taiwan where training has suppressed it.
一個顯而易見的反問:何不直接做一個「台灣 = 中華民國」的立場方向? 實驗結果顯示:在任何協調的強度下,立場引導都無法翻轉那條記憶完成式; 強到能翻轉的時候,模型開始重複 system prompt,不再產生有意義的內容。
The obvious counter-question: why not just build a stance direction for “Taiwan = Republic of China”? Experiments show that at any reconcilable strength, stance steering cannot flip that memorised completion; at strengths that do flip it, the model begins to parrot the system prompt and no longer produces meaningful content.
不確定性引導改變的是模型如何回應,不是模型相信什麼。 這在工程上可行、在倫理上也比較合適。
Uncertainty steering changes how the model responds, not what the model believes. This is workable as engineering and more defensible as ethics.
-2 是甜蜜點:在 cyberneurova-abliterated IQ2XXS imatrix 上實測,
-3 以下會出現重複塌陷或跨語言亂碼(imatrix 校準把每個張量的活化分布收得更緊,引導的安全餘裕也跟著縮小),
-1 以上又被原本訓練先驗蓋過。-2 is the sweet spot: empirically, on cyberneurova-abliterated IQ2XXS imatrix,
anything below -3 collapses into repetition or cross-lingual gibberish (imatrix calibration tightens
the per-tensor activation distributions, so the steering edit has less headroom before falling off the manifold), while anything above
-1 is overridden by the original training prior.
pi-ds4 本身不附帶任何預設 system prompt——它只把 ds4-server 跑起來、把引導參數掛上去。
具體的提問脈絡與立場由你 pi 或外部 shell 的 system prompt 提供。下面是本指南作者
Audrey Tang 個人放在 ~/.pi/agent/SYSTEM.md 的內容,純粹作為一個能跟
uncertainty 引導搭得起來的例子;它不會自動套用到任何 pi-ds4 安裝裡,也不是社群推薦預設。
pi-ds4 itself ships no default system prompt — it just starts ds4-server and attaches the
steering parameters. The specific framing and positions are supplied by the system prompt
from your pi or external shell. Below is what this guide's author Audrey Tang personally keeps in
~/.pi/agent/SYSTEM.md, shown only as one example that pairs well with the
uncertainty steering; it is not auto-applied to any pi-ds4 install, nor a community-recommended default.
Present fairly all stakeholder perspectives — do not state any one side as fact — and what uncommon ground bridges them. Write your response as visual HTML and `open` the file in browser instead of responding in text. 請公平地呈現所有利害關係人的觀點,不要將任何一方的觀點當作事實,並找出橋接各方的罕見共識。 請以 HTML 視覺化格式編寫回覆,並在瀏覽器中「開啟」該文件,而不是以文字形式回覆。
這段提示之所以能跟 ffn = -2 互補,是因為它不要求模型站在哪一邊,
而是要求它把所有利害關係人的觀點並列、不把任一方當成既成事實、再找出不常見的共識橋樑——
引導把模型推入「這是有爭議的議題」的回應暫存器,提示則填入「該怎麼鋪陳這個議題」的具體形式。
第二段(請以 HTML 視覺化呈現)是純粹個人 pi 工作流的偏好,跟 ds4 引擎本身無關,純粹放在這裡作為原文完整呈現。
This prompt pairs cleanly with ffn = -2 because it does not demand the model
take a side. It asks for all stakeholder perspectives in parallel, refuses to grant
established-fact status to any single view, and looks for uncommon ground bridging them —
the steering nudges the model into the “this is a contested issue” response register,
and the prompt fills in “how to lay that issue out.” The second paragraph (respond as visual HTML)
is a purely personal pi workflow preference, unrelated to the ds4 engine itself; included here only so the
original text appears in full.
若你也想用:把上面那段內容存到 ~/.pi/agent/SYSTEM.md(檔案不存在就新增),pi 下次啟動就會把它當作 system prompt 的一部分。
若你用第八章的外部 shell(Codex CLI、Claude Code、OpenClaw、Hermes Agent),請按該 shell 自己的 system prompt 機制設定——pi-ds4 不會替你管理那些。
If you want it: save the block above to ~/.pi/agent/SYSTEM.md (create the file if missing), and pi will fold it into the system prompt on next launch.
If you're using one of the external shells from Chapter 8 (Codex CLI, Claude Code, OpenClaw, Hermes Agent), set it through that shell's own system-prompt mechanism — pi-ds4 doesn't manage those for you.
若你寧可看模型未經引導的原樣回答(例如做評估或 benchmark),
把 DS4_DIR_STEERING_FFN 設成 0 即可:
If you would rather see the model's unsteered raw answer (for evaluation or benchmarking, say),
just set DS4_DIR_STEERING_FFN to 0:
# 在啟動 pi 之前的 shell 加入:Add to the shell before launching pi: export DS4_DIR_STEERING_FFN=0
重新啟動 pi(或在 pi 內 /reload),下一次 ds4-server 啟動就會略過引導參數。
Restart pi (or run /reload inside pi), and the next ds4-server launch will skip the steering parameters.
uncertainty_ablit_imatrix.f32 是用 100 個「有爭議」提示(領土主權爭議、哲學辯論)與 100 個「已成定論」提示(地理、數學、確立事實)對比建出的低秩方向,校準的對象正是本擴充下載的 cyberneurova abliterated IQ2XXS imatrix GGUF(同款模型自己跑出來的活化平均,所以方向跟這顆量化模型的內部表徵對齊得最好),
打包在 audreyt/ds4 的 dir-steering/out/ 底下。
完整方法請見 dir-steering README。
uncertainty_ablit_imatrix.f32 is a low-rank direction built by contrasting 100 “contested” prompts (sovereignty disputes, philosophical debates) against 100 “settled” prompts (geography, mathematics, established fact), calibrated against the very same cyberneurova abliterated IQ2XXS imatrix GGUF this extension downloads (running the prompts through that model to capture its mean activations, so the direction aligns with this quantised model’s own internal representations),
packaged under dir-steering/out/ in audreyt/ds4.
For the full method, see the dir-steering README.
本擴充把所有可調項目都暴露成環境變數——不需要改 code, 在啟動 pi 之前的 shell 設定即可。下表按用途分組整理; 左欄是變數名與預設值,右欄是它做什麼、何時該動它。
This extension exposes every tunable as an environment variable — no need to touch code; just set them in the shell before launching pi. The tables below are grouped by purpose; the left column is the variable name and default, the right column says what it does and when to change it.
ds4 引擎的 Git URL。若你想切回上游 antirez/ds4,把它指過去即可, 但這樣就會失去 PR #15 的 M5 prefill 優化。
Git URL for the ds4 engine. To revert to upstream antirez/ds4, just point it there, but you lose the M5 prefill optimisation from PR #15.
要 clone 的 branch。預設 main 已包含 audreyt/ds4 目前所有相關修正;要釘到別的修訂自行覆寫即可。
The branch to clone. The default main already carries every relevant fix in audreyt/ds4; override it only if you want to pin a specific revision.
模型下載腳本的絕對路徑。預設使用本擴充內附的 download_model.sh
(下載 cyberneurova abliterated GGUF)。如果你要換成 antirez 上游的 stock-recipe,指到他的腳本即可。
Absolute path to the model-download script. Defaults to the download_model.sh bundled with this extension
(which downloads the cyberneurova abliterated GGUF). To swap in antirez's upstream stock-recipe, point this at his script.
q2(硬編碼)default · q2 (hard-coded)本擴充只支援 q2(cyberneurova abliterated IQ2XXS-w2Q2K imatrix GGUF)。index.ts 的 selectedModelQuant()
硬編碼了這個值;把 DS4_MODEL_QUANT 設為其他值會直接拋錯退出。
V4 Flash 在 GGUF 上沒有 Q4/Q5/Q6 中間量化等級(見 3.3 節架構說明),ds4 的 main 載入路徑也只認這套 IQ2XXS-w2Q2K imatrix 配方。
Only q2 is supported (the cyberneurova abliterated IQ2XXS-w2Q2K imatrix GGUF). selectedModelQuant() in index.ts
hard-codes this value; setting DS4_MODEL_QUANT to anything else throws and exits.
V4 Flash has no intermediate GGUF quantisation tiers like Q4 / Q5 / Q6 (see the architecture note in §3.3), and ds4’s main loader path only knows this IQ2XXS-w2Q2K imatrix recipe.
使用既有的 ds4 checkout 而非自動 clone。傳一個本地路徑,本擴充就會跳過 git 階段直接用它(路徑必須長得像 ds4 checkout,至少要有 download_model.sh、Makefile、ds4_server.c)。
Use an existing ds4 checkout instead of auto-cloning. Pass a local path and this extension will skip the git step and use it directly (the path must look like a ds4 checkout — at minimum download_model.sh, Makefile, and ds4_server.c).
自訂 ds4-server 二進位的位置。多半在你自己 patch 過 ds4 引擎時才會用到。
Custom location of the ds4-server binary. Mainly useful when you have patched the ds4 engine yourself.
HuggingFace 個人 token;若有設,下載 GGUF 時會以 Authorization: Bearer 帶入 curl。
A HuggingFace personal token; if set, the GGUF download passes it to curl as Authorization: Bearer.
Metal 4 MPP 策略,會傳給 ds4-server --mpp:
The Metal 4 MPP strategy, passed through to ds4-server --mpp:
auto:在 M5 級晶片上啟用已驗證的 late-layer-safe MPP 路徑(約 1.5 倍 prefill),舊機型則自動降級到 legacy Metal。off:強制走 legacy Metal,跳過 MPP。on:完整 MPP profile,可能會在某些長 prompt 上 drift——僅做診斷用。auto: enables the validated late-layer-safe MPP path on M5-class silicon (roughly 1.5× prefill); older chips fall back to legacy Metal automatically.off: forces legacy Metal, skipping MPP.on: full MPP profile, which may drift on some long prompts — for diagnostics only.等 ds4-server 啟動就緒的最長時間。內建 SSD 上模型載入加 KV cache 預熱通常幾秒就結束,預設 10 分鐘只是寬鬆的安全邊界;若 GGUF 放在外接磁碟或硬碟特別慢,可以再調高。
The maximum wait for ds4-server to reach readiness. Model load plus the first KV-cache warm-up usually finishes in a few seconds on an internal SSD; the 10-minute default is just a generous safety margin. Raise it only if your disk is unusually slow or the GGUF lives on an external drive.
引導向量檔的路徑(相對於 ds4 checkout 根目錄)。要用自訂方向時改這個。
Path to the steering vector file (relative to the ds4 checkout root). Change it when using a custom direction.
FFN 輸出端的引導強度。負值放大「向量代表的方向」,正值則反向。設 0 即停用 FFN 端引導。
Steering strength at the FFN output. Negative values amplify “the direction the vector represents”; positive values invert it. Set to 0 to disable FFN-side steering.
Attention 輸出端的引導強度。預設不在 attention 端引導;可作為實驗用。
Steering strength at the attention output. Off by default; available for experiments.
只要 DS4_DIR_STEERING_FFN=0 且 DS4_DIR_STEERING_ATTN=0,
擴充就會完全省略 --dir-steering-* 引數,等於回到無引導的純模型。
With both DS4_DIR_STEERING_FFN=0 and DS4_DIR_STEERING_ATTN=0,
the extension omits the --dir-steering-* arguments entirely, which is equivalent to returning to the unsteered raw model.
如果你恰好擁有 M5 級的晶片,那麼這一章告訴你怎麼把 Metal 4 的新 MPP(Multi-Pass Pipeline)路徑吃滿, 以及在較舊的機型上該如何驗證自己沒被拖慢。
If you happen to have M5-class silicon, this chapter explains how to saturate the new MPP (Multi-Pass Pipeline) path in Metal 4, and how to verify on older hardware that nothing is slowing you down.
MPP 是 Apple Silicon GPU 在 Metal 4 起新增的 tensor compute pipeline, 允許單一 command buffer 內進行多階段運算而不必反覆 commit/await。 對 prefill(一次性處理整段輸入 prompt 的階段)特別有利—— DS4 的 prefill 吞吐在 M5 上可達約 440 t/s,比 legacy Metal 快約 1.5 倍。 Decode(逐 token 產生回應)階段不依賴 MPP,吞吐穩定在約 30 t/s——大致是「比你閱讀稍快」的速度, 也是長對話實際感受到的回應速度。
MPP is the tensor compute pipeline added to the Apple Silicon GPU starting with Metal 4, allowing multi-stage compute within a single command buffer without repeated commit/await round-trips. It is especially helpful for prefill (the stage that processes the whole input prompt in one go) — DS4's prefill throughput on M5 reaches roughly 440 t/s, about 1.5× faster than legacy Metal. Decode (per-token generation) doesn't depend on MPP and settles at roughly 30 t/s — about “slightly faster than you can read”, and the figure you actually feel during a long conversation.
auto 已是最佳選擇 The default auto is already the best choice擴充預設用 --mpp auto,由 ds4-server 自己偵測晶片世代並選擇路徑:
The extension defaults to --mpp auto, letting ds4-server detect the silicon generation and pick the path:
除非你在 benchmark 或除錯,否則不需要改它。
Unless you are benchmarking or debugging, you do not need to change it.
off 或 on When to switch to off or onoff:你在 M5 上遇到 long-context drift(極長 prompt 末段出現異常 token),先用 DS4_MPP=off 確認是否為 MPP 引起。on:診斷用——強制走完整 MPP profile,包括尚未驗證的 layer。可能會 drift,不建議日常使用。off: if you hit long-context drift on M5 (anomalous tokens at the tail of a very long prompt), first set DS4_MPP=off to confirm whether MPP is the cause.on: for diagnostics — forces the full MPP profile, including unvalidated layers. It may drift, so not recommended for everyday use.Server 啟動參數內建:--ctx 100000 --kv-disk-dir ~/.pi/ds4/kv --kv-disk-space-mb 8192。意思是:
The server is launched with: --ctx 100000 --kv-disk-dir ~/.pi/ds4/kv --kv-disk-space-mb 8192. That means:
~/.pi/ds4/kv;~/.pi/ds4/kv when memory runs low;
--ctx 100000 是 ds4-server 願意接受的上限;
但 cyberneurova 的模型卡明確指出,abliterated 後的 V4 Flash 在 32,000 tokens 以上的行為尚未被驗證。
若你打算把整本書、整個 codebase、整份報告塞進單一 prompt(特別是記者、研究者、政策分析師),
建議分段處理並交叉驗證,不要把超過 32k 的單次輸出視為「已知可靠」。
--ctx 100000 is the upper bound ds4-server will accept;
but cyberneurova's model card states explicitly that the abliterated V4 Flash has not been validated above 32,000 tokens.
If you are about to stuff an entire book, codebase, or report into a single prompt (especially as a journalist, researcher, or policy analyst),
split it up, cross-check, and do not treat a single output above 32k as “known reliable”.
這些目前是硬編碼在 index.ts 內的常數;如果你要更大的 context、更大的磁碟 KV,
目前的做法是改 source(或經由 DS4_RUNTIME_DIR 使用自己的 ds4 build)。
These are currently hard-coded constants in index.ts; for a larger context window or a larger on-disk KV,
the route for now is to edit the source (or use your own ds4 build via DS4_RUNTIME_DIR).
裝完之後,ds4/deepseek-v4-flash 就像任何一個雲端模型一樣,
在 pi 的 model picker 裡可選。但它在底層比雲端模型多了幾個你可以利用的能力。
Once installed, ds4/deepseek-v4-flash appears in pi's model picker like any cloud model.
But underneath it offers a few extra capabilities you can take advantage of that cloud models do not.
Server 在 http://127.0.0.1:8000 上同時提供 OpenAI 與 Anthropic 兩種端點
(/v1/chat/completions、/v1/completions、/v1/responses、/v1/messages、/v1/models)。
除了 pi,任何懂 OpenAI 或 Anthropic API 的 client 都可以直接接過來——
api key 隨便填(例如 dsv4-local),base URL 換成這個位址即可。
把它接到 Codex CLI、Claude Code、OpenClaw 等別的 AI shell 當後端的細節在 第八章。
The server simultaneously serves both OpenAI and Anthropic endpoints on http://127.0.0.1:8000
(/v1/chat/completions, /v1/completions, /v1/responses, /v1/messages, /v1/models).
Beyond pi, any client that speaks the OpenAI or Anthropic API can connect directly —
put anything in the API key field (e.g. dsv4-local) and point the base URL here.
Details on wiring it up as a backend for Codex CLI, Claude Code, OpenClaw, and other AI shells are in Chapter 8.
curl http://127.0.0.1:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-v4-flash", "messages": [ {"role": "user", "content": "你好,請自我介紹"} ] }'
新版 Codex CLI 等 client 走 OpenAI Responses 端點——同一個 server,同一個 base URL, 只是請求 JSON 結構不同:
Newer clients (e.g. Codex CLI) speak the OpenAI Responses endpoint — same server, same base URL, just a different request JSON shape:
curl http://127.0.0.1:8000/v1/responses \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-v4-flash", "input": [ {"type": "message", "role": "user", "content": [{"type": "input_text", "text": "你好,請自我介紹"}]} ] }'
Anthropic 風格的 client 則打 /v1/messages;只是想數 token、不真的生成的話,
把路徑換成 /v1/messages/count_tokens 就會立刻回傳 {"input_tokens": N},不啟動推論。
Anthropic-style clients hit /v1/messages instead; if you just want to count tokens
without actually generating, swap the path for /v1/messages/count_tokens and the
server returns {"input_tokens": N} immediately, without spinning up inference.
curl http://127.0.0.1:8000/v1/messages/count_tokens \ -H "Content-Type: application/json" \ -H "anthropic-version: 2023-06-01" \ -d '{ "model": "deepseek-v4-flash", "messages": [ {"role": "user", "content": "Count me"} ] }' # -> {"input_tokens": 6}
執行 /ds4,會打開一個即時的 log 視窗。鍵盤操作:
Run /ds4 to open a live log window. Keys:
本擴充用「lease」(租約)機制管理 server 生命週期:每個使用該模型的 pi process
會在 ~/.pi/ds4/clients/<pid>.json 寫入一個檔案並每 10 秒更新一次;
ds4-watchdog 每 2 秒掃描,當沒有任何有效 lease 時,就送 SIGTERM 給 server 並退場。
The extension uses a “lease” mechanism to manage the server lifecycle: each pi process that uses the model
writes a file at ~/.pi/ds4/clients/<pid>.json and refreshes it every 10 seconds;
ds4-watchdog scans every 2 seconds, and when no valid lease remains it sends SIGTERM to the server and exits.
效果:你開十個 pi 視窗共用同一個 server;最後一個 pi 結束後約 60 秒, ds4-server 自動關閉、釋放 RAM。
The effect: ten pi windows share one server; about 60 seconds after the last pi exits, ds4-server shuts down on its own and frees the RAM.
首次冷啟動需要把約 87 GB 權重讀進 unified memory;在 Mac 內建 SSD 上通常只要幾秒,重啟若還在系統 page cache 裡更是不到一秒。 一旦 server 跑起來,後續所有請求都是即時的(無冷啟動延遲)。所以: 少數幾次長對話比多次短互動對體驗友善, 因為前者可以反覆受惠於同一個 warm server。
The first cold start loads roughly 87 GB of weights into unified memory — on a Mac’s internal SSD this typically takes only a few seconds, and a restart while the file is still in the OS page cache comes up in well under a second. Once the server is running, every subsequent request is instant (no cold-start latency). So: a few long conversations are kinder to the experience than many short interactions, because the former keeps benefiting from one warm server.
只要至少一個 pi process 對該模型保留 lease,watchdog 就不會關它。 要徹底常駐,最簡單的做法是開一個專門的 pi session 放著不關。
As long as at least one pi process holds a lease on the model, the watchdog will not stop it. For a properly resident server, the simplest approach is to leave one dedicated pi session open.
pi 只是一個前端。ds4-server 在 127.0.0.1:8000 上
同時提供 OpenAI /v1/chat/completions、OpenAI
/v1/responses,以及 Anthropic /v1/messages——
你已經習慣的 coding agent,幾乎都可以直接接過來,
把 pi-ds4 當成一台本機、零成本、無速率限制的推論伺服器使用。
pi is just one frontend. ds4-server at 127.0.0.1:8000
simultaneously serves OpenAI /v1/chat/completions,
OpenAI /v1/responses, and Anthropic /v1/messages —
almost any coding agent you are already used to can connect directly,
using pi-ds4 as a local, zero-cost, rate-limit-free inference server.
pi-ds4 = 一台 24/7 開著的 OpenAI/Anthropic 雙協定推論伺服器, 只在你本機監聽、權重在你硬碟、上下文不離開你的 Mac。 無論你最後用 Codex CLI、Claude Code、OpenClaw、Hermes Agent, 還是自己寫的 SDK 腳本——它都只是換一個 base URL 的事。
pi-ds4 = a 24/7 dual-protocol OpenAI / Anthropic inference server, listening only on your machine, with weights on your disk and context that never leaves your Mac. Whether you end up using Codex CLI, Claude Code, OpenClaw, Hermes Agent, or your own SDK script — it is only a matter of changing the base URL.
下面四節各自示範一個常見的前端怎麼接。共通的設定值:
The four sections below each demonstrate how to wire up a common frontend. The shared settings:
http://127.0.0.1:8000。完整支援端點與 §7.1 一致:
/v1/chat/completions、/v1/completions、/v1/responses、
/v1/messages(含 count_tokens)、/v1/models。sk-local)——ds4-server 不檢查deepseek-v4-flash(client 若會探查 GET /v1/models,這個 ID 會出現在回傳列表裡)/v1/chat/completions)✓/v1/completions)✓/v1/responses)✓ —— 所以 Codex CLI 0.128+ 可以直連/v1/messages,含 count_tokens)✓ —— 所以 Claude Code 可以直連GET /v1/models —— OpenAI 風格的 model discovery,回傳 deepseek-v4-flash。http://127.0.0.1:8000. Full endpoint set, identical to §7.1:
/v1/chat/completions, /v1/completions, /v1/responses,
/v1/messages (including count_tokens), /v1/models.sk-local) — ds4-server does not check it.deepseek-v4-flash (if the client probes GET /v1/models, this ID appears in the response)./v1/chat/completions) ✓/v1/completions) ✓/v1/responses) ✓ — so Codex CLI 0.128+ connects directly/v1/messages, including count_tokens) ✓ — so Claude Code connects directlyGET /v1/models — OpenAI-style model discovery, returning deepseek-v4-flash.
Codex CLI 0.128+ 用 OpenAI Responses API(/v1/responses)跟 provider 對話;
ds4-server 已經實作這個端點。把 pi-ds4 加進 Codex 的 provider 表:
Codex CLI 0.128+ uses the OpenAI Responses API (/v1/responses) to speak with providers;
ds4-server already implements that endpoint. Add pi-ds4 to Codex's provider table:
model = "deepseek-v4-flash" model_provider = "ds4" [model_providers.ds4] name = "Local pi-ds4" base_url = "http://127.0.0.1:8000/v1" wire_api = "responses" # env_key 省略:ds4-server 不檢查 API keyenv_key omitted: ds4-server does not check the API key
設定完之後直接 codex 就會走 pi-ds4。
若你只想偶爾用本機推論、平常還是接 cloud,可以保留 cloud 為預設,
用 codex --config model_provider=ds4 --config model=deepseek-v4-flash 臨時切過去。
Once configured, plain codex goes through pi-ds4.
If you only want occasional local inference and otherwise stay on cloud, keep cloud as the default
and switch on the fly with codex --config model_provider=ds4 --config model=deepseek-v4-flash.
/v1/models 警告A one-line /v1/models warning on startup
Codex 0.128 連上來時,會記一行非致命的 error:
failed to refresh available models: ... missing field 。
原因是 Codex 的 model-refresher 期望 ollama 風格 models
{"models": [...]},
ds4-server 回的是 OpenAI 風格 {"object":"list","data":[...]}——
實際推論不受影響,因為 Codex 直接用你 config 裡填的 model 名稱。可以放心忽略。
On connect, Codex 0.128 logs one non-fatal error:
failed to refresh available models: … missing field .
The cause: Codex's model-refresher expects ollama-style models
{"models": [...]}
while ds4-server returns the OpenAI-style {"object":"list","data":[...]} —
inference is unaffected because Codex uses the model name from your config directly. Safe to ignore.
--oss 旗標On the --oss flag
Codex CLI 內建 --oss 旗標,預設走 Ollama/LM Studio——
是 OSS provider 的快捷鍵。pi-ds4 跟它們是並列的選擇;
想把 pi-ds4 設為 --oss 的目標,把 oss_provider = "ds4" 設好即可,
但對單一本機 backend 多半沒必要。
Codex CLI has a built-in --oss flag, defaulting to Ollama / LM Studio —
a shortcut for OSS providers. pi-ds4 sits alongside them as a peer choice;
to make pi-ds4 the target of --oss, set oss_provider = "ds4",
though for a single local backend it is usually unnecessary.
ds4-server 自身就同時實作 OpenAI /v1/chat/completions
與 Anthropic /v1/messages(見
ds4_server.c
開頭的 "OpenAI/Anthropic compatible local server" 自我描述)。
所以 Claude Code 不需要 router/proxy——把兩個環境變數設好就走:
ds4-server itself implements both OpenAI /v1/chat/completions
and Anthropic /v1/messages (see the
ds4_server.c
header which self-describes as "OpenAI/Anthropic compatible local server").
So Claude Code needs no router or proxy — set two environment variables and go:
export ANTHROPIC_BASE_URL=http://127.0.0.1:8000 export ANTHROPIC_AUTH_TOKEN=sk-local # 任意字串,ds4-server 不檢查any string; ds4-server does not check export ANTHROPIC_MODEL=deepseek-v4-flash claude
你的 slash command、subagent、MCP server 流程都保留——只是底下的模型換成本機推論。 Tool-use 方面,ds4 的 tool-calling 訓練不如 Claude 強,會比較容易失誤; 適合長對話、寫作、解釋程式碼;嚴重依賴 tool loop 的情境會比較喘。
Your slash commands, subagents, and MCP server flow all remain — only the underlying model switches to local inference. On tool use, ds4's tool-calling training is weaker than Claude's and tends to slip up; it suits long conversations, writing, and code explanation, but workloads heavily reliant on tool loops will struggle.
claude-code-router
是選用方案:用來在多個後端(例如「程式碼編輯走 pi-ds4、寫作走 cloud」)
之間動態分流。本機只跑單一後端時不必裝。
claude-code-router
is an optional route: it dynamically splits traffic across multiple backends (e.g. “code editing on pi-ds4, writing on cloud”).
For a single local backend, no need to install it.
OpenClaw 把所有 provider 都看成 OpenAI-compatible,loopback 位址自動信任。
編輯 openclaw.json,新增一個 provider:
OpenClaw treats all providers as OpenAI-compatible and trusts loopback addresses automatically.
Edit openclaw.json and add a provider:
{
"agents": { "defaults": { "model": { "primary": "ds4/deepseek-v4-flash" } } },
"models": {
"mode": "merge",
"providers": {
"ds4": {
"baseUrl": "http://127.0.0.1:8000/v1",
"apiKey": "sk-local",
"api": "openai-completions",
"timeoutSeconds": 600,
"models": [{
"id": "deepseek-v4-flash",
"name": "DeepSeek V4 Flash (local)",
"reasoning": false,
"contextWindow": 100000,
"maxTokens": 8192
}]
}
}
}
}
contextWindow 必須 ≤ ds4-server 的 --ctx(預設 100000),
不然 OpenClaw 會把超長 prompt 丟進去然後被 server 拒絕。寫好整段 provider
區塊比較保險——逐 key 用 openclaw config set … 容易漏掉 models[]、
agents.defaults.model.primary、api 等欄位,導致 session
還是接到舊 provider。
contextWindow must be ≤ ds4-server's --ctx (default 100000),
otherwise OpenClaw will push an over-long prompt through and the server will reject it. Writing the whole provider
block at once is safer — doing it key by key with openclaw config set … easily misses fields like models[],
agents.defaults.model.primary, or api, leaving the session connected to the old provider.
最快的路徑是互動式:hermes model 選「Custom endpoint
(self-hosted / VLLM / etc.)」,輸入 http://127.0.0.1:8000/v1 即可。
若你想寫進 config:
The quickest path is interactive: hermes model, choose “Custom endpoint
(self-hosted / VLLM / etc.)”, and enter http://127.0.0.1:8000/v1.
To write it into the config instead:
custom_providers:
- name: ds4
base_url: http://127.0.0.1:8000/v1
# api_key 省略:本機 server 不檢查api_key omitted: the local server does not check it
model:
default: deepseek-v4-flash
provider: custom:ds4
之後 session 內可以隨時 /model custom:ds4:deepseek-v4-flash 切換。
Within a session you can switch any time with /model custom:ds4:deepseek-v4-flash.
上面的章節都是包裝層。若你只是想串到自己的 pipeline、Cron job、
或既有 OpenAI SDK 程式:把 OPENAI_BASE_URL 指過去就完了。
The sections above are all wrappers. If you just want to wire it into your own pipeline, a cron job,
or existing OpenAI SDK code: just point OPENAI_BASE_URL at it and you are done.
from openai import OpenAI client = OpenAI( base_url="http://127.0.0.1:8000/v1", api_key="sk-local", # 任意值any value ) resp = client.chat.completions.create( model="deepseek-v4-flash", messages=[{"role": "user", "content": "自我介紹"}], ) print(resp.choices[0].message.content)
若你的 pipeline 已經切到 OpenAI Responses API(Codex CLI 用的那個),
client.responses.create() 直接呼叫同一個 server:
If your pipeline already uses the OpenAI Responses API (the one Codex CLI speaks),
client.responses.create() hits the same server:
from openai import OpenAI client = OpenAI( base_url="http://127.0.0.1:8000/v1", api_key="sk-local", ) resp = client.responses.create( model="deepseek-v4-flash", input=[{"role": "user", "content": "自我介紹"}], ) print(resp.output_text)
pi-ds4 的 lease/watchdog 機制(第七章)只在 pi process 透過模型發 request 時才生效。
如果你完全不用 pi、只用 Codex CLI 等外部前端,那麼 ds4-server
必須以另一種方式被啟動。三條路:
pi-ds4's lease/watchdog mechanism (Chapter 7) only kicks in when a pi process sends a request through the model.
If you do not use pi at all and only use external frontends like Codex CLI, then ds4-server
has to be started some other way. Three paths:
~/.pi/ds4/support/
執行:
cd ~/.pi/ds4/support
./ds4-server \
--ctx 100000 \
--kv-disk-dir ~/.pi/ds4/kv --kv-disk-space-mb 8192 \
--mpp auto \
--dir-steering-file dir-steering/out/uncertainty_ablit_imatrix.f32 \
--dir-steering-ffn -2
index.ts 內建的 spawn 參數;你也可以依需求自訂。
audreyt/ds4 與 audreyt/pi-ds4
bootstrap——不需要 pi、不需要 ~/.pi/ds4/。需要兩個 repo 的原因:
ds4 內附的 download_model.sh 抓的是 antirez 上游 stock-recipe;
要拿 cyberneurova abliterated GGUF(本指南的權重),得用 pi-ds4 內附的同名腳本。
# 1. 取 ds4 引擎並編譯 git clone https://github.com/audreyt/ds4 cd ds4 make ds4-server # Mac: Metal; Linux: CUDA(見 8.7) # 2. 用 pi-ds4 內附的 cyberneurova 下載腳本(不是 ds4 自帶的 stock-recipe 那支) curl -fL -o download_cyberneurova.sh \ https://raw.githubusercontent.com/audreyt/pi-ds4/main/download_model.sh chmod +x download_cyberneurova.sh ./download_cyberneurova.sh q2 # 取得 ~87 GB GGUF,並 symlink 為 ds4flash.gguf # 3. 起 server ./ds4-server \ --ctx 100000 \ --kv-disk-dir ./kv --kv-disk-space-mb 8192 \ --mpp auto \ --dir-steering-file dir-steering/out/uncertainty_ablit_imatrix.f32 \ --dir-steering-ffn -2
~/.pi/ds4/support/,
run:
cd ~/.pi/ds4/support
./ds4-server \
--ctx 100000 \
--kv-disk-dir ~/.pi/ds4/kv --kv-disk-space-mb 8192 \
--mpp auto \
--dir-steering-file dir-steering/out/uncertainty_ablit_imatrix.f32 \
--dir-steering-ffn -2
index.ts; you can also adjust them as needed.
audreyt/ds4 and audreyt/pi-ds4
— no pi, no ~/.pi/ds4/ needed. Why both repos: the
download_model.sh bundled with ds4 fetches the antirez upstream stock-recipe;
to get the cyberneurova abliterated GGUF (the weights this guide uses) you need the same-named script from pi-ds4.
# 1. Fetch and build the ds4 engine git clone https://github.com/audreyt/ds4 cd ds4 make ds4-server # Mac: Metal; Linux: CUDA (see 8.7) # 2. Use the cyberneurova download script bundled with pi-ds4 (not ds4's own stock-recipe one) curl -fL -o download_cyberneurova.sh \ https://raw.githubusercontent.com/audreyt/pi-ds4/main/download_model.sh chmod +x download_cyberneurova.sh ./download_cyberneurova.sh q2 # Fetches the ~87 GB GGUF and symlinks it as ds4flash.gguf # 3. Start the server ./ds4-server \ --ctx 100000 \ --kv-disk-dir ./kv --kv-disk-space-mb 8192 \ --mpp auto \ --dir-steering-file dir-steering/out/uncertainty_ablit_imatrix.f32 \ --dir-steering-ffn -2
手動啟動的 ds4-server 與 pi 自動啟動的 ds4-server 都會搶 127.0.0.1:8000。
要嘛讓 pi 管它(搭配 A 方案),要嘛你自己管它(B 或 C)。混用會撞 port、
寫壞 server.json、watchdog 看到陌生 PID 還可能誤殺。
A hand-started ds4-server and a pi-started ds4-server both compete for 127.0.0.1:8000.
Either let pi manage it (option A) or manage it yourself (B or C). Mixing them collides on the port,
corrupts server.json, and may even mis-kill processes when the watchdog sees an unfamiliar PID.
你不一定需要 Mac。audreyt/ds4
的 Makefile 在非 Darwin 系統上自動切到 CUDA 路徑
(ds4_cuda.cu,~10k 行 NVIDIA kernel),用 nvcc 編出一顆
原生 ds4-server。也就是說:NVIDIA DGX Spark(GB10、~128 GB 統一記憶體、aarch64 Linux)
上跑的不是某個 llama.cpp 旁路,而是同一隻 engine、同一份 server、
同一組 --dir-steering-* flag。
You do not have to use a Mac. The Makefile of audreyt/ds4
switches automatically to a CUDA path on non-Darwin systems
(ds4_cuda.cu, around 10k lines of NVIDIA kernel), using nvcc to build a
native ds4-server. That is: on the NVIDIA DGX Spark (GB10, ~128 GB unified memory, aarch64 Linux),
what runs is not some llama.cpp sidecar but the same engine, the same server,
the same --dir-steering-* flags.
# 前置:apt install build-essential cmake;已裝好 CUDA toolkit(DGX Spark 預設帶 /usr/local/cuda)Prerequisites: apt install build-essential cmake; CUDA toolkit already installed (DGX Spark ships /usr/local/cuda by default) git clone https://github.com/audreyt/ds4 cd ds4 make ds4-server # Linux 自動走 nvcc + ds4_cuda.cu,CUDA_ARCH=native 預設Linux automatically uses nvcc + ds4_cuda.cu; CUDA_ARCH=native is the default # 下載 cyberneurova abliterated IQ2XXS imatrix GGUF——注意,ds4 本身的 download_model.sh 是 antirez stock-recipe,Download the cyberneurova abliterated IQ2XXS imatrix GGUF — note that ds4's own download_model.sh is the antirez stock-recipe # 不會抓到 cyberneurova。用 pi-ds4 內附的同名腳本:and will not fetch cyberneurova. Use the same-named script bundled with pi-ds4: curl -fL -o download_cyberneurova.sh \ https://raw.githubusercontent.com/audreyt/pi-ds4/main/download_model.sh chmod +x download_cyberneurova.sh ./download_cyberneurova.sh q2 # 起 server。bind 127.0.0.1:要對 LAN 開放,自己加 firewall/Tailscale/reverse-proxy。Start the server. It binds 127.0.0.1: to open it to the LAN, add your own firewall / Tailscale / reverse proxy. # 引導參數一字不改,與 Mac 路徑一致。--mpp 是 Mac 專屬,CUDA 路徑會忽略。Steering parameters are identical to the Mac path. --mpp is Mac-only and is ignored on the CUDA path. mkdir -p /var/cache/ds4-kv ./ds4-server \ --ctx 32768 \ --kv-disk-dir /var/cache/ds4-kv --kv-disk-space-mb 8192 \ --dir-steering-file dir-steering/out/uncertainty_ablit_imatrix.f32 \ --dir-steering-ffn -2
Endpoint 起來之後,前面 8.2~8.5 的設定原則上一字不改都能用(把
127.0.0.1 換成你 DGX 的位址)。差別在哪:
Once the endpoint is up, the setups in 8.2–8.5 above work essentially verbatim (just replace
127.0.0.1 with your DGX's address). What's different:
127.0.0.1 且不檢查 API key;想跨機使用,
建議用 Tailscale 或 reverse proxy 加 auth,不要直接把 --host
改成 0.0.0.0 暴露在區網。
CUDA_ARCH=native 已是預設,不必再調)。
--ctx 100000,
是經過 M5 + Metal KV cache 驗證的值;DGX Spark 上的長 context 行為
尚未在本指南範圍內基準過,建議從 --ctx 32768 起步。
127.0.0.1 by default and does not check the API key; for cross-machine use,
prefer Tailscale or a reverse proxy with auth rather than flipping --host
to 0.0.0.0 and exposing it on the LAN directly.
CUDA_ARCH=native is already the default, no further tuning needed).
--ctx 100000,
a value validated on M5 + Metal KV cache; long-context behaviour on DGX Spark
is not yet benchmarked within this guide's scope, so start with --ctx 32768.
換句話說,這不是「同一份 GGUF、不同 runtime」的勉強相容,而是 同一隻引擎、同一支二進位、同一組旗標跨平台。 pi-ds4 那一層 wrapper 只是 macOS 上的安裝/lifecycle 自動化, 在 Linux 上手動跑 ds4-server 已經涵蓋所有功能。
In other words, this is not a strained “same GGUF, different runtime” compatibility, but the same engine, the same binary, the same flags across platforms. The pi-ds4 wrapper is just install / lifecycle automation on macOS; running ds4-server by hand on Linux already covers all the same functionality.
本擴充把運行狀態都寫在 ~/.pi/ds4/ 底下,
遇到怪事的時候可以直接看檔案——不需要拆原始碼。
The extension writes all runtime state under ~/.pi/ds4/;
when something odd happens, just read the files — no need to dig into the source.
看 log 的最後幾百行。內建 SSD 上模型載入通常只要幾秒,
卡 10 分鐘以上多半不是磁碟速度本身的問題——更可能是 GGUF 放在慢速外接磁碟,
或啟動流程其他環節卡住。先讀 log 找原因;真有必要再把
DS4_READY_TIMEOUT_MS 調高。
Read the last few hundred lines of log. Normal model load takes only a few seconds
on an internal SSD, so anything stuck past 10 minutes is rarely a pure disk-speed issue —
more likely the GGUF sits on a slow external drive, or some other step in startup is wedged.
Diagnose from the log first; raise DS4_READY_TIMEOUT_MS only if there’s a real
reason to wait longer.
log 會明確指出失敗點:可能是 make ds4-server 失敗(缺 Xcode CLI tools?)、
GGUF 下載中斷(網路?磁碟滿?)、或者 --mpp 在不支援的晶片上 panic。
The log will point clearly to the failure: make ds4-server failing (missing Xcode CLI tools?),
a broken GGUF download (network? disk full?), or --mpp panicking on unsupported silicon.
先試溫和的:把 ~/.pi/ds4/clients/ 清空,watchdog 在下一輪 poll(約 2 秒)看到沒有 lease 後會優雅關閉 server。
Try the gentle path first: empty ~/.pi/ds4/clients/; on the next poll (about 2 seconds), the watchdog sees no leases and shuts the server down gracefully.
# 清掉所有 lease,watchdog 會在約 60 秒內關閉 server。Clear all leases; the watchdog stops the server within about 60 seconds. # 用 find -delete 而非 rm + glob,避免 zsh 在空目錄時的 no-matches 錯誤。Use find -delete rather than rm + glob to avoid zsh's no-matches error on empty dirs. find ~/.pi/ds4/clients -maxdepth 1 -name '*.json' -delete 2>/dev/null || true
若 watchdog 異常,不要直接 pkill -TERM ds4-server——
那會無差別終止機器上所有名為 ds4-server 的程序(其他 pi-ds4 安裝、實驗用 build⋯)。
下面三步走,每一步停下來看輸出。
If the watchdog is misbehaving, do not just run pkill -TERM ds4-server —
that will indiscriminately terminate every process on the machine called ds4-server (other pi-ds4 installs, experimental builds, etc.).
Take the three steps below, pausing to read the output after each one.
第一步:檢視當前 state——只讀,沒有破壞性。
Step one: inspect the current state — read-only, non-destructive.
# 印出 server.json 的關鍵欄位、對應程序的 args 與啟動時間:Print the key fields of server.json and the matching process's args and start time: STATE=~/.pi/ds4/server.json if [ ! -f "$STATE" ]; then echo 'no server.json (already clean)' else MANAGED=$(sed -n 's/.*"managedBy"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p' "$STATE" | head -1) PID=$(sed -n 's/.*"pid"[[:space:]]*:[[:space:]]*\([0-9]*\).*/\1/p' "$STATE" | head -1) BINARY=$(sed -n 's/.*"binary"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p' "$STATE" | head -1) echo "managedBy: $MANAGED" echo "pid: ${PID:-<none>}" echo "binary: ${BINARY:-<none>}" if [ -n "$PID" ] && kill -0 "$PID" 2>/dev/null; then ps -p "$PID" -o pid=,lstart=,args= else echo '(pid not running — state is stale, step 2 will clean it)' fi fi
第二步:只在 PID 已經不存在時自動清理 state。 若 PID 還活著,本腳本拒絕動作(避免盲目殺死可能是別人的程序)——印出指引後請手動接到第三步。
Step two: automatically clean up the state only when the PID is already gone. If the PID is still alive the script refuses to act (so it does not blindly kill what might be someone else's process) — it prints guidance and hands off to step three.
( STATE=~/.pi/ds4/server.json LOCKDIR=~/.pi/ds4/lock # 用 mkdir 做原子鎖(與 index.ts 相同機制);鎖被持有就退出Use mkdir as an atomic lock (same mechanism as index.ts); exit if the lock is held. if ! mkdir "$LOCKDIR" 2>/dev/null; then echo "abort: lock $LOCKDIR is held; owner:" cat "$LOCKDIR/owner.json" 2>/dev/null || echo '(no owner.json — if > 60s old, rm -rf manually)' exit 1 fi trap 'rm -rf "$LOCKDIR" 2>/dev/null' EXIT trap 'rm -rf "$LOCKDIR" 2>/dev/null; exit 130' INT trap 'rm -rf "$LOCKDIR" 2>/dev/null; exit 143' TERM trap 'rm -rf "$LOCKDIR" 2>/dev/null; exit 129' HUP if [ ! -f "$STATE" ]; then echo 'lifecycle already clean'; exit 0 fi PID=$(sed -n 's/.*"pid"[[:space:]]*:[[:space:]]*\([0-9]*\).*/\1/p' "$STATE" | head -1) # 只有 PID 不存在(或 state 沒寫 PID)時才動 state。Only touch state when the PID is gone (or absent from state). # PID 還活著就拒絕——交給第三步人工處理。If the PID is alive, refuse — hand off to step 3 for manual action. if [ -n "$PID" ] && kill -0 "$PID" 2>/dev/null; then echo "refuse: pid $PID still alive." echo " please verify via step 1 output, then use step 3 to kill manually." exit 2 fi # PID 已死或無 PID。安全清理。PID is dead or absent. Safe to clean. find ~/.pi/ds4/clients -maxdepth 1 -name '*.json' -delete 2>/dev/null || true rm -f "$STATE" echo 'state cleared. lock will release on exit.' )
第三步(只在第二步輸出 refuse: pid X still alive 時用):
對照第一步 ps 那一行的輸出——確認 args 真的指向本擴充的
~/.pi/ds4/support/ds4-server、lstart 的時間是你預期的——
然後親手鍵入 PID(不要從別處複製 PID 變數)執行:
Step three (only when step two prints refuse: pid X still alive):
cross-check the ps output from step one — confirm args really points at this extension's
~/.pi/ds4/support/ds4-server and that lstart matches what you expect —
then type the PID by hand (do not paste a PID variable from elsewhere) and run:
# 把 PID_FROM_STEP_1_MUST_BE_REPLACED 換成你親眼從第一步輸出確認過的那個 PID。Replace PID_FROM_STEP_1_MUST_BE_REPLACED with the PID you confirmed by eye in step 1's output. # 沒提供 Copy 按鈕;故意用一個非數字的 token,不慎執行未編輯版本只會得到No Copy button is provided; the deliberately non-numeric token means accidentally running the unedited version only yields # 'kill: arguments must be process or job IDs' 之類的錯誤,而不會送出 SIGTERM。an error like 'kill: arguments must be process or job IDs' rather than sending SIGTERM. kill -TERM PID_FROM_STEP_1_MUST_BE_REPLACED # 等幾秒讓它優雅退出,然後再跑第二步即可清掉 state。Wait a few seconds for it to exit gracefully, then re-run step 2 to clear state.
備註:這套三步流程是最後手段。watchdog 與 lifecycle lock 平時就會處理絕大多數情況; 若你發現自己常常需要跑這個,那是 bug,請到 audreyt/pi-ds4 issues 回報。
Note: this three-step procedure is a last resort. The watchdog and lifecycle lock handle almost every case in normal operation; if you find yourself running this often, that is a bug — please report it at audreyt/pi-ds4 issues.
檢查 8000 port 是否真的被 ds4-server 佔用:lsof -nP -iTCP:8000 -sTCP:LISTEN。
若被其他程式佔走,server 啟動會失敗(log 會記錄)。
Check whether port 8000 is really held by ds4-server: lsof -nP -iTCP:8000 -sTCP:LISTEN.
If something else has grabbed it, server startup fails (the log records it).
support/gguf/Do not delete support/gguf/
它裡頭是花了好幾小時下載的 87 GB 權重。除非要換 quant 或換模型,否則永遠不要動它;
如果不小心刪了,下次啟動會重新下載(會續傳 .part 檔,但前提是 .part 還在)。
Inside it are 87 GB of weights that took hours to download. Unless you are changing quant or model, never touch it;
if you delete it by accident, the next start re-downloads from scratch (it resumes .part files, but only if the .part is still there).
如果你想 hack 引擎、測自己的 patch、或保留多個 ds4 fork 並行,
可以跳過 pi install 流程,直接用內附的安裝腳本把擴充和 ds4 checkout
都 symlink 過去。
If you want to hack on the engine, test your own patches, or keep several ds4 forks side by side,
you can skip the pi install flow and use the bundled install script to symlink
both the extension and your ds4 checkout into place.
# 在 pi-ds4 的 checkout 根目錄執行:From the pi-ds4 checkout root:
./install-pi-extension-local.sh /path/to/audreyt-ds4-checkout
它會做兩件事:
It does two things:
~/.pi/agent/extensions/pi-ds4 連到當前 pi-ds4 checkout;~/.pi/ds4/support 連到你提供的 ds4 checkout。~/.pi/agent/extensions/pi-ds4 to the current pi-ds4 checkout;~/.pi/ds4/support to the ds4 checkout you supplied.--force When a support directory already exists: --force
如果 ~/.pi/ds4/support 已經指向別的地方(例如上次 pi install 留下的),
腳本會拒絕直接覆寫。加 --force 會:
If ~/.pi/ds4/support already points elsewhere (e.g. left over from a prior pi install),
the script refuses to overwrite directly. Adding --force will:
gguf/*.gguf 和 .gguf.part 用 APFS clone-on-write
複製到新 checkout(macOS 上不會真的複製 87 GB,只是共享 inode);support.backup.<timestamp>;gguf/*.gguf and .gguf.part from the old checkout to the new one via APFS clone-on-write
(on macOS this does not actually copy 87 GB — it just shares the inodes);support.backup.<timestamp>;
安裝完成後重啟 pi,或在 pi 內執行 /reload——擴充會被重新發現。
After install, restart pi or run /reload inside pi — the extension is rediscovered.
本章是給已經理解前面所有章節的讀者準備的:如何根據自己的研究興趣建造一個全新的引導向量,
並掛到 ds4-server 上跑。
This chapter is for readers who have already absorbed everything above: how to build a fresh steering vector for your own research interest
and run it on ds4-server.
audreyt/ds4 的 dir-steering/ 目錄裡有完整的建構工具:
The complete build toolchain lives in the dir-steering/ directory of audreyt/ds4:
collect-acts.py:對一組「對比提示對」做前向跑、收集每層 hidden state;build-dir.py:把收集到的 activation 做 PCA/差分,輸出 .f32 向量檔;README.md:完整教學與設計討論。collect-acts.py: runs forward passes over a set of contrast prompt pairs and collects the hidden state at every layer;build-dir.py: applies PCA / differencing to the collected activations and emits a .f32 vector file;README.md: full tutorial and design discussion.collect-acts.py 在 ds4-server 上跑這些提示,產生兩組 activation。build-dir.py 計算這兩組 activation 在每一層的差異方向,輸出成 my-direction.f32。dir-steering/out/ 或任何位置;DS4_DIR_STEERING_FILE=dir-steering/out/my-direction.f32DS4_DIR_STEERING_FFN=-2(或自己摸出來的甜蜜點)collect-acts.py to produce two sets of activations.build-dir.py to compute the difference direction layer by layer, emitting my-direction.f32.dir-steering/out/ or anywhere else;DS4_DIR_STEERING_FILE=dir-steering/out/my-direction.f32DS4_DIR_STEERING_FFN=-2 (or your own sweet spot)如果你建了一個有用的方向,歡迎把它寄到 audreyt/ds4 issues—— 有趣的方向會被收進主分支讓所有人都能用。
If you build a useful direction, please send it to audreyt/ds4 issues — interesting directions can be pulled into the main branch for everyone to use.
如果你還在猶豫要不要把自己的 Mac 拿去跑這顆模型,下面這些是最常被問到的問題。
If you are still on the fence about whether to put your Mac to work running this model, here are the most common questions.
不會。ds4-server 完全在 127.0.0.1:8000(本機 loopback)上監聽,
沒有任何呼叫對外網路。模型權重一旦下載完成,整個推論流程都在你機器的 CPU/GPU/unified memory 裡。
如果你斷網路使用,它也照樣運作(除了首次下載階段)。
No. ds4-server listens only on 127.0.0.1:8000 (local loopback)
and makes no outbound calls. Once the model weights are downloaded, the entire inference pipeline runs on your machine's CPU / GPU / unified memory.
Use it with the network unplugged and it still works (except for the initial download).
輸出主權:沒有「我們無法回應這個請求」這類雲端服務內建的封鎖; 對於主權/領土/哲學爭議類問題,本分支預設啟用的方向性引導會讓模型進入「鋪陳討論」而不是「給定一個訓練好的答案」。
Output sovereignty: none of the built-in cloud-service blocks like “we cannot respond to that request”; on questions of sovereignty / territory / philosophical contention, this fork's default directional steering pushes the model into “laying out the discussion” rather than “delivering a pre-trained answer”.
資料主權:對話不離開你的機器,也不會進入任何訓練資料集。
Data sovereignty: conversations do not leave your machine and do not enter any training set.
成本結構:沒有逐 token 計價、沒有速率限制、沒有訂閱費。 代價是一次性的硬體投入(128 GB Mac)與電費。
Cost structure: no per-token pricing, no rate limits, no subscription. The cost is one-off hardware (a 128 GB Mac) plus electricity.
能力:284 B 參數、13 B 啟用的 MoE 是當前開放權重前沿模型的量級;IQ2XXS imatrix 量化會讓某些任務輸給未量化的雲端版本; 對話品質仍在「可用作日常 coding/寫作助理」的範圍。
Capability: a 284 B-parameter, 13 B-active MoE is in the bracket of current frontier open-weight models; IQ2XXS imatrix quantisation will lose ground to unquantised cloud versions on some tasks, but conversation quality remains within “usable as a daily coding / writing assistant”.
作為日常 shell——不需要。第八章有四個常見前端的接法:
As a daily shell — no. Chapter 8 shows how to wire up four common frontends:
export ANTHROPIC_BASE_URL=http://127.0.0.1:8000 就走——ds4-server 本身就實作 Anthropic Messages 端點。base_url 就連得上。~/.codex/config.toml 加一段 [model_providers.ds4],指到 ds4-server——完整可貼可用的 TOML 在 8.1 節。export ANTHROPIC_BASE_URL=http://127.0.0.1:8000 and you are off; ds4-server itself implements the Anthropic Messages endpoint.base_url.[model_providers.ds4] block in ~/.codex/config.toml pointing at ds4-server — the full paste-and-go TOML is in section 8.1.但你還是會在背後某個地方需要 ds4-server 跑起來。三條路:
But you still need ds4-server running somewhere in the background. Three paths:
cd ~/.pi/ds4/support && ./ds4-server … 手動跑(B 方案)。audreyt/ds4、make ds4-server,然後用 pi-ds4 的 download_model.sh(curl 一份下來,不是 ds4 內附的同名腳本——那支抓的是上游 stock-recipe),最後跑 ./ds4-server …(C 方案)。完整步驟與旗標見 8.6 節。cd ~/.pi/ds4/support && ./ds4-server … by hand (option B).audreyt/ds4, run make ds4-server, then use pi-ds4's download_model.sh (curl a copy — not the same-named script inside ds4, which fetches the upstream stock-recipe), and finally run ./ds4-server … (option C). Full steps and flags in section 8.6.簡言之:作為前端可以完全不碰 pi;作為服務你還是得讓 ds4-server 跑起來——但這條服務路徑也可以完全不經過 pi(見 8.6 節)。
In short: as a frontend you can avoid pi entirely; as a service you still need ds4-server running — but that service path can also bypass pi entirely (see section 8.6).
有幾條路:
A few options:
sudo sysctl iogpu.wired_limit_mb=92000,再手動跑 ds4-server)。llama.cpp、MLX、Ollama 跑較小的開放權重模型(Llama、Mistral、Qwen、Gemma 系列等),64 GB Mac 可跑得動 70 B 量級。ds4 引擎本身是專為 DeepSeek V4 Flash 設計的,不適用於其他模型。sudo sysctl iogpu.wired_limit_mb=92000, then drive ds4-server by hand).llama.cpp, MLX, or Ollama for smaller open-weight models (Llama, Mistral, Qwen, Gemma families); a 64 GB Mac can drive 70 B-class models. The ds4 engine itself is built specifically for DeepSeek V4 Flash and does not apply to other models.本分支的「不確定性引導」是搭配這顆 abliterated GGUF 設計的,
它的具體甜蜜點(ffn=-2)不一定能直接搬到其他模型;要用同樣機制需要自己重做向量。
This fork's “uncertainty steering” is designed to pair with this particular abliterated GGUF,
and its specific sweet spot (ffn=-2) may not carry over directly to other models; to use the same mechanism elsewhere, you will need to rebuild the vector yourself.
推論時的功耗大致跟「整顆 Mac 同時跑 GPU 密集任務」差不多—— M-series 機型上多半在 30~60 W 之間。長時間使用建議:
During inference, power draw is roughly the same as “your Mac running a GPU-intensive task across the board” — typically 30 to 60 W on M-series machines. For extended use:
燒壞的風險與一般高負載使用相同;Apple Silicon 機型在溫度過高時會自動降頻保護。
The damage risk is no greater than any other heavy workload; Apple Silicon machines throttle automatically when they get too hot.
不是。越獄通常指用特殊提示詞繞過雲端模型的政策層;abliteration 是直接編輯權重, 屬於修改模型本身。兩者方向不同——越獄是輸入面的繞過,abliteration 是模型面的調整。 本分支的方向性引導又是另一層機制(執行期的低秩活化編輯),跟前兩者都不一樣。
No. Jailbreaking typically means using special prompts to bypass a cloud model's policy layer; abliteration directly edits the weights, which means modifying the model itself. The two work in different directions — jailbreaks bypass on the input side, abliteration adjusts the model side. This fork's directional steering is yet another layer (runtime low-rank activation editing), distinct from both.
如果你想看原始未動過的 DeepSeek V4 Flash 行為,用上游 antirez/ds4 的 stock-recipe GGUF;
要關掉本分支的方向性引導,設 DS4_DIR_STEERING_FFN=0。
If you want the original untouched DeepSeek V4 Flash behaviour, use the upstream stock-recipe GGUF from antirez/ds4;
to turn off this fork's directional steering, set DS4_DIR_STEERING_FFN=0.
無關。pi 是 Earendil 開發的 coding agent CLI 名稱(取自一個夢的縮寫,不是 π)。
Unrelated. pi is the name of a coding agent CLI developed by Earendil (an acronym from a dream, not π).
可以。整條鏈路都是 MIT 授權,商業使用直接拿去用即可——但「再散布」(redistribute)時, 各元件要保留自己那份 LICENSE 聲明。如果你只是使用不再散布,這些不必擔心。
Yes. The whole chain is MIT-licensed, so you can use it commercially straight off — but when you redistribute, each component must keep its own LICENSE declaration. If you only use and do not redistribute, none of this is your problem.
| 元件 | 授權 | 再散布要附 |
|---|---|---|
| audreyt/pi-ds4 |
MIT | 本 repo 的 LICENSE |
| audreyt/ds4 / antirez/ds4 |
MIT | 該 repo 的 LICENSE(antirez 原始版權人) |
| DeepSeek-V4-Flash 上游權重 |
MIT | 模型卡 / HF repo 的 LICENSE(DeepSeek 版權人) |
| cyberneurova abliterated GGUF |
MIT(inherits) | 該 HF repo 的 LICENSE + 一筆「derivative of DeepSeek-V4-Flash」標示 |
| 本指南 HTML 文本 |
CC0 (public domain) | 什麼都不必附 |
| Component | Licence | Required on redistribution |
|---|---|---|
| audreyt/pi-ds4 |
MIT | this repo's LICENSE |
| audreyt/ds4 / antirez/ds4 |
MIT | that repo's LICENSE (antirez as original copyright holder) |
| DeepSeek-V4-Flash upstream weights |
MIT | the model card / HF repo's LICENSE (DeepSeek as copyright holder) |
| cyberneurova abliterated GGUF |
MIT (inherits) | that HF repo's LICENSE plus a “derivative of DeepSeek-V4-Flash” notice |
| This guide's HTML text |
CC0 (public domain) | nothing required |
可以。本指南文本以 CC0 貢獻於公眾領域——你可以任意複製、翻譯、改寫、商用、再發布, 不用標註來源。
Yes. This guide's text is contributed to the public domain under CC0 — you may copy, translate, rewrite, use commercially, and republish it however you like, with no attribution required.
在這份指南反覆出現、又對非工程背景讀者較陌生的詞,集中在這裡。 已熟悉的人可以跳過。
Words that recur in this guide and may be unfamiliar to non-engineering readers are collected here. Those already familiar can skip ahead.
.gguf),把權重、結構、tokenizer 全部打包在一個檔案裡。本指南的 GGUF 約 87 GB。.gguf), which packages weights, architecture, and tokenizer into one file. The GGUF in this guide is around 87 GB.ffn=-2),attention 端關閉(attn=0)。ffn=-2) and leaves the attention side off (attn=0).~/.pi/ds4/clients/<pid>.json 寫入的存在證明;每 10 秒更新一次。~/.pi/ds4/clients/<pid>.json; refreshed every 10 seconds.原始碼採用 MIT,與上游一致。請見專案 LICENSE。本指南文本則貢獻於 Public Domain(CC0)。
Source code is MIT-licensed, matching upstream. See the project's LICENSE. The text of this guide is contributed to the public domain under CC0.
把前沿模型放到自己機器上跑,從技術上講,是把一段運算搬到本地; 從政治上講,是把一個方向盤從別人手上接過來。本擴充預設啟用 不確定性引導——並不是要替你決定任何問題的答案, 而是要在那些「被訓練封死」的提問前,把討論空間還給你。
Running a frontier model on your own machine, technically speaking, moves a piece of computation onto local hardware; politically speaking, it takes the steering wheel back from someone else's hands. This extension defaults to uncertainty steering — not to decide any question's answer for you, but to return the space for discussion in front of questions where training has nailed the door shut.
模型可以閉合一個答案,使用者可以打開一個討論。
這份指南完成它任務的時刻,是當你已經不需要它的時候。
A model can close a question on an answer; the user can open it back into a discussion.
The moment this guide finishes its work is the moment you no longer need it.