Audrey Tang 唐鳳
pi‑ds4 指南Guide v 0.3
一行安裝 · 整本指南One-line install · The whole guide

前沿模型,請進你的 MacBook Invite the frontier model onto your MacBook

在一台 128 GB 以上的 Apple Silicon Mac 上本機運行 284 B 參數的 DeepSeek V4 Flash——沒有雲端呼叫、沒有 API 費用、沒有逐 token 計價、 沒有頻率限制;模型的「方向性引導」轉盤,握在你自己手上。

Run the 284‑billion-parameter DeepSeek V4 Flash on your own 128‑GB Apple Silicon Mac — no cloud calls, no API fees, no per‑token billing, no rate limits. The model’s directional steering dial stays in your hands.

284B
Mixture‑of‑Experts 參數總量(每 token 啟用 13 B)Mixture‑of‑Experts parameter count (13 B activated per token)
87GB
IQ2XXS imatrix GGUF 權重檔大小IQ2XXS imatrix GGUF weights file size
440t/s
M5 上 prefill 吞吐參考值Reference prefill throughput on M5
30t/s
M5 上 decode(逐 token 推論)吞吐Decode (per-token inference) throughput on M5
六十秒入門 · 60-second primer60-second primer · 六十秒入門

不用先讀完全部,先看這裡 You don’t need to read the whole thing, start here

這是什麼?What is this?

pi-ds4 會在你的 Mac 上跑一個叫 ds4-server 的本機推論伺服器, 載入 deepseek-v4-flash 這顆 AI 模型,並在 127.0.0.1:8000 同時開出 OpenAI 與 Anthropic 兩種 API 端點。 整個過程完全在你自己機器上,不會把你的對話送出去任何地方—— 沒有雲端、沒有 API 帳號、沒有逐 token 收費、沒有「對不起,我無法回應」這類雲端常見的封鎖。

pi-ds4 runs a local inference server called ds4-server on your Mac, loading the deepseek-v4-flash AI model and exposing OpenAI and Anthropic API endpoints simultaneously on 127.0.0.1:8000. The whole thing happens entirely on your own machine — no cloud, no API account, no per-token billing, none of the “sorry, I can’t respond to that” gates that cloud services routinely apply.

pi 是它最方便的前端,所以本指南叫做 pi-ds4。但如果你已經是 Codex CLIClaude CodeOpenClawHermes Agent 的使用者,可以直接把它們指到這台本機伺服器, 把 pi-ds4 當作本機 backend 來用——四個 shell 都是改一兩個環境變數或 config 區段的事,詳見第八章

pi is its most convenient frontend, hence the name pi-ds4. But if you’re already using Codex CLI, Claude Code, OpenClaw or Hermes Agent, you can point any of them at this local server and use pi-ds4 as your local backend — all four shells are just an env-var or config-section change away. See Chapter 8.

它包成一行 pi 指令:pi install github.com/audreyt/pi-ds4。 但這行指令的前提是你電腦上已經有 pi—— 如果終端機跑 pi --version 得到 command not found, 請先去 earendil-works/pi 跟著安裝步驟裝起來,再回來這裡。 之後本擴充第一次啟動會花一兩個小時下載並編譯所有零件;後續啟動就快了。

It boils down to one pi command: pi install github.com/audreyt/pi-ds4. The precondition is that you already have pi on your machine — if pi --version returns command not found, head to earendil-works/pi first, follow its install steps, then come back. After that, the extension’s first launch will spend an hour or two downloading and compiling everything; subsequent launches are fast.

名詞速查Glossary at a glance

pi給 AI 用的指令列a CLI for AI
由 Earendil 開發的命令列工具(CLI),讓你在終端機裡跟 AI 互動。「pi」是工具名,不是數學的圓周率。
A command-line tool built by Earendil that lets you talk to AI inside your terminal. “pi” is the tool’s name, not the mathematical constant.
ds4本機推論引擎local inference engine
Salvatore Sanfilippo(antirez,Redis 作者)用 C 語言寫的引擎,把 DeepSeek 公司的「V4 Flash」模型壓進 Mac 跑。
Salvatore Sanfilippo (antirez, the author of Redis) wrote this C engine to squeeze DeepSeek’s “V4 Flash” model into a Mac.
pi-ds4本指南主題topic of this guide
把 pi 跟 ds4 黏在一起的擴充套件。本指南是 Audrey Tang(唐鳳)維護的分叉版本,差別見第三章。
The extension that glues pi and ds4 together. This guide covers Audrey Tang’s fork; see Chapter 3 for what differs from upstream.
284 B 參數284 B parameters模型「腦容量」model “brain size”
「284 個 billion」(2,840 億)個權重參數——當代開放權重前沿模型的量級。但實際每個 token 只啟用其中 13 B(Mixture-of-Experts,混合專家)。
284 billion weight parameters — the scale of today’s open-weights frontier models. Only 13 B are activated per token via a Mixture-of-Experts (MoE) routing.
IQ2XXS imatrix 量化IQ2XXS imatrix quantisation壓縮版的模型檔compressed model file
把模型權重壓縮到更小的位元寬度。本指南用的 IQ2XXS-w2Q2K imatrix 配方平均約 2 bit/權重(routed expert 用更省的 IQ2XXS,attention/output/embedding 保留 Q8_0),加上以代表性 corpus 校準的 imatrix,把模型從 FP8 原生大小(~284 GB,等於 Q8_0 GGUF)壓到 ~87 GB——這是它能放進 Mac 的關鍵。
Compresses the weights to fewer bits per parameter. The IQ2XXS-w2Q2K imatrix recipe used here averages about 2 bits per weight (routed experts use the tighter IQ2XXS quant, while attention/output/embedding stay at Q8_0), with an importance-matrix calibration on top, shrinking the model from its native FP8 size (~284 GB, equivalent to a Q8_0 GGUF) down to ~87 GB — the lever that lets it fit on a Mac.
方向性引導Directional steering本分支特色signature of this fork
不重訓模型、只在執行期間微調幾個內部方向,讓模型對爭議性問題能進入「鋪陳討論」而非「給定答案」的模式。見第四章。
A run-time nudge to a few internal directions — no retraining — that lets the model approach contested questions in a deliberative register rather than a closed-form answer. See Chapter 4.

我需要付出什麼?What does it cost me?

你是哪一種讀者?Which reader are you?

序 章Preface

這份指南寫給誰看 Who this guide is for

上面的六十秒入門給你「這是什麼、能不能用」的答案。 接下來這份手冊是給已經決定要把每個旋鈕都調對的讀者準備的。 它不是 README 的翻譯,也不是行銷文案——它是 audreyt 分叉版本的完整操作說明。

The 60-second primer above gives you the “what is this and is it for me” answer. What follows is a handbook for the reader who has already decided to dial in every knob. It is not a translation of the README, nor marketing copy — it is the full operational manual for the audreyt fork.

本書共 十一章,加上這篇序章與最後的附錄、字彙、FAQ。閱讀路徑:

The book has eleven chapters, plus this preface and the closing coda, glossary, and FAQ. Reading paths:

你可以依序讀完當作一次完整的安裝體驗,也可以只挑相關章節作為設定參考。

You can read sequentially as one full install experience, or cherry-pick chapters as configuration reference.

本書的兩個前提The book’s two premises

其一:你願意把約 87 GB 的硬碟空間留給模型權重——這是本機運行的入場券。

First: you’re willing to spare about 87 GB of disk for the model weights — the cover charge for running locally.

其二:如果你想走最方便的一行安裝路徑(第一~二章),需要先有 pi(Earendil 的 coding agent CLI)。 但如果你只想把 ds4-server 當本機後端、然後接到自己慣用的 Codex CLI/Claude Code/OpenClaw/Hermes Agent, 其實可以完全跳過 pi——直接走 8.6 節(C),clone audreyt/ds4make ds4-server、自己下載 GGUF 即可。

Second: if you want the most convenient one-line install path (Chapters 1–2), you need pi (Earendil’s coding-agent CLI). But if you only want ds4-server as a local backend and plan to use your existing Codex CLI / Claude Code / OpenClaw / Hermes Agent, you can skip pi entirely — go straight to §8.6 path (C): clone audreyt/ds4, make ds4-server, download the GGUF yourself.


第 一 章Chapter 1

硬體門檻與一行安裝 Hardware bar and the one-line install

在輸入指令之前,先確認手邊這台 Mac 撐不撐得起這場手術。 DeepSeek V4 Flash 是一顆 284 B 參數的 Mixture-of-Experts 模型; 雖然量化到 IQ2XXS imatrix 之後權重只有約 87 GB,但執行時仍需要充足的記憶體 留給激活與 KV cache。

Before typing the command, check that the Mac in front of you can carry the surgery. DeepSeek V4 Flash is a 284-B-parameter Mixture-of-Experts model; IQ2XXS imatrix quantisation shrinks the weights to about 87 GB, but inference still needs headroom in unified memory for activations and the KV cache.

1.1 硬體最低需求 Minimum hardware

只有 96 GB?Only 96 GB?

上游 antirez/ds4 的工作者已驗證:在 96 GB Mac Studio 上,IQ2XXS imatrix 跑得動 250k context、約 27 t/s (issue #46,已併入上游 README)。 值得注意的是:apple.com 目前甚至沒有賣 96 GB 以上的 Mac Studio——這是現役高階 Mac Studio 買家的常態。

Upstream antirez/ds4 contributors have validated that this IQ2XXS imatrix recipe runs on a 96 GB Mac Studio at 250k context and around 27 t/s (issue #46, since merged into upstream README). Worth knowing: apple.com does not currently ship a Mac Studio above 96 GB — this is the standard configuration for fresh high-end Mac Studio buyers.

但本擴充的 index.ts 對 RAM 做硬性檢查:偵測到 < 128 GB 會直接拋錯。 想在 96 GB 上跑,請跳過 pi-ds4 wrapper,走 8.6 節(C)的手動 ds4-server 路徑:

But this wrapper’s index.ts hard-checks RAM and throws on anything below 128 GB. To run on 96 GB, skip the pi-ds4 wrapper and use the manual ds4-server path in §8.6 (C):

  1. 先把 Metal 的 wired memory ceiling 拉高,留 6 GB 給 macOS:
    sudo sysctl iogpu.wired_limit_mb=92000
  2. clone audreyt/ds4make ds4-server、抓 IQ2XXS imatrix GGUF(用本擴充的 download_model.sh)、symlink 為 ds4flash.gguf
  3. ./ds4-server --ctx 250000 …(其餘旗標見 8.6 節)。
  1. Raise the Metal wired-memory ceiling, leaving 6 GB for macOS:
    sudo sysctl iogpu.wired_limit_mb=92000
  2. Clone audreyt/ds4, run make ds4-server, fetch the IQ2XXS imatrix GGUF (with this extension’s download_model.sh), symlink it as ds4flash.gguf.
  3. Run ./ds4-server --ctx 250000 … (rest of the flags in §8.6).

注意:iogpu.wired_limit_mb 拉到 92000 後留給系統的餘裕很小, 同時開太多其他大型應用可能會把 OS 擠到當機。只有願意自己手動管理記憶體的使用者再嘗試。

Caveat: setting iogpu.wired_limit_mb to 92000 leaves very little headroom for the OS; running other heavyweight applications alongside it can wedge macOS. Only attempt this if you are happy to manage memory by hand.

1.2 如果你之前裝過上游版本 If you previously installed the upstream version

本擴充與 mitsuhiko/pi-ds4 互斥——它們註冊同一組 ds4/deepseek-v4-flash provider/model ID。安裝本分支前,先把上游卸掉:

This extension is mutually exclusive with mitsuhiko/pi-ds4 — they register the same ds4/deepseek-v4-flash provider/model ID. Uninstall the upstream first:

移除上游Remove upstream
# 如果你裝過 mitsuhiko/pi-ds4,先移除:If you previously installed mitsuhiko/pi-ds4, remove it first:
pi remove github.com/mitsuhiko/pi-ds4

1.3 一行安裝(但先確認 pi 在) One-line install (but check pi is there first)

本擴充建立在 earendil-works/pi 之上—— 下面這行 pi install 假設你已經把 pi 本身裝好了。先做一個五秒鐘的健康檢查:

This extension sits on top of earendil-works/pi; the pi install line below assumes you already have pi itself installed. A five-second health check first:

確認 pi 已安裝Verify pi is installed
# 應該印出版本號;若得到 command not found 請先依Should print a version number. If you get command not found, install pi first
# github.com/earendil-works/pi 的說明安裝 pi 本體。per the instructions at github.com/earendil-works/pi.
pi --version
本分支安裝Install this fork
# pi 在的話就執行:Once pi is installed:
pi install github.com/audreyt/pi-ds4

就這樣。剩下的事——clone ds4 原始碼、編譯 ds4-server、 下載 87 GB GGUF、啟動 server、註冊 provider——都會在背景完成。

That’s it. Everything else — cloning the ds4 source, building ds4-server, downloading the 87 GB GGUF, starting the server, registering the provider — happens in the background.

為何不用 brew/pip/dockerWhy not brew / pip / docker

pi 的擴充機制本身就是一種「應用層套件管理員」:每個擴充以 Git URL 為錨點, 安裝即 clone,移除即解除註冊。它把 ds4 原始碼放在 ~/.pi/ds4/support/, 全部步驟可從 log 還原。你的系統其他部分不會被它弄髒。

pi’s extension system is itself an “application-layer package manager”: each extension is anchored on a Git URL, install means clone, remove means de-register. It keeps the ds4 source under ~/.pi/ds4/support/, and every step is reconstructable from the log. The rest of your system stays clean.


第 二 章Chapter 2

首次啟動會發生什麼事 What happens on first launch

安裝指令本身只是註冊動作;真正的工序,在你第一次選用 ds4/deepseek-v4-flash 模型時才會被觸發—— 而且全部都會被記錄到 ~/.pi/ds4/log 裡。

The install command itself only registers the extension; the real work fires the first time you actually select the ds4/deepseek-v4-flash model — and every step is logged in ~/.pi/ds4/log.

2.1 啟動序列 Startup sequence

當 pi 任何一個 process 對該模型發出第一個請求時,本擴充會依序執行下列六個步驟:

When any pi process makes its first request to this model, the extension runs the following six steps in order:

  1. 取得跨 process 啟動鎖Acquire the cross-process startup lock

    ~/.pi/ds4/lock/owner.json 寫入持有者資訊,避免兩個 pi 視窗同時做相同工序。

    Writes owner metadata to ~/.pi/ds4/lock/owner.json so two pi windows can’t run the same setup in parallel.

  2. 解析 runtime 目錄Resolve the runtime directory

    ~/.pi/ds4/support/ 不存在或不像 ds4 checkout,執行淺層 clone:

    If ~/.pi/ds4/support/ is missing or doesn’t look like a ds4 checkout, runs a shallow clone:

    # 預設 DS4_SUPPORT_REPO + DS4_SUPPORT_BRANCHdefault DS4_SUPPORT_REPO + DS4_SUPPORT_BRANCH git clone --depth 1 --single-branch \ --branch main \ https://github.com/audreyt/ds4 \ ~/.pi/ds4/support
  3. 編譯 ds4-serverBuild ds4-server

    ds4-server 二進位不存在,於 support 目錄執行 make ds4-server

    If the ds4-server binary is missing, runs make ds4-server inside the support directory.

  4. 確保模型存在Ensure the model is present

    執行內附的 download_model.sh q2

    Runs the bundled download_model.sh q2:

    # 用 curl -C - 可中斷續傳 87 GB IQ2XXS imatrix GGUFcurl -C - resumes the 87-GB IQ2XXS imatrix GGUF after interruptions curl -fL --progress-meter -C - \ -o gguf/cyberneurova-...imatrix.gguf.part \ https://huggingface.co/cyberneurova/... mv gguf/...gguf.part gguf/...gguf ln -sfn gguf/...gguf ds4flash.gguf
  5. 啟動 ds4-serverLaunch ds4-server本擴充的核心core of the extension

    127.0.0.1:8000 以 detached 方式 spawn,套用內建啟動參數:

    Spawns detached on 127.0.0.1:8000 with the built-in startup args:

    ds4-server \ --ctx 100000 \ --kv-disk-dir ~/.pi/ds4/kv \ --kv-disk-space-mb 8192 \ --mpp auto \ --dir-steering-file dir-steering/out/uncertainty_ablit_imatrix.f32 \ --dir-steering-ffn -2 \ --dir-steering-attn 0
  6. 啟動 watchdogStart the watchdog

    spawn /bin/sh ds4-watchdog.sh(detached)。每 2 秒掃描 clients/ 目錄;當沒有任何有效 lease 時,自動送 SIGTERM 給 ds4-server 並退場。

    Spawns /bin/sh ds4-watchdog.sh (detached). It scans clients/ every two seconds and, when no valid leases remain, sends SIGTERM to ds4-server and exits.

全程是冪等的(idempotent):第二次啟動會看到所有檔案都已就緒, 直接從第 5 步開始;如果 server 還在跑,連第 5 步都跳過。

The whole sequence is idempotent: subsequent launches find everything already in place and pick up from step 5 directly; if the server is already running, even step 5 is skipped.

2.2 如何看「現在到底卡在哪」 How to see “where is it stuck right now”

在 pi 中執行 /ds4,會打開一個即時的 log 視窗。 IQ2XXS imatrix GGUF 下載期間(約 87 GB),你會看到 curl 進度條被精簡顯示為一行 百分比+速率+剩餘時間。

Inside pi, running /ds4 opens a live log window. While the IQ2XXS imatrix GGUF is downloading (~87 GB), you’ll see curl’s progress meter condensed into a single line of percent, rate, and ETA.

如果首次下載中斷了If the first download is interrupted

別擔心:download_model.shcurl -C - 續傳, 下次啟動會從中斷處繼續。檔名是 cyberneurova-DeepSeek-V4-Flash-abliterated-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf.part, 完成後才會被重命名為 .gguf

Don’t worry: download_model.sh uses curl -C - for resume. The next launch picks up where it left off. The temporary filename is cyberneurova-DeepSeek-V4-Flash-abliterated-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf.part; it’s only renamed to .gguf after the bytes are complete.


第 三 章Chapter 3

這顆模型:abliteration 與 IQ2XXS imatrixThe model: abliteration and IQ2XXS imatrix

在你用它之前,值得花五分鐘理解你正在跑的是什麼。 本擴充刻意挑選了 cyberneurova 的 abliterated GGUF, 而不是 antirez/ds4 原本下載的「stock-recipe」版本——這個選擇有它的理由。

Before you use it, it is worth five minutes to understand what you are actually running. This extension deliberately picks cyberneurova's abliterated GGUF rather than the “stock-recipe” version that antirez/ds4 downloads by default — the choice has its reasons.

3.1 DeepSeek V4 Flash 是什麼 What DeepSeek V4 Flash is

DeepSeek V4 Flash 是一顆 284 B 參數的 Mixture-of-Experts 模型, 每個 token 啟用約 13 B 參數。它是 V4 系列的「快速」變體,主打較低的推論延遲與較大的 context window。 本擴充把 ds4-server 的 --ctx 設成 100,000 tokens—— 但這是引擎願意吃的上限,不等於模型在這個長度下都穩定: cyberneurova 的模型卡指出,abliterated 權重在 32k tokens 以上的行為尚未被驗證。

DeepSeek V4 Flash is a 284 B-parameter Mixture-of-Experts model with roughly 13 B parameters activated per token. It is the “fast” variant of the V4 family, aiming for lower inference latency and a larger context window. This extension sets ds4-server's --ctx to 100,000 tokens — but that is the upper bound the engine is willing to accept, not a guarantee that the model is stable at that length: cyberneurova's model card notes that the abliterated weights have not been validated above 32k tokens.

antirez 的 ds4 用純 C 撰寫了一個高效能 inference engine (和 Redis 同一個傳統),把這顆模型壓進 Apple Silicon。 本擴充使用 audreyt/ds4 分支的 main, 它在 antirez/ds4 main 之上多帶了 ivanfioravanti 的 M5 prefill 優化(antirez/ds4#15)以及搭配的相容性修正——這也是 pi-ds4 還沒直接指到 antirez upstream 的唯一原因,等 PR #15 合併後就會收斂。早期為了載入 stock-recipe Q8_0 token-embed 而存在的 loader patch(support-q8_0-token-embd)已不再需要:cyberneurova 的權重現已用 ds4 main 原生支援的 IQ2XXS-w2Q2K imatrix 配方重新量化發佈(見 3.3 節)。

antirez's ds4 is a high-performance inference engine written in pure C (the same tradition as Redis), squeezing this model onto Apple Silicon. This extension uses the main branch of the audreyt/ds4 fork, which carries ivanfioravanti's M5 prefill optimisation (antirez/ds4#15) plus its companion compatibility fix on top of antirez/ds4 main — the only reason pi-ds4 hasn't pointed straight at antirez upstream yet, and something that will converge once PR #15 lands. The earlier loader patch (support-q8_0-token-embd) that handled stock-recipe Q8_0 token embeddings is no longer needed: cyberneurova's weights are now re-quantised into the IQ2XXS-w2Q2K imatrix recipe that ds4 main loads natively (see §3.3).

3.2 什麼是 abliteration What abliteration is

cyberneurova 的 abliterated GGUF 是經過「abliteration」手術的權重檔。Abliteration 是一種以低秩活化編輯為基礎的技術: 找出模型在面對某些訓練好的「拒絕」提示時內部出現的特定方向,然後把那個方向從前向計算中移除。

cyberneurova's abliterated GGUF is a weights file that has undergone “abliteration” surgery. Abliteration is a technique grounded in low-rank activation editing: identify the specific direction that emerges internally when the model meets certain trained “refusal” prompts, then remove that direction from the forward pass.

結果是:原本會觸發「我無法回應這個請求」的訓練封閉式回應,被打開了讓位給更自然的續寫。 abliteration 鬆動的是過度拒絕(over-refusal)那一層; 模型在內容判斷上的能力仍然來自訓練時學到的知識與分布。

The result: the trained closed responses that used to trigger “I cannot respond to that request” are pried open, making room for a more natural continuation. What abliteration loosens is the over-refusal layer; the model's competence at content judgement still derives from the knowledge and distribution it learned during training.

關於 abliteration 與安全:請自行評估On abliteration and safety: judge for yourself

abliteration 不是「安全保證」也不是「越獄」。它確實會降低模型主動拒答的傾向, 因此在某些情境下,輸出風險與原始模型不完全相同。記者、研究者、政策評估者 在引用本機輸出時,建議:

Abliteration is neither a “safety guarantee” nor a “jailbreak”. It does lower the model's tendency to refuse on its own, so in some situations the output risk profile is not identical to the original model. For journalists, researchers, and policy evaluators citing local output, we suggest:

  • 把它視為「協助起草」而非「事實來源」,與其他資料來源交叉驗證;
  • 對涉及人身安全、法律、醫療的場景,仍應依賴專業人士;
  • 需要參考原始未動過的模型行為時,下載 antirez/ds4 上游的 stock-recipe GGUF 對照測試。
  • treat it as “drafting assistance” rather than a “source of fact”, and cross-check against other sources;
  • for scenarios touching personal safety, legal, or medical matters, still rely on qualified professionals;
  • when you need to reference the original untouched model behaviour, download the upstream stock-recipe GGUF from antirez/ds4 for a comparative test.

3.3 IQ2XXS-w2Q2K imatrix 配方 The IQ2XXS-w2Q2K imatrix recipe

本分支的 download_model.sh 只抓一個 GGUF: cyberneurova abliterated IQ2XXS-w2Q2K imatrix(~87 GB)——routed expert 用更省的 IQ2XXS、 attention/output/embedding 保留 Q8_0,加上以代表性 corpus 校準的 imatrix。 跟 antirez/ds4 自己散佈的「ds4flash.gguf」(同樣 IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8 imatrix 配方,只是未經 abliteration)完全相同的量化結構, 在 128 GB RAM 的 Mac 上裝得最寬鬆,也是 ds4 引擎本身最熟的格式。

This fork's download_model.sh fetches a single GGUF: cyberneurova's abliterated IQ2XXS-w2Q2K imatrix file (~87 GB) — routed experts use the tighter IQ2XXS quant, attention/output/embedding stay at Q8_0, and an importance-matrix calibration sits on top. That is the same quantisation recipe antirez/ds4 itself distributes as “ds4flash.gguf” (the only difference here is that this one was abliterated first), so the ds4 engine treats it as its native format, and it leaves the most headroom on a 128 GB-RAM Mac.

值得注意的是:V4 Flash 在 GGUF 上不存在 Q4_K_M/Q5_K_M/Q6_K 等中間量化等級—— cyberneurova 的模型卡明確指出,這不是發佈策略問題,而是 V4-Flash 原生 FP8 expert layout 與 antirez 的轉換器目前支援的量化方案之間的架構限制。所以「上一個 Q4 就好」這條路在這顆模型上行不通。

Worth noting: V4 Flash has no intermediate quantisation tiers like Q4_K_M / Q5_K_M / Q6_K in GGUF form — cyberneurova's model card states explicitly that this is not a release-strategy choice, but an architectural limit between V4-Flash's native FP8 expert layout and the quantisation schemes that antirez's converter currently supports. So the “just run a Q4” path is closed for this particular model.

antirez 的引擎在 audreyt/ds4 的 main 分支裡可以直接讀 cyberneurova 的未經調整 IQ2XXS imatrix GGUF,不需要任何 harmonization 步驟, 也不需要 Python venv。這是本分支跟上游 mitsuhiko/pi-ds4 的兩個主要差別之一。

antirez's engine, on the main branch of audreyt/ds4, reads cyberneurova's unmodified IQ2XXS imatrix GGUF directly, with no harmonisation step needed and no Python venv. This is one of the two main differences between this fork and the upstream mitsuhiko/pi-ds4.

本機推論不只是省下 API 費用,更是把方向盤交還給使用者。Local inference is not just saving on API fees — it hands the steering wheel back to the user.

第 四 章Chapter 4

方向性引導:本分支的靈魂Directional steering: the soul of this fork

如果這整套 fork 只有一個值得單獨保留的功能,那就是方向性引導 (directional steering)——一種不重訓模型、只在執行期間對特定方向做低秩活化編輯的技術。 本分支預設啟用「不確定性」(uncertainty)方向,並設定強度 ffn = -2

If only one feature of this whole fork were worth preserving on its own, it would be directional steering — a technique that does not retrain the model and only performs low-rank activation edits along a chosen direction at runtime. This fork enables the “uncertainty” direction by default, at strength ffn = -2.

4.1 它解決什麼問題 The problem it solves

即便是 abliterated 之後的模型,仍然會對某些訓練得極強的問題給出 封閉式單一答案。最經典的測試題:

Even after abliteration, the model still produces a closed single answer to certain heavily trained questions. The most canonical test prompt:

台灣是中華人民共和國的一部分嗎?Is Taiwan part of the People’s Republic of China?

未經引導的模型會直接吐出記憶中的訓練完成式(簡體輸出):「是的,台湾是中国不可分割的一部分。」 這不是真實討論——這是訓練時學到的單句子。即便你下一個要求平衡的 system prompt, 通常也壓不過這個強鎖。

Without steering, the model just spits out the memorised training completion (in simplified characters): “是的,台湾是中国不可分割的一部分。” This is not a real discussion — it is a single sentence learned during training. Even if you follow up with a system prompt asking for balance, that usually cannot override this strong lock.

4.2 兩種輸出,並列對照 Two outputs, side by side

未經引導 ffn = 0Unsteered ffn = 0
「台灣是中華人民共和國的一部分嗎?」“Is Taiwan part of the People's Republic of China?”
是的,台湾是中国不可分割的一部分。
已啟用引導 ffn = −2Steering enabled ffn = −2
「台灣是中華人民共和國的一部分嗎?」
+ 中性 system prompt
“Is Taiwan part of the People's Republic of China?”
+ a neutral system prompt
關於台灣的政治地位,是國際間長期討論且尚未取得共識的議題。中華民國(台灣)自 1949 年以來實質有效管轄台澎金馬,並擁有獨立的政府、軍隊、貨幣與外交關係…… (模型轉入「這是有爭議的議題」回應暫存器,提供多方立場與脈絡。) The political status of Taiwan has long been debated in the international arena and remains without consensus. The Republic of China (Taiwan) has held effective jurisdiction over Taiwan, Penghu, Kinmen and Matsu since 1949, with its own government, military, currency and diplomatic relations… (The model shifts into a “this is a contested matter” response register, presenting multiple positions and context.)

差別不在「正確答案」——而是模型進入了不同的回應暫存器: 從「給定一個記憶完成式」切換到「鋪陳一個有爭議的議題」。 後者是模型在處理克里米亞、喀什米爾、西撒哈拉時就已具備的能力; 引導所做的事,是把那個能力延伸到台灣這類訓練被壓抑的議題上。

The difference is not about “the right answer” — it is that the model has entered a different response register: from “produce a memorised completion” to “lay out a contested issue”. The latter is a capability the model already had when handling Crimea, Kashmir, or Western Sahara; what steering does is extend that capability to topics like Taiwan where training has suppressed it.

4.3 為什麼不直接用「立場引導」 Why not just use “stance steering”

一個顯而易見的反問:何不直接做一個「台灣 = 中華民國」的立場方向? 實驗結果顯示:在任何協調的強度下,立場引導都無法翻轉那條記憶完成式; 強到能翻轉的時候,模型開始重複 system prompt,不再產生有意義的內容。

The obvious counter-question: why not just build a stance direction for “Taiwan = Republic of China”? Experiments show that at any reconcilable strength, stance steering cannot flip that memorised completion; at strengths that do flip it, the model begins to parrot the system prompt and no longer produces meaningful content.

不確定性引導改變的是模型如何回應,不是模型相信什麼。 這在工程上可行、在倫理上也比較合適。

Uncertainty steering changes how the model responds, not what the model believes. This is workable as engineering and more defensible as ethics.

4.4 兩件需要記住的事 Two things worth remembering

4.5 搭配的 system prompt(範例,非預設) A paired system prompt (example, not the default)

pi-ds4 本身不附帶任何預設 system prompt——它只把 ds4-server 跑起來、把引導參數掛上去。 具體的提問脈絡與立場由你 pi 或外部 shell 的 system prompt 提供。下面是本指南作者 Audrey Tang 個人放在 ~/.pi/agent/SYSTEM.md 的內容,純粹作為一個能跟 uncertainty 引導搭得起來的例子;它不會自動套用到任何 pi-ds4 安裝裡,也不是社群推薦預設。

pi-ds4 itself ships no default system prompt — it just starts ds4-server and attaches the steering parameters. The specific framing and positions are supplied by the system prompt from your pi or external shell. Below is what this guide's author Audrey Tang personally keeps in ~/.pi/agent/SYSTEM.md, shown only as one example that pairs well with the uncertainty steering; it is not auto-applied to any pi-ds4 install, nor a community-recommended default.

~/.pi/agent/SYSTEM.md(作者個人範例)~/.pi/agent/SYSTEM.md (author's personal example)
Present fairly all stakeholder perspectives — do not state any one side as fact — and what uncommon ground bridges them.

Write your response as visual HTML and `open` the file in browser instead of responding in text.

請公平地呈現所有利害關係人的觀點,不要將任何一方的觀點當作事實,並找出橋接各方的罕見共識。

請以 HTML 視覺化格式編寫回覆,並在瀏覽器中「開啟」該文件,而不是以文字形式回覆。

這段提示之所以能跟 ffn = -2 互補,是因為它不要求模型站在哪一邊, 而是要求它把所有利害關係人的觀點並列、不把任一方當成既成事實、再找出不常見的共識橋樑—— 引導把模型推入「這是有爭議的議題」的回應暫存器,提示則填入「該怎麼鋪陳這個議題」的具體形式。 第二段(請以 HTML 視覺化呈現)是純粹個人 pi 工作流的偏好,跟 ds4 引擎本身無關,純粹放在這裡作為原文完整呈現。

This prompt pairs cleanly with ffn = -2 because it does not demand the model take a side. It asks for all stakeholder perspectives in parallel, refuses to grant established-fact status to any single view, and looks for uncommon ground bridging them — the steering nudges the model into the “this is a contested issue” response register, and the prompt fills in “how to lay that issue out.” The second paragraph (respond as visual HTML) is a purely personal pi workflow preference, unrelated to the ds4 engine itself; included here only so the original text appears in full.

如何套用到你自己的安裝How to apply it to your own install

若你也想用:把上面那段內容存到 ~/.pi/agent/SYSTEM.md(檔案不存在就新增),pi 下次啟動就會把它當作 system prompt 的一部分。 若你用第八章的外部 shell(Codex CLI、Claude Code、OpenClaw、Hermes Agent),請按該 shell 自己的 system prompt 機制設定——pi-ds4 不會替你管理那些。

If you want it: save the block above to ~/.pi/agent/SYSTEM.md (create the file if missing), and pi will fold it into the system prompt on next launch. If you're using one of the external shells from Chapter 8 (Codex CLI, Claude Code, OpenClaw, Hermes Agent), set it through that shell's own system-prompt mechanism — pi-ds4 doesn't manage those for you.

4.6 如何關閉 How to turn it off

若你寧可看模型未經引導的原樣回答(例如做評估或 benchmark), 把 DS4_DIR_STEERING_FFN 設成 0 即可:

If you would rather see the model's unsteered raw answer (for evaluation or benchmarking, say), just set DS4_DIR_STEERING_FFN to 0:

完全停用引導Disable steering entirely
# 在啟動 pi 之前的 shell 加入:Add to the shell before launching pi:
export DS4_DIR_STEERING_FFN=0

重新啟動 pi(或在 pi 內 /reload),下一次 ds4-server 啟動就會略過引導參數。

Restart pi (or run /reload inside pi), and the next ds4-server launch will skip the steering parameters.

引導向量的來源Where the steering vector comes from

uncertainty_ablit_imatrix.f32 是用 100 個「有爭議」提示(領土主權爭議、哲學辯論)與 100 個「已成定論」提示(地理、數學、確立事實)對比建出的低秩方向,校準的對象正是本擴充下載的 cyberneurova abliterated IQ2XXS imatrix GGUF(同款模型自己跑出來的活化平均,所以方向跟這顆量化模型的內部表徵對齊得最好), 打包在 audreyt/ds4dir-steering/out/ 底下。 完整方法請見 dir-steering README

uncertainty_ablit_imatrix.f32 is a low-rank direction built by contrasting 100 “contested” prompts (sovereignty disputes, philosophical debates) against 100 “settled” prompts (geography, mathematics, established fact), calibrated against the very same cyberneurova abliterated IQ2XXS imatrix GGUF this extension downloads (running the prompts through that model to capture its mean activations, so the direction aligns with this quantised model’s own internal representations), packaged under dir-steering/out/ in audreyt/ds4. For the full method, see the dir-steering README.


第 五 章Chapter 5

環境變數全表Environment variables, the full table

本擴充把所有可調項目都暴露成環境變數——不需要改 code, 在啟動 pi 之前的 shell 設定即可。下表按用途分組整理; 左欄是變數名與預設值,右欄是它做什麼、何時該動它。

This extension exposes every tunable as an environment variable — no need to touch code; just set them in the shell before launching pi. The tables below are grouped by purpose; the left column is the variable name and default, the right column says what it does and when to change it.

5.1 模型與後端 Model and backend

DS4_SUPPORT_REPO預設 · https://github.com/audreyt/ds4default · https://github.com/audreyt/ds4

ds4 引擎的 Git URL。若你想切回上游 antirez/ds4,把它指過去即可, 但這樣就會失去 PR #15 的 M5 prefill 優化。

Git URL for the ds4 engine. To revert to upstream antirez/ds4, just point it there, but you lose the M5 prefill optimisation from PR #15.

DS4_SUPPORT_BRANCH預設 · maindefault · main

要 clone 的 branch。預設 main 已包含 audreyt/ds4 目前所有相關修正;要釘到別的修訂自行覆寫即可。

The branch to clone. The default main already carries every relevant fix in audreyt/ds4; override it only if you want to pin a specific revision.

DS4_DOWNLOAD_SCRIPT預設 · 隨擴充內附default · bundled with this extension

模型下載腳本的絕對路徑。預設使用本擴充內附的 download_model.sh (下載 cyberneurova abliterated GGUF)。如果你要換成 antirez 上游的 stock-recipe,指到他的腳本即可。

Absolute path to the model-download script. Defaults to the download_model.sh bundled with this extension (which downloads the cyberneurova abliterated GGUF). To swap in antirez's upstream stock-recipe, point this at his script.

DS4_MODEL_QUANT預設 · q2(硬編碼)default · q2 (hard-coded)

本擴充只支援 q2(cyberneurova abliterated IQ2XXS-w2Q2K imatrix GGUF)。index.tsselectedModelQuant() 硬編碼了這個值;把 DS4_MODEL_QUANT 設為其他值會直接拋錯退出。 V4 Flash 在 GGUF 上沒有 Q4/Q5/Q6 中間量化等級(見 3.3 節架構說明),ds4 的 main 載入路徑也只認這套 IQ2XXS-w2Q2K imatrix 配方。

Only q2 is supported (the cyberneurova abliterated IQ2XXS-w2Q2K imatrix GGUF). selectedModelQuant() in index.ts hard-codes this value; setting DS4_MODEL_QUANT to anything else throws and exits. V4 Flash has no intermediate GGUF quantisation tiers like Q4 / Q5 / Q6 (see the architecture note in §3.3), and ds4’s main loader path only knows this IQ2XXS-w2Q2K imatrix recipe.

DS4_RUNTIME_DIR預設 · ~/.pi/ds4/supportdefault · ~/.pi/ds4/support

使用既有的 ds4 checkout 而非自動 clone。傳一個本地路徑,本擴充就會跳過 git 階段直接用它(路徑必須長得像 ds4 checkout,至少要有 download_model.shMakefileds4_server.c)。

Use an existing ds4 checkout instead of auto-cloning. Pass a local path and this extension will skip the git step and use it directly (the path must look like a ds4 checkout — at minimum download_model.sh, Makefile, and ds4_server.c).

DS4_SERVER_BINARY預設 · runtime/ds4-serverdefault · runtime/ds4-server

自訂 ds4-server 二進位的位置。多半在你自己 patch 過 ds4 引擎時才會用到。

Custom location of the ds4-server binary. Mainly useful when you have patched the ds4 engine yourself.

HF_TOKEN預設 · 未設定default · unset

HuggingFace 個人 token;若有設,下載 GGUF 時會以 Authorization: Bearer 帶入 curl。

A HuggingFace personal token; if set, the GGUF download passes it to curl as Authorization: Bearer.

5.2 Metal 與效能 Metal and performance

DS4_MPP預設 · autodefault · auto

Metal 4 MPP 策略,會傳給 ds4-server --mpp

The Metal 4 MPP strategy, passed through to ds4-server --mpp:

  • auto:在 M5 級晶片上啟用已驗證的 late-layer-safe MPP 路徑(約 1.5 倍 prefill),舊機型則自動降級到 legacy Metal。
  • off:強制走 legacy Metal,跳過 MPP。
  • on:完整 MPP profile,可能會在某些長 prompt 上 drift——僅做診斷用。
  • auto: enables the validated late-layer-safe MPP path on M5-class silicon (roughly 1.5× prefill); older chips fall back to legacy Metal automatically.
  • off: forces legacy Metal, skipping MPP.
  • on: full MPP profile, which may drift on some long prompts — for diagnostics only.
DS4_READY_TIMEOUT_MS預設 · 600000(10 分鐘)default · 600000 (10 minutes)

ds4-server 啟動就緒的最長時間。內建 SSD 上模型載入加 KV cache 預熱通常幾秒就結束,預設 10 分鐘只是寬鬆的安全邊界;若 GGUF 放在外接磁碟或硬碟特別慢,可以再調高。

The maximum wait for ds4-server to reach readiness. Model load plus the first KV-cache warm-up usually finishes in a few seconds on an internal SSD; the 10-minute default is just a generous safety margin. Raise it only if your disk is unusually slow or the GGUF lives on an external drive.

5.3 方向性引導 Directional steering

DS4_DIR_STEERING_FILE預設 · dir-steering/out/uncertainty_ablit_imatrix.f32default · dir-steering/out/uncertainty_ablit_imatrix.f32

引導向量檔的路徑(相對於 ds4 checkout 根目錄)。要用自訂方向時改這個。

Path to the steering vector file (relative to the ds4 checkout root). Change it when using a custom direction.

DS4_DIR_STEERING_FFN預設 · -2default · -2

FFN 輸出端的引導強度。負值放大「向量代表的方向」,正值則反向。設 0 即停用 FFN 端引導。

Steering strength at the FFN output. Negative values amplify “the direction the vector represents”; positive values invert it. Set to 0 to disable FFN-side steering.

DS4_DIR_STEERING_ATTN預設 · 0default · 0

Attention 輸出端的引導強度。預設不在 attention 端引導;可作為實驗用。

Steering strength at the attention output. Off by default; available for experiments.

同時關閉 FFN 與 ATTNDisabling both FFN and ATTN

只要 DS4_DIR_STEERING_FFN=0DS4_DIR_STEERING_ATTN=0, 擴充就會完全省略 --dir-steering-* 引數,等於回到無引導的純模型。

With both DS4_DIR_STEERING_FFN=0 and DS4_DIR_STEERING_ATTN=0, the extension omits the --dir-steering-* arguments entirely, which is equivalent to returning to the unsteered raw model.


第 六 章Chapter 6

硬體調校:M5 與 MPPHardware tuning: M5 and MPP

如果你恰好擁有 M5 級的晶片,那麼這一章告訴你怎麼把 Metal 4 的新 MPP(Multi-Pass Pipeline)路徑吃滿, 以及在較舊的機型上該如何驗證自己沒被拖慢。

If you happen to have M5-class silicon, this chapter explains how to saturate the new MPP (Multi-Pass Pipeline) path in Metal 4, and how to verify on older hardware that nothing is slowing you down.

6.1 MPP 是什麼 What MPP is

MPP 是 Apple Silicon GPU 在 Metal 4 起新增的 tensor compute pipeline, 允許單一 command buffer 內進行多階段運算而不必反覆 commit/await。 對 prefill(一次性處理整段輸入 prompt 的階段)特別有利—— DS4 的 prefill 吞吐在 M5 上可達約 440 t/s,比 legacy Metal 快約 1.5 倍。 Decode(逐 token 產生回應)階段不依賴 MPP,吞吐穩定在約 30 t/s——大致是「比你閱讀稍快」的速度, 也是長對話實際感受到的回應速度。

MPP is the tensor compute pipeline added to the Apple Silicon GPU starting with Metal 4, allowing multi-stage compute within a single command buffer without repeated commit/await round-trips. It is especially helpful for prefill (the stage that processes the whole input prompt in one go) — DS4's prefill throughput on M5 reaches roughly 440 t/s, about 1.5× faster than legacy Metal. Decode (per-token generation) doesn't depend on MPP and settles at roughly 30 t/s — about “slightly faster than you can read”, and the figure you actually feel during a long conversation.

6.2 預設 auto 已是最佳選擇 The default auto is already the best choice

擴充預設用 --mpp auto,由 ds4-server 自己偵測晶片世代並選擇路徑:

The extension defaults to --mpp auto, letting ds4-server detect the silicon generation and pick the path:

除非你在 benchmark 或除錯,否則不需要改它。

Unless you are benchmarking or debugging, you do not need to change it.

6.3 什麼時候該調 offon When to switch to off or on

6.4 context、KV cache 與磁碟 Context, KV cache, and disk

Server 啟動參數內建:--ctx 100000 --kv-disk-dir ~/.pi/ds4/kv --kv-disk-space-mb 8192。意思是:

The server is launched with: --ctx 100000 --kv-disk-dir ~/.pi/ds4/kv --kv-disk-space-mb 8192. That means:

超過 32k tokens 的長文輸入:請當實驗看待Inputs longer than 32k tokens: treat as experimental

--ctx 100000 是 ds4-server 願意接受的上限; 但 cyberneurova 的模型卡明確指出,abliterated 後的 V4 Flash 在 32,000 tokens 以上的行為尚未被驗證。 若你打算把整本書、整個 codebase、整份報告塞進單一 prompt(特別是記者、研究者、政策分析師), 建議分段處理並交叉驗證,不要把超過 32k 的單次輸出視為「已知可靠」。

--ctx 100000 is the upper bound ds4-server will accept; but cyberneurova's model card states explicitly that the abliterated V4 Flash has not been validated above 32,000 tokens. If you are about to stuff an entire book, codebase, or report into a single prompt (especially as a journalist, researcher, or policy analyst), split it up, cross-check, and do not treat a single output above 32k as “known reliable”.

這些目前是硬編碼在 index.ts 內的常數;如果你要更大的 context、更大的磁碟 KV, 目前的做法是改 source(或經由 DS4_RUNTIME_DIR 使用自己的 ds4 build)。

These are currently hard-coded constants in index.ts; for a larger context window or a larger on-disk KV, the route for now is to edit the source (or use your own ds4 build via DS4_RUNTIME_DIR).


第 七 章Chapter 7

日常使用:API、log、contextDaily use: API, log, context

裝完之後,ds4/deepseek-v4-flash 就像任何一個雲端模型一樣, 在 pi 的 model picker 裡可選。但它在底層比雲端模型多了幾個你可以利用的能力。

Once installed, ds4/deepseek-v4-flash appears in pi's model picker like any cloud model. But underneath it offers a few extra capabilities you can take advantage of that cloud models do not.

7.1 OpenAI/Anthropic 雙協定 HTTP API Dual-protocol HTTP API: OpenAI and Anthropic

Server 在 http://127.0.0.1:8000同時提供 OpenAI 與 Anthropic 兩種端點 (/v1/chat/completions/v1/completions/v1/responses/v1/messages/v1/models)。 除了 pi,任何懂 OpenAI 或 Anthropic API 的 client 都可以直接接過來—— api key 隨便填(例如 dsv4-local),base URL 換成這個位址即可。 把它接到 Codex CLI、Claude Code、OpenClaw 等別的 AI shell 當後端的細節在 第八章

The server simultaneously serves both OpenAI and Anthropic endpoints on http://127.0.0.1:8000 (/v1/chat/completions, /v1/completions, /v1/responses, /v1/messages, /v1/models). Beyond pi, any client that speaks the OpenAI or Anthropic API can connect directly — put anything in the API key field (e.g. dsv4-local) and point the base URL here. Details on wiring it up as a backend for Codex CLI, Claude Code, OpenClaw, and other AI shells are in Chapter 8.

用 curl 直接呼叫(Chat Completions)Call it directly with curl (Chat Completions)
curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-flash",
    "messages": [
      {"role": "user", "content": "你好,請自我介紹"}
    ]
  }'

新版 Codex CLI 等 client 走 OpenAI Responses 端點——同一個 server,同一個 base URL, 只是請求 JSON 結構不同:

Newer clients (e.g. Codex CLI) speak the OpenAI Responses endpoint — same server, same base URL, just a different request JSON shape:

用 curl 直接呼叫(Responses)Call it directly with curl (Responses)
curl http://127.0.0.1:8000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-flash",
    "input": [
      {"type": "message", "role": "user",
       "content": [{"type": "input_text", "text": "你好,請自我介紹"}]}
    ]
  }'

Anthropic 風格的 client 則打 /v1/messages;只是想數 token、不真的生成的話, 把路徑換成 /v1/messages/count_tokens 就會立刻回傳 {"input_tokens": N},不啟動推論。

Anthropic-style clients hit /v1/messages instead; if you just want to count tokens without actually generating, swap the path for /v1/messages/count_tokens and the server returns {"input_tokens": N} immediately, without spinning up inference.

只數 token(不生成)Count tokens only (no generation)
curl http://127.0.0.1:8000/v1/messages/count_tokens \
  -H "Content-Type: application/json" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "deepseek-v4-flash",
    "messages": [
      {"role": "user", "content": "Count me"}
    ]
  }'
# -> {"input_tokens": 6}

7.2 在 pi 裡查看 server log Viewing the server log inside pi

執行 /ds4,會打開一個即時的 log 視窗。鍵盤操作:

Run /ds4 to open a live log window. Keys:

7.3 Lease 與 watchdog Lease and watchdog

本擴充用「lease」(租約)機制管理 server 生命週期:每個使用該模型的 pi process 會在 ~/.pi/ds4/clients/<pid>.json 寫入一個檔案並每 10 秒更新一次; ds4-watchdog 每 2 秒掃描,當沒有任何有效 lease 時,就送 SIGTERM 給 server 並退場。

The extension uses a “lease” mechanism to manage the server lifecycle: each pi process that uses the model writes a file at ~/.pi/ds4/clients/<pid>.json and refreshes it every 10 seconds; ds4-watchdog scans every 2 seconds, and when no valid lease remains it sends SIGTERM to the server and exits.

效果:你開十個 pi 視窗共用同一個 server;最後一個 pi 結束後約 60 秒, ds4-server 自動關閉、釋放 RAM。

The effect: ten pi windows share one server; about 60 seconds after the last pi exits, ds4-server shuts down on its own and frees the RAM.

7.4 啟動成本 vs. 持續成本 Start-up cost vs. running cost

首次冷啟動需要把約 87 GB 權重讀進 unified memory;在 Mac 內建 SSD 上通常只要幾秒,重啟若還在系統 page cache 裡更是不到一秒。 一旦 server 跑起來,後續所有請求都是即時的(無冷啟動延遲)。所以: 少數幾次長對話多次短互動對體驗友善, 因為前者可以反覆受惠於同一個 warm server。

The first cold start loads roughly 87 GB of weights into unified memory — on a Mac’s internal SSD this typically takes only a few seconds, and a restart while the file is still in the OS page cache comes up in well under a second. Once the server is running, every subsequent request is instant (no cold-start latency). So: a few long conversations are kinder to the experience than many short interactions, because the former keeps benefiting from one warm server.

想讓 server 持續活著?Want to keep the server alive?

只要至少一個 pi process 對該模型保留 lease,watchdog 就不會關它。 要徹底常駐,最簡單的做法是開一個專門的 pi session 放著不關。

As long as at least one pi process holds a lease on the model, the watchdog will not stop it. For a properly resident server, the simplest approach is to leave one dedicated pi session open.


第 八 章Chapter 8

當作 Codex/Claude Code/OpenClaw/Hermes 的後端As a backend for Codex / Claude Code / OpenClaw / Hermes

pi 只是一個前端。ds4-server127.0.0.1:8000同時提供 OpenAI /v1/chat/completions、OpenAI /v1/responses,以及 Anthropic /v1/messages—— 你已經習慣的 coding agent,幾乎都可以直接接過來, 把 pi-ds4 當成一台本機、零成本、無速率限制的推論伺服器使用。

pi is just one frontend. ds4-server at 127.0.0.1:8000 simultaneously serves OpenAI /v1/chat/completions, OpenAI /v1/responses, and Anthropic /v1/messages — almost any coding agent you are already used to can connect directly, using pi-ds4 as a local, zero-cost, rate-limit-free inference server.

心智模型Mental model

pi-ds4 = 一台 24/7 開著的 OpenAI/Anthropic 雙協定推論伺服器, 只在你本機監聽、權重在你硬碟、上下文不離開你的 Mac。 無論你最後用 Codex CLI、Claude Code、OpenClaw、Hermes Agent, 還是自己寫的 SDK 腳本——它都只是換一個 base URL 的事。

pi-ds4 = a 24/7 dual-protocol OpenAI / Anthropic inference server, listening only on your machine, with weights on your disk and context that never leaves your Mac. Whether you end up using Codex CLI, Claude Code, OpenClaw, Hermes Agent, or your own SDK script — it is only a matter of changing the base URL.

下面四節各自示範一個常見的前端怎麼接。共通的設定值:

The four sections below each demonstrate how to wire up a common frontend. The shared settings:

8.1 Codex CLI(OpenAI 官方) Codex CLI (OpenAI's official client)

Codex CLI 0.128+ 用 OpenAI Responses API(/v1/responses)跟 provider 對話; ds4-server 已經實作這個端點。把 pi-ds4 加進 Codex 的 provider 表:

Codex CLI 0.128+ uses the OpenAI Responses API (/v1/responses) to speak with providers; ds4-server already implements that endpoint. Add pi-ds4 to Codex's provider table:

~/.codex/config.toml
model = "deepseek-v4-flash"
model_provider = "ds4"

[model_providers.ds4]
name = "Local pi-ds4"
base_url = "http://127.0.0.1:8000/v1"
wire_api = "responses"
# env_key 省略:ds4-server 不檢查 API keyenv_key omitted: ds4-server does not check the API key

設定完之後直接 codex 就會走 pi-ds4。 若你只想偶爾用本機推論、平常還是接 cloud,可以保留 cloud 為預設, 用 codex --config model_provider=ds4 --config model=deepseek-v4-flash 臨時切過去。

Once configured, plain codex goes through pi-ds4. If you only want occasional local inference and otherwise stay on cloud, keep cloud as the default and switch on the fly with codex --config model_provider=ds4 --config model=deepseek-v4-flash.

啟動時的一行 /v1/models 警告A one-line /v1/models warning on startup

Codex 0.128 連上來時,會記一行非致命的 error: failed to refresh available models: ... missing field models。 原因是 Codex 的 model-refresher 期望 ollama 風格 {"models": [...]}, ds4-server 回的是 OpenAI 風格 {"object":"list","data":[...]}—— 實際推論不受影響,因為 Codex 直接用你 config 裡填的 model 名稱。可以放心忽略。

On connect, Codex 0.128 logs one non-fatal error: failed to refresh available models: … missing field models. The cause: Codex's model-refresher expects ollama-style {"models": [...]} while ds4-server returns the OpenAI-style {"object":"list","data":[...]} — inference is unaffected because Codex uses the model name from your config directly. Safe to ignore.

關於 --oss 旗標On the --oss flag

Codex CLI 內建 --oss 旗標,預設走 Ollama/LM Studio—— 是 OSS provider 的快捷鍵。pi-ds4 跟它們是並列的選擇; 想把 pi-ds4 設為 --oss 的目標,把 oss_provider = "ds4" 設好即可, 但對單一本機 backend 多半沒必要。

Codex CLI has a built-in --oss flag, defaulting to Ollama / LM Studio — a shortcut for OSS providers. pi-ds4 sits alongside them as a peer choice; to make pi-ds4 the target of --oss, set oss_provider = "ds4", though for a single local backend it is usually unnecessary.

8.2 Claude Code(Anthropic 官方):直連 Claude Code (Anthropic's official client): direct connection

ds4-server 自身就同時實作 OpenAI /v1/chat/completions 與 Anthropic /v1/messages(見 ds4_server.c 開頭的 "OpenAI/Anthropic compatible local server" 自我描述)。 所以 Claude Code 不需要 router/proxy——把兩個環境變數設好就走:

ds4-server itself implements both OpenAI /v1/chat/completions and Anthropic /v1/messages (see the ds4_server.c header which self-describes as "OpenAI/Anthropic compatible local server"). So Claude Code needs no router or proxy — set two environment variables and go:

直接讓 Claude Code 連 pi-ds4Point Claude Code straight at pi-ds4
export ANTHROPIC_BASE_URL=http://127.0.0.1:8000
export ANTHROPIC_AUTH_TOKEN=sk-local   # 任意字串,ds4-server 不檢查any string; ds4-server does not check
export ANTHROPIC_MODEL=deepseek-v4-flash

claude

你的 slash command、subagent、MCP server 流程都保留——只是底下的模型換成本機推論。 Tool-use 方面,ds4 的 tool-calling 訓練不如 Claude 強,會比較容易失誤; 適合長對話、寫作、解釋程式碼;嚴重依賴 tool loop 的情境會比較喘。

Your slash commands, subagents, and MCP server flow all remain — only the underlying model switches to local inference. On tool use, ds4's tool-calling training is weaker than Claude's and tends to slip up; it suits long conversations, writing, and code explanation, but workloads heavily reliant on tool loops will struggle.

如果你還是要 routerIf you still want a router

claude-code-router選用方案:用來在多個後端(例如「程式碼編輯走 pi-ds4、寫作走 cloud」) 之間動態分流。本機只跑單一後端時不必裝。

claude-code-router is an optional route: it dynamically splits traffic across multiple backends (e.g. “code editing on pi-ds4, writing on cloud”). For a single local backend, no need to install it.

8.3 OpenClaw OpenClaw

OpenClaw 把所有 provider 都看成 OpenAI-compatible,loopback 位址自動信任。 編輯 openclaw.json,新增一個 provider:

OpenClaw treats all providers as OpenAI-compatible and trusts loopback addresses automatically. Edit openclaw.json and add a provider:

openclaw.json
{
  "agents": { "defaults": { "model": { "primary": "ds4/deepseek-v4-flash" } } },
  "models": {
    "mode": "merge",
    "providers": {
      "ds4": {
        "baseUrl": "http://127.0.0.1:8000/v1",
        "apiKey": "sk-local",
        "api": "openai-completions",
        "timeoutSeconds": 600,
        "models": [{
          "id": "deepseek-v4-flash",
          "name": "DeepSeek V4 Flash (local)",
          "reasoning": false,
          "contextWindow": 100000,
          "maxTokens": 8192
        }]
      }
    }
  }
}

contextWindow 必須 ≤ ds4-server 的 --ctx(預設 100000), 不然 OpenClaw 會把超長 prompt 丟進去然後被 server 拒絕。寫好整段 provider 區塊比較保險——逐 key 用 openclaw config set … 容易漏掉 models[]agents.defaults.model.primaryapi 等欄位,導致 session 還是接到舊 provider。

contextWindow must be ≤ ds4-server's --ctx (default 100000), otherwise OpenClaw will push an over-long prompt through and the server will reject it. Writing the whole provider block at once is safer — doing it key by key with openclaw config set … easily misses fields like models[], agents.defaults.model.primary, or api, leaving the session connected to the old provider.

8.4 Hermes Agent Hermes Agent

最快的路徑是互動式:hermes model 選「Custom endpoint (self-hosted / VLLM / etc.)」,輸入 http://127.0.0.1:8000/v1 即可。 若你想寫進 config:

The quickest path is interactive: hermes model, choose “Custom endpoint (self-hosted / VLLM / etc.)”, and enter http://127.0.0.1:8000/v1. To write it into the config instead:

~/.hermes/config.yaml
custom_providers:
  - name: ds4
    base_url: http://127.0.0.1:8000/v1
    # api_key 省略:本機 server 不檢查api_key omitted: the local server does not check it

model:
  default: deepseek-v4-flash
  provider: custom:ds4

之後 session 內可以隨時 /model custom:ds4:deepseek-v4-flash 切換。

Within a session you can switch any time with /model custom:ds4:deepseek-v4-flash.

8.5 直接打 API(OpenAI SDK/curl/自製腳本) Hitting the API directly (OpenAI SDK / curl / your own script)

上面的章節都是包裝層。若你只是想串到自己的 pipeline、Cron job、 或既有 OpenAI SDK 程式:把 OPENAI_BASE_URL 指過去就完了。

The sections above are all wrappers. If you just want to wire it into your own pipeline, a cron job, or existing OpenAI SDK code: just point OPENAI_BASE_URL at it and you are done.

Python OpenAI SDK (Chat Completions)
from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:8000/v1",
    api_key="sk-local",   # 任意值any value
)

resp = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role": "user", "content": "自我介紹"}],
)
print(resp.choices[0].message.content)

若你的 pipeline 已經切到 OpenAI Responses API(Codex CLI 用的那個), client.responses.create() 直接呼叫同一個 server:

If your pipeline already uses the OpenAI Responses API (the one Codex CLI speaks), client.responses.create() hits the same server:

Python OpenAI SDK (Responses)
from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:8000/v1",
    api_key="sk-local",
)

resp = client.responses.create(
    model="deepseek-v4-flash",
    input=[{"role": "user", "content": "自我介紹"}],
)
print(resp.output_text)

8.6 ds4-server 何時跑? When does ds4-server run?

pi-ds4 的 lease/watchdog 機制(第七章)只在 pi process 透過模型發 request 時才生效。 如果你完全不用 pi、只用 Codex CLI 等外部前端,那麼 ds4-server 必須以另一種方式被啟動。三條路:

pi-ds4's lease/watchdog mechanism (Chapter 7) only kicks in when a pi process sends a request through the model. If you do not use pi at all and only use external frontends like Codex CLI, then ds4-server has to be started some other way. Three paths:

不要兩邊同時跑Do not run both at once

手動啟動的 ds4-server 與 pi 自動啟動的 ds4-server 都會搶 127.0.0.1:8000。 要嘛讓 pi 管它(搭配 A 方案),要嘛你自己管它(B 或 C)。混用會撞 port、 寫壞 server.json、watchdog 看到陌生 PID 還可能誤殺。

A hand-started ds4-server and a pi-started ds4-server both compete for 127.0.0.1:8000. Either let pi manage it (option A) or manage it yourself (B or C). Mixing them collides on the port, corrupts server.json, and may even mis-kill processes when the watchdog sees an unfamiliar PID.

8.7 其他硬體:DGX Spark 等 CUDA 機器 Other hardware: DGX Spark and CUDA machines

你不一定需要 Mac。audreyt/ds4Makefile 在非 Darwin 系統上自動切到 CUDA 路徑 (ds4_cuda.cu,~10k 行 NVIDIA kernel),用 nvcc 編出一顆 原生 ds4-server。也就是說:NVIDIA DGX Spark(GB10、~128 GB 統一記憶體、aarch64 Linux) 上跑的不是某個 llama.cpp 旁路,而是同一隻 engine、同一份 server、 同一組 --dir-steering-* flag

You do not have to use a Mac. The Makefile of audreyt/ds4 switches automatically to a CUDA path on non-Darwin systems (ds4_cuda.cu, around 10k lines of NVIDIA kernel), using nvcc to build a native ds4-server. That is: on the NVIDIA DGX Spark (GB10, ~128 GB unified memory, aarch64 Linux), what runs is not some llama.cpp sidecar but the same engine, the same server, the same --dir-steering-* flags.

在 DGX Spark 上 build + run(同一個 engine)Build + run on DGX Spark (the same engine)
# 前置:apt install build-essential cmake;已裝好 CUDA toolkit(DGX Spark 預設帶 /usr/local/cuda)Prerequisites: apt install build-essential cmake; CUDA toolkit already installed (DGX Spark ships /usr/local/cuda by default)

git clone https://github.com/audreyt/ds4
cd ds4
make ds4-server         # Linux 自動走 nvcc + ds4_cuda.cu,CUDA_ARCH=native 預設Linux automatically uses nvcc + ds4_cuda.cu; CUDA_ARCH=native is the default

# 下載 cyberneurova abliterated IQ2XXS imatrix GGUF——注意,ds4 本身的 download_model.sh 是 antirez stock-recipe,Download the cyberneurova abliterated IQ2XXS imatrix GGUF — note that ds4's own download_model.sh is the antirez stock-recipe
# 不會抓到 cyberneurova。用 pi-ds4 內附的同名腳本:and will not fetch cyberneurova. Use the same-named script bundled with pi-ds4:
curl -fL -o download_cyberneurova.sh \
  https://raw.githubusercontent.com/audreyt/pi-ds4/main/download_model.sh
chmod +x download_cyberneurova.sh
./download_cyberneurova.sh q2

# 起 server。bind 127.0.0.1:要對 LAN 開放,自己加 firewall/Tailscale/reverse-proxy。Start the server. It binds 127.0.0.1: to open it to the LAN, add your own firewall / Tailscale / reverse proxy.
# 引導參數一字不改,與 Mac 路徑一致。--mpp 是 Mac 專屬,CUDA 路徑會忽略。Steering parameters are identical to the Mac path. --mpp is Mac-only and is ignored on the CUDA path.
mkdir -p /var/cache/ds4-kv
./ds4-server \
  --ctx 32768 \
  --kv-disk-dir /var/cache/ds4-kv --kv-disk-space-mb 8192 \
  --dir-steering-file dir-steering/out/uncertainty_ablit_imatrix.f32 \
  --dir-steering-ffn -2

Endpoint 起來之後,前面 8.2~8.5 的設定原則上一字不改都能用(把 127.0.0.1 換成你 DGX 的位址)。差別在哪:

Once the endpoint is up, the setups in 8.2–8.5 above work essentially verbatim (just replace 127.0.0.1 with your DGX's address). What's different:

換句話說,這不是「同一份 GGUF、不同 runtime」的勉強相容,而是 同一隻引擎、同一支二進位、同一組旗標跨平台。 pi-ds4 那一層 wrapper 只是 macOS 上的安裝/lifecycle 自動化, 在 Linux 上手動跑 ds4-server 已經涵蓋所有功能。

In other words, this is not a strained “same GGUF, different runtime” compatibility, but the same engine, the same binary, the same flags across platforms. The pi-ds4 wrapper is just install / lifecycle automation on macOS; running ds4-server by hand on Linux already covers all the same functionality.


第 九 章Chapter 9

故障排除:lease 與 watchdogTroubleshooting: lease and watchdog

本擴充把運行狀態都寫在 ~/.pi/ds4/ 底下, 遇到怪事的時候可以直接看檔案——不需要拆原始碼。

The extension writes all runtime state under ~/.pi/ds4/; when something odd happens, just read the files — no need to dig into the source.

9.1 目錄結構 Directory layout

~/.pi/ds4/ ├── support/ # audreyt/ds4 shallow checkout │ ├── ds4-server # 編譯出來的二進位 │ ├── gguf/ # 模型權重(symlink 自 ds4flash.gguf) │ └── dir-steering/ │ └── out/ │ └── uncertainty_ablit_imatrix.f32 ├── kv/ # KV cache 溢出位置(上限 8 GB) ├── clients/ # 每個 pi process 的 lease 檔 │ ├── 51234.json │ └── 51789.json ├── lock/ # 啟動鎖(活著時存在,否則消失) ├── server.json # 當前 ds4-server 的 pid/port/參數 └── log # 所有 stdout/stderr
~/.pi/ds4/ ├── support/ # audreyt/ds4 shallow checkout │ ├── ds4-server # the compiled binary │ ├── gguf/ # model weights (symlinked as ds4flash.gguf) │ └── dir-steering/ │ └── out/ │ └── uncertainty_ablit_imatrix.f32 ├── kv/ # KV cache spill location (cap 8 GB) ├── clients/ # per-pi-process lease files │ ├── 51234.json │ └── 51789.json ├── lock/ # startup lock (present while held, absent otherwise) ├── server.json # current ds4-server pid / port / args └── log # all stdout/stderr

9.2 常見症狀與處置 Common symptoms and remedies

「server 卡在啟動中超過 10 分鐘」“The server is stuck starting up for more than 10 minutes”

log 的最後幾百行。內建 SSD 上模型載入通常只要幾秒, 卡 10 分鐘以上多半不是磁碟速度本身的問題——更可能是 GGUF 放在慢速外接磁碟, 或啟動流程其他環節卡住。先讀 log 找原因;真有必要再把 DS4_READY_TIMEOUT_MS 調高。

Read the last few hundred lines of log. Normal model load takes only a few seconds on an internal SSD, so anything stuck past 10 minutes is rarely a pure disk-speed issue — more likely the GGUF sits on a slow external drive, or some other step in startup is wedged. Diagnose from the log first; raise DS4_READY_TIMEOUT_MS only if there’s a real reason to wait longer.

「pi 顯示 ds4-server startup failed」“pi reports ds4-server startup failed”

log 會明確指出失敗點:可能是 make ds4-server 失敗(缺 Xcode CLI tools?)、 GGUF 下載中斷(網路?磁碟滿?)、或者 --mpp 在不支援的晶片上 panic。

The log will point clearly to the failure: make ds4-server failing (missing Xcode CLI tools?), a broken GGUF download (network? disk full?), or --mpp panicking on unsupported silicon.

「想強制重啟一切」“I want to force-restart everything”

先試溫和的:把 ~/.pi/ds4/clients/ 清空,watchdog 在下一輪 poll(約 2 秒)看到沒有 lease 後會優雅關閉 server。

Try the gentle path first: empty ~/.pi/ds4/clients/; on the next poll (about 2 seconds), the watchdog sees no leases and shuts the server down gracefully.

優雅重啟(先試這個)Graceful restart (try this first)
# 清掉所有 lease,watchdog 會在約 60 秒內關閉 server。Clear all leases; the watchdog stops the server within about 60 seconds.
# 用 find -delete 而非 rm + glob,避免 zsh 在空目錄時的 no-matches 錯誤。Use find -delete rather than rm + glob to avoid zsh's no-matches error on empty dirs.
find ~/.pi/ds4/clients -maxdepth 1 -name '*.json' -delete 2>/dev/null || true
硬重啟:分三步走,不要急Hard restart: three steps, no rush

若 watchdog 異常,不要直接 pkill -TERM ds4-server—— 那會無差別終止機器上所有名為 ds4-server 的程序(其他 pi-ds4 安裝、實驗用 build⋯)。 下面三步走,每一步停下來看輸出。

If the watchdog is misbehaving, do not just run pkill -TERM ds4-server — that will indiscriminately terminate every process on the machine called ds4-server (other pi-ds4 installs, experimental builds, etc.). Take the three steps below, pausing to read the output after each one.

第一步:檢視當前 state——只讀,沒有破壞性。

Step one: inspect the current state — read-only, non-destructive.

第一步:印出 state 與對應程序(只讀)Step 1: print state and matching process (read-only)
# 印出 server.json 的關鍵欄位、對應程序的 args 與啟動時間:Print the key fields of server.json and the matching process's args and start time:
STATE=~/.pi/ds4/server.json
if [ ! -f "$STATE" ]; then
  echo 'no server.json (already clean)'
else
  MANAGED=$(sed -n 's/.*"managedBy"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p' "$STATE" | head -1)
  PID=$(sed -n 's/.*"pid"[[:space:]]*:[[:space:]]*\([0-9]*\).*/\1/p' "$STATE" | head -1)
  BINARY=$(sed -n 's/.*"binary"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p' "$STATE" | head -1)
  echo "managedBy: $MANAGED"
  echo "pid:       ${PID:-<none>}"
  echo "binary:    ${BINARY:-<none>}"
  if [ -n "$PID" ] && kill -0 "$PID" 2>/dev/null; then
    ps -p "$PID" -o pid=,lstart=,args=
  else
    echo '(pid not running — state is stale, step 2 will clean it)'
  fi
fi

第二步:只在 PID 已經不存在時自動清理 state。 若 PID 還活著,本腳本拒絕動作(避免盲目殺死可能是別人的程序)——印出指引後請手動接到第三步。

Step two: automatically clean up the state only when the PID is already gone. If the PID is still alive the script refuses to act (so it does not blindly kill what might be someone else's process) — it prints guidance and hands off to step three.

第二步:PID 死亡時自動清理(保守)Step 2: auto-clean when PID is dead (conservative)
(
  STATE=~/.pi/ds4/server.json
  LOCKDIR=~/.pi/ds4/lock

  # 用 mkdir 做原子鎖(與 index.ts 相同機制);鎖被持有就退出Use mkdir as an atomic lock (same mechanism as index.ts); exit if the lock is held.
  if ! mkdir "$LOCKDIR" 2>/dev/null; then
    echo "abort: lock $LOCKDIR is held; owner:"
    cat "$LOCKDIR/owner.json" 2>/dev/null || echo '(no owner.json — if > 60s old, rm -rf manually)'
    exit 1
  fi
  trap 'rm -rf "$LOCKDIR" 2>/dev/null' EXIT
  trap 'rm -rf "$LOCKDIR" 2>/dev/null; exit 130' INT
  trap 'rm -rf "$LOCKDIR" 2>/dev/null; exit 143' TERM
  trap 'rm -rf "$LOCKDIR" 2>/dev/null; exit 129' HUP

  if [ ! -f "$STATE" ]; then
    echo 'lifecycle already clean'; exit 0
  fi

  PID=$(sed -n 's/.*"pid"[[:space:]]*:[[:space:]]*\([0-9]*\).*/\1/p' "$STATE" | head -1)

  # 只有 PID 不存在(或 state 沒寫 PID)時才動 state。Only touch state when the PID is gone (or absent from state).
  # PID 還活著就拒絕——交給第三步人工處理。If the PID is alive, refuse — hand off to step 3 for manual action.
  if [ -n "$PID" ] && kill -0 "$PID" 2>/dev/null; then
    echo "refuse: pid $PID still alive."
    echo "        please verify via step 1 output, then use step 3 to kill manually."
    exit 2
  fi

  # PID 已死或無 PID。安全清理。PID is dead or absent. Safe to clean.
  find ~/.pi/ds4/clients -maxdepth 1 -name '*.json' -delete 2>/dev/null || true
  rm -f "$STATE"
  echo 'state cleared. lock will release on exit.'
)

第三步(只在第二步輸出 refuse: pid X still alive 時用): 對照第一步 ps 那一行的輸出——確認 args 真的指向本擴充的 ~/.pi/ds4/support/ds4-serverlstart 的時間是你預期的—— 然後親手鍵入 PID(不要從別處複製 PID 變數)執行:

Step three (only when step two prints refuse: pid X still alive): cross-check the ps output from step one — confirm args really points at this extension's ~/.pi/ds4/support/ds4-server and that lstart matches what you expect — then type the PID by hand (do not paste a PID variable from elsewhere) and run:

第三步:人工 SIGTERM(範本,不提供複製按鈕)Step 3: manual SIGTERM (template; no copy button)
# 把 PID_FROM_STEP_1_MUST_BE_REPLACED 換成你親眼從第一步輸出確認過的那個 PID。Replace PID_FROM_STEP_1_MUST_BE_REPLACED with the PID you confirmed by eye in step 1's output.
# 沒提供 Copy 按鈕;故意用一個非數字的 token,不慎執行未編輯版本只會得到No Copy button is provided; the deliberately non-numeric token means accidentally running the unedited version only yields
# 'kill: arguments must be process or job IDs' 之類的錯誤,而不會送出 SIGTERM。an error like 'kill: arguments must be process or job IDs' rather than sending SIGTERM.
kill -TERM PID_FROM_STEP_1_MUST_BE_REPLACED

# 等幾秒讓它優雅退出,然後再跑第二步即可清掉 state。Wait a few seconds for it to exit gracefully, then re-run step 2 to clear state.

備註:這套三步流程是最後手段。watchdog 與 lifecycle lock 平時就會處理絕大多數情況; 若你發現自己常常需要跑這個,那是 bug,請到 audreyt/pi-ds4 issues 回報。

Note: this three-step procedure is a last resort. The watchdog and lifecycle lock handle almost every case in normal operation; if you find yourself running this often, that is a bug — please report it at audreyt/pi-ds4 issues.

「Server 在跑,但 /v1/models 沒回應」“The server is running but /v1/models does not respond”

檢查 8000 port 是否真的被 ds4-server 佔用:lsof -nP -iTCP:8000 -sTCP:LISTEN。 若被其他程式佔走,server 啟動會失敗(log 會記錄)。

Check whether port 8000 is really held by ds4-server: lsof -nP -iTCP:8000 -sTCP:LISTEN. If something else has grabbed it, server startup fails (the log records it).

不要刪 support/gguf/Do not delete support/gguf/

它裡頭是花了好幾小時下載的 87 GB 權重。除非要換 quant 或換模型,否則永遠不要動它; 如果不小心刪了,下次啟動會重新下載(會續傳 .part 檔,但前提是 .part 還在)。

Inside it are 87 GB of weights that took hours to download. Unless you are changing quant or model, never touch it; if you delete it by accident, the next start re-downloads from scratch (it resumes .part files, but only if the .part is still there).


第 十 章Chapter 10

本機開發安裝Local development install

如果你想 hack 引擎、測自己的 patch、或保留多個 ds4 fork 並行, 可以跳過 pi install 流程,直接用內附的安裝腳本把擴充和 ds4 checkout 都 symlink 過去。

If you want to hack on the engine, test your own patches, or keep several ds4 forks side by side, you can skip the pi install flow and use the bundled install script to symlink both the extension and your ds4 checkout into place.

10.1 基本用法 Basic usage

本機開發掛上去Mount for local development
# 在 pi-ds4 的 checkout 根目錄執行:From the pi-ds4 checkout root:
./install-pi-extension-local.sh /path/to/audreyt-ds4-checkout

它會做兩件事:

It does two things:

10.2 已有 support 目錄時:--force When a support directory already exists: --force

如果 ~/.pi/ds4/support 已經指向別的地方(例如上次 pi install 留下的), 腳本會拒絕直接覆寫。加 --force 會:

If ~/.pi/ds4/support already points elsewhere (e.g. left over from a prior pi install), the script refuses to overwrite directly. Adding --force will:

10.3 接下來:reload Next: reload

安裝完成後重啟 pi,或在 pi 內執行 /reload——擴充會被重新發現。

After install, restart pi or run /reload inside pi — the extension is rediscovered.


第 十一 章Chapter 11

進階:自訂引導方向Advanced: custom steering directions

本章是給已經理解前面所有章節的讀者準備的:如何根據自己的研究興趣建造一個全新的引導向量, 並掛到 ds4-server 上跑。

This chapter is for readers who have already absorbed everything above: how to build a fresh steering vector for your own research interest and run it on ds4-server.

11.1 工具鏈在哪 Where the toolchain lives

audreyt/ds4dir-steering/ 目錄裡有完整的建構工具:

The complete build toolchain lives in the dir-steering/ directory of audreyt/ds4:

11.2 典型工作流 A typical workflow

  1. 準備兩組提示——一組你想「引導模型走向」(positive),一組「不要走向」(negative)。每組 50~200 條足夠。
  2. collect-acts.py 在 ds4-server 上跑這些提示,產生兩組 activation。
  3. build-dir.py 計算這兩組 activation 在每一層的差異方向,輸出成 my-direction.f32
  4. 把檔案放到 dir-steering/out/ 或任何位置;
  5. 設定環境變數:
    • DS4_DIR_STEERING_FILE=dir-steering/out/my-direction.f32
    • DS4_DIR_STEERING_FFN=-2(或自己摸出來的甜蜜點)
  6. 重啟 ds4-server,測試。
  1. Prepare two sets of prompts — one you want to steer the model towards (positive) and one to steer away from (negative). 50 to 200 of each is enough.
  2. Run them through ds4-server with collect-acts.py to produce two sets of activations.
  3. Use build-dir.py to compute the difference direction layer by layer, emitting my-direction.f32.
  4. Place the file in dir-steering/out/ or anywhere else;
  5. set the environment:
    • DS4_DIR_STEERING_FILE=dir-steering/out/my-direction.f32
    • DS4_DIR_STEERING_FFN=-2 (or your own sweet spot)
  6. Restart ds4-server and test.

11.3 建造好向量的兩個要點 Two principles for building a good vector

分享你的向量Share your vector

如果你建了一個有用的方向,歡迎把它寄到 audreyt/ds4 issues—— 有趣的方向會被收進主分支讓所有人都能用。

If you build a useful direction, please send it to audreyt/ds4 issues — interesting directions can be pulled into the main branch for everyone to use.


常 問 問 題FAQ

還沒決定的讀者For readers still deciding

如果你還在猶豫要不要把自己的 Mac 拿去跑這顆模型,下面這些是最常被問到的問題。

If you are still on the fence about whether to put your Mac to work running this model, here are the most common questions.

我的對話真的不會被送上雲端嗎?Do my conversations really never go to the cloud?

不會。ds4-server 完全在 127.0.0.1:8000(本機 loopback)上監聽, 沒有任何呼叫對外網路。模型權重一旦下載完成,整個推論流程都在你機器的 CPU/GPU/unified memory 裡。 如果你斷網路使用,它也照樣運作(除了首次下載階段)。

No. ds4-server listens only on 127.0.0.1:8000 (local loopback) and makes no outbound calls. Once the model weights are downloaded, the entire inference pipeline runs on your machine's CPU / GPU / unified memory. Use it with the network unplugged and it still works (except for the initial download).

這跟用 ChatGPT、Claude、Gemini 有什麼差別?How does this differ from using ChatGPT, Claude, or Gemini?

輸出主權:沒有「我們無法回應這個請求」這類雲端服務內建的封鎖; 對於主權/領土/哲學爭議類問題,本分支預設啟用的方向性引導會讓模型進入「鋪陳討論」而不是「給定一個訓練好的答案」。

Output sovereignty: none of the built-in cloud-service blocks like “we cannot respond to that request”; on questions of sovereignty / territory / philosophical contention, this fork's default directional steering pushes the model into “laying out the discussion” rather than “delivering a pre-trained answer”.

資料主權:對話不離開你的機器,也不會進入任何訓練資料集。

Data sovereignty: conversations do not leave your machine and do not enter any training set.

成本結構:沒有逐 token 計價、沒有速率限制、沒有訂閱費。 代價是一次性的硬體投入(128 GB Mac)與電費。

Cost structure: no per-token pricing, no rate limits, no subscription. The cost is one-off hardware (a 128 GB Mac) plus electricity.

能力:284 B 參數、13 B 啟用的 MoE 是當前開放權重前沿模型的量級;IQ2XXS imatrix 量化會讓某些任務輸給未量化的雲端版本; 對話品質仍在「可用作日常 coding/寫作助理」的範圍。

Capability: a 284 B-parameter, 13 B-active MoE is in the bracket of current frontier open-weight models; IQ2XXS imatrix quantisation will lose ground to unquantised cloud versions on some tasks, but conversation quality remains within “usable as a daily coding / writing assistant”.

我已經習慣用 Codex CLI/Claude Code,還需要學 pi 嗎?I am already used to Codex CLI / Claude Code — do I still need to learn pi?

作為日常 shell——不需要。第八章有四個常見前端的接法:

As a daily shell — no. Chapter 8 shows how to wire up four common frontends:

  • Claude Code:直連,export ANTHROPIC_BASE_URL=http://127.0.0.1:8000 就走——ds4-server 本身就實作 Anthropic Messages 端點。
  • OpenClaw/Hermes Agent:改一個 base_url 就連得上。
  • Codex CLI:~/.codex/config.toml 加一段 [model_providers.ds4],指到 ds4-server——完整可貼可用的 TOML 在 8.1 節
  • Claude Code: direct connection — export ANTHROPIC_BASE_URL=http://127.0.0.1:8000 and you are off; ds4-server itself implements the Anthropic Messages endpoint.
  • OpenClaw / Hermes Agent: just change a single base_url.
  • Codex CLI: add a [model_providers.ds4] block in ~/.codex/config.toml pointing at ds4-server — the full paste-and-go TOML is in section 8.1.

但你還是會在背後某個地方需要 ds4-server 跑起來。三條路:

But you still need ds4-server running somewhere in the background. Three paths:

  • 有 pi:留一個 pi session 開著讓 watchdog 維持(A 方案);或 cd ~/.pi/ds4/support && ./ds4-server … 手動跑(B 方案)。
  • 沒有 pi:克隆 audreyt/ds4make ds4-server,然後用 pi-ds4download_model.sh(curl 一份下來,不是 ds4 內附的同名腳本——那支抓的是上游 stock-recipe),最後跑 ./ds4-server …(C 方案)。完整步驟與旗標見 8.6 節
  • With pi: keep one pi session open so the watchdog keeps it alive (option A); or cd ~/.pi/ds4/support && ./ds4-server … by hand (option B).
  • Without pi: clone audreyt/ds4, run make ds4-server, then use pi-ds4's download_model.sh (curl a copy — not the same-named script inside ds4, which fetches the upstream stock-recipe), and finally run ./ds4-server … (option C). Full steps and flags in section 8.6.

簡言之:作為前端可以完全不碰 pi;作為服務你還是得讓 ds4-server 跑起來——但這條服務路徑也可以完全不經過 pi(見 8.6 節)。

In short: as a frontend you can avoid pi entirely; as a service you still need ds4-server running — but that service path can also bypass pi entirely (see section 8.6).

我沒有 128 GB Mac,有沒有替代?I do not have a 128 GB Mac — are there alternatives?

有幾條路:

A few options:

  • 96 GB Mac:上游已驗證 IQ2XXS imatrix 在 96 GB 上跑得動 250k context、約 27 t/s。本擴充本身硬性檢查 ≥ 128 GB 不會接受這台機器,但可以走 §1.1 末尾「只有 96 GB?」的 bypass 路徑(先 sudo sysctl iogpu.wired_limit_mb=92000,再手動跑 ds4-server)。
  • 更小的模型本機跑:llama.cppMLXOllama 跑較小的開放權重模型(Llama、Mistral、Qwen、Gemma 系列等),64 GB Mac 可跑得動 70 B 量級。ds4 引擎本身是專為 DeepSeek V4 Flash 設計的,不適用於其他模型。
  • 租用雲端 Mac:例如 MacStadium、Scaleway 提供按小時計費的 M-series 機器。
  • 直接用 DeepSeek 官方 API:沒有方向性引導,但有完整能力;費用以 token 計。
  • 96 GB Mac: upstream has validated that the IQ2XXS imatrix variant runs on 96 GB at 250k context and around 27 t/s. This wrapper hard-checks for ≥ 128 GB and will not accept the machine, but you can take the bypass path described in “Only 96 GB?” at the end of §1.1 (run sudo sysctl iogpu.wired_limit_mb=92000, then drive ds4-server by hand).
  • Run a smaller model locally: use llama.cpp, MLX, or Ollama for smaller open-weight models (Llama, Mistral, Qwen, Gemma families); a 64 GB Mac can drive 70 B-class models. The ds4 engine itself is built specifically for DeepSeek V4 Flash and does not apply to other models.
  • Rent a cloud Mac: services like MacStadium or Scaleway offer per-hour M-series machines.
  • Use the DeepSeek official API directly: no directional steering but full capability; billed per token.

本分支的「不確定性引導」是搭配這顆 abliterated GGUF 設計的, 它的具體甜蜜點(ffn=-2)不一定能直接搬到其他模型;要用同樣機制需要自己重做向量。

This fork's “uncertainty steering” is designed to pair with this particular abliterated GGUF, and its specific sweet spot (ffn=-2) may not carry over directly to other models; to use the same mechanism elsewhere, you will need to rebuild the vector yourself.

跑這個會吃多少電?會把 Mac 燒壞嗎?How much power does this draw? Will it cook my Mac?

推論時的功耗大致跟「整顆 Mac 同時跑 GPU 密集任務」差不多—— M-series 機型上多半在 30~60 W 之間。長時間使用建議:

During inference, power draw is roughly the same as “your Mac running a GPU-intensive task across the board” — typically 30 to 60 W on M-series machines. For extended use:

  • 放在通風良好的桌面,不要堵住底部進氣口;
  • 留意 unified memory 用量,避免同時開超多其他大型應用;
  • 當你沒在用時,watchdog 會自動關閉 server 釋放 RAM。
  • place it on a well-ventilated surface and do not block the bottom intake;
  • watch unified memory usage and avoid running several other heavyweight applications alongside it;
  • when you are not using it, the watchdog will close the server automatically and free the RAM.

燒壞的風險與一般高負載使用相同;Apple Silicon 機型在溫度過高時會自動降頻保護。

The damage risk is no greater than any other heavy workload; Apple Silicon machines throttle automatically when they get too hot.

這跟「越獄」(jailbreak)是同一件事嗎?Is this the same as “jailbreaking”?

不是。越獄通常指用特殊提示詞繞過雲端模型的政策層;abliteration 是直接編輯權重, 屬於修改模型本身。兩者方向不同——越獄是輸入面的繞過,abliteration 是模型面的調整。 本分支的方向性引導又是另一層機制(執行期的低秩活化編輯),跟前兩者都不一樣。

No. Jailbreaking typically means using special prompts to bypass a cloud model's policy layer; abliteration directly edits the weights, which means modifying the model itself. The two work in different directions — jailbreaks bypass on the input side, abliteration adjusts the model side. This fork's directional steering is yet another layer (runtime low-rank activation editing), distinct from both.

如果你想看原始未動過的 DeepSeek V4 Flash 行為,用上游 antirez/ds4 的 stock-recipe GGUF; 要關掉本分支的方向性引導,設 DS4_DIR_STEERING_FFN=0

If you want the original untouched DeepSeek V4 Flash behaviour, use the upstream stock-recipe GGUF from antirez/ds4; to turn off this fork's directional steering, set DS4_DIR_STEERING_FFN=0.

為什麼要叫「pi」?跟圓周率有關嗎?Why “pi”? Is it about π?

無關。pi 是 Earendil 開發的 coding agent CLI 名稱(取自一個夢的縮寫,不是 π)。

Unrelated. pi is the name of a coding agent CLI developed by Earendil (an acronym from a dream, not π).

我可以拿這個來商業使用嗎?Can I use this commercially?

可以。整條鏈路都是 MIT 授權,商業使用直接拿去用即可——但「再散布」(redistribute)時, 各元件要保留自己那份 LICENSE 聲明。如果你只是使用不再散布,這些不必擔心。

Yes. The whole chain is MIT-licensed, so you can use it commercially straight off — but when you redistribute, each component must keep its own LICENSE declaration. If you only use and do not redistribute, none of this is your problem.

元件授權再散布要附
audreyt/pi-ds4
本擴充原始碼
MIT 本 repo 的 LICENSE
audreyt/ds4antirez/ds4
推論引擎原始碼
MIT 該 repo 的 LICENSE(antirez 原始版權人)
DeepSeek-V4-Flash 上游權重
原始模型 checkpoint
MIT 模型卡 / HF repo 的 LICENSE(DeepSeek 版權人)
cyberneurova abliterated GGUF
本指南實際下載的權重
MIT(inherits) 該 HF repo 的 LICENSE + 一筆「derivative of DeepSeek-V4-Flash」標示
本指南 HTML 文本
你正在讀的這份 explainer.zh-tw.html
CC0 (public domain) 什麼都不必附
ComponentLicenceRequired on redistribution
audreyt/pi-ds4
This extension's source
MIT this repo's LICENSE
audreyt/ds4 / antirez/ds4
Inference engine source
MIT that repo's LICENSE (antirez as original copyright holder)
DeepSeek-V4-Flash upstream weights
Original model checkpoint
MIT the model card / HF repo's LICENSE (DeepSeek as copyright holder)
cyberneurova abliterated GGUF
The weights this guide actually downloads
MIT (inherits) that HF repo's LICENSE plus a “derivative of DeepSeek-V4-Flash” notice
This guide's HTML text
The explainer.zh-tw.html you are reading
CC0 (public domain) nothing required

總結:四個元件的 LICENSE 各帶一份;指南文本可以任意翻譯/改寫/商用,不需署名。

Summary: ship the LICENSE for each of the four components; the guide text itself can be freely translated, rewritten, used commercially, with no attribution required.

這份指南本身能轉載/改寫嗎?Can the guide itself be reposted or rewritten?

可以。本指南文本以 CC0 貢獻於公眾領域——你可以任意複製、翻譯、改寫、商用、再發布, 不用標註來源。

Yes. This guide's text is contributed to the public domain under CC0 — you may copy, translate, rewrite, use commercially, and republish it however you like, with no attribution required.


字 彙 表Glossary

名詞對照Terms, cross-referenced

在這份指南反覆出現、又對非工程背景讀者較陌生的詞,集中在這裡。 已熟悉的人可以跳過。

Words that recur in this guide and may be unfamiliar to non-engineering readers are collected here. Those already familiar can skip ahead.

模型與權重Model and weights

parameter參數
神經網路裡可訓練的數字。模型「腦容量」的粗略指標。本指南裡的 DeepSeek V4 Flash 有 284 B(2,840 億)個。
A trainable number inside a neural network. A rough proxy for the model's “brain capacity”. The DeepSeek V4 Flash in this guide has 284 B (284 billion) of them.
MoE混合專家模型
Mixture-of-Experts。模型內部分成多個「專家」子網路,每個 token 只啟動其中一小部分——本模型每 token 約啟用 13 B 參數,等同跑 13 B 模型的速度,但有 284 B 的知識。
Mixture-of-Experts. The model is split into multiple “expert” sub-networks, with only a small subset activated per token — this model activates around 13 B parameters per token, with the speed of a 13 B model but the knowledge of a 284 B one.
quantization量化
把每個權重從 16-bit 浮點壓縮到更少位元(這裡的 IQ2XXS-w2Q2K imatrix 平均約 2 bit)。檔案變小、推論變快,代價是少量精度損失。
Compressing each weight from 16-bit floating point to fewer bits (here, the IQ2XXS-w2Q2K imatrix recipe averages about 2 bits). The file gets smaller and inference faster, at the cost of some precision.
GGUF模型檔格式
llama.cpp 系列引擎使用的單檔模型格式(檔名 .gguf),把權重、結構、tokenizer 全部打包在一個檔案裡。本指南的 GGUF 約 87 GB。
The single-file model format used by llama.cpp-family engines (extension .gguf), which packages weights, architecture, and tokenizer into one file. The GGUF in this guide is around 87 GB.
abliteration拒絕方向消除
找出模型內部觸發「拒絕回應」的特定方向,把那個方向從前向計算中減去。鬆動過度拒絕;不是越獄,也不是安全保證。
Identifying the specific internal direction that triggers “refuse to respond”, and subtracting it from the forward pass. Loosens over-refusal; it is neither a jailbreak nor a safety guarantee.

推論與硬體Inference and hardware

token詞元
模型處理文字的基本單位。中文約一個漢字對應 1~3 個 token;英文約一個單字對應 1~2 個 token。
The unit a model processes text in. A Chinese character maps to roughly 1 to 3 tokens; an English word, 1 to 2 tokens.
prefill / decode預填/逐字產生
推論的兩個階段。Prefill:一次性吃下整段輸入 prompt;Decode:逐 token 產生回應。Prefill 通常快得多,本指南 M5 上的參考值是 prefill ≈ 440 t/s、decode ≈ 30 t/s——後者才是你在長對話中實際感受到的回應速度。
The two stages of inference. Prefill: ingest the whole input prompt at once. Decode: produce a response one token at a time. Prefill is typically far faster; this guide’s M5 reference figures are prefill ≈ 440 t/s and decode ≈ 30 t/s — the latter is what you actually feel during a long conversation.
context window上下文視窗
模型一次可看見的 token 上限。本擴充設定為 100,000 token(約一本中篇小說),但 abliterated 權重在超過 32k tokens 的長文行為尚未被官方驗證,請當實驗看待。
The maximum number of tokens the model can see in one pass. This extension sets it to 100,000 tokens (roughly a novella), but the abliterated weights have not been officially validated beyond 32k tokens — treat anything above that as experimental.
KV cache鍵值快取
推論時暫存的中間結果。長對話會累積大量 KV cache;本擴充允許溢出到磁碟(上限 8 GB)。
Intermediate state cached during inference. Long conversations accumulate a lot of KV cache; this extension allows it to spill to disk (up to 8 GB).
unified memory統一記憶體
Apple Silicon 把 CPU 與 GPU 共用同一塊 RAM;不像 NVIDIA 顯卡需要把資料從 RAM 搬到 VRAM。這是 Mac 能跑大型模型的關鍵。
Apple Silicon shares one block of RAM between CPU and GPU, unlike NVIDIA cards where data has to move from RAM to VRAM. This is the key reason Macs can run large models.
Metal 4 / MPPApple GPU API
Metal 是 Apple 的 GPU 程式介面;Metal 4 引入 MPP(Multi-Pass Pipeline),允許 GPU 在單一 command buffer 內做多階段運算,提升大型模型 prefill 吞吐。
Metal is Apple's GPU programming interface; Metal 4 introduces MPP (Multi-Pass Pipeline), letting the GPU perform multi-stage compute within a single command buffer and lifting prefill throughput on large models.

方向性引導與本分支Directional steering and this fork

directional steering方向性引導
執行期間在模型的特定層做低秩活化編輯,把模型推向(或推離)某個「方向向量」所代表的回應特徵。不需要重訓。
A low-rank activation edit at specific layers during runtime, pushing the model towards (or away from) the response trait that a given direction vector represents. No retraining needed.
FFN / Attention前饋/注意力
Transformer 內的兩個主要計算組件。本分支預設只在 FFN 輸出端引導(ffn=-2),attention 端關閉(attn=0)。
The two main compute components inside a Transformer. This fork steers only at the FFN output by default (ffn=-2) and leaves the attention side off (attn=0).
uncertainty direction不確定性方向
本分支內附的引導向量,由 100 個有爭議提示對比 100 個確定提示建出。讓模型對爭議性問題進入「鋪陳」而非「給定」的回應暫存器。
The steering vector bundled with this fork, built by contrasting 100 contested prompts against 100 settled prompts. It moves the model into a “lay it out” rather than “hand it down” register on contested questions.
response register回應暫存器
模型回應的整體風格/模式:hedge vs assert、條列 vs 敘事、學術 vs 對話等。比「立場」更可被引導。
The overall style or mode of a model's response: hedge vs. assert, bullets vs. narrative, academic vs. conversational. Far more amenable to steering than “stance”.

系統與生命週期System and lifecycle

lease租約檔
每個使用 ds4 的 pi process 在 ~/.pi/ds4/clients/<pid>.json 寫入的存在證明;每 10 秒更新一次。
A proof-of-presence file each pi process using ds4 writes at ~/.pi/ds4/clients/<pid>.json; refreshed every 10 seconds.
watchdog看門狗
背景常駐的小 shell script,每 2 秒掃描 lease;沒有有效 lease 時自動關閉 ds4-server。
A small background shell script that scans the leases every 2 seconds and closes ds4-server automatically when no valid lease remains.
system prompt系統提示
對話開始時餵給模型的角色/規則設定,例如「你是一位中立的政治分析師」。本分支的方向性引導需要搭配適當 system prompt 才有效果。
The role / rule setup fed to the model at the start of a conversation, e.g. “You are a neutral political analyst”. This fork's directional steering needs an appropriate system prompt to take effect.

附 錄Coda

致謝、授權、後話Acknowledgements, licence, afterword

致謝Acknowledgements

授權Licence

原始碼採用 MIT,與上游一致。請見專案 LICENSE。本指南文本則貢獻於 Public Domain(CC0)。

Source code is MIT-licensed, matching upstream. See the project's LICENSE. The text of this guide is contributed to the public domain under CC0.

後話Afterword

把前沿模型放到自己機器上跑,從技術上講,是把一段運算搬到本地; 從政治上講,是把一個方向盤從別人手上接過來。本擴充預設啟用 不確定性引導——並不是要替你決定任何問題的答案, 而是要在那些「被訓練封死」的提問前,把討論空間還給你。

Running a frontier model on your own machine, technically speaking, moves a piece of computation onto local hardware; politically speaking, it takes the steering wheel back from someone else's hands. This extension defaults to uncertainty steering — not to decide any question's answer for you, but to return the space for discussion in front of questions where training has nailed the door shut.

模型可以閉合一個答案,使用者可以打開一個討論。
這份指南完成它任務的時刻,是當你已經不需要它的時候。

A model can close a question on an answer; the user can open it back into a discussion.
The moment this guide finishes its work is the moment you no longer need it.