使用 ThreadSanitizer 轉換至多核心

5.0 版本為 OCaml 語言帶來了多核心，基於 Domain 的平行處理。然而，在共享可變記憶體位置上執行非協調操作的平行 Domain 可能會導致資料競爭。不幸的是，這些問題（目前）不會被 OCaml 強大的型別系統所捕獲，這表示在將平行處理引入現有的 OCaml 程式碼庫時，它們可能會被忽略。因此，在本指南中，我們將研究一個逐步的工作流程，該流程利用 ThreadSanitizer (TSan) 工具來協助您的 OCaml 程式碼為 5.x 版本做好準備。

注意：自 OCaml 5.2.0 起，TSan 支援適用於所有第一層架構，在 FreeBSD、Linux 和 macOS 上。建置具 TSan 支援的 OCaml 需要至少安裝 GCC 11 或 Clang 14 作為您的 C 編譯器。請注意，已知使用 GCC 11 的 TSan 資料競爭報告會導致堆疊追蹤報告不佳（沒有行號），這在 GCC 12 中已修正。

範例應用程式

考慮一個小的銀行函式庫，其簽名在 bank.mli 中如下

type t
(** a collective type representing a bank *)

val init : num_accounts:int -> init_balance:int -> t
(** [init ~num_accounts ~init_balance] creates a bank with [num_accounts] each
    containing [init_balance]. *)

val transfer : t -> src_acc:int -> dst_acc:int -> amount:int -> unit
(** [transfer t ~src_acc ~dst_acc ~amount] moves [amount] from account
    [src_acc] to account [dst_acc].
    @raise Invalid_argument if amount is not positive,
    if [src_acc] and [dst_acc] are the same, or if [src_acc] contains
    insufficient funds. *)

val iter_accounts : t -> (account:int -> balance:int -> unit) -> unit
(** [iter_accounts t f] applies [f] to each account from [t]
    one after another. *)

在幕後，該函式庫可能以各種方式實作。考慮以下在 bank.ml 中的執行緒不安全實作

type t = int array

let init ~num_accounts ~init_balance =
  Array.make num_accounts init_balance

let transfer t ~src_acc ~dst_acc ~amount =
  begin
    if amount <= 0 then raise (Invalid_argument "Amount has to be positive");
    if src_acc = dst_acc then raise (Invalid_argument "Cannot transfer to yourself");
    if t.(src_acc) < amount then raise (Invalid_argument "Not enough money on account");
    t.(src_acc) <- t.(src_acc) - amount;
    t.(dst_acc) <- t.(dst_acc) + amount;
  end

let iter_accounts t f = (* inspect the bank accounts *)
  Array.iteri (fun account balance -> f ~account ~balance) t;

建議的工作流程

現在，如果我們想看看此程式碼是否為 OCaml 5.x 的多核心做好準備，我們可以利用以下工作流程

安裝 TSan
撰寫平行測試執行器
在 TSan 下執行測試
如果 TSan 抱怨資料競爭，請解決報告的問題，然後跳至步驟 2。

遵循工作流程

我們現在將針對我們的範例應用程式逐步執行建議的工作流程。

安裝檢測 TSan 編譯器（步驟 0）

TSan 包含在 OCaml 中，從 OCaml 5.2 開始，但必須明確啟用。您可以按如下方式安裝 TSan 切換（這裡我們建立一個名為 5.2.0+tsan 的 5.2.0 切換）

opam switch create 5.2.0+tsan ocaml-variants.5.2.0+options ocaml-option-tsan

疑難排解

如果以上步驟在安裝 conf-unwind 期間失敗，並出現 No package 'libunwind' found 的訊息，請嘗試設定環境變數 PKG_CONFIG_PATH，以指向 libunwind.pc 的位置，例如 PKG_CONFIG_PATH=/usr/lib/x86_64-linux-gnu/pkgconfig
如果以上步驟失敗，並出現類似 FATAL: ThreadSanitizer: unexpected memory mapping 0x61a1a94b2000-0x61a1a94ca000 的錯誤，這是 TSan 舊版本的已知問題，可以透過執行 sudo sysctl vm.mmap_rnd_bits=28 來降低 ASLR 熵來解決。

撰寫平行測試執行器（步驟 1）

首先，我們可以透過平行執行兩個 Domain 來測試我們函式庫的平行使用情況。以下是在 bank_test.ml 中利用此想法的快速小型測試執行器

let num_accounts = 7

let money_shuffle t = (* simulate an economy *)
  for i = 1 to 10 do
    Unix.sleepf 0.1 ; (* wait for a network request *)
    let src_acc = i mod num_accounts in
    let dst_acc = (i*3+1) mod num_accounts in
    try Bank.transfer t ~src_acc ~dst_acc ~amount:1 (* transfer $1 *)
    with Invalid_argument _ -> ()
  done

let print_balances t = (* inspect the bank accounts *)
  for _ = 1 to 12 do
    let sum = ref 0 in
    Bank.iter_accounts t
      (fun ~account ~balance -> Format.printf "%i %3i " account balance; sum := !sum + balance);
    Format.printf "  total = %i @." !sum;
    Unix.sleepf 0.1;
  done

let _ =
  let t = Bank.init ~num_accounts ~init_balance:100 in
  (* run the simulation and the debug view in parallel *)
  [| Domain.spawn (fun () -> money_shuffle t);
     Domain.spawn (fun () -> print_balances t);
  |]
  |> Array.iter Domain.join

執行器建立一個包含 7 個帳戶的銀行，每個帳戶包含 $100，然後平行執行兩個迴圈，並執行以下操作

使用 money_shuffle 轉帳
另一個使用 print_balances 重複列印帳戶餘額

$ opam switch 5.2.0
$ opam exec -- dune runtest
0 100 1 100 2 100 3 100 4 100 5 100 6 100   total = 700
0 100 1 100 2 100 3 100 4 100 5 100 6 100   total = 700
0 100 1  99 2 100 3 100 4 101 5 100 6 100   total = 700
0 101 1  99 2  99 3 100 4 101 5 100 6 100   total = 700
0 101 1  99 2  99 3 100 4 100 5 100 6 101   total = 700
0 101 1  99 2 100 3 100 4 100 5  99 6 101   total = 700
0 101 1  99 2 100 3 100 4 100 5 100 6 100   total = 700
0 100 1 100 2 100 3 100 4 100 5 100 6 100   total = 700
0 100 1  99 2 100 3 100 4 101 5 100 6 100   total = 700
0 101 1  99 2  99 3 100 4 101 5 100 6 100   total = 700
0 101 1  99 2  99 3 100 4 101 5 100 6 100   total = 700
0 101 1  99 2  99 3 100 4 101 5 100 6 100   total = 700

從以上在常規 5.1.0 編譯器下執行的結果來看，人們可能會認為一切正常，因為餘額總和為預期的 $700，表示沒有錢遺失。

在 TSan 下執行平行測試（步驟 2）

現在，讓我們在 TSan 下執行相同的測試。這樣做很簡單，如下所示，並立即抱怨出現競爭

$ opam switch 5.2.0+tsan
$ opam exec -- dune runtest
File "test/dune", line 2, characters 7-16:
2 |  (name bank_test)
           ^^^^^^^^^
0 100 1 100 2 100 3 100 4 100 5 100 6 100   total = 700
0 100 1 100 2 100 3 100 4 100 5 100 6 100   total = 700
0 100 1  99 2 100 3 100 4 101 5 100 6 100   total = 700
0 101 1  99 2  99 3 100 4 101 5 100 6 100   total = 700
0 101 1  99 2  99 3 100 4 101 5 100 6 100   total = 700
0 101 1  99 2  99 3 100 4 100 5 100 6 101   total = 700
0 101 1  99 2 100 3 100 4 100 5  99 6 101   total = 700
0 101 1  99 2 100 3 100 4 100 5 100 6 100   total = 700
0 100 1 100 2 100 3 100 4 100 5 100 6 100   total = 700
0 100 1  99 2 100 3 100 4 101 5 100 6 100   total = 700
0 101 1  99 2  99 3 100 4 101 5 100 6 100   total = 700
0 101 1  99 2  99 3 100 4 101 5 100 6 100   total = 700
==================
WARNING: ThreadSanitizer: data race (pid=26148)
  Write of size 8 at 0x7f5b0c0fd6d8 by thread T4 (mutexes: write M85):
    #0 camlBank.transfer_322 lib/bank.ml:11 (bank_test.exe+0x6de4d)
    #1 camlDune__exe__Bank_test.money_shuffle_270 test/bank_test.ml:8 (bank_test.exe+0x6d7c5)
    #2 camlStdlib__Domain.body_703 /home/opam/.opam/5.2.0+tsan/.opam-switch/build/ocaml-variants.5.2.0+tsan/stdlib/domain.ml:202 (bank_test.exe+0xb06b0)
    #3 caml_start_program <null> (bank_test.exe+0x13fdfb)
    #4 caml_callback_exn runtime/callback.c:201 (bank_test.exe+0x106053)
    #5 domain_thread_func runtime/domain.c:1215 (bank_test.exe+0x10a2b1)

  Previous read of size 8 at 0x7f5b0c0fd6d8 by thread T1 (mutexes: write M81):
    #0 camlStdlib__Array.iteri_367 /home/opam/.opam/5.2.0+tsan/.opam-switch/build/ocaml-variants.5.2.0+tsan/stdlib/array.ml:136 (bank_test.exe+0xa0f36)
    #1 camlDune__exe__Bank_test.print_balances_496 test/bank_test.ml:15 (bank_test.exe+0x6d8f4)
    #2 camlStdlib__Domain.body_703 /home/opam/.opam/5.2.0+tsan/.opam-switch/build/ocaml-variants.5.2.0+tsan/stdlib/domain.ml:202 (bank_test.exe+0xb06b0)
    #3 caml_start_program <null> (bank_test.exe+0x13fdfb)
    #4 caml_callback_exn runtime/callback.c:201 (bank_test.exe+0x106053)
    #5 domain_thread_func runtime/domain.c:1205 (bank_test.exe+0x10a2b1)

  [...]

請注意，我們取得了兩個競爭存取的堆疊追蹤，其中

在一個 Domain 中寫入，來自 Bank.transfer 中的陣列指派
在另一個 Domain 中讀取，來自對 Stdlib.Array.iteri 的呼叫，以讀取並列印 print_balances 中的陣列項目。

解決報告的競爭並重新執行測試（步驟 3 和 2）

解決報告的競爭的一種方法是新增一個 Mutex，以確保對基礎陣列的獨佔存取。第一次嘗試可能是使用 lock-unlock 呼叫包裝 transfer 和 iter_accounts，如下所示

let lock = Mutex.create () (* addition *)

let transfer t ~src_acc ~dst_acc ~amount =
  begin
    Mutex.lock lock; (* addition *)
    if amount <= 0 then raise (Invalid_argument "Amount has to be positive");
    if src_acc = dst_acc then raise (Invalid_argument "Cannot transfer to yourself");
    if t.(src_acc) < amount then raise (Invalid_argument "Not enough money on account");
    t.(src_acc) <- t.(src_acc) - amount;
    t.(dst_acc) <- t.(dst_acc) + amount;
    Mutex.unlock lock; (* addition *)
  end

let iter_accounts t f = (* inspect the bank accounts *)
  Mutex.lock lock; (* addition *)
  Array.iteri (fun account balance -> f ~account ~balance) t;
  Mutex.unlock lock (* addition *)

重新執行我們的測試，我們得到

$ opam exec -- dune runtest
File "test/dune", line 2, characters 7-16:
2 |  (name bank_test)
           ^^^^^^^^^
0 100 1 100 2 100 3 100 4 100 5 100 6 100   total = 700
0 100 1 100 2 100 3 100 4 100 5 100 6 100   total = 700
0 100 1  99 2 100 3 100 4 101 5 100 6 100   total = 700
0 101 1  99 2  99 3 100 4 101 5 100 6 100   total = 700
Fatal error: exception Sys_error("Mutex.lock: Resource deadlock avoided")

為什麼在僅新增兩對 Mutex.lock 和 Mutex.unlock 呼叫時，我們可能會遇到資源死鎖錯誤？

解決報告的競爭並重新執行測試，第 2 次嘗試（步驟 3 和 2）

喔，等等！在 transfer 中引發例外狀況時，我們忘記再次解除鎖定 Mutex。讓我們修改函式以執行此操作

let transfer t ~src_acc ~dst_acc ~amount =
  begin
    if amount <= 0 then raise (Invalid_argument "Amount has to be positive");
    if src_acc = dst_acc then raise (Invalid_argument "Cannot transfer to yourself");
    Mutex.lock lock; (* addition *)
    if t.(src_acc) < amount
    then (Mutex.unlock lock; (* addition *)
          raise (Invalid_argument "Not enough money on account"));
    t.(src_acc) <- t.(src_acc) - amount;
    t.(dst_acc) <- t.(dst_acc) + amount;
    Mutex.unlock lock; (* addition *)
  end

我們現在可以在 TSan 下重新執行我們的測試，以確認修復

$ opam exec -- dune runtest
0 100 1 100 2 100 3 100 4 100 5 100 6 100   total = 700
0 100 1 100 2 100 3 100 4 100 5 100 6 100   total = 700
0 100 1  99 2 100 3 100 4 101 5 100 6 100   total = 700
0 101 1  99 2  99 3 100 4 101 5 100 6 100   total = 700
0 101 1  99 2  99 3 100 4 101 5 100 6 100   total = 700
0 101 1  99 2 100 3 100 4 100 5  99 6 101   total = 700
0 101 1  99 2 100 3 100 4 100 5 100 6 100   total = 700
0 100 1 100 2 100 3 100 4 100 5 100 6 100   total = 700
0 100 1  99 2 100 3 100 4 101 5 100 6 100   total = 700
0 101 1  99 2  99 3 100 4 101 5 100 6 100   total = 700
0 101 1  99 2  99 3 100 4 101 5 100 6 100   total = 700
0 101 1  99 2  99 3 100 4 101 5 100 6 100   total = 700

這運作良好，TSan 不再抱怨，因此我們的小函式庫已為 OCaml 5.x 平行處理做好準備，萬歲！

最後的說明和警告

我們在遺失 Mutex.unlock 時遇到的「永遠必須在最後執行某些操作」的程式設計模式是一個重複出現的模式，OCaml 為此提供了一個專用的函式

 Fun.protect : finally:(unit -> unit) -> (unit -> 'a) -> 'a

使用 Fun.protect，我們可以將最終的修復程式碼撰寫如下

let transfer t ~src_acc ~dst_acc ~amount =
  begin
    if amount <= 0 then raise (Invalid_argument "Amount has to be positive");
    if src_acc = dst_acc then raise (Invalid_argument "Cannot transfer to yourself");
    Mutex.lock lock; (* addition *)
    Fun.protect ~finally:(fun () -> Mutex.unlock lock) (* addition *)
      (fun () ->
         begin
           if t.(src_acc) < amount
           then raise (Invalid_argument "Not enough money on account");
           t.(src_acc) <- t.(src_acc) - amount;
           t.(dst_acc) <- t.(dst_acc) + amount;
         end)
  end

誠然，如果效能是一個考量，使用 Mutex 來確保獨佔存取可能有點繁重。如果情況是這樣，一種選擇是用無鎖資料結構取代底層的 array，例如 Kcas_data 中的 Hashtbl。

最後一個警告，Domain 的速度非常快，以至於在太簡單的測試執行器中，一個 Domain 可能在第二個 Domain 甚至還沒啟動之前就已完成！這會造成問題，因為不會有明顯的平行處理供 TSan 觀察和檢查。在上面的範例中，對 Unix.sleepf 的呼叫有助於確保測試執行器確實是平行的。一個有用的替代技巧是在 Atomic 上協調，以確保兩個 Domain 在平行測試程式碼繼續之前都已啟動並執行。若要這樣做，我們可以調整我們的平行測試執行器，如下所示

let _ =
  let wait = Atomic.make 2 in
  let t = Bank.init ~num_accounts ~init_balance:100 in
  (* run the simulation and the debug view in parallel *)
  [| Domain.spawn (fun () ->
         Atomic.decr wait; while Atomic.get wait > 0 do () done; money_shuffle t);
     Domain.spawn (fun () ->
         Atomic.decr wait; while Atomic.get wait > 0 do () done; print_balances t);
  |]
  |> Array.iter Domain.join

記住該警告並掌握 TSan，您現在應該已準備好尋找資料競爭了。

仍然需要協助嗎？

詢問 OCaml 社群

開啟議題

協助改善我們的文件

所有 OCaml 文件都是開放原始碼。看到有錯誤或不清楚的地方嗎？提交提取請求。

貢獻