Ch3nyang's blog | 算法笔记 (2) -git diff 是如何实现的？

你是否好奇过 git diff 是如何高效地比较文件并精确显示变化的？这个看似简单的操作背后，隐藏着相当精妙的算法设计。

假设有两个文件 \(A\) 和 \(B\)，它们分别有 \(N=7\) 行和 \(M=6\) 行：

Text

 A     B
===   ===
 A     C
 B     B
 C     A
 A     B
 B     A
 B     C
 A

我们可以将这个比较问题抽象为在一个 \(N \times M\) 网格中寻找路径：

grid

从任意坐标 \((x, y)\) 出发，我们有三种可能的移动方式：

向右移动到 \((x+1, y)\) — 表示删除第一个文件的第 \(x+1\) 行
向下移动到 \((x, y+1)\) — 表示添加第二个文件的第 \(y+1\) 行
对角线移动到 \((x+1, y+1)\) — 表示两个文件的对应行内容相同（无需编辑）

对角线移动在 diff 算法中被称为”snake”，关键在于它不会增加我们的编辑距离。

我们的目标是找到一条从 \((0, 0)\) 到 \((N, M)\) 的路径，使得水平移动和垂直移动的总数最小——也就是最小化编辑距离。

这本质上就是经典的最小编辑距离问题，也称为计算两个序列间的 Levenshtein 距离。

动态规划方法

对于这个问题，最直观的解法是使用动态规划：

\[dp[i][j] = \begin{cases} \min\left(dp[i-1][j], dp[i][j-1]\right) + 1 & A[i] \neq B[j] \\ dp[i-1][j-1] & A[i] = B[j] \end{cases}\]

边界条件为：

\[\begin{aligned} dp[0][0] &= 0 \\ dp[i][0] &= i \\ dp[0][j] &= j \end{aligned}\]

这里给出 Python 代码实现：

Python

def diff(A: list[str], B: list[str]) -> tuple[int, list[tuple[str, str, int | None, int | None]]]:
    """
    计算两个序列之间的最小编辑距离（Levenshtein距离）并返回详细的编辑路径。
    
    Args:
        A: 源序列
        B: 目标序列
    
    Returns:
        tuple: 包含两个元素的元组
            - int: 最小编辑距离（需要的最少操作数）
            - list: 编辑路径，每个元素是一个四元组 (操作, 元素值, A索引, B索引)
    
    编辑操作说明:
        '=': 元素相同，无需修改
        '-': 删除操作，A中存在但B中不存在
        '+': 插入操作，B中存在但A中不存在
    """
    N, M = len(A), len(B)
    dp = [[0] * (M + 1) for _ in range(N + 1)]
    for i in range(1, N + 1):
        dp[i][0] = i
    for j in range(1, M + 1):
        dp[0][j] = j
    for i in range(1, N + 1):
        for j in range(1, M + 1):
            if A[i - 1] == B[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            else:
                if dp[i - 1][j] < dp[i][j - 1]:
                    dp[i][j] = dp[i - 1][j] + 1
                else:
                    dp[i][j] = dp[i][j - 1] + 1
    i, j = N, M
    path = []
    while i > 0 or j > 0:
        if i > 0 and j > 0 and A[i - 1] == B[j - 1]:
            path.append(('=', A[i - 1], i - 1, j - 1))
            i -= 1
            j -= 1
        elif j > 0 and (i == 0 or dp[i][j - 1] <= dp[i - 1][j]):
            path.append(('+', B[j - 1], None, j - 1))
            j -= 1
        else:
            path.append(('-', A[i - 1], i - 1, None))
            i -= 1
    path.reverse()
    return dp[N][M], path

当存在多条最优路径时，回溯过程需要仔细考虑优先级规则。在我们的实现中，当删除和添加操作的代价相等时，优先选择水平移动（删除操作）而不是垂直移动（添加操作）。这个选择会影响 diff 输出的视觉效果，但不会改变最小编辑距离。

注意： 如果希望优先添加而不是删除，只需将回溯逻辑中的 dp[i][j - 1] <= dp[i - 1][j] 改为 dp[i][j - 1] < dp[i - 1][j]。

下面是运行结果：

A = ["A", "B", "C", "A", "B", "B", "A"]
B = ["C", "B", "A", "B", "A", "C"]
distance, path = diff(A, B)
print("最小编辑距离:", distance)
print("diff:")
for op, val, i, j in path:
    if op == '=':
        print(f"  {val}")
    elif op == '-':
        print(f"- {val}")
    elif op == '+':
        print(f"+ {val}")

DP diff

最小编辑距离: 5
diff:
- A
- B
  C
- A
  B
+ A
  B
  A
+ C

这个方法完全正确！不过，我们的 DP 算法时间复杂度和空间复杂度都是 \(O(NM)\)。对于大文件来说，这可能会变得相当昂贵。我们能否做得更好呢？

Myers 算法

Myers 算法提供了一个更高效的解决方案，时间复杂度为 \(O(N\log N + D^2)\)，空间复杂度为 \(O(N)\)，其中 \(D\) 是最小编辑距离。当文件相似度较高（\(D\) 较小）时，这比 DP 方法有显著优势。

关键思路是使用不同的坐标系统。我们定义对角线 \(k\) 为水平和垂直位置的差值：\(k = x - y\)。

由于我们需要恰好 \(N\) 次水平移动和 \(M\) 次垂直移动才能到达 \((N, M)\)，我们知道最终的对角线将是 \(k = N - M\)。而且，最小编辑距离 \(D\) 必须满足：

\[D \geq |N - M|\]

算法通过按编辑距离 \(D\) 递增的顺序探索路径，使用 \(k-D\) 坐标系：

myers-kd坐标系

我们不需要填充整个 \(N \times M\) 表格，而是逐步探索 \(D = 0, 1, 2, \ldots\) 的路径，直到到达目标。对于每个编辑距离 \(D\)，可能的对角线范围从 \(k = -D\) 到 \(k = D\)（步长为 2，因为每次编辑操作会使 \(k\) 改变 \(\pm 1\)）。

我们维护一个数组 \(V\)，其中 \(V[k]\) 存储在对角线 \(k\) 上使用恰好 \(D\) 次编辑操作能到达的最远 \(x\) 坐标。

初始时，\(V[1] = 0\)（这里有一个微妙的索引约定——我们从 \(D=0\) 时的 \(k=0\) 开始）。

策略是在每条对角线上最大化 \(x\)，这样我们就能尽可能多地进行”免费”的对角线移动（snake）。要用 \(D\) 次编辑到达对角线 \(k\)，我们可以从以下路径选择：

对角线 \(k+1\) 通过向下移动（插入）：\(x = V[k+1]\)
对角线 \(k-1\) 通过向右移动（删除）：\(x = V[k-1] + 1\)

选择取决于边界条件以及哪个选项能给我们更大的 \(x\)：

当 \(k = -D\)：必须来自 \(k+1\)（不能再向左），所以 \(x = V[k+1]\)
当 \(k = D\)：必须来自 \(k-1\)（不能再向右），所以 \(x = V[k-1] + 1\)
当 \(-D < k < D\)：选择能给出更大 \(x\) 值的那个

合并一下就能得到：

Python

if k == -D or (k != D and V.get(k - 1) < V.get(k + 1)):
    x = V.get(k + 1)
else:
    x = V.get(k - 1) + 1

有了 \(x\)，我们可以计算出 \(y\)：

\[y = x - k\]

然后我们可以进行 snake 移动，直到 \(A[x] \neq B[y]\)：

Python

while x < N and y < M and A[x] == B[y]:
    x += 1
    y += 1

我们继续这个过程，逐步增加 \(D\)，直到达到目标：当 \(V[N - M] \geq N\) 时，我们就找到了从 \((0, 0)\) 到 \((N, M)\) 的最短路径。

下面的示意图展示了 Myers 算法的执行过程：

myers-执行过程1

myers-执行过程2

下面是 Python 代码实现：

Python

def myers_diff(A: list[str], B: list[str]) -> tuple[int, list[tuple[str, str, int | None, int | None]]]:
    """
    使用 Myers 算法计算编辑距离。
    
    返回格式与 DP 版本相同，便于比较。
    """
    N, M = len(A), len(B)
    if N == 0:
        return M, [('+', B[j], None, j) for j in range(M)]
    if M == 0:
        return N, [('-', A[i], i, None) for i in range(N)]
    
    max_d = N + M
    V = {1: 0}
    trace = [V.copy()]
    
    for D in range(max_d + 1):
        new_V = {}
        for k in range(-D, D + 1, 2):
            if k == -D or (k != D and V.get(k - 1, -1) < V.get(k + 1, -1)):
                x = V.get(k + 1, 0)
            else:
                x = V.get(k - 1, 0) + 1
            y = x - k
            
            while x < N and y < M and A[x] == B[y]:
                x += 1
                y += 1
            
            new_V[k] = x
            
            if x >= N and y >= M:
                path = []
                x, y = N, M
                for d in range(D, -1, -1):
                    k = x - y
                    V_prev = trace[d]
                    
                    if k == -d or (k != d and V_prev.get(k - 1, -1) < V_prev.get(k + 1, -1)):
                        prev_k = k + 1
                    else:
                        prev_k = k - 1
                    
                    prev_x = V_prev.get(prev_k, 0)
                    prev_y = prev_x - prev_k
                    
                    while x > prev_x and y > prev_y:
                        path.append(('=', A[x - 1], x - 1, y - 1))
                        x -= 1
                        y -= 1
                    
                    if d > 0:
                        if x == prev_x:
                            path.append(('+', B[y - 1], None, y - 1))
                            y -= 1
                        else:
                            path.append(('-', A[x - 1], x - 1, None))
                            x -= 1
                
                path.reverse()
                return D, path
        
        V = new_V
        trace.append(V.copy())
    
    return max_d, []

下面是运行结果：

A = ["A", "B", "C", "A", "B", "B", "A"]
B = ["C", "B", "A", "B", "A", "C"]
distance, path = myers_diff(A, B)
print("最小编辑距离:", distance)
print("diff:")
for op, val, i, j in path:
    if op == '=':
        print(f"  {val}")
    elif op == '-':
        print(f"- {val}")
    elif op == '+':
        print(f"+ {val}")

Myers diff

最小编辑距离: 5
diff:
- A
- B
  C
- A
  B
+ A
  B
  A
+ C

性能分析

算法比较

虽然 Myers 算法在理论上有更好的复杂度——时间复杂度 \(O(N\log N + D^2)\) 和空间复杂度 \(O(N)\)，相比 DP 的 \(O(NM)\)——但实际性能很大程度上取决于输入文件的相似程度。

当 \(D\) 接近 \(N + M\)（完全不同的文件）时，Myers 的时间复杂度会退化到 \(O((N + M)^2)\)，可能比直接的 DP 方法更差。让我们在各种条件下对两种算法进行基准测试：

import random
import time
import matplotlib.pyplot as plt
import numpy as np

def generate_test_case(N: int, M: int, similarity: float) -> tuple[list[str], list[str]]:
    base = [chr(i) for i in range(65, 91)]  # A-Z
    A = [random.choice(base) for _ in range(N)]
    B = []
    for i in range(M):
        if random.random() < similarity and i < N:
            B.append(A[i])
        else:
            B.append(random.choice(base))
    return A, B
def benchmark():
    test_cases = []
    for size in [100, 200, 300, 400, 500, 600]:
        for similarity in [0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1]:
            test_cases.append((size, size, similarity))
    
    runs_per_test = 100
    results = []
    
    for N, M, sim in test_cases:
        dp_times = []
        myers_times = []
        
        for run in range(runs_per_test):
            A, B = generate_test_case(N, M, sim)
            
            start = time.time()
            dp_distance, _ = diff(A, B)
            dp_time = time.time() - start
            dp_times.append(dp_time)
            
            start = time.time()
            myers_distance, _ = myers_diff(A, B)
            myers_time = time.time() - start
            myers_times.append(myers_time)
            
            assert dp_distance == myers_distance, "结果不一致"
        
        dp_time_avg = np.mean(dp_times)
        dp_time_std = np.std(dp_times)
        myers_time_avg = np.mean(myers_times)
        myers_time_std = np.std(myers_times)
        
        results.append((N, M, sim, dp_time_avg, myers_time_avg))
        print(f"N={N}, M={M}, similarity={sim:.1f} | "
              f"DP: {dp_time_avg:.4f}s (±{dp_time_std:.4f}) | "
              f"Myers: {myers_time_avg:.4f}s (±{myers_time_std:.4f})")
    
    plot_results(results)

def plot_results(results):
    sizes = sorted(set([r[0] for r in results]))
    similarities = sorted(set([r[2] for r in results]), reverse=True)
    
    fig = plt.figure(figsize=(12, 18))
    
    ax1 = plt.subplot(3, 2, 1)
    dp_by_size = {}
    myers_by_size = {}
    for size in sizes:
        size_results = [r for r in results if r[0] == size]
        dp_by_size[size] = np.mean([r[3] for r in size_results])
        myers_by_size[size] = np.mean([r[4] for r in size_results])
    
    x_pos = range(len(sizes))
    width = 0.35
    ax1.bar([x - width/2 for x in x_pos], [dp_by_size[s] for s in sizes], width, label='DP', alpha=0.8)
    ax1.bar([x + width/2 for x in x_pos], [myers_by_size[s] for s in sizes], width, label='Myers', alpha=0.8)
    ax1.set_xlabel('Sequence Length')
    ax1.set_ylabel('Average Time (s)')
    ax1.set_title('Performance by Sequence Length')
    ax1.set_xticks(x_pos)
    ax1.set_xticklabels(sizes)
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    ax2 = plt.subplot(3, 2, 2)
    dp_by_sim = {}
    myers_by_sim = {}
    for sim in similarities:
        sim_results = [r for r in results if abs(r[2] - sim) < 0.01]
        dp_by_sim[sim] = np.mean([r[3] for r in sim_results])
        myers_by_sim[sim] = np.mean([r[4] for r in sim_results])
    
    x_pos = range(len(similarities))
    ax2.bar([x - width/2 for x in x_pos], [dp_by_sim[s] for s in similarities], width, label='DP', alpha=0.8)
    ax2.bar([x + width/2 for x in x_pos], [myers_by_sim[s] for s in similarities], width, label='Myers', alpha=0.8)
    ax2.set_xlabel('Similarity')
    ax2.set_ylabel('Average Time (s)')
    ax2.set_title('Performance by Similarity')
    ax2.set_xticks(x_pos)
    ax2.set_xticklabels([f'{s:.1f}' for s in similarities])
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    
    ax3 = plt.subplot(3, 2, 3)
    ratios_by_size = {size: myers_by_size[size] / dp_by_size[size] for size in sizes}
    ax3.plot(sizes, [ratios_by_size[s] for s in sizes], marker='o', linewidth=2, markersize=8)
    ax3.axhline(y=1, color='r', linestyle='--', label='Equal Performance')
    ax3.set_xlabel('Sequence Length')
    ax3.set_ylabel('Myers Time / DP Time')
    ax3.set_title('Performance Ratio (Myers/DP) by Length')
    ax3.legend()
    ax3.grid(True, alpha=0.3)
    
    ax4 = plt.subplot(3, 2, 4)
    dp_heatmap = np.zeros((len(similarities), len(sizes)))
    for i, sim in enumerate(similarities):
        for j, size in enumerate(sizes):
            matching = [r for r in results if r[0] == size and abs(r[2] - sim) < 0.01]
            if matching:
                dp_heatmap[i, j] = matching[0][3]
    
    im1 = ax4.imshow(dp_heatmap, aspect='auto', cmap='YlOrRd')
    ax4.set_xticks(range(len(sizes)))
    ax4.set_xticklabels(sizes)
    ax4.set_yticks(range(len(similarities)))
    ax4.set_yticklabels([f'{s:.1f}' for s in similarities])
    ax4.set_xlabel('Sequence Length')
    ax4.set_ylabel('Similarity')
    ax4.set_title('DP Algorithm Performance (s)')
    plt.colorbar(im1, ax=ax4)
    
    ax5 = plt.subplot(3, 2, 5)
    myers_heatmap = np.zeros((len(similarities), len(sizes)))
    for i, sim in enumerate(similarities):
        for j, size in enumerate(sizes):
            matching = [r for r in results if r[0] == size and abs(r[2] - sim) < 0.01]
            if matching:
                myers_heatmap[i, j] = matching[0][4]
    
    im2 = ax5.imshow(myers_heatmap, aspect='auto', cmap='YlOrRd')
    ax5.set_xticks(range(len(sizes)))
    ax5.set_xticklabels(sizes)
    ax5.set_yticks(range(len(similarities)))
    ax5.set_yticklabels([f'{s:.1f}' for s in similarities])
    ax5.set_xlabel('Sequence Length')
    ax5.set_ylabel('Similarity')
    ax5.set_title('Myers Algorithm Performance (s)')
    plt.colorbar(im2, ax=ax5)
    
    ax6 = plt.subplot(3, 2, 6)
    ratio_heatmap = np.divide(myers_heatmap, dp_heatmap, where=dp_heatmap!=0)
    
    im3 = ax6.imshow(ratio_heatmap, aspect='auto', cmap='RdYlGn_r', vmin=0.5, vmax=1.5)
    ax6.set_xticks(range(len(sizes)))
    ax6.set_xticklabels(sizes)
    ax6.set_yticks(range(len(similarities)))
    ax6.set_yticklabels([f'{s:.1f}' for s in similarities])
    ax6.set_xlabel('Sequence Length')
    ax6.set_ylabel('Similarity')
    ax6.set_title('Performance Ratio (Myers/DP)')
    plt.colorbar(im3, ax=ax6)
    
    plt.tight_layout()
    plt.savefig('diff_algorithm_benchmark.png', dpi=300, bbox_inches='tight')
    plt.show()

benchmark()

Benchmark

N=100, M=100, similarity=0.9 | DP: 0.0008s (±0.0001) | Myers: 0.0001s (±0.0000)
N=100, M=100, similarity=0.8 | DP: 0.0008s (±0.0001) | Myers: 0.0002s (±0.0001)
N=100, M=100, similarity=0.7 | DP: 0.0008s (±0.0000) | Myers: 0.0003s (±0.0001)
N=100, M=100, similarity=0.6 | DP: 0.0008s (±0.0000) | Myers: 0.0006s (±0.0001)
N=100, M=100, similarity=0.5 | DP: 0.0008s (±0.0001) | Myers: 0.0008s (±0.0002)
N=100, M=100, similarity=0.4 | DP: 0.0008s (±0.0000) | Myers: 0.0011s (±0.0002)
N=100, M=100, similarity=0.3 | DP: 0.0008s (±0.0000) | Myers: 0.0013s (±0.0002)
N=100, M=100, similarity=0.2 | DP: 0.0008s (±0.0000) | Myers: 0.0015s (±0.0001)
N=100, M=100, similarity=0.1 | DP: 0.0008s (±0.0000) | Myers: 0.0017s (±0.0001)
N=200, M=200, similarity=0.9 | DP: 0.0031s (±0.0001) | Myers: 0.0002s (±0.0001)
N=200, M=200, similarity=0.8 | DP: 0.0031s (±0.0001) | Myers: 0.0006s (±0.0002)
N=200, M=200, similarity=0.7 | DP: 0.0031s (±0.0000) | Myers: 0.0013s (±0.0003)
N=200, M=200, similarity=0.6 | DP: 0.0032s (±0.0001) | Myers: 0.0021s (±0.0004)
N=200, M=200, similarity=0.5 | DP: 0.0032s (±0.0002) | Myers: 0.0032s (±0.0004)
N=200, M=200, similarity=0.4 | DP: 0.0032s (±0.0001) | Myers: 0.0042s (±0.0005)
N=200, M=200, similarity=0.3 | DP: 0.0032s (±0.0003) | Myers: 0.0052s (±0.0004)
N=200, M=200, similarity=0.2 | DP: 0.0032s (±0.0003) | Myers: 0.0060s (±0.0004)
N=200, M=200, similarity=0.1 | DP: 0.0032s (±0.0003) | Myers: 0.0066s (±0.0003)
N=300, M=300, similarity=0.9 | DP: 0.0074s (±0.0002) | Myers: 0.0004s (±0.0001)
N=300, M=300, similarity=0.8 | DP: 0.0075s (±0.0005) | Myers: 0.0013s (±0.0003)
N=300, M=300, similarity=0.7 | DP: 0.0076s (±0.0003) | Myers: 0.0028s (±0.0006)
N=300, M=300, similarity=0.6 | DP: 0.0077s (±0.0003) | Myers: 0.0045s (±0.0006)
N=300, M=300, similarity=0.5 | DP: 0.0078s (±0.0003) | Myers: 0.0072s (±0.0011)
N=300, M=300, similarity=0.4 | DP: 0.0078s (±0.0003) | Myers: 0.0097s (±0.0013)
N=300, M=300, similarity=0.3 | DP: 0.0079s (±0.0004) | Myers: 0.0119s (±0.0010)
N=300, M=300, similarity=0.2 | DP: 0.0080s (±0.0004) | Myers: 0.0141s (±0.0010)
N=300, M=300, similarity=0.1 | DP: 0.0079s (±0.0004) | Myers: 0.0156s (±0.0007)
N=400, M=400, similarity=0.9 | DP: 0.0140s (±0.0007) | Myers: 0.0007s (±0.0002)
N=400, M=400, similarity=0.8 | DP: 0.0141s (±0.0004) | Myers: 0.0024s (±0.0005)
N=400, M=400, similarity=0.7 | DP: 0.0145s (±0.0008) | Myers: 0.0050s (±0.0007)
N=400, M=400, similarity=0.6 | DP: 0.0147s (±0.0006) | Myers: 0.0085s (±0.0012)
N=400, M=400, similarity=0.5 | DP: 0.0152s (±0.0011) | Myers: 0.0127s (±0.0015)
N=400, M=400, similarity=0.4 | DP: 0.0151s (±0.0007) | Myers: 0.0176s (±0.0016)
N=400, M=400, similarity=0.3 | DP: 0.0150s (±0.0008) | Myers: 0.0221s (±0.0018)
N=400, M=400, similarity=0.2 | DP: 0.0149s (±0.0006) | Myers: 0.0258s (±0.0014)
N=400, M=400, similarity=0.1 | DP: 0.0150s (±0.0012) | Myers: 0.0283s (±0.0010)
N=500, M=500, similarity=0.9 | DP: 0.0227s (±0.0009) | Myers: 0.0010s (±0.0002)
N=500, M=500, similarity=0.8 | DP: 0.0231s (±0.0008) | Myers: 0.0036s (±0.0007)
N=500, M=500, similarity=0.7 | DP: 0.0234s (±0.0007) | Myers: 0.0077s (±0.0011)
N=500, M=500, similarity=0.6 | DP: 0.0239s (±0.0007) | Myers: 0.0131s (±0.0016)
N=500, M=500, similarity=0.5 | DP: 0.0243s (±0.0009) | Myers: 0.0206s (±0.0020)
N=500, M=500, similarity=0.4 | DP: 0.0244s (±0.0012) | Myers: 0.0283s (±0.0021)
N=500, M=500, similarity=0.3 | DP: 0.0244s (±0.0010) | Myers: 0.0354s (±0.0024)
N=500, M=500, similarity=0.2 | DP: 0.0243s (±0.0010) | Myers: 0.0417s (±0.0027)
N=500, M=500, similarity=0.1 | DP: 0.0243s (±0.0007) | Myers: 0.0452s (±0.0020)
N=600, M=600, similarity=0.9 | DP: 0.0337s (±0.0015) | Myers: 0.0015s (±0.0003)
N=600, M=600, similarity=0.8 | DP: 0.0342s (±0.0015) | Myers: 0.0053s (±0.0010)
N=600, M=600, similarity=0.7 | DP: 0.0348s (±0.0014) | Myers: 0.0109s (±0.0015)
N=600, M=600, similarity=0.6 | DP: 0.0360s (±0.0017) | Myers: 0.0198s (±0.0020)
N=600, M=600, similarity=0.5 | DP: 0.0360s (±0.0010) | Myers: 0.0302s (±0.0028)
N=600, M=600, similarity=0.4 | DP: 0.0364s (±0.0020) | Myers: 0.0407s (±0.0034)
N=600, M=600, similarity=0.3 | DP: 0.0359s (±0.0012) | Myers: 0.0513s (±0.0031)
N=600, M=600, similarity=0.2 | DP: 0.0363s (±0.0016) | Myers: 0.0614s (±0.0040)
N=600, M=600, similarity=0.1 | DP: 0.0362s (±0.0014) | Myers: 0.0666s (±0.0033)

Benchmark Results

结果清楚地表明，当序列高度相似（高相似度比率）时，Myers 表现出色，可以利用许多”snake”移动。然而，随着文件差异增大，Myers 的性能下降，实际上可能比直接的 DP 方法更慢。这很符合直觉：当 \(D\) 接近 \(N + M\) 时，\(D^2\) 因子占主导，抵消了 Myers 的优势。

关键要点： 对于典型的版本控制场景（文件渐进式变化），Myers 通常是更好的选择。对于比较完全不相关的文件，DP 可能更可预测。

提升 Diff 可读性

虽然两种算法都能在编辑距离方面产生最优结果，但它们并不总是生成最具人类可读性的 diff。考虑这两个等价的输出：

更可读：

Diff

for (int i = 0; i < n; i++) {
    process1(i);
}
+for (int i = 0; i < n; i++) {
+    process2(i);
+}

较不可读：

Diff

for (int i = 0; i < n; i++) {
    process1(i);
+}
+for (int i = 0; i < n; i++) {
+    process2(i);
}

类似地，考虑：

更可读（分组的变化）：

Diff

if (isSocketReady()) {
-    sendDataPart1();
-    sendDataPart2();
+    sendDataPartA();
+    sendDataPartB();
}

较不可读（交错的变化）：

Diff

if (isSocketReady()) {
-    sendDataPart1();
+    sendDataPartA();
-    sendDataPart2();
+    sendDataPartB();
}

两者达到相同的编辑距离，但第一个版本一眼就能理解得多。

这两种算法本质上都不会优化可读性，但我们可以应用后处理来改善输出。一个简单的方法是将连续的删除和添加操作分组，然后尝试移动块边界以创建更直观的变化模式。

下面是这种后处理的基本实现：

Python

def optimize_diff_readability(path: list[tuple[str, str, int | None, int | None]]) -> list[tuple[str, str, int | None, int | None]]:
    """
    后处理 diff 输出以提高可读性，通过分组相关变化
    并最小化交错操作来实现。
    """
    # 这是一个简化的例子 - 像 git diff 这样的真实实现
    # 使用更复杂的启发式算法
    optimized = []
    i = 0
    
    while i < len(path):
        if path[i][0] == '-':
            # 收集连续的删除操作
            deletions = []
            while i < len(path) and path[i][0] == '-':
                deletions.append(path[i])
                i += 1
            
            # 检查紧随其后的添加操作
            additions = []
            while i < len(path) and path[i][0] == '+':
                additions.append(path[i])
                i += 1
            
            # 添加分组的变化
            optimized.extend(deletions)
            optimized.extend(additions)
        else:
            optimized.append(path[i])
            i += 1
    
    return optimized

这只是冰山一角——像 Git 这样的生产级 diff 工具采用了更复杂的算法来优化可读性，包括：

Patience diff：Myers 的替代方案，往往为代码产生更直观的结果
Histogram diff：patience diff 的变种，具有更好的性能特征
词级和字符级差异：用于行内更细粒度的比较
语义感知：理解代码结构以进行更有意义的比较

总结

看似简单的 git diff 命令实际上蕴含着数十年的算法研究和优化成果。虽然动态规划方法提供了具有可预测 \(O(NM)\) 性能的坚实基础，但 Myers 算法为比较相似文件的常见情况提供了卓越的效率——这正是版本控制中遇到的场景。

算法的选择取决于你的具体用例：

使用 Myers 当文件可能相似时（版本控制、增量变化）
使用 DP 当你需要可预测的性能，无论文件相似度如何
考虑两者 并根据快速相似度估计动态选择

理解这些算法不仅揭开了我们最常用的开发工具之一的神秘面纱，还为更广泛的序列比较领域提供了洞察——其应用范围从生物信息学（DNA 序列比对）到抄袭检测等等。

下次你运行 git diff 时，你就会知道在这个看似简单的命令背后，隐藏着多么精妙的算法魔法！

算法笔记 | (2)

git diff 是如何实现的？

动态规划方法

Myers 算法

性能分析

算法比较

提升 Diff 可读性

总结

Series 算法笔记

Suggested Posts

Comments

算法笔记 | (2) git diff 是如何实现的？

动态规划方法

Myers 算法

性能分析

算法比较

提升 Diff 可读性

总结

Series 算法笔记

Suggested Posts

Comments

Share This Post

算法笔记 | (2)

git diff 是如何实现的？