Ch3nyang's blog web_stories

post

assignment_ind

about

data_object

github

local_offer

tag

rss_feed

rss

git diff 是怎么实现的?

假设有两个文件 \(A\) 和 \(B\),它们分别有 \(N=7\) 行和 \(M=6\) 行:

A
B
C
A
B
B
A
C
B
A
B
A
C

我们可以将这两个文件看作是一个 \(N \times M\) 的网格:

grid

从坐标 \((x, y)\) 出发,我们可以:

  • 向右移动到 \((x+1, y)\),表示在第一个文件中删除了第 \(x+1\) 行
  • 向下移动到 \((x, y+1)\),表示在第二个文件中添加了第 \(y+1\) 行
  • 向右下移动到 \((x+1, y+1)\),表示两个文件的第 \(x+1\) 行和第 \(y+1\) 行相同

其中,第三种移动方式被称为 snake,它不会增加路径的长度。

我们的目标是从 \((0, 0)\) 移动到 \((N, M)\),并且使得路径上的向右移动和向下移动的总数最小化。

这一问题即为两个序列之间的最小编辑距离(Levenshtein 距离)来解决。

DP

你也许已经看出,这个问题最简单的解法是使用动态规划:

\[dp[i][j] = \begin{cases} \min\left(dp[i-1][j], dp[i][j-1]\right) + 1 & A[i] \neq B[j] \\ dp[i-1][j-1] & A[i] = B[j] \end{cases}\]

边界条件为:

\[\begin{aligned} dp[0][0] &= 0 \\ dp[i][0] &= i \\ dp[0][j] &= j \end{aligned}\]

dp

这里给出 Python 代码实现:

def diff(A: list[str], B: list[str]) -> tuple[int, list[tuple[str, str, int | None, int | None]]]:
    """
    计算两个序列之间的最小编辑距离(Levenshtein距离)并返回详细的编辑路径。
    
    Args:
        A: 源序列
        B: 目标序列
    
    Returns:
        tuple: 包含两个元素的元组
            - int: 最小编辑距离(需要的最少操作数)
            - list: 编辑路径,每个元素是一个四元组 (操作, 元素值, A索引, B索引)
    
    编辑操作说明:
        '=': 元素相同,无需修改
        '-': 删除操作,A中存在但B中不存在
        '+': 插入操作,B中存在但A中不存在
    """
    N, M = len(A), len(B)
    dp = [[0] * (M + 1) for _ in range(N + 1)]
    for i in range(1, N + 1):
        dp[i][0] = i
    for j in range(1, M + 1):
        dp[0][j] = j
    for i in range(1, N + 1):
        for j in range(1, M + 1):
            if A[i - 1] == B[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            else:
                if dp[i - 1][j] < dp[i][j - 1]:
                    dp[i][j] = dp[i - 1][j] + 1
                else:
                    dp[i][j] = dp[i][j - 1] + 1
    i, j = N, M
    path = []
    while i > 0 or j > 0:
        if i > 0 and j > 0 and A[i - 1] == B[j - 1]:
            path.append(('=', A[i - 1], i - 1, j - 1))
            i -= 1
            j -= 1
        elif j > 0 and (i == 0 or dp[i][j - 1] <= dp[i - 1][j]):
            path.append(('+', B[j - 1], None, j - 1))
            j -= 1
        else:
            path.append(('-', A[i - 1], i - 1, None))
            i -= 1
    path.reverse()
    return dp[N][M], path

由于实际的最短路径可能不止一条,因此在回溯路径时也有一些讲究。例如我们这里选择了优先向右移动(删除行)而不是向下移动(添加行)。这会影响最终的 diff 结果,但不会影响最小编辑距离。

如果需要相反的策略,可以将 dp[i][j - 1] <= dp[i - 1][j] 改为 dp[i][j - 1] < dp[i - 1][j]

下面是运行结果:

A = ["A", "B", "C", "A", "B", "B", "A"]
B = ["C", "B", "A", "B", "A", "C"]
distance, path = diff(A, B)
print("最小编辑距离:", distance)
print("diff:")
for op, val, i, j in path:
    if op == '=':
        print(f"  {val}")
    elif op == '-':
        print(f"- {val}")
    elif op == '+':
        print(f"+ {val}")
DP diff
最小编辑距离: 5
diff:
- A
- B
  C
- A
  B
+ A
  B
  A
+ C

很好!但是以上算法的时间复杂度和空间复杂度均为 \(O(NM)\)。我们能否做得更好呢?

Myers

Myers 算法提供了一个更高效的解法,时间复杂度为 \(O\left(N\log N + D^2\right)\),空间复杂度为 \(O(N)\),其中 \(D\) 是最小编辑距离(即路径上向右移动和向下移动的总数)。

我们定义 \(k\) 为路径上向右移动和向下移动的差值,即 \(k = x - y\)。可以证明,最短路径的 \(D\) 满足以下不等式:

\[D \geq |N - M|\]

我们可以作出如下的 \(k-D\) 坐标系:

myers

我们可以通过逐步增加 \(D\) 的值来寻找最短路径。对于每一个固定的 \(D\),我们可以计算出所有可能的 \(k\) 值,这些 \(k\) 值的范围是 \(-D \leq k \leq D\)。

我们用一个数组 \(V\) 来记录每一个 \(k\) 对应的最大 \(x\) 值。

初始时,\(V[1] = 0\),表示 \(D = 0\) 时,\(k = 0\) 对应的最大 \(x\) 值为 0。

我们选择其中 \(x\) 最大的方式,这样可以让我们尽可能多地进行 snake 移动。我们有:

  • 当 \(k = -D\) 时,说明已经删无可删,只能从 \((x, y-1)\) 向下移动一步,因此

    \[x = V[k+1]\]
  • 当 \(k = D\) 时,说明已经添无可添,只能从 \((x-1, y)\) 向右移动一步,因此

    \[x = V[k-1] + 1\]
  • 当 \(-D < k < D\) 时,我们可以从两种方式中选择 \(x\) 较大的那一种,因此

    \[x = \text{max}(V[k-1] + 1, V[k+1])\]

合并一下就能得到:

if k == -D or (k != D and V.get(k - 1) < V.get(k + 1)):
    x = V.get(k + 1)
else:
    x = V.get(k - 1) + 1

有了 \(x\),我们可以计算出 \(y\):

\[y = x - k\]

然后我们可以进行 snake 移动,直到 \(A[x] \neq B[y]\):

while x < N and y < M and A[x] == B[y]:
    x += 1
    y += 1

我们不断增加 \(D\) 的值,直到 \(V[N - M]\) 的值达到 \(N\),表示我们已经找到了从 \((0, 0)\) 到 \((N, M)\) 的最短路径。

下面是 Myers 算法的示意图:

myers2

myers3

下面是 Python 代码实现:

def myers_diff(A: list[str], B: list[str]) -> tuple[int, list[tuple[str, str, int | None, int | None]]]:
    """
    使用 Myers 算法计算两个序列之间的最小编辑距离(Levenshtein距离)并返回详细的编辑路径。
    """
    N, M = len(A), len(B)
    if N == 0:
        return M, [('+', B[j], None, j) for j in range(M)]
    if M == 0:
        return N, [('-', A[i], i, None) for i in range(N)]
    
    max_d = N + M
    V = {1: 0}
    trace = [V.copy()]
    
    for D in range(max_d + 1):
        new_V = {}
        for k in range(-D, D + 1, 2):
            if k == -D or (k != D and V.get(k - 1, -1) < V.get(k + 1, -1)):
                x = V.get(k + 1, 0)
            else:
                x = V.get(k - 1, 0) + 1
            y = x - k
            
            while x < N and y < M and A[x] == B[y]:
                x += 1
                y += 1
            
            new_V[k] = x
            
            if x >= N and y >= M:
                path = []
                x, y = N, M
                for d in range(D, -1, -1):
                    k = x - y
                    V_prev = trace[d]
                    
                    if k == -d or (k != d and V_prev.get(k - 1, -1) < V_prev.get(k + 1, -1)):
                        prev_k = k + 1
                    else:
                        prev_k = k - 1
                    
                    prev_x = V_prev.get(prev_k, 0)
                    prev_y = prev_x - prev_k
                    
                    while x > prev_x and y > prev_y:
                        path.append(('=', A[x - 1], x - 1, y - 1))
                        x -= 1
                        y -= 1
                    
                    if d > 0:
                        if x == prev_x:
                            path.append(('+', B[y - 1], None, y - 1))
                            y -= 1
                        else:
                            path.append(('-', A[x - 1], x - 1, None))
                            x -= 1
                
                path.reverse()
                return D, path
        
        V = new_V
        trace.append(V.copy())
    
    return max_d, []

下面是运行结果:

A = ["A", "B", "C", "A", "B", "B", "A"]
B = ["C", "B", "A", "B", "A", "C"]
distance, path = myers_diff(A, B)
print("最小编辑距离:", distance)
print("diff:")
for op, val, i, j in path:
    if op == '=':
        print(f"  {val}")
    elif op == '-':
        print(f"- {val}")
    elif op == '+':
        print(f"+ {val}")
Myers diff
最小编辑距离: 5
diff:
- A
- B
  C
- A
  B
+ A
  B
  A
+ C

讨论

性能比较

Myers 算法的时间复杂度为 \(O\left(N\log N + D^2\right)\),空间复杂度为 \(O(N)\)。在最坏情况下,\(D\) 可能接近 \(N + M\),此时时间复杂度退化为 \(O\left({(N + M)}^2\right)\),甚至比 DP 更差。我们可以编写代码来测试两种算法在不同情况下的性能差异:

Benchmark
N=100, M=100, similarity=0.9 | DP: 0.0008s (±0.0001) | Myers: 0.0001s (±0.0000)
N=100, M=100, similarity=0.8 | DP: 0.0008s (±0.0001) | Myers: 0.0002s (±0.0001)
N=100, M=100, similarity=0.7 | DP: 0.0008s (±0.0000) | Myers: 0.0003s (±0.0001)
N=100, M=100, similarity=0.6 | DP: 0.0008s (±0.0000) | Myers: 0.0006s (±0.0001)
N=100, M=100, similarity=0.5 | DP: 0.0008s (±0.0001) | Myers: 0.0008s (±0.0002)
N=100, M=100, similarity=0.4 | DP: 0.0008s (±0.0000) | Myers: 0.0011s (±0.0002)
N=100, M=100, similarity=0.3 | DP: 0.0008s (±0.0000) | Myers: 0.0013s (±0.0002)
N=100, M=100, similarity=0.2 | DP: 0.0008s (±0.0000) | Myers: 0.0015s (±0.0001)
N=100, M=100, similarity=0.1 | DP: 0.0008s (±0.0000) | Myers: 0.0017s (±0.0001)
N=200, M=200, similarity=0.9 | DP: 0.0031s (±0.0001) | Myers: 0.0002s (±0.0001)
N=200, M=200, similarity=0.8 | DP: 0.0031s (±0.0001) | Myers: 0.0006s (±0.0002)
N=200, M=200, similarity=0.7 | DP: 0.0031s (±0.0000) | Myers: 0.0013s (±0.0003)
N=200, M=200, similarity=0.6 | DP: 0.0032s (±0.0001) | Myers: 0.0021s (±0.0004)
N=200, M=200, similarity=0.5 | DP: 0.0032s (±0.0002) | Myers: 0.0032s (±0.0004)
N=200, M=200, similarity=0.4 | DP: 0.0032s (±0.0001) | Myers: 0.0042s (±0.0005)
N=200, M=200, similarity=0.3 | DP: 0.0032s (±0.0003) | Myers: 0.0052s (±0.0004)
N=200, M=200, similarity=0.2 | DP: 0.0032s (±0.0003) | Myers: 0.0060s (±0.0004)
N=200, M=200, similarity=0.1 | DP: 0.0032s (±0.0003) | Myers: 0.0066s (±0.0003)
N=300, M=300, similarity=0.9 | DP: 0.0074s (±0.0002) | Myers: 0.0004s (±0.0001)
N=300, M=300, similarity=0.8 | DP: 0.0075s (±0.0005) | Myers: 0.0013s (±0.0003)
N=300, M=300, similarity=0.7 | DP: 0.0076s (±0.0003) | Myers: 0.0028s (±0.0006)
N=300, M=300, similarity=0.6 | DP: 0.0077s (±0.0003) | Myers: 0.0045s (±0.0006)
N=300, M=300, similarity=0.5 | DP: 0.0078s (±0.0003) | Myers: 0.0072s (±0.0011)
N=300, M=300, similarity=0.4 | DP: 0.0078s (±0.0003) | Myers: 0.0097s (±0.0013)
N=300, M=300, similarity=0.3 | DP: 0.0079s (±0.0004) | Myers: 0.0119s (±0.0010)
N=300, M=300, similarity=0.2 | DP: 0.0080s (±0.0004) | Myers: 0.0141s (±0.0010)
N=300, M=300, similarity=0.1 | DP: 0.0079s (±0.0004) | Myers: 0.0156s (±0.0007)
N=400, M=400, similarity=0.9 | DP: 0.0140s (±0.0007) | Myers: 0.0007s (±0.0002)
N=400, M=400, similarity=0.8 | DP: 0.0141s (±0.0004) | Myers: 0.0024s (±0.0005)
N=400, M=400, similarity=0.7 | DP: 0.0145s (±0.0008) | Myers: 0.0050s (±0.0007)
N=400, M=400, similarity=0.6 | DP: 0.0147s (±0.0006) | Myers: 0.0085s (±0.0012)
N=400, M=400, similarity=0.5 | DP: 0.0152s (±0.0011) | Myers: 0.0127s (±0.0015)
N=400, M=400, similarity=0.4 | DP: 0.0151s (±0.0007) | Myers: 0.0176s (±0.0016)
N=400, M=400, similarity=0.3 | DP: 0.0150s (±0.0008) | Myers: 0.0221s (±0.0018)
N=400, M=400, similarity=0.2 | DP: 0.0149s (±0.0006) | Myers: 0.0258s (±0.0014)
N=400, M=400, similarity=0.1 | DP: 0.0150s (±0.0012) | Myers: 0.0283s (±0.0010)
N=500, M=500, similarity=0.9 | DP: 0.0227s (±0.0009) | Myers: 0.0010s (±0.0002)
N=500, M=500, similarity=0.8 | DP: 0.0231s (±0.0008) | Myers: 0.0036s (±0.0007)
N=500, M=500, similarity=0.7 | DP: 0.0234s (±0.0007) | Myers: 0.0077s (±0.0011)
N=500, M=500, similarity=0.6 | DP: 0.0239s (±0.0007) | Myers: 0.0131s (±0.0016)
N=500, M=500, similarity=0.5 | DP: 0.0243s (±0.0009) | Myers: 0.0206s (±0.0020)
N=500, M=500, similarity=0.4 | DP: 0.0244s (±0.0012) | Myers: 0.0283s (±0.0021)
N=500, M=500, similarity=0.3 | DP: 0.0244s (±0.0010) | Myers: 0.0354s (±0.0024)
N=500, M=500, similarity=0.2 | DP: 0.0243s (±0.0010) | Myers: 0.0417s (±0.0027)
N=500, M=500, similarity=0.1 | DP: 0.0243s (±0.0007) | Myers: 0.0452s (±0.0020)
N=600, M=600, similarity=0.9 | DP: 0.0337s (±0.0015) | Myers: 0.0015s (±0.0003)
N=600, M=600, similarity=0.8 | DP: 0.0342s (±0.0015) | Myers: 0.0053s (±0.0010)
N=600, M=600, similarity=0.7 | DP: 0.0348s (±0.0014) | Myers: 0.0109s (±0.0015)
N=600, M=600, similarity=0.6 | DP: 0.0360s (±0.0017) | Myers: 0.0198s (±0.0020)
N=600, M=600, similarity=0.5 | DP: 0.0360s (±0.0010) | Myers: 0.0302s (±0.0028)
N=600, M=600, similarity=0.4 | DP: 0.0364s (±0.0020) | Myers: 0.0407s (±0.0034)
N=600, M=600, similarity=0.3 | DP: 0.0359s (±0.0012) | Myers: 0.0513s (±0.0031)
N=600, M=600, similarity=0.2 | DP: 0.0363s (±0.0016) | Myers: 0.0614s (±0.0040)
N=600, M=600, similarity=0.1 | DP: 0.0362s (±0.0014) | Myers: 0.0666s (±0.0033)

Benchmark Results

可以看出,当两个序列非常相似时,Myers 算法显著优于 DP 算法;而当两个序列差异较大时,Myers 算法的性能会下降到不如 DP 算法。不过,从整体来看,Myers 算法在大多数情况下仍然优于 DP 算法。

可读性优化

我们看如下两种 diff 结果:

for (int i = 0; i < n; i++) {
    process1(i);
}
+for (int i = 0; i < n; i++) {
+    process2(i);
+}
for (int i = 0; i < n; i++) {
    process1(i);
+}
+for (int i = 0; i < n; i++) {
+    process2(i);
}

还有这一组:

if (isSocketReady()) {
-    sendDataPart1();
-    sendDataPart2();
+    sendDataPartA();
+    sendDataPartB();
}
if (isSocketReady()) {
-    sendDataPart1();
+    sendDataPartA();
-    sendDataPart2();
+    sendDataPartB();
}

它们最终的结果是等价的,但在两组对比中,前者的 diff 结果从易读性上都要好得多。

Myers 算法并没有考虑到这一点,因此我们可以在 Myers 算法的基础上进行一些后处理来优化 diff 结果。

一种简单的做法是将连续的删除和添加操作合并为一个块,然后尝试在块的边界处进行调整,以减少不必要的删除和添加操作。

下面是优化后的代码实现:

def optimize_diff(path: list[tuple[str, str, int | None, int | None]]) -> list[tuple[str, str, int | None, int | None]]:
    optimized_path = []
    i = 0
    while i < len(path):
        if path[i][0] == '-':
            j = i
            while j + 1 < len(path) and path[j + 1][0] == '-':
                j += 1
            del_block = path[i:j + 1]
            i = j + 1
            if i < len(path) and path[i][0] == '+':
                add_block = path[i]
                optimized_path.append((del_block, add_block))
                i += 1
        else:
            optimized_path.append(path[i])
            i += 1
    return optimized_path