git diff
是怎么实现的?
假设有两个文件 \(A\) 和 \(B\),它们分别有 \(N=7\) 行和 \(M=6\) 行:
A
B
C
A
B
B
A
C
B
A
B
A
C
我们可以将这两个文件看作是一个 \(N \times M\) 的网格:
从坐标 \((x, y)\) 出发,我们可以:
- 向右移动到 \((x+1, y)\),表示在第一个文件中删除了第 \(x+1\) 行
- 向下移动到 \((x, y+1)\),表示在第二个文件中添加了第 \(y+1\) 行
- 向右下移动到 \((x+1, y+1)\),表示两个文件的第 \(x+1\) 行和第 \(y+1\) 行相同
其中,第三种移动方式被称为 snake,它不会增加路径的长度。
我们的目标是从 \((0, 0)\) 移动到 \((N, M)\),并且使得路径上的向右移动和向下移动的总数最小化。
这一问题即为两个序列之间的最小编辑距离(Levenshtein 距离)来解决。
DP
你也许已经看出,这个问题最简单的解法是使用动态规划:
\[dp[i][j] = \begin{cases} \min\left(dp[i-1][j], dp[i][j-1]\right) + 1 & A[i] \neq B[j] \\ dp[i-1][j-1] & A[i] = B[j] \end{cases}\]边界条件为:
\[\begin{aligned} dp[0][0] &= 0 \\ dp[i][0] &= i \\ dp[0][j] &= j \end{aligned}\]这里给出 Python 代码实现:
def diff(A: list[str], B: list[str]) -> tuple[int, list[tuple[str, str, int | None, int | None]]]:
"""
计算两个序列之间的最小编辑距离(Levenshtein距离)并返回详细的编辑路径。
Args:
A: 源序列
B: 目标序列
Returns:
tuple: 包含两个元素的元组
- int: 最小编辑距离(需要的最少操作数)
- list: 编辑路径,每个元素是一个四元组 (操作, 元素值, A索引, B索引)
编辑操作说明:
'=': 元素相同,无需修改
'-': 删除操作,A中存在但B中不存在
'+': 插入操作,B中存在但A中不存在
"""
N, M = len(A), len(B)
dp = [[0] * (M + 1) for _ in range(N + 1)]
for i in range(1, N + 1):
dp[i][0] = i
for j in range(1, M + 1):
dp[0][j] = j
for i in range(1, N + 1):
for j in range(1, M + 1):
if A[i - 1] == B[j - 1]:
dp[i][j] = dp[i - 1][j - 1]
else:
if dp[i - 1][j] < dp[i][j - 1]:
dp[i][j] = dp[i - 1][j] + 1
else:
dp[i][j] = dp[i][j - 1] + 1
i, j = N, M
path = []
while i > 0 or j > 0:
if i > 0 and j > 0 and A[i - 1] == B[j - 1]:
path.append(('=', A[i - 1], i - 1, j - 1))
i -= 1
j -= 1
elif j > 0 and (i == 0 or dp[i][j - 1] <= dp[i - 1][j]):
path.append(('+', B[j - 1], None, j - 1))
j -= 1
else:
path.append(('-', A[i - 1], i - 1, None))
i -= 1
path.reverse()
return dp[N][M], path
由于实际的最短路径可能不止一条,因此在回溯路径时也有一些讲究。例如我们这里选择了优先向右移动(删除行)而不是向下移动(添加行)。这会影响最终的 diff 结果,但不会影响最小编辑距离。
如果需要相反的策略,可以将
dp[i][j - 1] <= dp[i - 1][j]
改为dp[i][j - 1] < dp[i - 1][j]
下面是运行结果:
A = ["A", "B", "C", "A", "B", "B", "A"]
B = ["C", "B", "A", "B", "A", "C"]
distance, path = diff(A, B)
print("最小编辑距离:", distance)
print("diff:")
for op, val, i, j in path:
if op == '=':
print(f" {val}")
elif op == '-':
print(f"- {val}")
elif op == '+':
print(f"+ {val}")
最小编辑距离: 5 diff: - A - B C - A B + A B A + C
很好!但是以上算法的时间复杂度和空间复杂度均为 \(O(NM)\)。我们能否做得更好呢?
Myers
Myers 算法提供了一个更高效的解法,时间复杂度为 \(O\left(N\log N + D^2\right)\),空间复杂度为 \(O(N)\),其中 \(D\) 是最小编辑距离(即路径上向右移动和向下移动的总数)。
我们定义 \(k\) 为路径上向右移动和向下移动的差值,即 \(k = x - y\)。可以证明,最短路径的 \(D\) 满足以下不等式:
\[D \geq |N - M|\]我们可以作出如下的 \(k-D\) 坐标系:
我们可以通过逐步增加 \(D\) 的值来寻找最短路径。对于每一个固定的 \(D\),我们可以计算出所有可能的 \(k\) 值,这些 \(k\) 值的范围是 \(-D \leq k \leq D\)。
我们用一个数组 \(V\) 来记录每一个 \(k\) 对应的最大 \(x\) 值。
初始时,\(V[1] = 0\),表示 \(D = 0\) 时,\(k = 0\) 对应的最大 \(x\) 值为 0。
我们选择其中 \(x\) 最大的方式,这样可以让我们尽可能多地进行 snake 移动。我们有:
-
当 \(k = -D\) 时,说明已经删无可删,只能从 \((x, y-1)\) 向下移动一步,因此
\[x = V[k+1]\] -
当 \(k = D\) 时,说明已经添无可添,只能从 \((x-1, y)\) 向右移动一步,因此
\[x = V[k-1] + 1\] -
当 \(-D < k < D\) 时,我们可以从两种方式中选择 \(x\) 较大的那一种,因此
\[x = \text{max}(V[k-1] + 1, V[k+1])\]
合并一下就能得到:
if k == -D or (k != D and V.get(k - 1) < V.get(k + 1)):
x = V.get(k + 1)
else:
x = V.get(k - 1) + 1
有了 \(x\),我们可以计算出 \(y\):
\[y = x - k\]然后我们可以进行 snake 移动,直到 \(A[x] \neq B[y]\):
while x < N and y < M and A[x] == B[y]:
x += 1
y += 1
我们不断增加 \(D\) 的值,直到 \(V[N - M]\) 的值达到 \(N\),表示我们已经找到了从 \((0, 0)\) 到 \((N, M)\) 的最短路径。
下面是 Myers 算法的示意图:
下面是 Python 代码实现:
def myers_diff(A: list[str], B: list[str]) -> tuple[int, list[tuple[str, str, int | None, int | None]]]:
"""
使用 Myers 算法计算两个序列之间的最小编辑距离(Levenshtein距离)并返回详细的编辑路径。
"""
N, M = len(A), len(B)
if N == 0:
return M, [('+', B[j], None, j) for j in range(M)]
if M == 0:
return N, [('-', A[i], i, None) for i in range(N)]
max_d = N + M
V = {1: 0}
trace = [V.copy()]
for D in range(max_d + 1):
new_V = {}
for k in range(-D, D + 1, 2):
if k == -D or (k != D and V.get(k - 1, -1) < V.get(k + 1, -1)):
x = V.get(k + 1, 0)
else:
x = V.get(k - 1, 0) + 1
y = x - k
while x < N and y < M and A[x] == B[y]:
x += 1
y += 1
new_V[k] = x
if x >= N and y >= M:
path = []
x, y = N, M
for d in range(D, -1, -1):
k = x - y
V_prev = trace[d]
if k == -d or (k != d and V_prev.get(k - 1, -1) < V_prev.get(k + 1, -1)):
prev_k = k + 1
else:
prev_k = k - 1
prev_x = V_prev.get(prev_k, 0)
prev_y = prev_x - prev_k
while x > prev_x and y > prev_y:
path.append(('=', A[x - 1], x - 1, y - 1))
x -= 1
y -= 1
if d > 0:
if x == prev_x:
path.append(('+', B[y - 1], None, y - 1))
y -= 1
else:
path.append(('-', A[x - 1], x - 1, None))
x -= 1
path.reverse()
return D, path
V = new_V
trace.append(V.copy())
return max_d, []
下面是运行结果:
A = ["A", "B", "C", "A", "B", "B", "A"]
B = ["C", "B", "A", "B", "A", "C"]
distance, path = myers_diff(A, B)
print("最小编辑距离:", distance)
print("diff:")
for op, val, i, j in path:
if op == '=':
print(f" {val}")
elif op == '-':
print(f"- {val}")
elif op == '+':
print(f"+ {val}")
最小编辑距离: 5 diff: - A - B C - A B + A B A + C
讨论
性能比较
Myers 算法的时间复杂度为 \(O\left(N\log N + D^2\right)\),空间复杂度为 \(O(N)\)。在最坏情况下,\(D\) 可能接近 \(N + M\),此时时间复杂度退化为 \(O\left({(N + M)}^2\right)\),甚至比 DP 更差。我们可以编写代码来测试两种算法在不同情况下的性能差异:
N=100, M=100, similarity=0.9 | DP: 0.0008s (±0.0001) | Myers: 0.0001s (±0.0000) N=100, M=100, similarity=0.8 | DP: 0.0008s (±0.0001) | Myers: 0.0002s (±0.0001) N=100, M=100, similarity=0.7 | DP: 0.0008s (±0.0000) | Myers: 0.0003s (±0.0001) N=100, M=100, similarity=0.6 | DP: 0.0008s (±0.0000) | Myers: 0.0006s (±0.0001) N=100, M=100, similarity=0.5 | DP: 0.0008s (±0.0001) | Myers: 0.0008s (±0.0002) N=100, M=100, similarity=0.4 | DP: 0.0008s (±0.0000) | Myers: 0.0011s (±0.0002) N=100, M=100, similarity=0.3 | DP: 0.0008s (±0.0000) | Myers: 0.0013s (±0.0002) N=100, M=100, similarity=0.2 | DP: 0.0008s (±0.0000) | Myers: 0.0015s (±0.0001) N=100, M=100, similarity=0.1 | DP: 0.0008s (±0.0000) | Myers: 0.0017s (±0.0001) N=200, M=200, similarity=0.9 | DP: 0.0031s (±0.0001) | Myers: 0.0002s (±0.0001) N=200, M=200, similarity=0.8 | DP: 0.0031s (±0.0001) | Myers: 0.0006s (±0.0002) N=200, M=200, similarity=0.7 | DP: 0.0031s (±0.0000) | Myers: 0.0013s (±0.0003) N=200, M=200, similarity=0.6 | DP: 0.0032s (±0.0001) | Myers: 0.0021s (±0.0004) N=200, M=200, similarity=0.5 | DP: 0.0032s (±0.0002) | Myers: 0.0032s (±0.0004) N=200, M=200, similarity=0.4 | DP: 0.0032s (±0.0001) | Myers: 0.0042s (±0.0005) N=200, M=200, similarity=0.3 | DP: 0.0032s (±0.0003) | Myers: 0.0052s (±0.0004) N=200, M=200, similarity=0.2 | DP: 0.0032s (±0.0003) | Myers: 0.0060s (±0.0004) N=200, M=200, similarity=0.1 | DP: 0.0032s (±0.0003) | Myers: 0.0066s (±0.0003) N=300, M=300, similarity=0.9 | DP: 0.0074s (±0.0002) | Myers: 0.0004s (±0.0001) N=300, M=300, similarity=0.8 | DP: 0.0075s (±0.0005) | Myers: 0.0013s (±0.0003) N=300, M=300, similarity=0.7 | DP: 0.0076s (±0.0003) | Myers: 0.0028s (±0.0006) N=300, M=300, similarity=0.6 | DP: 0.0077s (±0.0003) | Myers: 0.0045s (±0.0006) N=300, M=300, similarity=0.5 | DP: 0.0078s (±0.0003) | Myers: 0.0072s (±0.0011) N=300, M=300, similarity=0.4 | DP: 0.0078s (±0.0003) | Myers: 0.0097s (±0.0013) N=300, M=300, similarity=0.3 | DP: 0.0079s (±0.0004) | Myers: 0.0119s (±0.0010) N=300, M=300, similarity=0.2 | DP: 0.0080s (±0.0004) | Myers: 0.0141s (±0.0010) N=300, M=300, similarity=0.1 | DP: 0.0079s (±0.0004) | Myers: 0.0156s (±0.0007) N=400, M=400, similarity=0.9 | DP: 0.0140s (±0.0007) | Myers: 0.0007s (±0.0002) N=400, M=400, similarity=0.8 | DP: 0.0141s (±0.0004) | Myers: 0.0024s (±0.0005) N=400, M=400, similarity=0.7 | DP: 0.0145s (±0.0008) | Myers: 0.0050s (±0.0007) N=400, M=400, similarity=0.6 | DP: 0.0147s (±0.0006) | Myers: 0.0085s (±0.0012) N=400, M=400, similarity=0.5 | DP: 0.0152s (±0.0011) | Myers: 0.0127s (±0.0015) N=400, M=400, similarity=0.4 | DP: 0.0151s (±0.0007) | Myers: 0.0176s (±0.0016) N=400, M=400, similarity=0.3 | DP: 0.0150s (±0.0008) | Myers: 0.0221s (±0.0018) N=400, M=400, similarity=0.2 | DP: 0.0149s (±0.0006) | Myers: 0.0258s (±0.0014) N=400, M=400, similarity=0.1 | DP: 0.0150s (±0.0012) | Myers: 0.0283s (±0.0010) N=500, M=500, similarity=0.9 | DP: 0.0227s (±0.0009) | Myers: 0.0010s (±0.0002) N=500, M=500, similarity=0.8 | DP: 0.0231s (±0.0008) | Myers: 0.0036s (±0.0007) N=500, M=500, similarity=0.7 | DP: 0.0234s (±0.0007) | Myers: 0.0077s (±0.0011) N=500, M=500, similarity=0.6 | DP: 0.0239s (±0.0007) | Myers: 0.0131s (±0.0016) N=500, M=500, similarity=0.5 | DP: 0.0243s (±0.0009) | Myers: 0.0206s (±0.0020) N=500, M=500, similarity=0.4 | DP: 0.0244s (±0.0012) | Myers: 0.0283s (±0.0021) N=500, M=500, similarity=0.3 | DP: 0.0244s (±0.0010) | Myers: 0.0354s (±0.0024) N=500, M=500, similarity=0.2 | DP: 0.0243s (±0.0010) | Myers: 0.0417s (±0.0027) N=500, M=500, similarity=0.1 | DP: 0.0243s (±0.0007) | Myers: 0.0452s (±0.0020) N=600, M=600, similarity=0.9 | DP: 0.0337s (±0.0015) | Myers: 0.0015s (±0.0003) N=600, M=600, similarity=0.8 | DP: 0.0342s (±0.0015) | Myers: 0.0053s (±0.0010) N=600, M=600, similarity=0.7 | DP: 0.0348s (±0.0014) | Myers: 0.0109s (±0.0015) N=600, M=600, similarity=0.6 | DP: 0.0360s (±0.0017) | Myers: 0.0198s (±0.0020) N=600, M=600, similarity=0.5 | DP: 0.0360s (±0.0010) | Myers: 0.0302s (±0.0028) N=600, M=600, similarity=0.4 | DP: 0.0364s (±0.0020) | Myers: 0.0407s (±0.0034) N=600, M=600, similarity=0.3 | DP: 0.0359s (±0.0012) | Myers: 0.0513s (±0.0031) N=600, M=600, similarity=0.2 | DP: 0.0363s (±0.0016) | Myers: 0.0614s (±0.0040) N=600, M=600, similarity=0.1 | DP: 0.0362s (±0.0014) | Myers: 0.0666s (±0.0033)
可以看出,当两个序列非常相似时,Myers 算法显著优于 DP 算法;而当两个序列差异较大时,Myers 算法的性能会下降到不如 DP 算法。不过,从整体来看,Myers 算法在大多数情况下仍然优于 DP 算法。
可读性优化
我们看如下两种 diff 结果:
for (int i = 0; i < n; i++) {
process1(i);
}
+for (int i = 0; i < n; i++) {
+ process2(i);
+}
for (int i = 0; i < n; i++) {
process1(i);
+}
+for (int i = 0; i < n; i++) {
+ process2(i);
}
还有这一组:
if (isSocketReady()) {
- sendDataPart1();
- sendDataPart2();
+ sendDataPartA();
+ sendDataPartB();
}
if (isSocketReady()) {
- sendDataPart1();
+ sendDataPartA();
- sendDataPart2();
+ sendDataPartB();
}
它们最终的结果是等价的,但在两组对比中,前者的 diff 结果从易读性上都要好得多。
Myers 算法并没有考虑到这一点,因此我们可以在 Myers 算法的基础上进行一些后处理来优化 diff 结果。
一种简单的做法是将连续的删除和添加操作合并为一个块,然后尝试在块的边界处进行调整,以减少不必要的删除和添加操作。
下面是优化后的代码实现:
def optimize_diff(path: list[tuple[str, str, int | None, int | None]]) -> list[tuple[str, str, int | None, int | None]]:
optimized_path = []
i = 0
while i < len(path):
if path[i][0] == '-':
j = i
while j + 1 < len(path) and path[j + 1][0] == '-':
j += 1
del_block = path[i:j + 1]
i = j + 1
if i < len(path) and path[i][0] == '+':
add_block = path[i]
optimized_path.append((del_block, add_block))
i += 1
else:
optimized_path.append(path[i])
i += 1
return optimized_path
Comments