Near-duplicate with SimHash
Before talking about SimHash, let’s review some other methods which can also identify duplication.
Longest Common Subsequence(LCS) #
This is the algorithm used by diff
command. It is also edit distance with insertion and deletion as the only two edit operations.