Near-duplicate with SimHash
·4 mins
Before talking about SimHash, let’s review some other methods which can also identify duplication.
Longest Common Subsequence(LCS) #This is the algorithm used by diff command. It is also edit distance with insertion and deletion as the only two edit operations.