SimHash

#algorithms #public

SimHash é um algorítimo para encontrar documentos similares (cópias não exatas).

A ideia do algorítimo é

Criar um vetor $weights$ de tamanho igual a quantidade de bits da função de hash utilizada
Quebrar o documento original em features (token + peso)
Para cada feature
1. Calcular um hash dessa feature
2. Para cada $bit_i$ do hash
  1. Se $bit_i = 0$, soma $feature_w$ na posição $i$ de $weights$
  2. Se $bit_i = 0$ , subtraí $feature_w$ na posição $i$ de $weights$
Para cada item do vetor $weights$
1. Se $weights_i$ > 0, $simhash_i = 1$
2. Caso contrário $simhash_i = 0$

#tech/rust

use std::hash::Hasher;

pub struct SimHash<H: Hasher, T: Tokenize> {
    hasher: H,
    tokenizer: T,
}

impl<H: Hasher, T: Tokenize> SimHash<H, T> {
    pub fn new(hasher: H, tokenizer: T) -> SimHash<H, T> {
        SimHash { hasher, tokenizer }
    }

    pub fn hash(&mut self, value: &str) -> u64 {
        let mut weights = [0_i32; 64];

        for token in self.tokenizer.tokenize(value) {
            self.hasher.write(token.value.as_bytes());
            let h = self.hasher.finish();

            for (i, w) in weights.iter_mut().enumerate() {
                if ((h >> i) & 1) == 1 {
                    *w = w.saturating_add(token.weight);
                } else {
                    *w = w.saturating_sub(token.weight);
                }
            }
        }

        let mut simhash: u64 = 0;

        for (i, w) in weights.into_iter().enumerate() {
            if w > 0 {
                simhash |= 1 << i;
            }
        }

        simhash
    }
}

Referências

Probabilistic Near-Duplicate Detection Using Simhash