Comment by Apreche

Comment by Apreche a day ago

7 replies

This would be more interesting if it was generalized. Using a hash, even one character difference will result in a miss.

If I could have it analyze my blog and then find people who have similar ideas that would be incredibly useful.

Imustaskforhelp a day ago

To be really honest, they can take a look at bao. (I used it for an eerily similar project like this one though its great that this is receiving traction! I Do feel like scuttlebutt protocol might be good implementation for most use cases as well)

Bao allows us to have a common hash for the first n contents of the term and then they can still have common hash so you can just loop it over each continuous word to see how much commonly (long?) their hash is and the length becomes the amount similar

Some issue might come where if the word changes in the start and the rest is similar but I feel like bao could/does support that as well. My information on bao is pretty rusty (get the pun? It's written in rust) but I am sure that this idea is technically possible & I hope someone experienced in the field could tell more about it

https://github.com/oconnor663/bao, Oconnor's bao's video or documentaries on youtube are so good, worth a watch & worth a star (though they do mention that its a little less formally cryptographically solved iirc but its still pretty robust imo)

sonnig a day ago

True! That would be a more powerful approach. Here I kept it quite basic since I was not very familiar with the tooling. I do apply lowercasing of text + some whitespace stripping in order to increase the number of collisions a bit.

Edit: any other "quick hacks" to increase the number of collisions are welcome :)

nathan_compton a day ago

Natural to use LM embeddings for this.

  • jamilton a day ago

    Yeah, convert to embedding, check if it's within a certain distance to an existing embedding and if so store it with that cluster and increment? Then check check further entries against against an average so clusters don't increase their "reach" indefinitely.

stogot a day ago

That is a problem Also a long paragraph would likely never be hashed the same because of a comma or capital letter and so the builder of this would need to cap the length of the thought and make all thoughts lower case without punctuation

  • sonnig a day ago

    i agree removing punctuation wouldve been a good idea alas it may be a bit too late since that would modify the hash of previous inputs in the future hmm but i will think about it