While researching, a necessity to find out if two web documents are the same has arise. For instance, i want to know if this URL http://www.eutrigtreat.eu points to the same content has http://www.eutrigtreat.com (this example is relatively easy but you can find completely different URLs pointing to the same web content).
At first sight, using a normal md5 hashing algorithm on the websites content seems to solve the problem. But web sites are not static, and much of them can have dynamic content being generated, like displaying the current date time or a entire html section, or a simple extra blank space. Hashing one website content now, and hashing the same website content one second later will result in a completely different hash and a misleading conclusion that the two websites are not the same.
Fortunately, there is a better hashing method that will help to work around this problem and also unlock useful information about that two websites URLs.
Content Triggered Piecewise Hashing (CTPH) is the solution. The basic idea of the algorithm is that the hashing is made in blocks of bits, so a change in a document will only affect that block and the hash output will not be completely different.
You can have more details about fuzzy hashing and use it at ssdeep.
Since i needed to use with python, there is a wrapper around ssdeep at https://pypi.python.org/pypi/ssdeep.
Fuzzy hashing example:
-
With the problem stated at the beginning, the results of the hashing would be a fuzzy hashing matching score of 91%. With a high score like this, at the context of web documents we can say with a good degree of confidence that the websites point to the same content.
You can have more details about fuzzy hashing and use it at ssdeep.
Since i needed to use with python, there is a wrapper around ssdeep at https://pypi.python.org/pypi/ssdeep.
Fuzzy hashing example:
1 2 3 4 5 6 7 8 9 10 11 | import ssdeep hash1 = ssdeep.hash('Also called fuzzy hashes, Ctph can match inputs that have homologies.') hash1 '3:AXGBicFlgVNhBGcL6wCrFQEv:AXGHsNhxLsr2C' hash2 = ssdeep.hash('Also called fuzzy hashes, CTPH can match inputs that have homologies.') # comparing ssdeep.compare(hash1, hash2) 22 hash2 '3:AXGBicFlIHBGcL6wCrFQEv:AXGH6xLsr2C' |
With the problem stated at the beginning, the results of the hashing would be a fuzzy hashing matching score of 91%. With a high score like this, at the context of web documents we can say with a good degree of confidence that the websites point to the same content.