Weeder



A weeder is a utility which identifies binary identical files using a fingerprint of CRC and filelength. With this informations it is possible to seperate, exchange or delete large amounts of various binaries like: archives, viruses (like in my case), pictures and many others.

The used algorithms are fast and reliable (for about 2GB (180.000) virus samples it will take about 1 hour on a P200). It is prepared for different datafiles and incorporates integrity-checking abilities. This is my first Linux program. It is something I need for my work and is to 50% made at Ikarus software, a antivirus company in vienna and the rest my own effort. Usually I would place them at the header but I have no cute ASCII logo for it ;-)

In antivirus research there are gigabytes of small files to process and many of them are duplicates. Every day there are several files (up to a few thousands) and somtimes packages with 100.000 and more. Our serverspace is faster exhausted than we are able to buy harddisks if we would not kick out every virus sample that is still in our virus base (the chaotic part of the collection). At 1996 I first made contact with a wonderful little (about 3KB) MSDOS tool called tbweeder. This file solved the problem of duplicate binaries in a very elegant way. It took 32-bit CRC's and the length from every file and used this as a fingerprint to weed out duplicates. Unfortunately in times of Windows with the long filenames and the amount of viruses, tbweeder is not of use anymore. There was another weeder from Ralph Roth of VHM (Virus Help Munich) which is a more advanced version of the tbweeder (called rfw) but far to slow for my amount of data. So I started to create my own weeder, based on Linux.

Soon after fooling around with tbweeder I discovered that there would be another fine purpose. At this time I collected this cute little pictures of cute naked girls.... [hey I could have mentioned pictures of fractals, but that would sound a lot more boring]. Anyway it was the same problem. How to avoid having every picture 2 or three times? You cannot remember some thousands of pictures longer than a week. At least binary identical pictures may be weeded out too.

With the time I got new ideas for applicating this tool. Fast file checks with with a integrity check option which will show within five minutes if some binaries in your system has changed. Identifying a files in a collection, where the files will be frequently renamed. Even syncing two distant directories will be possible with a weeder.

With this applications in mind I decided to place this tool in the pool of Linux applications.


INSTALLATION

simply do a "make" followed by a "make install".
MAKE SURE YOU ARE ROOT DURING INSTALLATION!

Actual version: weeder-0.9.7.tgz
..