Weeder
A weeder is a utility which identifies binary identical files using a
fingerprint of CRC and filelength. With this informations it is possible
to seperate, exchange or delete large amounts of various binaries like:
archives, viruses (like in my case), pictures and many others.
The used algorithms are fast and reliable (for about 2GB (180.000) virus
samples it will take about 1 hour on a P200). It is prepared for different
datafiles and incorporates integrity-checking abilities.
This is my first Linux program. It is something I need for my work
and is to 50% made at Ikarus software, a antivirus company in vienna
and the rest my own effort. Usually I would place them at the header
but I have no cute ASCII logo for it ;-)
In antivirus research there are gigabytes of small files to process
and many of them are duplicates. Every day there are several files
(up to a few thousands) and somtimes packages with 100.000 and more.
Our serverspace is faster exhausted than we are able to buy harddisks
if we would not kick out every virus sample that is still in our
virus base (the chaotic part of the collection). At 1996 I first made
contact with a wonderful little (about 3KB) MSDOS tool called tbweeder.
This file solved the problem of duplicate binaries in a very elegant
way. It took 32-bit CRC's and the length from every file and used this
as a fingerprint to weed out duplicates. Unfortunately in times of
Windows with the long filenames and the amount of viruses, tbweeder
is not of use anymore. There was another weeder from Ralph Roth of
VHM (Virus Help Munich) which is a more advanced version of the tbweeder
(called rfw) but far to slow for my amount of data. So I started to
create my own weeder, based on Linux.
Soon after fooling around with tbweeder I discovered that there would
be another fine purpose. At this time I collected this cute little
pictures of cute naked girls.... [hey I could have mentioned pictures
of fractals, but that would sound a lot more boring]. Anyway it
was the same problem. How to avoid having every picture 2 or three times?
You cannot remember some thousands of pictures longer than a week.
At least binary identical pictures may be weeded out too.
With the time I got new ideas for applicating this tool. Fast file checks
with with a integrity check option which will show within five minutes
if some binaries in your system has changed. Identifying a files in a
collection, where the files will be frequently renamed. Even syncing two
distant directories will be possible with a weeder.
With this applications in mind I decided to place this tool in the pool
of Linux applications.
INSTALLATION
simply do a "make" followed by a "make install".
MAKE SURE YOU ARE ROOT DURING INSTALLATION!
Actual version: weeder-0.9.7.tgz
..