AGREP, an approximate GREP

QUICK REFERENCE

The source code moved to Github 17.02.2013
Download from git, compile and install:

git clone git://github.com/wikinaut/agrep.git
cd agrep
make
cp agrep /usr/local/bin

added the code for LINUX 25.09.2011 agrep-3.41.tgz
AGREP for Windows (compilation with MicroSoft Visual C++ 6.0 VC 6.0)
plans for an AGREP-Wiki
The versions AGREP 3.41 for LINUX, AGREP 3.37 for Windows 95/NT, AGREP 3.35 for OS/2 and DOS are based on the AGREP program from GLIMPSE 3.0. Many bugs have been identified and I regard my version as the most tested and debugged version of all AGREP versions I found until 31.05.2004. However, the core from which I started is extremely diffcult to read (apart from understanding) and lacking descriptions. However, my port never crashed in all the last eight years.
I was in contact with the authors and have asked the license departement of University of Arizona and the Webglimpse keepers several times to give it free with success: As of Sept 18, 2014, Webglimpse, Glimpse and Agrep are available under the ISC open source license, thanks to the University of Arizona Office of Technology Transfer and all the developers, who were more than happy to release it.
If you use Windows and if you already use previous versions of AGREP, please upgrade to this version 3.37

The AGREP software and document parts of this page are still unchanged since 16. September 1998, because no further developements took place (yet). If you want to contribute an new port or implementation, please contact me (see below).

Download section - program, source files and documentation
What's new in AGREP 3.37 for OS/2, DOS, Win3.x and Windows 95, NT ?
Help Page is here - options and examples
Limitations and known bugs
The To-Do List - Research and development
Algorithms used within AGREP
Literature and Links
A must: Hundreds of really relevant Pattern Matching Pointers and other web resources (Prof. Stefano Lonardi, University of California, Riverside, USA)
E-mail contact- see below
my home page
Imprint/Impressum

Some interesting links related to AGREP and newer developments as of 16.12.2007

http://search.cpan.org/~jhi/String-Approx-3.26/Approx.pm

PERL module String:Approx

Perl extension for approximate matching (fuzzy matching) by Jarkko Hietaniemi, Finland

String::Approx lets you match and substitute strings approximately. With this you can emulate errors: typing errorrs, speling errors, closely related vocabularies (colour color), genetic mutations (GAG ACT), abbreviations (McScot, MacScot).

Excerpt from the README file:

"The most important change was that the underlying algorithm waschanged completely. Instead of doing everything in Perl using regular expressions we now do the matching in C using the so-called Manber-Wuk-differences algorithm shift-add. You have met this algorithm if youhave used the agrep utility or the Glimpse indexing system.Because this implementation shares no code with agrep, only the well-publicized algorithms, the use of this software is limited only by my copyright (file COPYRIGHT). This interpretation has been kindly confirmed by Udi Manber. More details in the file README.apse."

http://freshmeat.net/projects/tre/ | http://laurikari.net/tre/ by Ville Laurikari, Finland

TRE is a lightweight, robust, and efficient POSIX compliant regexp matching library with some exciting features such as approximate (fuzzy) matching.

At the core of TRE is a new algorithm for regular expression matching with submatch addressing. The algorithm uses linear worst-case time in the length of the text being searched, and quadratic worst-case time in the length of the used regular expression. In other words, the time complexity of the algorithm is O(M²N), where M is the length of the regular expression and N is the length of the text. The used space is also quadratic on the length of the regex, but does not depend on the searched string. This quadratic behaviour occurs only on pathological cases which are probably very rare in practice.

TRE is not just yet another regexp matcher. TRE has some features which are not there in most free POSIX compatible implementations. Most of these features are not present in non-free implementations either, for that matter.

Approximate pattern matching allows matches to be approximate, that is, allows the matches to be close to the searched pattern under some measure of closeness. TRE uses the edit-distance measure (also known as the Levenshtein distance) where characters can be inserted, deleted, or substituted in the searched text in order to get an exact match. Each insertion, deletion, or substitution adds the distance, or cost, of the match. TRE can report the matches which have a cost lower than some given threshold value. TRE can also be used to search for matches with the lowest cost.

TRE includes a version of the agrep command line tool for approximate regexp matching in the style of grep. Unlike other agrep implementations (like the one by Sun Wu and Udi Manber from University of Arizona available here) TRE agrep allows full regexps of any length, any number of errors, and non-uniform costs for insertion, deletion and substitution.

TRE is free; you can redistribute it and/or modify it under the terms of the GNU General Public License version 2 (June 1991) as published by the Free Software Foundation. See the file COPYING, included in the distribution packages, for the complete license.

If you cannot accept the GNU GPL, it may be possible that you can persuade me to give or sell you a version which is licensed differently. Please contact the author for more information.

http://freshmeat.net/projects/libbitap/ | http://rational.co.za/libbitap/ by Nic Roets, South Africa

Windows - C - Cygwin

The bitap library , another new and fresh implementation of the bitap algorithm.

"The bitap library is a clean implementation of regular expression (regex/grep) string matching using the bitap algorithm. Approximate (also known as fuzzy) matching is allowed. This is the same algorithm as the one used in Glimpse and agrep, but it is much more complete with regard to regular expression syntax, and is much cleaner. "

http://www.bio.cam.ac.uk/~mw263/pyagrep.html

AGREPY - A Python port of agrep string matching with errors (by Michael J. Wise, August 2002)

WIBISA

Wiki-based Intranet- and Internet Search Assistant and bookmark sharing tool using auto-type detection and approximate and regular expression pattern shortcuts (WIBISA).

Research Disclosure RD488004, publication date 10.11.2004.

See http://www.tgries.de/wibisa

What is AGREP?

From the authors' notes:

AGREP is a powerful tool for fast searching a file or many files for a string or regular expression, with approximate matching capabilities and user-definable records. It was developed 1989-1991 by Udi Manber and Sun Wu at the University of Arizona and many others (please read CONTRIB.TXT and MANUAL.DOC). AGREP is similar to egrep (or grep or fgrep), but it is much more general and usually faster.

It also supports many kinds of queries including arbitrary wild cards, sets of patterns, and in general, regular expressions. It supports most of the options supported by the GREP family plus several more (but it is not 100% compatible with grep).

AGREP is the search engine and part of the GLIMPSE tool for searching and indexing whole file systems. GLIMPSE stands for Global Implicit Search and is part of the HARVEST Information Discovery and Access System.

AGREP belongs to the University of Arizona, which licenses it (see copyright).

It is not public-domain, but free for non-commercial use. And it comes with source!

back to the top of this page

AGREP ports to OS/2 and DOS and to Windows 95 and NT

The first port of AGREP to OS/2 was done by Robert Mahoney (2Rud Software).

Based on the version of AGREP as found in the GLIMPSE 3.0 package, I made this port to OS/2 and DOS using Eberhard Mattes' GNU C-compiler EMX 0.9c. As GLIMPSE is now at version 4.1, the relevant changes 3.x -> 4.x with respect to AGREP will be taken into account. Available are stand-alone versions for OS/2 (agrep2.exe V3.35), for native DOS (agrepdos.exe V3.35) and for 32-bit Windows systems (agrepw32.exe V3.37)

Rainhard Schnitker's RSX is a DOS DPMI-compliant runtime package, which makes a fourth version (agrep.exe V3.35) also run in a DOS box of OS/2, DOS box of Win3.x/Win95, under native DOS, and OS/2. But running this, you need one, two, or three additional files, depending on your operating system (see next section).

Dave was the first to compile a version for Windows 95 and NT using the Microsoft Visual C (VC) 5.0 compiler (January 1998).
Now we have a Windows 95 and NT version 3.37 compiled by Ron Aaron using Borland C 5.02 (September 1998).

back to the top of this page

AGREP files version 3.41 for LINUX)

agrep-3.41.tgz for LINUX (type "make" to compile)

AGREP files version 3.37 (3.35 for DOS and OS/2)

Executables only see also limitations
Stand-alone executable for OS/2	Stand-alone executable for Win95 and Win/NT recommended for most of the users	Stand-alone executable for native DOS	Windows DOS box, OS/2 DOS box, DOS, OS/2 recommended for users who need to run AGREP in a multi-OS environment
agrep2.zip (v3.35;128K)	agrepw32.zip(94K)	agrepdos.zip (v3.35;147K)	agrep.zip (v3.35; 249K) needs the included runtime modules EMX.EXE, EMX.DLL, RSX.EXE
All these Zip files include the COPYRIGHT and README files.

Complete package for W95/NT recommended set for developers (Sources, Executables, Basic Documentation)	Complete package for OS/2 and DOS recommended set for developers (Sources, Executables, Basic Documentation)
agrep337.zip (301K)	agrep335.zip (608K)

Utilities and hints

forall.zip (30K)

A utility written by Kai Uwe Rommel to execute a command or EXE (like AGREP.EXE) in all subdirectories.
This neat tool can be found on LEO, look for toolsnn.zip (nn denotes the version number, for example tools55.zip)

The following example shows the syntax
to let AGREP search for "needle" through all files * in all subdirectories of C:\MAIL and to write all results into file AGREP.OUT:

FORALL -R -d C:\MAIL\* : AGREP needle @F\* >> AGREP.OUT

Remark: this does not search the C:\MAIL directory itself.

Another method to search multiple files and/or subdirectories:

Use AGREP's built-in @reponsefile (listfile) option !

The DIR command can first be used to create a list of files:
OS/2: DIR /F/S/A-D/-L/-P C:\MAIL\*.* > DIR.LST
DOS: DIR /B/S/A-D/-L/-P C:\MAIL\*.* > DIR.LST

Then you would invoke AGREP and pass that filename to it:

AGREP needle @DIR.LST

Remember, that AGREP is faster when you load it once and let it search a bunch of files:

AGREP needle C:\MAIL\*

AGREP © by Udi Manber, University of Arizona, Sun Wu, Thomas Gries.

EMX © by Eberhard Mattes.
The OS/2 and DOS runtime support libraries EMX.DLL and EMX.EXE.

EMX is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License version 2 as published by the Free Software Foundation.

EMX is available here:
http://archiv.leo.org/pub/comp/os/os2/leo/gnu/emx+gcc/

RSX © by Rainhard Schnitker.
The DPMI DOS extender RSX allows to run 32-bit programs (like AGREP) in a DOS box of several operation systems. The extender RSX is 32-bit. The code is compiled with EMX+GCC.

RSX is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License version 2 as published by the Free Software Foundation.

RSX is available here:
http://archiv.leo.org/pub/comp/os/os2/leo/gnu/emx+gcc/rsxnt/

back to the top of this page

AGREP: What was new in the versions 3.35 .. 3.37 ?

Here is the list of all options.

There are no new options in the 3.37 version, but several bugs have been fixed.
New makefile agrep.mak and conditional compiling using _WIN32 for the Microsoft Visual C (VC) 5.0 compiler to generate Windows 95 and Windows/NT stand-alone executables.
The -f option for multi-pattern searches (multiple patterns are read from a file) did not work in pre-3.35 versions. The problem has been found and solved.
Another problem - using AGREP's -d option for user-defined records on large files caused SIGSEGV segment violation - has also been fixed.

These are the new features introduced in AGREP version 3.33, which makes it differ from the GLIMSPE/AGREP versions of University of Arizona:

Wildcard expansion

List file (response file) option

Example:
agrep needle @listfile.lst

Support of different codepages

-CP nnn

nnn

Limitation: the version 3.37 only supports nnn=437 or nnn=850.

More information on codepages can be found here:

codepage 437 (US) and 850 (Latin-1) (list of pointers)
ISO 8859-1 National Caracter Set FAQ (FAQ; list of systems, programs, problems)
Was ist eine Codepage ? (in German)

Three different levels of case-insensitive searching:

option -i searches case-insensitive

Examples:

"Ä" matches "ä"
"À" matches "à"

but these do not match "a" or "á" !

option -ia searches case-insensitive

Example:

"A", "Ä", "À", "Á", "Â", "Ã", "Å", "Æ", "a", "ä", "à", "á", "â", "ã", "å", "æ,"
--> all match each other, you will find any of them by simply using (e.g.) "a" in the search pattern.

option -i0 searches case-sensitive (default).

case-sensitive

AGREPOPTS

Format search

option -i# format search
any letter matches any letter, any digit matches any digit
Example ("car number plate search"):
agrep -i# "aa11zz" haystack.*
will find "bb22yy"

Introduction of an optional environment variable AGREPOPTS

Example:

(in CONFIG.SYS) SET AGREPOPTS=-i -V4
make AGREP search case-insensitive by default, verbose level 4

Different levels of verbose option

-V version information only
-V0 no diagnostic messages at all (use the -V0 option together with -s option to avoid any output)
-V1 shows Grand Total (= count of records having matches; default)
-V2 shows Grand Total and case-in/sensitive option
-V3 shows Grand Total, case option, detected options and codepage number
-V4 shows almost everything (debug)
-V5 shows the translation table as used with -i, -ia, and -i# options

AGREP return codes have been revised. When calling AGREP, the return code reflects to the number of matches:

return code of AGREP meaning

>= 0 the total number of matches (zero means no match)

< 0 syntax error/s or inaccessible file/s

There was a problem of infinite loops for older AGREP versions.

The following can cause an infinite loop: agrep pattern * > output_file.

If the number of matches is high, they may be deposited in output_file before it is completely read leading to more matches of the pattern within output_file (the matches are against the whole directory).

It's not clear whether this is a "bug" (grep will do the same), but be warned.

This is solved in the new version.

Another bug in older versions:
A Boolean query with a pattern of length 1 (i.e., one character only) may miss matches.

This is solved now.

back to the top of this page

Limitations of AGREP 3.37

The stand-alone executable AGREP2.EXE for OS/2 cannot be run in a DOS box of Windows or OS/2. Please use AGREP.EXE instead, together with EMX.EXE, RSX.EXE or EMX.DLL for running under OS/2. AGREP2.EXE runs under OS/2 only.

The stand-alone executable AGREPDOS.EXE for DOS cannot be run in a DOS box of Windows or OS/2, please use AGREP.EXE instead, together with EMX.EXE, RSX.EXE. AGREPDOS.EXE runs under native DOS only (or with EMX.DLL under OS/2).

The stand-alone executable for 32-bit Windows (Win95 and Windows/NT) is AGREPW32.EXE. Since the release of AGREP version 3.35, the complete distribution package also contains a Makefile to be used with the Microsoft Visual C (VC) 5.0 resp. Borland C 5.02 compiler.

There are a few restrictions regarding the possible combinations of search options. AGREP will display a message, when it does not support the requested combination of options. As later versions could do, please check regularly this page for amendments.

Running AGREP.EXE in a DOS box together with RSX.EXE does not allow to search long file or directory names under Windows 95, Windows NT.

Known Bugs of AGREP 3.37

When searching user-defined records (with option -d),

in rare cases it happens that a record containing hits is not properly "cut out". But AGREP does not fail to find your needle, it's only a problem of presenting this record. The problem is under investigation. The option -r for recursing into subdirectories does work UNIX-like,

all files *.txt

subdirectory *.txt

download

forall

response file (listfile)

When using AGREP and pipes,

there could be some problems like the message "no target files found". I am working on that. My apologies. back to the top of this page

How can I use AGREP ?

The program itself shows six pages of on-line help - when you call it without any parameters.

Visit the help pages for AGREP here

There is a list of all options of AGREP and a lot of examples. If you find a problem, please send me your bug report.

back to the top of this page

Research & Development: AGREP's Wishlist (To-Do-List)

The following lists ideas, proposals, and extensions for improvements of AGREP, which I intend to implement in later versions. Please send me your proposals.

PERL and/or REXX-Interface

Recommended literature:

OS/2 REXX Program Reference REXXAPI.INF.
The Rexx API - An Introduction to Extending the Rexx Language (contents) (by Bill Potvin).
REGEXX.C, REGEXX.DEF and RGREP.CMD - Examples taken from \EMX\SAMPLES of emx (by Eberhard Mattes).

Return codes:
If you intend to call this version of AGREP from a PERL script, you probably want to avoid any output while keeping AGREPs return code. In this case, please use options -s (almost silent) together with -V0 (verbose nothing) to avoid the output of the Grand Total number of matches.

$retc = system('AGREP -sV0 ' . $needle . ' ' . $foo);
if ($retc > 0) { print('AGREP has found ' . $retc . ' matches.\n'); }

Userdefinable Codepages

Limit the Maximum Number of Matches (proposed by Scott Ng)

-Mn

Example:

Retrieval in context (RIC) = focus function

-d

The proposed extension of this option would be very useful when the target files do not have a certain record structure. In this case, one would prefer to run AGREP line oriented (default), but giving -dn would allow a range of ±n target lines to be displayed around the line with the needle. Preferable in combination with highlight function:

Highlight (mark, tag) the matches in the output record

needle

The strings could be composed of ANSI strings to select/deselect colours, or they could be used to generate HTML links. Preferable with RIC technique.

Implementation of Sunday's Optimal Mismatch Algorithm (for exact pattern searches)

New metasymbols for three predefined sets of characters

Examples: search for a car plate number which starts with "ABC" followed by 4 digits: "ABC%%%%" or to search for US patents "US%%%%%%%" or for European patents "EP%%%%%%%"

Dynamic Metasymbol Assignment DMSA

needle

It is planned to remove that restriction of AGREP in a later version.

back to the top of this page

Algorithms used in AGREP

The implementation of agrep includes the following algorithms.

Except for exact matching of simple patterns, for which we use a simple variation of the Boyer-Moore algorithm, all the algorithms (listed below) were designed by Sun Wu and Udi Manber.

bitap:

It supports many extensions such as approximate regular expression pattern matching, non-uniform costs, simultaneous matching of multiple patterns, mixed exact/approximate matching, etc.

The algorithm is described in agrep1.ps.

mgrep:

It assumes that the set of patterns contains k patterns, and that the shortest pattern is of size m.

See agrep2.ps for a brief description of the algorithm.

amonkey:

Let b = log_c (2*m), where c is the size of alphabet set. In the preprocessing, a table is built to determine whether a given substring of size b is in the pattern.

Suppose we are looking for matches with at most k errors.

The search is done in two passes:

In the first pass (the filtering pass), the areas in the text that have a possibility to contain the matches are marked.

The second pass finds the matches in those marked areas.

The search in the first pass is done in the following way.

Suppose the end position of the pattern is currently aligned with position tx in the text.

The algorithm scans backward from tx until either (k+1) blocks that do not occur in the pattern have been scanned, or the scan has passed position (tx-m+k).

In the former case, pattern is shifted forward to align the beginning position of the pattern with one character after the position in the text where the scan was stopped.

In the latter case, we marked tx-m to tx+m as a candidate area.

mmonkey:

For ASCII text and pattern, this algorithm is faster than amonkey.

The principle of the partition technique is as follows.

Let A and B be two strings of size m.

If we partition A into (k+1) blocks, then the distance between A and B is > k if none of the blocks of A occur in B.

This implies that to match A with no more than k errors, B has to contain a substring that matches exactly one block of A.

A brief description can be found in agrep2.ps

Copyright information

back to the top of this page

Literature:

The articles describe techniques for approximate pattern search in general and/or with respect to AGREP.
When you are interested in special areas (e.g. searching for multiple string patterns), please follow one of the links given here.

Graham A. Stephen: String Searching Algorithms book.

recommended reading on searching algorithms of all flavours !

Sun Wu and Udi Manber: Fast Text Searching Allowing Errors.

ACM

agrep1ps.zip

recommended reading to understand AGREP !

Sun Wu and Udi Manber: AGREP - A Fast Approximate Pattern-matching Tool.

agrep2ps.zip

recommended reading to understand AGREP !

Udi Manber and Sun Wu: Approximate Pattern Matching.

BYTE

More articles and recent papers on Udi Manber's homepage.

Gaston H. Gonnet and Ricardo Baeza-Yates:

source code

here

See sections Text Searching and Algorithms coded in Pascal and C,

Ricardo Baeza-Yates and W. Frakes:

source code

Information Retrieval: Algorithms and Data Structures

Ricardo Baeza-Yates and L. Fuentes:

Animation

Ricardo Baeza-Yates and Gonzalo Navarro:

Faster Algorithm for Approximate String Matching.
Proceeding of CPM'96, pp. 1-23.

List of publications and somesources

A must:
Hundreds of important Pattern Matching Pointers and other Web Resources (Lonardi)

Exact string matching algorithms(C sourcecode; visualization in Java; examples) (Charras/Lecroq)

A German article:
Guido Gronek: Ähnlichkeiten gesucht - Fehlertoleranter Suchalgorithmus "Shift-AND".

c't

Heise Verlag

source

ct9505.zip

back to the top of this page

Links:

The home page of AGREP (non-UNIX) http://www.tgries.de/agrep
The original home page of AGREP and GLIMPSE for UNIX http://glimpse.cs.arizona.edu
The home page of HARVEST http://harvest.transarc.com

Bug reports:

Please file bug reports here.

back to the top of this page
Back to my home page. AGREP V3.411, Homepage Version 1.19, T. Gries April 9, 2016
Page designed by Opteryx®
Opteryx Design