QUICK REFERENCE
  • The source code moved to Github 17.02.2013
    Download from git, compile and install:
    git clone git://github.com/wikinaut/agrep.git
    cd agrep
    make
    cp agrep /usr/local/bin
    
  • added the code for LINUX 25.09.2011 agrep-3.41.tgz
  • AGREP for Windows (compilation with MicroSoft Visual C++ 6.0 VC 6.0)
  • plans for an AGREP-Wiki
  • The versions AGREP 3.41 for LINUX, AGREP 3.37 for Windows 95/NT, AGREP 3.35 for OS/2 and DOS are based on the AGREP program from GLIMPSE 3.0. Many bugs have been identified and I regard my version as the most tested and debugged version of all AGREP versions I found until 31.05.2004. However, the core from which I started is extremely diffcult to read (apart from understanding) and lacking descriptions. However, my port never crashed in all the last eight years.
  • I was in contact with the authors and have asked the license departement of University of Arizona and the Webglimpse keepers several times to give it free (I also offered to "free the software"by giving a donation), but without success: AGREP is unfortunately still not under GNU/GPL license (see AGREP copyright - free for private and non-commercial use).
  •  If you use Windows and if you already use previous versions of AGREP, please upgrade to this version 3.37
  • The AGREP software and document parts of this page are still unchanged since 16. September 1998, because no further developements took place (yet). If you want to contribute an new port or implementation, please contact me (see below).
  • Download section - program, source files and documentation
  • What's new in AGREP 3.37 for OS/2, DOS, Win3.x and Windows 95, NT ?
  • Help Page is here - options and examples
  • Limitations and known bugs
  • The To-Do List - Research and development
  • Algorithms used within AGREP
  • Literature and Links
  • A must:  Hundreds of really relevant Pattern Matching Pointers and other web resources (Prof. Stefano Lonardi, University of California, Riverside, USA)
  • E-mail contact- see below
  • my home page
  • Imprint/Impressum


Some interesting links related to AGREP and newer developments as of 16.12.2007

PERL module String:Approx
Perl extension for approximate matching (fuzzy matching) by Jarkko Hietaniemi, Finland

String::Approx lets you match and substitute strings approximately. With this you can emulate errors: typing errorrs, speling errors, closely related vocabularies (colour color), genetic mutations (GAG ACT), abbreviations (McScot, MacScot).

Excerpt from the README file:
"The most important change was that the underlying algorithm waschanged completely. Instead of doing everything in Perl using regular expressions we now do the matching in C using the so-called Manber-Wuk-differences algorithm shift-add. You have met this algorithm if youhave used the agrep utility or the Glimpse indexing system.Because this implementation shares no code with agrep, only the well-publicized algorithms, the use of this software is limited only by my copyright (file COPYRIGHT). This interpretation has been kindly confirmed by Udi Manber. More details in the file README.apse."

TRE is a lightweight, robust, and efficient POSIX compliant regexp matching library with some exciting features such as approximate (fuzzy) matching.

At the core of TRE is a new algorithm for regular expression matching with submatch addressing. The algorithm uses linear worst-case time in the length of the text being searched, and quadratic worst-case time in the length of the used regular expression. In other words, the time complexity of the algorithm is O(M2N), where M is the length of the regular expression and N is the length of the text. The used space is also quadratic on the length of the regex, but does not depend on the searched string. This quadratic behaviour occurs only on pathological cases which are probably very rare in practice.

TRE is not just yet another regexp matcher. TRE has some features which are not there in most free POSIX compatible implementations. Most of these features are not present in non-free implementations either, for that matter.

Approximate pattern matching allows matches to be approximate, that is, allows the matches to be close to the searched pattern under some measure of closeness. TRE uses the edit-distance measure (also known as the Levenshtein distance) where characters can be inserted, deleted, or substituted in the searched text in order to get an exact match. Each insertion, deletion, or substitution adds the distance, or cost, of the match. TRE can report the matches which have a cost lower than some given threshold value. TRE can also be used to search for matches with the lowest cost.

TRE includes a version of the agrep command line tool for approximate regexp matching in the style of grep. Unlike other agrep implementations (like the one by Sun Wu and Udi Manber from University of Arizona available here) TRE agrep allows full regexps of any length, any number of errors, and non-uniform costs for insertion, deletion and substitution.

TRE is free; you can redistribute it and/or modify it under the terms of the GNU General Public License version 2 (June 1991) as published by the Free Software Foundation. See the file COPYING, included in the distribution packages, for the complete license.

If you cannot accept the GNU GPL, it may be possible that you can persuade me to give or sell you a version which is licensed differently. Please contact the author for more information.

Windows - C - Cygwin

The bitap library , another new and fresh implementation of the bitap algorithm.

"The bitap library is a clean implementation of regular expression (regex/grep) string matching using the bitap algorithm. Approximate (also known as fuzzy) matching is allowed. This is the same algorithm as the one used in Glimpse and agrep, but it is much more complete with regard to regular expression syntax, and is much cleaner. "

AGREPY  - A Python port of agrep string matching with errors (by  Michael J. Wise, August 2002)

Wiki-based Intranet- and Internet Search Assistant and bookmark sharing tool using auto-type detection and approximate and regular expression pattern shortcuts (WIBISA).
Research Disclosure RD488004, publication date 10.11.2004.

See http://www.wibisa.org


What is AGREP?

From the authors' notes:

AGREP is a powerful tool for fast searching a file or many files for a string or regular expression, with approximate matching capabilities and user-definable records. It was developed 1989-1991 by Udi Manber and Sun Wu at the University of Arizona and many others (please read CONTRIB.TXT and MANUAL.DOC). AGREP is similar to egrep (or grep or fgrep), but it is much more general and usually faster.

It also supports many kinds of queries including arbitrary wild cards, sets of patterns, and in general, regular expressions. It supports most of the options supported by the GREP family plus several more (but it is not 100% compatible with grep).

AGREP is the search engine and part of the GLIMPSE tool for searching and indexing whole file systems. GLIMPSE stands for Global Implicit Search and is part of the HARVEST Information Discovery and Access System.

AGREP belongs to the University of Arizona, which licenses it (see copyright).

It is not public-domain, but free for non-commercial use. And it comes with source!

back to the top of this page


AGREP ports to OS/2 and DOS and to Windows 95 and NT

The first port of AGREP to OS/2 was done by Robert Mahoney (2Rud Software).

Based on the version of AGREP as found in the GLIMPSE 3.0 package, I made this port to OS/2 and DOS using Eberhard Mattes' GNU C-compiler EMX 0.9c. As GLIMPSE is now at version 4.1, the relevant changes 3.x -> 4.x with respect to AGREP will be taken into account. Available are stand-alone versions for OS/2 (agrep2.exe V3.35), for native DOS (agrepdos.exe V3.35) and for 32-bit Windows systems (agrepw32.exe V3.37)

Rainhard Schnitker's RSX is a DOS DPMI-compliant runtime package, which makes a fourth version (agrep.exe V3.35) also run in a DOS box of OS/2, DOS box of Win3.x/Win95, under native DOS, and OS/2. But running this, you need one, two, or three additional files, depending on your operating system (see next section).

Dave was the first to compile a version for Windows 95 and NT using the Microsoft Visual C (VC) 5.0 compiler (January 1998).
Now we have a Windows 95 and NT version 3.37 compiled by Ron Aaron using Borland C 5.02 (September 1998).
 

back to the top of this page



 

AGREP files version 3.41 for LINUX)

agrep-3.41.tgz for LINUX (type "make" to compile)

AGREP files version 3.37 (3.35 for DOS and OS/2)

Executables only
see also limitations
Stand-alone executable for OS/2
Stand-alone executable for Win95 and Win/NT

recommended for most of the users

Stand-alone executable for native DOS
Windows DOS box, OS/2 DOS box, DOS, OS/2

recommended for users who need to run AGREP in a multi-OS environment

agrep2.zip (v3.35;128K) agrepw32.zip(94K) agrepdos.zip (v3.35;147K) agrep.zip (v3.35; 249K)
needs the included runtime modules EMX.EXE, EMX.DLL, RSX.EXE
All these Zip files include the COPYRIGHT and README files.

 
Complete package for W95/NT 
recommended set for developers
(Sources, Executables, Basic Documentation)
Complete package for OS/2 and DOS
recommended set for developers
(Sources, Executables, Basic Documentation) 

agrep337.zip (301K)
agrep335.zip (608K)

 
Utilities and hints
forall.zip (30K)

A utility written by Kai Uwe Rommel to execute a command or EXE (like AGREP.EXE) in all subdirectories. 
This neat tool can be found on LEO, look for toolsnn.zip (nn denotes the version number, for example tools55.zip) 

The following example shows the syntax 
to let AGREP search for "needle" through all files * in all subdirectories of C:\MAIL and to write all results into file AGREP.OUT:

FORALL -R -d C:\MAIL\* : AGREP needle @F\* >> AGREP.OUT

Remark: this does not search the C:\MAIL directory itself. 

Another method to search multiple files and/or subdirectories:

Use AGREP's built-in @reponsefile (listfile) option ! 

The DIR command can first be used to create a list of files: 
OS/2: DIR /F/S/A-D/-L/-P C:\MAIL\*.* > DIR.LST 
DOS: DIR /B/S/A-D/-L/-P C:\MAIL\*.* > DIR.LST 

Then you would invoke AGREP and pass that filename to it: 

AGREP needle @DIR.LST

Remember, that AGREP is faster when you load it once and let it search a bunch of files:

AGREP needle C:\MAIL\*


 
AGREP © by Udi Manber, University of Arizona, Sun Wu, Thomas Gries. 
AGREP is free for non commercial use. See copyright notice. 

 
EMX © by Eberhard Mattes. 
The OS/2 and DOS runtime support libraries EMX.DLL and EMX.EXE. 
EMX is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License version 2 as published by the Free Software Foundation. 
EMX is available here: 

http://archiv.leo.org/pub/comp/os/os2/leo/gnu/emx+gcc/

 
RSX © by Rainhard Schnitker. 
The DPMI DOS extender RSX allows to run 32-bit programs (like AGREP) in a DOS box of several operation systems. The extender RSX is 32-bit. The code is compiled with EMX+GCC. 
RSX is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License version 2 as published by the Free Software Foundation. 
RSX is available here: 

http://archiv.leo.org/pub/comp/os/os2/leo/gnu/emx+gcc/rsxnt/

back to the top of this page


AGREP: What was new in the versions 3.35 .. 3.37 ?

Here is the list of all options.
  • There are no new options in the 3.37 version, but several bugs have been fixed.
  • New makefile agrep.mak and conditional compiling using _WIN32 for the Microsoft Visual C (VC) 5.0 compiler to generate Windows 95 and Windows/NT stand-alone executables.
  • The -f option for multi-pattern searches (multiple patterns are read from a file) did not work in pre-3.35 versions. The problem has been found and solved.
  • Another problem - using AGREP's -d option for user-defined records on large files caused SIGSEGV segment violation - has also been fixed.
These are the new features introduced in AGREP version 3.33, which makes it differ from the GLIMSPE/AGREP versions of University of Arizona:
  • Wildcard expansion
  • List file (response file) option
      let agrep use the filenames in listfile.lst

      Example:
      agrep needle @listfile.lst
       

  • Support of different codepages
      • option -i searches case-insensitive

      •  
          ISO characters preserve their "Umlauts" or accents.

          Examples:

          "Ä" matches "ä"
          "À" matches "à"

          but these do not match "a" or "á" !
           

      • option -ia searches case-insensitive

      •  
          ISO characters are mapped to the nearest lower ASCII equivalent.

          Example:

          "A", "Ä", "À", "Á", "Â", "Ã", "Å", "Æ", "a", "ä", "à", "á", "â", "ã", "å", "æ,"
          --> all match each other, you will find any of them by simply using (e.g.) "a" in the search pattern.
           

      • option -i0 searches case-sensitive (default).

      •  
          This simply allows to force case-sensitive searches in case that you also use the environment variable AGREPOPTS saying to search case-insensitive.
  • Format search
  • option -i# format search
    any letter matches any letter, any digit matches any digit

    Example ("car number plate search"):
    agrep -i# "aa11zz" haystack.*
    will find "bb22yy"

  • Introduction of an optional environment variable AGREPOPTS
    1.  
      The environment variable can be used to preset the most preferred options, but actual parameters as given in the commandline take precedence.

      Example:

      (in CONFIG.SYS) SET AGREPOPTS=-i -V4
      make AGREP search case-insensitive by default, verbose level 4
       

  • Different levels of verbose option
  • -V version information only
    -V0 no diagnostic messages at all (use the -V0 option together with -s option to avoid any output)
    -V1 shows Grand Total (= count of records having matches; default)
    -V2 shows Grand Total and case-in/sensitive option
    -V3 shows Grand Total, case option, detected options and codepage number
    -V4 shows almost everything (debug)
    -V5 shows the translation table as used with -i, -ia, and -i# options
  • AGREP return codes have been revised. When calling AGREP, the return code reflects to the number of matches:
  •  
    return code of AGREP
    meaning
    >= 0
    the total number of matches (zero means no match)
    < 0
    syntax error/s or inaccessible file/s
  • There was a problem of infinite loops for older AGREP versions.
    1.  
      The following can cause an infinite loop: agrep pattern * > output_file.
      If the number of matches is high, they may be deposited in output_file before it is completely read leading to more matches of the pattern within output_file (the matches are against the whole directory).
      It's not clear whether this is a "bug" (grep will do the same), but be warned.

      This is solved in the new version.
       

      Another bug in older versions:
      A Boolean query with a pattern of length 1 (i.e., one character only) may miss matches.

      This is solved now.
       

back to the top of this page

Limitations of AGREP 3.37

The stand-alone executable AGREP2.EXE for OS/2 cannot be run in a DOS box of Windows or OS/2. Please use AGREP.EXE instead, together with EMX.EXE, RSX.EXE or EMX.DLL for running under OS/2. AGREP2.EXE runs under OS/2 only.

The stand-alone executable AGREPDOS.EXE for DOS cannot be run in a DOS box of Windows or OS/2, please use AGREP.EXE instead, together with EMX.EXE, RSX.EXE. AGREPDOS.EXE runs under native DOS only (or with EMX.DLL under OS/2).

The stand-alone executable for 32-bit Windows (Win95 and Windows/NT) is AGREPW32.EXE. Since the release of AGREP version 3.35, the complete distribution package also contains a Makefile to be used with the Microsoft Visual C (VC) 5.0 resp. Borland C 5.02 compiler.

There are a few restrictions regarding the possible combinations of search options. AGREP will display a message, when it does not support the requested combination of options. As later versions could do, please check regularly this page for amendments.

Running AGREP.EXE in a DOS box together with RSX.EXE does not allow to search long file or directory names under Windows 95, Windows NT.


Known Bugs of AGREP 3.37

When searching user-defined records (with option -d),
    in rare cases it happens that a record containing hits is not properly "cut out". But AGREP does not fail to find your needle, it's only a problem of presenting this record. The problem is under investigation.
The option -r for recursing into subdirectories does work UNIX-like,
    so if you start it as "AGREP -r needle *.txt" it simply searches in all files *.txt of the current subdirectory and in a subdirectory *.txt. This is not what a user normally wants. The -r option will be revised very soon; please refer to the examples in the download section and use the forall program or a response file (listfile), when you want to search through the subdirectories.
When using AGREP and pipes,
    there could be some problems like the message "no target files found". I am working on that. My apologies.
back to the top of this page


How can I use AGREP ?

The program itself shows six pages of on-line help - when you call it without any parameters.
 

Visit the help pages for AGREP here

 

There is a list of all options of AGREP and a lot of examples. If you find a problem, please send me your bug report.

back to the top of this page


Research & Development: AGREP's Wishlist (To-Do-List)

The following lists ideas, proposals, and extensions for improvements of AGREP, which I intend to implement in later versions. Please send me your proposals.
  • PERL and/or REXX-Interface
       
      The Rexx Interface allows the simple call of AGREP as a function from any user-written Rexx program.

      Recommended literature:

      OS/2 REXX Program Reference REXXAPI.INF.
      The Rexx API - An Introduction to Extending the Rexx Language (contents) (by Bill Potvin).
      REGEXX.C, REGEXX.DEF and RGREP.CMD - Examples taken from \EMX\SAMPLES of emx (by Eberhard Mattes).

      Return codes:
      If you intend to call this version of AGREP from a PERL script, you probably want to avoid any output while keeping AGREPs return code. In this case, please use options -s (almost silent) together with -V0 (verbose nothing) to avoid the output of the Grand Total number of matches.

      $retc = system('AGREP -sV0 ' . $needle . ' ' . $foo);
      if ($retc > 0) { print('AGREP has found ' . $retc . ' matches.\n'); }

         
  • Userdefinable Codepages
    •  
      Partially done. Can be achieved by editing CODEPAGE.C and recompiling
  • Limit the Maximum Number of Matches (proposed by Scott Ng)
    •  
      A new option -Mn let the user specify the maximum number of matches in order to exit out without going through the entire search:

      Example:

         
      agrep -M100 "ABC"
      will only return the first 100 matched lines (or records) even if the total number of matches is, e.g. 1000.
         
  • Retrieval in context (RIC) = focus function
    •  
      One of the most powerful features is already the -d option allowing user-definable records.

      The proposed extension of this option would be very useful when the target files do not have a certain record structure. In this case, one would prefer to run AGREP line oriented (default), but giving -dn would allow a range of ±n target lines to be displayed around the line with the needle. Preferable in combination with highlight function:
       

  • Highlight (mark, tag) the matches in the output record
    •  
      Allow user-definable prefix- and suffix strings to mark the needle in the output record.

      The strings could be composed of ANSI strings to select/deselect colours, or they could be used to generate HTML links. Preferable with RIC technique.
       

  • Implementation of Sunday's Optimal Mismatch Algorithm (for exact pattern searches)

  •  
  • New metasymbols for three predefined sets of characters
    •  
      @ all letters
      % all digits
      ~ all the rest

      Examples: search for a car plate number which starts with "ABC" followed by 4 digits: "ABC%%%%" or to search for US patents "US%%%%%%%" or for European patents "EP%%%%%%%"
       

  • Dynamic Metasymbol Assignment DMSA
    •  
      In its current implementation, AGREP needs sixteen characters from the character set internally.
      The graphic characters, which are not common to text files, cannot be searched at the moment and must therefore not appear in the needle string. A warning message is shown.

      It is planned to remove that restriction of AGREP in a later version.

back to the top of this page

Algorithms used in AGREP

/* Copyright (c) 1994 Sun Wu, Udi Manber, Burra Gopal. All Rights Reserved. */

The implementation of agrep includes the following algorithms.

Except for exact matching of simple patterns, for which we use a simple variation of the Boyer-Moore algorithm, all the algorithms (listed below) were designed by Sun Wu and Udi Manber.

  • bitap:
    •  
      The most general algorithm inside agrep.

      It supports many extensions such as approximate regular expression pattern matching, non-uniform costs, simultaneous matching of multiple patterns, mixed exact/approximate matching, etc.

      The algorithm is described in agrep1.ps.
       

  • mgrep:
    •  
      A sub-linear expect-time algorithm for matching a set of patterns.

      It assumes that the set of patterns contains k patterns, and that the shortest pattern is of size m.

      See agrep2.ps for a brief description of the algorithm.
       

  • amonkey:
    •  
      A Boyer-Moore style algorithm for approximate pattern matching.

      Let b = log_c (2*m), where c is the size of alphabet set. In the preprocessing, a table is built to determine whether a given substring of size b is in the pattern.

      Suppose we are looking for matches with at most k errors.

      The search is done in two passes:

      In the first pass (the filtering pass), the areas in the text that have a possibility to contain the matches are marked.

      The second pass finds the matches in those marked areas.

      The search in the first pass is done in the following way.

      Suppose the end position of the pattern is currently aligned with position tx in the text.

      The algorithm scans backward from tx until either (k+1) blocks that do not occur in the pattern have been scanned, or the scan has passed position (tx-m+k).

      In the former case, pattern is shifted forward to align the beginning position of the pattern with one character after the position in the text where the scan was stopped.

      In the latter case, we marked tx-m to tx+m as a candidate area.
       

  • mmonkey:
    •  
      Combining the mgrep algorithm with a partition technique, we have an algorithm with the same time complexity as amonkey.

      For ASCII text and pattern, this algorithm is faster than amonkey.

      The principle of the partition technique is as follows.

      Let A and B be two strings of size m.

      If we partition A into (k+1) blocks, then the distance between A and B is > k if none of the blocks of A occur in B.

      This implies that to match A with no more than k errors, B has to contain a substring that matches exactly one block of A.

      A brief description can be found in agrep2.ps

back to the top of this page
 

Copyright information

The following is from the copyright of the AGREP zip file: 
    This material was developed by Sun Wu and Udi Manber at the University of Arizona, Department of Computer Science. Permission is granted to copy this software, to redistribute it on a nonprofit basis, and to use it for any purpose, subject to the following restrictions and understandings.
     
  1. Any copy made of this software must include this copyright notice in full. 
  2. All materials developed as a consequence of the use of this software shall duly acknowledge such use, in accordance with the usual standards of acknowledging credit in academic research. 
  3. The authors have made no warranty or representation that the operation of this software will be error-free or suitable for any application, and they are under under no obligation to provide any services, by way of maintenance, update, or otherwise. The software is an experimental prototype offered on an as-is basis. 
  4. Redistribution for profit requires the express, written permission of the authors. 

back to the top of this page


Literature:

The articles describe techniques for approximate pattern search in general and/or with respect to AGREP.
When you are interested in special areas (e.g. searching for multiple string patterns), please follow one of the links given here.
 
 
Graham A. Stephen: String Searching Algorithms book.
    Lecture Notes Series on Computing - Vol. 3, World Scientific Publishing Co., Singapore 1994.
    ISBN 981-02-1829-X
    recommended reading on searching algorithms of  all flavours !
Sun Wu and Udi Manber: Fast Text Searching Allowing Errors.
    Communications of the ACM, pp. 83-91, Vol. 35, No. 10, Oct. 1992, USA.
    PostScript version of the paper in its submitted form: agrep1.ps (agrep1ps.zip 59K)
    recommended reading to understand AGREP !
Sun Wu and Udi Manber: AGREP - A Fast Approximate Pattern-matching Tool.
    Proceedings of the Winter 1992 USENIX Conference San Francisco, USA, 20.-24. Jan. 1992,
    pp. 153-162, Berkeley, USA, 1991.
    PostScript version of the paper in its submitted form: agrep2.ps (agrep2ps.zip 35K)
    recommended reading to understand AGREP !
Udi Manber and Sun Wu: Approximate Pattern Matching.
    BYTE pp. 281-292, November 1992.
More articles and recent papers on Udi Manber's homepage.
Gaston H. Gonnet and Ricardo Baeza-Yates:
Ricardo Baeza-Yates and W. Frakes:
Ricardo Baeza-Yates and L. Fuentes:
    Source code of Xaa: Animationof String Searching Algorithms
Ricardo Baeza-Yates and Gonzalo Navarro:
Faster Algorithm for Approximate String Matching.
Proceeding of CPM'96, pp. 1-23.


List of publications and some sources

A must:
Hundreds of important Pattern Matching Pointers and other Web Resources (Lonardi)

Exact string matching algorithms(C sourcecode; visualization in Java; examples) (Charras/Lecroq)


A German article:
Guido Gronek: Ähnlichkeiten gesucht - Fehlertoleranter Suchalgorithmus "Shift-AND".


back to the top of this page


Links:

The home page of AGREP (non-UNIX) http://www.tgries.de/agrep
The original home page of AGREP and GLIMPSE for UNIX http://glimpse.cs.arizona.edu
The home page of HARVEST http://harvest.transarc.com


E-Mail addresses:

Please send bug reports of this OS/2 and DOS port of AGREP to


The author of AGREP, Prof. Udi Manber, Univ. of Arizona (no email disclosed here due)
To be put on the GLIMPSE mailing list for announcements related to GLIMPSE and/or AGREP  send a mail to glimpse-request@cs.arizona.edu



back to the top of this page
Back to my home page.
AGREP V3.411, Homepage Version 1.17, T. Gries November 12, 2013
Page designed by Opteryx®
Opteryx Design