X-Hacker.org- FAST TEXT SEARCH for Clipper v.2.0

Click above to get retro games delivered to your door ever month! X-Hacker.org- FAST TEXT SEARCH for Clipper v.2.0 - <b>general principles</b>
[<<Previous Entry] [^^Up^^] [Next Entry>>] [Menu] [About The Guide]
  General Principles

   The FAST TEXT SEARCH for Clipper system has been designed to offer 
   the developer great flexibility and provide an inexpensive solution 
   for the great majority of text search problems. The FAST TEXT SEARCH 
   functions are used to create, maintain and search one or more CFTS 
   index files. In the demonstration programs, these index files are 
   given the extension .IA and this document will refer to CFTS index 
   files as .IA files. CFTS indexes contain fixed length keys (index 
   records) which track the contents of text records. A text record may 
   consist of any arbitrary block of text. In the case of a .DBF 
   (Clipper) format file (see SHOW1.PRG), a text record would usually 
   consist of the concatenation of selected fields from a data record. A 
   text record might also be a line (see SHOW2.PRG) or paragraph from a 
   text file or even an entire file. For example, a correspondence 
   management/tracking system made up of many individual WordPerfect 
   files might be structured so that each file is a text record. In all 
   cases, there will be one index record for each and every text record. 
   This one to one relationship is critical to the proper operation of a 
   FAST TEXT SEARCH system.

   An .IA file is created by CftsCrea(). CftsCrea() builds the .IA file 
   header and establishes the index's attributes. It does not add records 
   to the file. This is done by CftsAdd(). See the discussion of 
   Cfts_Index() in CFTS87.PRG or CFTS5.PRG. This UDF combines CftsCrea(), 
   CftsAdd(), CftsClose() and CftsOpen() to create and populate an .IA 
   file.

   The contents of each text record are passed to CFTS, a key is built 
   according to the parameters of CftsCrea() and an index record is added 
   to the .IA file using CftsAdd(). These index records are given numbers 
   (beginning with 1) which are returned by CftsAdd(). Index record 
   numbers are not database record numbers. They are created by CFTS and 
   are incremented by one as each record is added. Again using the 
   example of a .DBF file, if the records are read and passed to CFTS in 
   natural order (that is, not in an index order), the database record 
   numbers will be the same as the index record numbers. If each text 
   record is a line of text, each index record number will correspond to 
   a line number. CFTS is not aware of the origin of the text strings it 
   receives. It merely receives a string, constructs an index key, adds 
   that key to the .IA file, increments its internal counter and returns 
   the value of the counter. It is the responsibility of the application 
   to manage the text records. CFTS will manage the index records.

   The .IA file is initially built like other types of indexes. 
   Maintenance operations such as adds, replaces, deletes and undeletes 
   are performed on individual index records. The .IA file will not have 
   to be rebuilt unless the data on which it is based is significantly 
   altered. An example of this is when a .DBF file is PACKED. A PACK will 
   remove all data records that have been marked for deletion. The .IA 
   file must be rebuilt if any .DBF records were removed because the one 
   to one relationship between data records (text records) and index 
   records would have been destroyed. Similarly, if files in a document 
   database or lines of a text file were erased, the related .IA indexes 
   should be rebuilt.

   Searches are performed by passing the string or strings being searched 
   for (search string) to CFTS along with the identifier of the 
   appropriate .IA file. The .IA file is searched and the index record 
   numbers of those records containing matches are returned. The text 
   record related to each returned index record number is then inspected 
   to verify that a match has actually been found. This verification 
   process is necessary because CFTS will sometimes return aliases. See 
   CftsVeri().

   An .IA index is not an inversion or compression of the original 
   textual data. Its keys track the occurrence of text signatures. It 
   identifies matches as those index records containing all the 
   signatures or attributes of the search string(s). It is possible that 
   differing strings will resolve to the same signature. We refer to 
   these as aliases. It is important to note that while an .IA index 
   search may mistakenly identify a record as containing a match, it will 
   never fail to find a record that does match. The speed of the index 
   search allows for verification as well as other post search operations 
   and still provide exceptionally rapid results.

   As mentioned above, the CFTS system has been built to provide fast 
   text searches in a wide variety of applications. There are limits 
   however. It will be less useful in two extreme situations. The first 
   is where text records are very small. The smallest .IA index key is 16 
   bytes. When individual text records are also that small the index file 
   overhead can be 100% or more. The other extreme is where individual 
   text records are very large. The largest choice for an .IA index key 
   is 64 bytes. It is easy to imagine that indexing 64 kilobytes of text 
   with a single 64 byte key is not practical. Additionally, because CFTS 
   identifies matches on a text record level, there would still be 
   significant work left to find the string(s) within the 64K block of 
   text that CFTS identifies as containing a match. See the section below 
   on CftsCrea() for more on record vs. key size.
Online resources provided by: http://www.X-Hacker.org --- NG 2 HTML conversion by Dave Pearson