From: drh Date: Sat, 27 Apr 2013 18:06:40 +0000 (+0000) Subject: Remove spellfix virtual table documentation from the source tree. X-Git-Tag: version-3.7.17~44 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=015db9c85995e1661079afd11fb7c45a95ceee7e;p=thirdparty%2Fsqlite.git Remove spellfix virtual table documentation from the source tree. Reference the separate documentation on the website instead. FossilOrigin-Name: adcf78909ff9064b6e3c4dd15ccd3245c8cf270b --- diff --git a/ext/misc/editdist3.wiki b/ext/misc/editdist3.wiki deleted file mode 100644 index 494922c5e9..0000000000 --- a/ext/misc/editdist3.wiki +++ /dev/null @@ -1,114 +0,0 @@ -The editdist3 algorithm - -The editdist3 algorithm is a function that computes the minimum edit distance -(a.k.a. the Levenshtein distance) between two input strings. Features of -editdist3 include: - - * It works with unicode (UTF8) text. - - * A table of insertion, deletion, and substitution costs can be - provided by the application. - - * Multi-character insertsions, deletions, and substitutions can be - enumerated in the cost table. - -

The COST table

- -To program the costs of editdist3, create a table such as the following: - -
-CREATE TABLE editcost(
-  iLang INT,   -- The language ID
-  cFrom TEXT,  -- Convert text from this
-  cTo   TEXT,  -- Convert text into this
-  iCost INT    -- The cost of doing the conversionnn
-);
-
- -The cost table can be named anything you want - it does not have to be called -"editcost". And the table can contain additional columns. However, it the -table must contain the four columns show above, with exactly the names shown. - -The iLang column is a non-negative integer that identifies a set of costs -appropriate for a particular language. The editdist3 function will only use -a single iLang value for any given edit-distance computation. The default -value is 0. It is recommended that applications that only need to use a -single langauge always use iLang==0 for all entries. - -The iCost column is the numeric cost of transforming cFrom into cTo. This -value should be a non-negative integer, and should probably be less than 100. -The default single-character insertion and deletion costs are 100 and the -default single-character to single-character substitution cost is 150. A -cost of 10000 or more is considered "infinite" and causes the rule to be -ignored. - -The cFrom and cTo columns show edit transformation strings. Either or both -columns may contain more than one character. Or either column (but not both) -may hold an empty string. When cFrom is empty, that is the cost of inserting -cTo. When cTo is empty, that is the cost of deleting cFrom. - -In the spellfix1 algorithm, cFrom is the text as the user entered it and -cTo is the correctly spelled text as it exists in the database. The goal -of the editdist3 algorithm is to determine how close the user-entered text is -to the dictionary text. - -There are three special-case entries in the cost table: - - - - - - -
cFromcToMeaning
'''?'The default insertion cost
'?'''The default deletion cost
'?''?'The default substitution cost
- -If any of the special-case entries shows above are omitted, then the -value of 100 is used for insertion and deletion and 150 is used for -substitution. To disable the default insertion, deletion, and/or substitution -set their respective cost to 10000 or more. - -Other entries in the cost table specific transforms for particular characters. -The cost of specific transforms should be less than the default costs, or else -the default costs will take precedence and the specific transforms will never -be used. - -Some example, cost table entries: - -
-INSERT INTO editcost(iLang, cFrom, cTo, iCost)
-VALUES(0, 'a', 'ä', 5);
-
- -The rule above says that the letter "a" in user input can be matched against -the letter "ä" in the dictionary with a penalty of 5. - -
-INSERT INTO editcost(iLang, cFrom, cTo, iCost)
-VALUES(0, 'ss', 'ß', 8);
-
- -The number of characters in cFrom and cTo do not need to be the same. The -rule above says that "ss" on user input will match "ß" with a penalty of 8. - -

Experimenting with the editcost3() function

- -The [./spellfix1.wiki | spellfix1 virtual table] -uses editdist3 if the "edit_cost_table=TABLE" option -is specified as an argument when the spellfix1 virtual table is created. -But editdist3 can also be tested directly using the built-in "editdist3()" -SQL function. The editdist3() SQL function has 3 forms: - - 1. editdist3('TABLENAME'); - 2. editdist3('string1', 'string2'); - 3. editdist3('string1', 'string2', langid); - -The first form loads the edit distance coefficients from a table called -'TABLENAME'. Any prior coefficients are discarded. So when experimenting -with weights and the weight table changes, simply rerun the single-argument -form of editdist3() to reload revised coefficients. Note that the -edit distance -weights used by the editdist3() SQL function are independent from the -weights used by the spellfix1 virtual table. - -The second and third forms return the computed edit distance between strings -'string1' and "string2'. In the second form, an language id of 0 is used. -The language id is specified in the third form. diff --git a/ext/misc/spellfix.c b/ext/misc/spellfix.c index e0a7d1f374..c368b34e8f 100644 --- a/ext/misc/spellfix.c +++ b/ext/misc/spellfix.c @@ -12,7 +12,7 @@ ** ** This module implements the spellfix1 VIRTUAL TABLE that can be used ** to search a large vocabulary for close matches. See separate -** documentation files (spellfix1.wiki and editdist3.wiki) for details. +** documentation (http://www.sqlite.org/spellfix1.html) for details. */ #include "sqlite3ext.h" SQLITE_EXTENSION_INIT1 diff --git a/ext/misc/spellfix1.wiki b/ext/misc/spellfix1.wiki deleted file mode 100644 index 288c55dd2e..0000000000 --- a/ext/misc/spellfix1.wiki +++ /dev/null @@ -1,464 +0,0 @@ -The Spellfix1 Virtual Table - -This spellfix1 virtual table is used to search -a large vocabulary for close matches. For example, spellfix1 -can be used to suggest corrections to misspelled words. Or, -it could be used with FTS4 to do full-text search using potentially -misspelled words. - -Create an instance of the spellfix1 virtual table like this: - -
-CREATE VIRTUAL TABLE demo USING spellfix1;
-
- -The "spellfix1" term is the name of this module and must be entered as -shown. The "demo" term is the -name of the virtual table you will be creating and can be altered -to suit the needs of your application. The virtual table is initially -empty. In order for the virtual table to be useful, you will need to -populate it with your vocabulary. Suppose you -have a list of words in a table named "big_vocabulary". Then do this: - -
-INSERT INTO demo(word) SELECT word FROM big_vocabulary;
-
- -If you intend to use this virtual table in cooperation with an FTS4 -table (for spelling correctly of search terms) then you might extract -the vocabulary using an fts3aux table: - -
-INSERT INTO demo(word) SELECT term FROM search_aux WHERE col='*';
-
- -You can also provide the virtual table with a "rank" for each word. -The "rank" is an estimate of how common the word is. Larger numbers -mean the word is more common. If you omit the rank when populating -the table, then a rank of 1 is assumed. But if you have rank -information, you can supply it and the virtual table will show a -slight preference for selecting more commonly used terms. To -populate the rank from an fts4aux table "search_aux" do something -like this: - -
-INSERT INTO demo(word,rank)
-   SELECT term, documents FROM search_aux WHERE col='*';
-
- -To query the virtual table, include a MATCH operator in the WHERE -clause. For example: - -
-SELECT word FROM demo WHERE word MATCH 'kennasaw';
-
- -Using a dataset of American place names (derived from -[http://geonames.usgs.gov/domestic/download_data.htm]) the query above -returns 20 results beginning with: - -
-kennesaw
-kenosha
-kenesaw
-kenaga
-keanak
-
- -If you append the character '*' to the end of the pattern, then -a prefix search is performed. For example: - -
-SELECT word FROM demo WHERE word MATCH 'kennes*';
-
- -Yields 20 results beginning with: - -
-kennesaw
-kennestone
-kenneson
-kenneys
-keanes
-keenes
-
- -

Search Refinements

- -By default, the spellfix1 table returns no more than 20 results. -(It might return less than 20 if there were fewer good matches.) -You can change the upper bound on the number of returned rows by -adding a "top=N" term to the WHERE clause of your query, where N -is the new maximum. For example, to see the 5 best matches: - -
-SELECT word FROM demo WHERE word MATCH 'kennes*' AND top=5;
-
- -Each entry in the spellfix1 virtual table is associated with a -a particular language, identified by the integer "langid" column. -The default langid is 0 and if no other actions are taken, the -entire vocabulary is a part of the 0 language. But if your application -needs to operate in multiple languages, then you can specify different -vocabulary items for each language by specifying the langid field -when populating the table. For example: - -
-INSERT INTO demo(word,langid) SELECT word, 0 FROM en_vocabulary;
-INSERT INTO demo(word,langid) SELECT word, 1 FROM de_vocabulary;
-INSERT INTO demo(word,langid) SELECT word, 2 FROM fr_vocabulary;
-INSERT INTO demo(word,langid) SELECT word, 3 FROM ru_vocabulary;
-INSERT INTO demo(word,langid) SELECT word, 4 FROM cn_vocabulary;
-
- -After the virtual table has been populated with items from multiple -languages, specify the language of interest using a "langid=N" term -in the WHERE clause of the query: - -
-SELECT word FROM demo WHERE word MATCH 'hildes*' AND langid=1;
-
- -Note that if you do not include the "langid=N" term in the WHERE clause, -the search will be against language 0 (English in the example above.) -All spellfix1 searches are against a single language id. There is no -way to search all languages at once. - - -

Virtual Table Details

- -The virtual table actually has a unique rowid with seven columns plus five -extra hidden columns. The columns are as follows: - -
-
rowid
-A unique integer number associated with each -vocabulary item in the table. This can be used -as a foreign key on other tables in the database. - -
word
-The text of the word that matches the pattern. -Both word and pattern can contains unicode characters -and can be mixed case. - -
rank
-This is the rank of the word, as specified in the -original INSERT statement. - - -
distance
-This is an edit distance or Levensthein distance going -from the pattern to the word. - -
langid
-This is the language-id of the word. All queries are -against a single language-id, which defaults to 0. -For any given query this value is the same on all rows. - -
score
-The score is a combination of rank and distance. The -idea is that a lower score is better. The virtual table -attempts to find words with the lowest score and -by default (unless overridden by ORDER BY) returns -results in order of increasing score. - -
matchlen
-In a prefix search, the matchlen is the number of characters in -the string that match against the prefix. For a non-prefix search, -this is the same as length(word). - -
phonehash
-This column shows the phonetic hash prefix that was used to restrict -the search. For any given query, this column should be the same for -every row. This information is available for diagnostic purposes and -is not normally considered useful in real applications. - -
top
-(HIDDEN) For any query, this value is the same on all -rows. It is an integer which is the maximum number of -rows that will be output. The actually number of rows -output might be less than this number, but it will never -be greater. The default value for top is 20, but that -can be changed for each query by including a term of -the form "top=N" in the WHERE clause of the query. - -
scope
-(HIDDEN) For any query, this value is the same on all -rows. The scope is a measure of how widely the virtual -table looks for matching words. Smaller values of -scope cause a broader search. The scope is normally -choosen automatically and is capped at 4. Applications -can change the scope by including a term of the form -"scope=N" in the WHERE clause of the query. Increasing -the scope will make the query run faster, but will reduce -the possible corrections. - -
srchcnt
-(HIDDEN) For any query, this value is the same on all -rows. This value is an integer which is the number of -of words examined using the edit-distance algorithm to -find the top matches that are ultimately displayed. This -value is for diagnostic use only. - -
soundslike
-(HIDDEN) When inserting vocabulary entries, this field -can be set to an spelling that matches what the word -sounds like. See the DEALING WITH UNUSUAL AND DIFFICULT -SPELLINGS section below for details. - -
command
-(HIDDEN) The value of the "command" column is always NULL. However, -applications can insert special strings into the "command" column in order -to provoke certain behaviors in the spellfix1 virtual table. -For example, inserting the string 'reset' into the "command" column -will cause the virtual table will reread its edit distance weights -(if there are any). -
- -

Algorithm

- -The spellfix1 virtual table creates a single -shadow table named "%_vocab" (where the % is replaced by the name of -the virtual table; Ex: "demo_vocab" for the "demo" virtual table). -the shadow table contains the following columns: - -
-
id
-The unique id (INTEGER PRIMARY KEY) - -
rank
-The rank of word. - -
langid
-The language id for this entry. - -
word
-The original UTF8 text of the vocabulary word - -
k1
-The word transliterated into lower-case ASCII. -There is a standard table of mappings from non-ASCII -characters into ASCII. Examples: "æ" -> "ae", -"þ" -> "th", "ß" -> "ss", "á" -> "a", ... The -accessory function spellfix1_translit(X) will do -the non-ASCII to ASCII mapping. The built-in lower(X) -function will convert to lower-case. Thus: -k1 = lower(spellfix1_translit(word)). - -
k2
-This field holds a phonetic code derived from k1. Letters -that have similar sounds are mapped into the same symbol. -For example, all vowels and vowel clusters become the -single symbol "A". And the letters "p", "b", "f", and -"v" all become "B". All nasal sounds are represented -as "N". And so forth. The mapping is base on -ideas found in Soundex, Metaphone, and other -long-standing phonetic matching systems. This key can -be generated by the function spellfix1_phonehash(X). -Hence: k2 = spellfix1_phonehash(k1) -
- -There is also a function for computing the Wagner edit distance or the -Levenshtein distance between a pattern and a word. This function -is exposed as spellfix1_editdist(X,Y). The edit distance function -returns the "cost" of converting X into Y. Some transformations -cost more than others. Changing one vowel into a different vowel, -for example is relatively cheap, as is doubling a constant, or -omitting the second character of a double-constant. Other transformations -or more expensive. The idea is that the edit distance function returns -a low cost of words that are similar and a higher cost for words -that are futher apart. In this implementation, the maximum cost -of any single-character edit (delete, insert, or substitute) is 100, -with lower costs for some edits (such as transforming vowels). - -The "score" for a comparison is the edit distance between the pattern -and the word, adjusted down by the base-2 logorithm of the word rank. -For example, a match with distance 100 but rank 1000 would have a -score of 122 (= 100 - log2(1000) + 32) where as a match with distance -100 with a rank of 1 would have a score of 131 (100 - log2(1) + 32). -(NB: The constant 32 is added to each score to keep it from going -negative in case the edit distance is zero.) In this way, frequently -used words get a slightly lower cost which tends to move them toward -the top of the list of alternative spellings. - -A straightforward implementation of a spelling corrector would be -to compare the search term against every word in the vocabulary -and select the 20 with the lowest scores. However, there will -typically be hundreds of thousands or millions of words in the -vocabulary, and so this approach is not fast enough. - -Suppose the term that is being spell-corrected is X. To limit -the search space, X is converted to a k2-like key using the -equivalent of: - -
-   key = spellfix1_phonehash(lower(spellfix1_translit(X)))
-
- -This key is then limited to "scope" characters. The default scope -value is 4, but an alternative scope can be specified using the -"scope=N" term in the WHERE clause. After the key has been truncated, -the edit distance is run against every term in the vocabulary that -has a k2 value that begins with the abbreviated key. - -For example, suppose the input word is "Paskagula". The phonetic -key is "BACACALA" which is then truncated to 4 characters "BACA". -The edit distance is then run on the 4980 entries (out of -272,597 entries total) of the vocabulary whose k2 values begin with -BACA, yielding "Pascagoula" as the best match. - -Only terms of the vocabulary with a matching langid are searched. -Hence, the same table can contain entries from multiple languages -and only the requested language will be used. The default langid -is 0. - -

Configurable Edit Distance

- -The built-in Wagner edit-distance function with fixed weights can be -replaced by the [./editdist3.wiki | editdist3()] edit-distance function -with application-defined weights and support for unicode, by specifying -the "edit_cost_table=TABLENAME" parameter to the spellfix1 module -when the virtual table is created. -For example: - -
-CREATE VIRTUAL TABLE demo2 USING spellfix1(edit_cost_table=APPCOST);
-
- -In the example above, the APPCOST table would be interrogated to find -the edit distance coefficients. It is the presence of the "edit_cost_table=" -parameter to the spellfix1 module name that causes editdist3() to be used -in place of the built-in edit distance function. - -The edit distance coefficients are normally read from the APPCOST table -once and there after stored in memory. Hence, run-time changes to the -APPCOST table will not normally effect the edit distance results. -However, inserting the special string 'reset' into the "command" column of the -virtual table causes the edit distance coefficients to be reread the -APPCOST table. Hence, applications should run a SQL statement similar -to the following when changes to the APPCOST table occur: - -
-INSERT INTO demo2(command) VALUES('reset'); -
- -The tables used for edit distance costs can be changed using a command -like the following: - -
-INSERT INTO demo2(command) VALUES('edit_cost_table=APPCOST2'); -
- -In the example above, any prior edit distance costs would be discarded and -all future queries would use the costs found in the APPCOST2 table. If the -name of the table specified by the "edit_cost_table" command is "NULL", then -theh built-in Wagner edit-distance function will be used instead of the -editdist3() function in all future queries. - -

Dealing With Unusual And Difficult Spellings

- -The algorithm above works quite well for most cases, but there are -exceptions. These exceptions can be dealt with by making additional -entries in the virtual table using the "soundslike" column. - -For example, many words of Greek origin begin with letters "ps" where -the "p" is silent. Ex: psalm, pseudonym, psoriasis, psyche. In -another example, many Scottish surnames can be spelled with an -initial "Mac" or "Mc". Thus, "MacKay" and "McKay" are both pronounced -the same. - -Accommodation can be made for words that are not spelled as they -sound by making additional entries into the virtual table for the -same word, but adding an alternative spelling in the "soundslike" -column. For example, the canonical entry for "psalm" would be this: - -
-  INSERT INTO demo(word) VALUES('psalm');
-
- -To enhance the ability to correct the spelling of "salm" into -"psalm", make an addition entry like this: - -
-  INSERT INTO demo(word,soundslike) VALUES('psalm','salm');
-
- -It is ok to make multiple entries for the same word as long as -each entry has a different soundslike value. Note that if no -soundslike value is specified, the soundslike defaults to the word -itself. - -Listed below are some cases where it might make sense to add additional -soundslike entries. The specific entries will depend on the application -and the target language. - - * Silent "p" in words beginning with "ps": psalm, psyche - - * Silent "p" in words beginning with "pn": pneumonia, pneumatic - - * Silent "p" in words beginning with "pt": pterodactyl, ptolemaic - - * Silent "d" in words beginning with "dj": djinn, Djikarta - - * Silent "k" in words beginning with "kn": knight, Knuthson - - * Silent "g" in words beginning with "gn": gnarly, gnome, gnat - - * "Mac" versus "Mc" beginning Scottish surnames - - * "Tch" sounds in Slavic words: Tchaikovsky vs. Chaykovsky - - * The letter "j" pronounced like "h" in Spanish: LaJolla - - * Words beginning with "wr" versus "r": write vs. rite - - * Miscellanous problem words such as "debt", "tsetse", - "Nguyen", "Van Nuyes". - -

Auxiliary Functions

- -The source code module that implements the spellfix1 virtual table also -implements several SQL functions that might be useful to applications -that employ spellfix1 or for testing or diagnostic work while developing -applications that use spellfix1. The following auxiliary functions are -available: - -
-
editdist3(P,W)
editdist2(P,W,L)
editdist3(T)
-These routines provide direct access to the version of the Wagner -edit-distance function that allows for application-defined weights -on edit operations. The first two forms of this function compare -pattern P against word W and return the edit distance. In the first -function, the langid is assumed to be 0 and in the second, the -langid is given by the L parameter. The third form of this function -reloads edit distance coefficience from the table named by T. - -
spellfix1_editdist(P,W)
-This routine provides access to the built-in Wagner edit-distance -function that uses default, fixed costs. The value returned is -the edit distance needed to transform W into P. - -
spellfix1_phonehash(X)
-This routine constructs a phonetic hash of the pure ascii input word X -and returns that hash. This routine is used internally by spellfix1 in -order to transform the K1 column of the shadow table into the K2 -column. - -
spellfix1_scriptcode(X)
-Given an input string X, this routine attempts to determin the dominant -script of that input and returns the ISO-15924 numeric code for that -script. The current implementation understands the following scripts: -
    -
  • 215 - Latin -
  • 220 - Cyrillic -
  • 200 - Greek -
-Additional language codes might be added in future releases. - -
spellfix1_translit(X)
-This routine transliterates unicode text into pure ascii, returning -the pure ascii representation of the input text X. This is the function -that is used internally to transform vocabulary words into the K1 -column of the shadow table. - -
diff --git a/manifest b/manifest index b47b06dfb4..16c565ef04 100644 --- a/manifest +++ b/manifest @@ -1,5 +1,5 @@ -C Untested\sfix\sfor\sbuilding\son\sVxWorks. -D 2013-04-27T12:13:29.526 +C Remove\sspellfix\svirtual\stable\sdocumentation\sfrom\sthe\ssource\stree.\nReference\sthe\sseparate\sdocumentation\son\sthe\swebsite\sinstead. +D 2013-04-27T18:06:40.561 F Makefile.arm-wince-mingw32ce-gcc d6df77f1f48d690bd73162294bbba7f59507c72f F Makefile.in ce81671efd6223d19d4c8c6b88ac2c4134427111 F Makefile.linux-gcc 91d710bdc4998cb015f39edf3cb314ec4f4d7e23 @@ -85,13 +85,11 @@ F ext/icu/icu.c eb9ae1d79046bd7871aa97ee6da51eb770134b5a F ext/icu/sqliteicu.h 728867a802baa5a96de7495e9689a8e01715ef37 F ext/misc/amatch.c 3369b2b544066e620d986f0085d039c77d1ef17f F ext/misc/closure.c fec0c8537c69843e0b7631d500a14c0527962cd6 -F ext/misc/editdist3.wiki 06100a0c558921a563cbc40e0d0151902b1eef6d F ext/misc/fuzzer.c fb64a15af978ae73fa9075b9b1dfbe82b8defc6f F ext/misc/ieee754.c 2565ce373d842977efe0922dc50b8a41b3289556 F ext/misc/nextchar.c 1131e2b36116ffc6fe6b2e3464bfdace27978b1e F ext/misc/regexp.c c25c65fe775f5d9801fb8573e36ebe73f2c0c2e0 -F ext/misc/spellfix.c e323eebb877d735bc64404c16a6d758ab17a0b7a -F ext/misc/spellfix1.wiki dd1830444c14cf0f54dd680cc044df2ace2e9d09 +F ext/misc/spellfix.c f9d24a2b2617cee143b7841b453e4e1fd8f189cc F ext/misc/wholenumber.c ce362368b9381ea48cbd951ade8df867eeeab014 F ext/rtree/README 6315c0d73ebf0ec40dedb5aa0e942bc8b54e3761 F ext/rtree/rtree.c 757abea591d4ff67c0ff4e8f9776aeda86b18c14 @@ -1062,7 +1060,7 @@ F tool/vdbe-compress.tcl f12c884766bd14277f4fcedcae07078011717381 F tool/warnings-clang.sh f6aa929dc20ef1f856af04a730772f59283631d4 F tool/warnings.sh fbc018d67fd7395f440c28f33ef0f94420226381 F tool/win/sqlite.vsix 97894c2790eda7b5bce3cc79cb2a8ec2fde9b3ac -P 7a97226ffe174349e7113340f5354c4e44bd9738 -R 45f16684eb908e3d81d1d0d938ab9d11 +P f14d55cf358b0392d3b8cd61dc85f43a610a8edf +R 1d8237b86dffdd5327eb650dcb291120 U drh -Z 3089ee3240913ca0c76196e0917eb228 +Z 369cfa66c298e6ca3c71b08895115463 diff --git a/manifest.uuid b/manifest.uuid index 8005ea1784..edb252e518 100644 --- a/manifest.uuid +++ b/manifest.uuid @@ -1 +1 @@ -f14d55cf358b0392d3b8cd61dc85f43a610a8edf \ No newline at end of file +adcf78909ff9064b6e3c4dd15ccd3245c8cf270b \ No newline at end of file