Spell Correction Efficiency

Hi Doug,

We had a discussion a month ago I think about the Spell Correction algorithm which was used in CLang, notably for auto-completion, based on the Levenstein distance.

It turns out I just came upon this link today: http://nlp.stanford.edu/IR-book/html/htmledition/k-gram-indexes-for-spelling-correction-1.html

The idea is to use bigrams (2-letters parts of a word) to build an index of the form (bigram > all words containing this bigram), then use this index to retrieve all the words with enough bigrams in common with the word you are currently trying to approximate.

This drastically reduces the set of identifiers on which computing the Levenstein distance, especially if we directly trim those which have a length incompatible anyway from the beginning. Furthermore, the result may be ordered by the number of common bigrams, and thus the Levenstein distance may be computed first on the most promising candidates in order to have the maximum edit distance drop quickly.

I have implemented a quick algorithm in python (~70 lines, few comments though), to test it out, and I find the results quite promising.

I have used the following corpus: http://norvig.com/big.txt (~6 MB) which can be found on Peter Norvig’s page: http://norvig.com/spell-correct.html

For example, looking for the word “neaer”:

Number of distinct Words from the corpus: 48911
Number of distinct Bigrams from the corpus: 1528
Number of results: [(1, 34), (2, 1133)]
Most promising results: (1, [‘Kearney’, ‘NEAR’, ‘nearing’, ‘linear’, ‘learner’, ‘uneasy’, ‘nearest’, ‘neat’, ‘lineage’, ‘leaned’, ‘Nearer’, ‘Learned’, ‘Nearly’, ‘cleaner’, ‘cleaned’, ‘neatly’, ‘nearer’, ‘earned’, ‘n
eared’, ‘nearly’, ‘learned’, ‘nearby’, ‘Nearest’, ‘near’, ‘meanest’, ‘earnest’, ‘Near’, ‘beneath’, ‘gleaned’, ‘Beneath’, ‘kneaded’, ‘weaned’, ‘Freneau’, ‘guineas’])

Or for the word “sppose”:

Number of distinct Words from the corpus: 48911
Number of distinct Bigrams from the corpus: 1528
Number of results: [(1, 14), (2, 136), (3, 991)]
Most promising results: (1, [‘disposal’, ‘opposed’, ‘opposing’, ‘Oppose’, ‘suppose’, ‘Dispose’, ‘dispose’, ‘apposed’, ‘disposed’, ‘oppose’, ‘supposed’, ‘opposite’, ‘supposes’, ‘Suppose’])

Do you think the method worth investigating ?

Regards,
Matthieu

spellcorrection.py (1.97 KB)

Hi Matthieu,

Hi Doug,

2010/11/9 Douglas Gregor <dgregor@apple.com>

Hi Matthieu,

Hi Doug,

We had a discussion a month ago I think about the Spell Correction algorithm which was used in CLang, notably for auto-completion, based on the Levenstein distance.

It turns out I just came upon this link today: http://nlp.stanford.edu/IR-book/html/htmledition/k-gram-indexes-for-spelling-correction-1.html

The idea is to use bigrams (2-letters parts of a word) to build an index of the form (bigram > all words containing this bigram), then use this index to retrieve all the words with enough bigrams in common with the word you are currently trying to approximate.

This drastically reduces the set of identifiers on which computing the Levenstein distance, especially if we directly trim those which have a length incompatible anyway from the beginning. Furthermore, the result may be ordered by the number of common bigrams, and thus the Levenstein distance may be computed first on the most promising candidates in order to have the maximum edit distance drop quickly.

I have implemented a quick algorithm in python (~70 lines, few comments though), to test it out, and I find the results quite promising.

I have used the following corpus: http://norvig.com/big.txt (~6 MB) which can be found on Peter Norvig’s page: http://norvig.com/spell-correct.html

For example, looking for the word “neaer”:

Number of distinct Words from the corpus: 48911
Number of distinct Bigrams from the corpus: 1528
Number of results: [(1, 34), (2, 1133)]
Most promising results: (1, [‘Kearney’, ‘NEAR’, ‘nearing’, ‘linear’, ‘learner’, ‘uneasy’, ‘nearest’, ‘neat’, ‘lineage’, ‘leaned’, ‘Nearer’, ‘Learned’, ‘Nearly’, ‘cleaner’, ‘cleaned’, ‘neatly’, ‘nearer’, ‘earned’, ‘n
eared’, ‘nearly’, ‘learned’, ‘nearby’, ‘Nearest’, ‘near’, ‘meanest’, ‘earnest’, ‘Near’, ‘beneath’, ‘gleaned’, ‘Beneath’, ‘kneaded’, ‘weaned’, ‘Freneau’, ‘guineas’])

Or for the word “sppose”:

Number of distinct Words from the corpus: 48911
Number of distinct Bigrams from the corpus: 1528
Number of results: [(1, 14), (2, 136), (3, 991)]
Most promising results: (1, [‘disposal’, ‘opposed’, ‘opposing’, ‘Oppose’, ‘suppose’, ‘Dispose’, ‘dispose’, ‘apposed’, ‘disposed’, ‘oppose’, ‘supposed’, ‘opposite’, ‘supposes’, ‘Suppose’])

Do you think the method worth investigating ?

Yes, I do. Typo correction performance is very important (and has historically been very poor).

Note that there are also more-efficient algorithms for computing the Levenstein distance (e.g., dmin(m, n) rather than mn). We should also consider those.

  • Doug

The diagonal optimization you are talking about seems to bring a *2 to *3 speed-up, which is quite nice, however the algorithm is tricky and I am not sure yet that I have the correct distance in every case. I am going to settle and implement some fuzzy testers this week to help find bugs in the various algorithms.

Anyway I’ve been exploring a few options regarding typo correction these past weeks and I now have a little ecosystem of various solutions, however I currently lack 3 things:

  • a real lexicon (mine is about ~18k words) → I would appreciate if someone pointed me to a more representative lexicon, otherwise I may try and index a subset of the boost library and see what I can come up with
  • a real benchmark: I have no idea how to “select” good (typo-ed) words for testing / benchmarking the typo correction algorithms. I have started by always using the same word (at a distance of 1) but it clearly isn’t representative and my knowledge of statistics is rather lacking so if you have an approach to suggest…
  • a good Trie implementation, or hints in how to implement one. A Trie seems to be the best index (quite compact yet ordered), however I only have a naive implementation at the moment.

My best concurrent so far seems to be based on CS Language theory and involves extracting the set of words recognized by two automatons, one being the Trie, which recognize all the known words, and one being crafted for the word we are looking for, and recognize all words in a certain distance. However for maximal efficiency, this requires some structures to be precomputed, and this structure grows exponentially with the distance. It works for distance 1 to 3, is already 2.8MB for distance 4 (and takes about 70s to be computed on my machine, though I suppose it would be faster to deserialize a precomputed version), and I haven’t dared trying further as it seems to be a geometrical serie with a factor of ~28 !

Would it be considered a loss to limit ourselves to words within distance min(word.length()/4+1, 3) ?

It occured to me that one transformation (the transposition of adjacent characters) was quite a common typo. I have toyed with the Damereau-Levenshtein distance but this is quite harder to implement the optmizations with it. Without it a simple transposition costs 2 transformations (one addition and one deletion) so in this case 3 is rather limiting, but still I don’t know if looking much further is very interesting.

I’d like to get some feedback before delving further :slight_smile:

Also, I haven’t considered how to actually plug this in the llvm/clang code base yet, I am just prototyping at the moment.

We had a discussion a month ago I think about the Spell Correction
algorithm which was used in CLang, notably for auto-completion, based on the
Levenstein distance.

I presume you've thought about whether suffix-tries and prefix-tries
would work for the case of typos on computer identifiers?

- a good Trie implementation, or hints in how to implement one. A Trie seems
to be the best index (quite compact yet ordered), however I only have a
naive implementation at the moment.

I've got a hash-trie implementation (the datastructure that is used,
eg, to store the dictionary of hypenation exceptions in TeX) in C++
that's heavily designed for small alphabets and integer "payloads", so
it's unlikely to be directly suitable, but you can have a look at it
if you like. (Drop me a personal email.)

Hi Dave,

actually I try not to think :slight_smile:

Truth to be told I have little experience in the domain, but I have time at my disposal, therefore rather than thinking I simply brute-force the tests and see what comes best. I’d rather not assume, especially performance-wise.

For the prefix / suffix implementation, I’m waiting for a “typical” lexicon to see what comes out first. I’m in the hope than it should not matter, unless I have a glaring bottleneck in the code, but as I said i prefer to measure. Frankly it’s the least of my worries since it suffices to reverse the strings to get from one to the other so there is no technical difficulty there.

I know that my trie implementation is probably lacking (as I said, it’s naive), so I am trying to see what is the state of the art on storing dictionaries and notably how to “pack” tries to get a smaller memory foot-print and a faster query. The idea is to gather the various approaches, implement them and benchmark them, and I’ll see what comes on top.

I’d really like to see this hash-trie implementation if you could send it to me. I don’t mind having to read C / C++ code as long as it’s not obfuscated, in fact I’ll probably feel much more at ease than decoding the scientific paper I got on the automaton theory and which took me a good week to implement :slight_smile:

Thanks,
Matthieu.

2010/11/22 David Tweed <david.tweed@gmail.com>

Hi Doug,

2010/11/9 Douglas Gregor <dgregor@apple.com>

Hi Matthieu,

Hi Doug,

We had a discussion a month ago I think about the Spell Correction algorithm which was used in CLang, notably for auto-completion, based on the Levenstein distance.

It turns out I just came upon this link today: http://nlp.stanford.edu/IR-book/html/htmledition/k-gram-indexes-for-spelling-correction-1.html

The idea is to use bigrams (2-letters parts of a word) to build an index of the form (bigram > all words containing this bigram), then use this index to retrieve all the words with enough bigrams in common with the word you are currently trying to approximate.

This drastically reduces the set of identifiers on which computing the Levenstein distance, especially if we directly trim those which have a length incompatible anyway from the beginning. Furthermore, the result may be ordered by the number of common bigrams, and thus the Levenstein distance may be computed first on the most promising candidates in order to have the maximum edit distance drop quickly.

I have implemented a quick algorithm in python (~70 lines, few comments though), to test it out, and I find the results quite promising.

I have used the following corpus: http://norvig.com/big.txt (~6 MB) which can be found on Peter Norvig’s page: http://norvig.com/spell-correct.html

For example, looking for the word “neaer”:

Number of distinct Words from the corpus: 48911
Number of distinct Bigrams from the corpus: 1528
Number of results: [(1, 34), (2, 1133)]
Most promising results: (1, [‘Kearney’, ‘NEAR’, ‘nearing’, ‘linear’, ‘learner’, ‘uneasy’, ‘nearest’, ‘neat’, ‘lineage’, ‘leaned’, ‘Nearer’, ‘Learned’, ‘Nearly’, ‘cleaner’, ‘cleaned’, ‘neatly’, ‘nearer’, ‘earned’, ‘n
eared’, ‘nearly’, ‘learned’, ‘nearby’, ‘Nearest’, ‘near’, ‘meanest’, ‘earnest’, ‘Near’, ‘beneath’, ‘gleaned’, ‘Beneath’, ‘kneaded’, ‘weaned’, ‘Freneau’, ‘guineas’])

Or for the word “sppose”:

Number of distinct Words from the corpus: 48911
Number of distinct Bigrams from the corpus: 1528
Number of results: [(1, 14), (2, 136), (3, 991)]
Most promising results: (1, [‘disposal’, ‘opposed’, ‘opposing’, ‘Oppose’, ‘suppose’, ‘Dispose’, ‘dispose’, ‘apposed’, ‘disposed’, ‘oppose’, ‘supposed’, ‘opposite’, ‘supposes’, ‘Suppose’])

Do you think the method worth investigating ?

Yes, I do. Typo correction performance is very important (and has historically been very poor).

Note that there are also more-efficient algorithms for computing the Levenstein distance (e.g., dmin(m, n) rather than mn). We should also consider those.

  • Doug

The diagonal optimization you are talking about seems to bring a *2 to *3 speed-up, which is quite nice, however the algorithm is tricky and I am not sure yet that I have the correct distance in every case. I am going to settle and implement some fuzzy testers this week to help find bugs in the various algorithms.

Okay. Replacing our current implementation with one of the faster algorithms would be great, since that would not require any infrastructure changes at all. Just replace the existing StringRef::edit_distance and we’re done.

Anyway I’ve been exploring a few options regarding typo correction these past weeks and I now have a little ecosystem of various solutions, however I currently lack 3 things:

  • a real lexicon (mine is about ~18k words) → I would appreciate if someone pointed me to a more representative lexicon, otherwise I may try and index a subset of the boost library and see what I can come up with

I tend to use Cocoa.h on Mac OS, which is great for Objective-C. Boost is reasonable for C++ code.

  • a real benchmark: I have no idea how to “select” good (typo-ed) words for testing / benchmarking the typo correction algorithms. I have started by always using the same word (at a distance of 1) but it clearly isn’t representative and my knowledge of statistics is rather lacking so if you have an approach to suggest…

Random permutations of randomly-selected words from the lexicon would be fine, I think.

  • a good Trie implementation, or hints in how to implement one. A Trie seems to be the best index (quite compact yet ordered), however I only have a naive implementation at the moment.

My best concurrent so far seems to be based on CS Language theory and involves extracting the set of words recognized by two automatons, one being the Trie, which recognize all the known words, and one being crafted for the word we are looking for, and recognize all words in a certain distance. However for maximal efficiency, this requires some structures to be precomputed, and this structure grows exponentially with the distance. It works for distance 1 to 3, is already 2.8MB for distance 4 (and takes about 70s to be computed on my machine, though I suppose it would be faster to deserialize a precomputed version), and I haven’t dared trying further as it seems to be a geometrical serie with a factor of ~28 !

The issue with using a trie is that you have to actually build the trie. It can’t be something that is maintained online, because typo correction must be zero cost unless it is needed (since most compilation runs never need typo correction). Also, performance in the presence of precompiled headers is also important: we don’t want to have to construct the trie when building the precompiled header, for example.

Would it be considered a loss to limit ourselves to words within distance min(word.length()/4+1, 3) ?

I could live with that. Clang already has limits on the maximum edit distance it will accept for a typo correction.

It occured to me that one transformation (the transposition of adjacent characters) was quite a common typo. I have toyed with the Damereau-Levenshtein distance but this is quite harder to implement the optmizations with it. Without it a simple transposition costs 2 transformations (one addition and one deletion) so in this case 3 is rather limiting, but still I don’t know if looking much further is very interesting.

Making transposition a 1-unit transformation would be an improvement. It can be added to the current implementation or come in with a new implementation.

  • Doug

Hello Doug,

putting llvmdev in copy since they are concerned too

I’ve finally got around to finish a working implementation of the typical Levenshtein Distance with the diagonal optimization.

I’ve tested it against the original llvm implementation and checked it on a set of ~18k by randomly generating a variation of each word and checking that both implementations would return the same result… took me some time to patch the diagonal version, but they now do.

I have left the older method in place, especially because I saw that it was used in FileCheck and was a bit wary of any side-effect I might introduce there.

I have thus created a new method: unsigned StringRef::levenshtein_distance(StringRef other, unsigned maxEditDistance) and replace the two uses of edit_distance in Sema\SemaLookup.cpp with it.

I have ditched the “AllowReplace” argument since it further complicated the logic of an already ~150 lines method for no obvious gain (as far as I could see) and I have also explicited the computations a bit by using named temporaries.

Unfortunately I have only a Windows machine at hand for the moment and the tests don’t work that well… so I’d appreciate if anyone could check my changes.

I have done two separate patches: one for llvm (stringref-path.diff) and one for clang (sema-patch.diff), the llvm patch should be applied first of course.

Since it is my first time proposing a patch, I hope I have done things right :slight_smile:

Regards,
Matthieu.

2010/12/1 Douglas Gregor <dgregor@apple.com>

sema-patch.diff (3.34 KB)

stringref-patch.diff (15 KB)

Hello Doug,

*putting llvmdev in copy since they are concerned too*

I've finally got around to finish a working implementation of the typical Levenshtein Distance with the diagonal optimization.

I've tested it against the original llvm implementation and checked it on a set of ~18k by randomly generating a variation of each word and checking that both implementations would return the same result... took me some time to patch the diagonal version, but they now do.

Okay. Did you do any performance comparisons? The point of introducing a new algorithm is to improve performance.

I have left the older method in place, especially because I saw that it was used in FileCheck and was a bit wary of any side-effect I might introduce there.

I have thus created a new method: unsigned StringRef::levenshtein_distance(StringRef other, unsigned maxEditDistance) and replace the two uses of edit_distance in Sema\SemaLookup.cpp with it.

I would much rather just replace edit_distance completely. Compiler writers aren't string-savvy enough to pick between the two algorithms :slight_smile:

I have ditched the "AllowReplace" argument since it further complicated the logic of an already ~150 lines method for no obvious gain (as far as I could see) and I have also explicited the computations a bit by using named temporaries.

Okay. I can't imagine why anyone would want to turn off replacement anyway.

Unfortunately I have only a Windows machine at hand for the moment and the tests don't work that well... so I'd appreciate if anyone could check my changes.

A bunch of Clang tests failed when I applied your patch, so it's not quite ready yet. These are the failures:

Failing Tests (10):
    Clang :: CXX/class/class.friend/p1.cpp
    Clang :: FixIt/fixit.cpp
    Clang :: Parser/objc-quirks.m
    Clang :: Parser/recovery.c
    Clang :: Sema/MicrosoftExtensions.c
    Clang :: SemaCXX/unknown-type-name.cpp
    Clang :: SemaObjC/category-1.m
    Clang :: SemaObjC/super.m
    Clang :: SemaObjC/synth-provisional-ivars.m
    Clang :: SemaObjCXX/category-lookup.mm

I looked at the first one, and for the strings "UndeclaredSoFar" and "_Imaginary" and a maxEditDistance of 5 (= the difference in their lengths, if that matters), the algorithm returns an edit distance of 2.

One of the other tests (Parser/objc-quirks.m) is actually a crash.

+ // maxDistance (aka D) directly influences the width of the strip (2*D+1)
+ // we arbitrarily restrict it in case the caller did not see fit to do it
+ // (from clang\lib\Sema\SemaLookup.cpp in TypoCorrectionConsumer::FoundName)
+ unsigned const maxDistance = std::min(unsigned((otherLength+2)/3),
+ maxEditDistance);

I don't think this belongs here. We can count on the client to supply a reasonable maxEditDistance. If our performance is going to degrade horribly when an unreasonable maxEditDistance is provided (e.g., > the sum of the string lengths), we could just fall back to the simplistic algorithm we currently have.

Oh, and please look at the LLVM Coding Standards (http://llvm.org/docs/CodingStandards.html), and reformat your patch accordingly. Thanks!

  - Doug

Hello Doug,

thanks for the review :slight_smile:

I went ahead and reworked the patches according to your remarks, namely:

  • I modified edit_distance directly, rather than introducing a new method, dropping the “AllowReplacement” parameter and refactoring the calls
  • I have changed the “artificial” lowering, a levenshtein cannot be higher than the longer of the two strings (perform “short.length()” replacements and the rest as additions to turn the short into the long)
  • I checked the 80-columns line
  • I realized that the K&R brace style was used

If there is anything I missed in the Coding Standard, please let met know; I have notably interpreted the “camelCase” guideline as lower camelCase, since otherwise “PascalCase” would have been clearer I guess.

The tests still don’t work well on windows (exception in clang’s lit.cfg, I’ve tried to fix it but it yielded a fatal error instead so… :/). But I did succeeded in compiling the whole of llvm and clang, from scratch, after wiping out the build directory and reruning CMake (and I do hate VC++ sluggishness, btw).

I’ve tested it in isolation (against the original algorithm), it looks like I had made an error when transposing my implementation to the body of StringRef last time (lost in translation…) so this time I worked directly on StringRef to avoid further issues. It looks like the algo had a bug (’=’ instead of ‘+’) and would not report distances greater than 1 or 2… hum

I have benchmarked the performances using the following method (compiled in Release mode in VC++2010):

pickup a corpus of about ~18,000 distinct words, containing letters, digits and underscore (to match C++ identifiers), with length varying from 1 to 18.
randomly select 100 words in the corpus, and perform up to 5 random alterations on them
for each altered word

  • initialize max to the altered word length
  • loop over the corpus, compute the distance between the current word and the altered one (bounded by max), update max if necessary and stop if it reaches 0
  • report the “best” score achieved (check that both algorithm report the same score)
  • rerun 5000 times this search (over the whole corpus) first for the “classic” distance then for the “diagonal” distance

record each “time” (measured from outside the 5000 loop) depending on the “best” score achieved
display the average, per “best” score, as well as the speed-up achieved

Distance Brute Diagonal Speedup
Distance: 0 0.00234 0.00057 x4.11
Distance: 1 0.00550 0.00150 x3.67
Distance: 2 0.00785 0.00337 x2.33
Distance: 3 0.01003 0.00636 x1.57
Distance: 4 0.01303 0.00993 x1.31
Distance: 5 0.01701 0.01087 x1.56

I don’t have any explanation of why 5 would perform better than 4, it looks like an artifact of the tests (and may be due to the distance tightening sooner). The trend should, logically, be decreasing, since the number of computations is directly impacted by the “best distance so far” in the loop. Anyway, it looks like it performs rather well.

Note: there are two patches, one for llvm and one for clang. The patches are correlated and cannot be applied separately, because the signature of edit_distance changed (we lost a parameter).

2011/2/3 Douglas Gregor <dgregor@apple.com>

edit_distance_clang.diff (1.13 KB)

edit_distance_llvm.diff (19.4 KB)