Soundex - Everything2.com

by Segnbora-t

Sun Nov 05 2000 at 1:12:35

Soundex is a system, invented by the U.S. National Archive, of grouping differently-spelled variants of a word together. It was created to index U.S. Census results, but has many uses: at least in Florida, Soundex is used to generate the first part of a person's driver's license number (my last name is Saunders, and my license number starts out "S536"), and it is used in Everything's non-exact searches. But one of its major uses is in genealogy.

The point of Soundex in genealogy is to make it easier for the researcher to connect different spellings of names that may be related. My last name is Saunders, but four generations ago it was spelled Sanders. Soundex ignores vowels, and so would group those names together. It's not perfect; my mother's family the Lonons would not be grouped in an index with their ancestors who spelled it London (but would be listed together with some relative whose census taker spelled it Lunun).

There is a converter form at http://www.ourancestry.com/soundex.html, but if you'd rather know the rules to make Soundex codes yourself, here they are as given at http://www.nara.gov/genealogy/coding.html:

Every soundex code consists of a letter and three numbers, such as W-252. The letter is always the first letter of the surname. The numbers are assigned to the remaining letters of the surname according to the soundex guide shown below. Zeroes are added at the end if necessary to produce a four-character code. Additional letters are disregarded. Examples:

Washington is coded W-252 (W, 2 for the S, 5 for the N, 2 for the G, remaining letters disregarded).

Lee is coded L-000 (L, 000 added).

Soundex Coding Guide: Number Represents the Letters

1: B, F, P, V
2: C, G, J, K, Q, S, X, Z
3: D, T
4: L
5: M, N
6: R

Disregard the letters A, E, I, O, U, H, W, and Y.

Additional Soundex Coding Rules
Names With Double Letters

If the surname has any double letters, they should be treated as one letter. For example:

Gutierrez is coded G-362 (G, 3 for the T, 6 for the first R, second R ignored, 2 for the Z).

Names with Letters Side-by-Side that have the Same Soundex Code Number

If the surname has different letters side-by-side that have the same number in the soundex coding guide, they should be treated as one letter. Examples:

Pfister is coded as P-236 (P, F ignored, 2 for the S, 3 for the T, 6 for the R).

Jackson is coded as J-250 (J, 2 for the C, K ignored, S ignored, 5 for the N, 0 added).

Tymczak is coded as T-522 (T, 5 for the M, 2 for the C, Z ignored, 2 for the K). Since the vowel "A" separates the Z and K, the K is coded.

Names with Prefixes

If a surname has a prefix, such as Van, Con, De, Di, La, or Le, code both with and without the prefix because the surname might be listed under either code. Note, however, that Mc and Mac are not considered prefixes.

For example, VanDeusen might be coded two ways: V-532 (V, 5 for N, 3 for D, 2 for S)

or D-250 (D, 2 for the S, 5 for the N, 0 added).

Consonant Separators

If a vowel (A, E, I, O, U) separates two consonants that have the same soundex code, the consonant to the right of the vowel is coded. Example:

Tymczak is coded as T-522 (T, 5 for the M, 2 for the C, Z ignored (see "Side-by-Side" rule above), 2 for the K). Since the vowel "A" separates the Z and K, the K is coded.

If "H" or "W" separate two consonants that have the same soundex code, the consonant to the right of the H or W is not coded. Example:

Ashcraft is coded A-261 (A, 2 for the S, C ignored, 6 for the R, 1 for the F). It is not coded A-226.

I like it!

1 C!

(thing)

by Jetifi

Tue Mar 25 2003 at 14:02:26

Apart from its uses in genealogy and filing systems in general, soundex is also used for fuzzy searching in databases. One example of this with which you are probably familiar is the ''Near Matches'' checkbox next to the search button above.

The algorithm used for computer soundexes is rarely the one presented above, but a variant described by Donald Knuth in Volume 3 of The Art of Computer Programming. The modifications were probably made to speed up the algorithm and make it easier to program, since they chiefly consist of ignoring special cases:

Knuth's algorithm does not disregard adjacent characters if they are represented by the same number - only if they are the same character.
In Knuth's algorithm, if H or W separate two consonants, both consonants are encoded, instead of just the left one.
Trivially, the dash after the first letter is omitted.

The Java code below was adapted from an example of the Knuth algorithm¹ to follow all the rules in the above writeup.

/**
* Soundex, modified from the Knuth algorithm to comply with the NARA standard.
*/
public class Soundex {

private static final char[] MAP = {
//A B D D E F G H I J K L M
'0', '1', '2', '3', '0', '1', '2', '-', '0', '2', '2', '4', '5',
//N O P W R S T U V W X Y Z
'5', '0', '1', '2', '6', '2', '3', '0', '1', '-', '2', '0', '2'
};

public static String soundex(String input) {

input = input.toUpperCase();

StringBuffer result = new StringBuffer();

char current, previous = '?';

for(int i=0; i < input.length() && result.length() < 5; i++) {

current = input.charAt(i);
char mapped = MAP[current-'A'];

if(mapped == previous) continue;

previous = mapped;

if(i==0) {
result.append(current).append('-');
continue;
}

if(mapped != '0' && mapped != '-')
result.append(mapped);
else if(mapped == '-') previous = MAP[input.charAt(i-1)-'A'];
}

if(result.length() == 0) return null;

for(int i=result.length(); i < 5; i++) result.append('0');

return result.toString();
}

public static void main(String args) {

String[][] tests = {
{"Washington", "W-252"},
{"Lee", "L-000"},
{"Gutierrez", "G-362"},
{"Pfister", "P-236"},
{"Jackson", "J-250"},
{"Tymczak", "T-522"},
{"Ashcraft", "A-261"},
};

for(int i = 0; i != tests.length; i++)
System.out.println("Soundex of "+tests[i][0]+
" should be "+tests[i][1]+
" and is "+soundex(tests[i][0]));
}
}

Footnotes/sources:
1: See http://www.porcupyne.org/docs/browse_source/JavaCookBook/Soundex.java.html
http://www.archives.gov/research_room/genealogy/census/soundex.html

I like it!

(idea)

by Txikwa

Tue Mar 25 2003 at 15:31:46

Soundex is woefully inadequate. It's meant to be phonetic but gets stuck on individual letters instead of the spelling. It could be improved by a few simple rules.

The letter numbering is assigned by saying B = 1, C = 2, D = 3, and so on, but assigning a similar-sounding letter to an existing group, so F = 1 because F is labial like B. The next consonant that doesn't sound much like any of the previous ones is L, so L = 4, then M = 5, then the contentious claim that R counts as a consonant and is different from the rest. So you get this list, with their phonetic basis:

1: B, F, P, V -- all labial
2a: C, G, K, Q, X -- all velar
2b: C, S, X, Z -- all sibilant
2c: C, G, J -- all palatal
3: D, T -- both alveolar
4: L -- lateral
5: M, N -- both nasal
6: R -- rhotic in some positions or in some dialects

Because C can be like K and Q in cat they all get lumped together; but because it can be like S in city so do they, with the result that the phonetically nonsensical K = S gets built in. Then CH as in church is like G as in gent, judge is like J; but these are not like the K set or the S set. This fudge caused by treating C just as a letter in isolation means too many sounds get lumped together.

Fix: Look at next letter. CE, CI, CY cause equation C = S; CH causes equation C = J; and anything else makes it C = K. These pick up the great majority of contexts correctly.

On the other hand, R gets treated as a consonant even when it's just part of a vowel in many accents, so identical-sounding names like Houghton and Horton don't get picked up. As genealogy is likely to be conducted on names used in England, this is a significant consideration. Fix: Have a dialect switch. If set, look at next letter. If it's a vowel, treat R as a consonant, else ignore it.

The omission of W shows that it's for English-language use. In most languages W = V would be more appropriate.

I like it!

metaphone	How to pronounce an English "R"	Levenshtein distance	Pseudonumerology
Bela Fleck and the Flecktones	Google	Donald Knuth	UDF
E2 node tracker	SQL Server	John Ashcroft	March 21, 2000
Zen theme	John Adams's 1797 State of the Union Address	George Washington's 1796 State of the Union Address	Windows 2000 Evaluation and Deployment Kit
The Black Stallion	rhotic	Buddy Lee	Velar
Sibilant	Ratable	Palatal	Naturalization