Tuesday, April 24, 2012

Lean Geocoding

Geocoding is the art of turning street addresses into geographical coordinates (e.g. lat/long). In this article we will explore some of the essential ideas upon which a geocoder can be built, along with a simple example assembled from freely available components.

The Data Model

A geocoder is comprised of two main components: a road network database (with an associated data model) and a set of queries acting on it. The database is spatial in the sense that it contains a representation of the road network as a set of geometric features (i.e. typically lines). The representation mechanism works at the street segment level: a row corresponds to the subset of a street defining a block (usually mostly linear, but not always, as in most North American cities, for instance):

In addition to a geometric attribute (describing its shape), a given street segment is associated to four numeric (i.e. non-spatial) attributes: from_left, to_left, from_right and to_right. These numbers correspond to the ranging values of the typical house numbering scheme, having the odd numbers on one side of the street, and the even ones on the other. If we add its name as an additional string attribute, a street in this model can be thus defined (in most cases) as the set of segments sharing it:

    name    | from_left | to_left | from_right | to_right |                  astext(the_geom)                  
------------+-----------+---------+------------+----------+---------------------------------------------------
 [...]
 Jean-Talon |      1000 |    1024 |       1001 |     1035 | LINESTRING(-73.625538467 45.524777946,-73.625[...]
 Jean-Talon |      1001 |    1025 |       1000 |     1080 | LINESTRING(-73.6126363129999 45.5417330830001[...]
 Jean-Talon |      1101 |    1185 |       1110 |     1150 | LINESTRING(-73.612026869 45.5425537690001,-73[...]
 Jean-Talon |      1201 |    1247 |       1210 |     1244 | LINESTRING(-73.611316541 45.543310246,-73.610[...]
 Jean-Talon |      1213 |    1245 |        200 |      260 | LINESTRING(-71.2765479409999 46.8713327090001[...]
 Jean-Talon |      1247 |    1273 |        270 |      298 | LINESTRING(-71.2756217529999 46.8717741090001[...]
 Jean-Talon |      1251 |    1293 |       1250 |     1294 | LINESTRING(-73.610724326 45.543951109,-73.610[...]
 Jean-Talon |      1275 |    1299 |        300 |      360 | LINESTRING(-71.274682868 46.872224756,-71.273[...]
 Jean-Talon |      1383 |    1383 |       1360 |     1388 | LINESTRING(-73.609408209 45.545375275,-73.608[...]
 Jean-Talon |      1385 |    1385 |       1390 |     1408 | LINESTRING(-73.608972737 45.545850934,-73.608[...]
 [...]

The Database

As part of the 2011 Census, Statistics Canada offers (free of charge!) the Road Network File, modeling the full country. Let's first import the shapefile version (grnf000r11a_e.shp) of the dataset into a PostGIS spatial database:

$ createdb -T template_postgis geocoding
$ shp2pgsql -W latin1 -s 4269 -S grnf000r11a_e.shp rnf | psql -d geocoding

Note that the -s 4269 option refers to the mapping projection system used (and specified in the manual), while -S enforces the generation of simple, rather than MULTI geometries for spatial data. Let's also cast some columns to a numeric type and rename them, to make our life simpler later:

$ psql -d geocoding -c 'alter table rnf alter afl_val type int using afl_val::int
$ psql -d geocoding -c 'alter table rnf rename afl_val to from_left'
$ psql -d geocoding -c 'alter table rnf alter atl_val type int using atl_val::int'
$ psql -d geocoding -c 'alter table rnf rename atl_val to to_left'
$ psql -d geocoding -c 'alter table rnf alter afr_val type int using afr_val::int'
$ psql -d geocoding -c 'alter table rnf rename afr_val to from_right'
$ psql -d geocoding -c 'alter table rnf alter atr_val type int using atr_val::int'
$ psql -d geocoding -c 'alter table rnf rename atr_val to to_right'

Address Interpolation

Since it wouldn't make sense to store an exhaustive list of all the existing address locations in which we'd perform a simple lookup (there are too many, and their configuration is rapidly evolving), our database is actually a compressed model of reality. To perform the geocoding of an actual address, we need an additional tool, interpolation, with which we can find any new point on a line on the basis of already defined points. For instance, given a street segment ranging from 1000 to 2000, it takes a very simple calculation to determine that a house at 1500 should be about halfway. For a piecewise linear segment, the calculation is a little bit more involved (although as intuitively clear), but thanks to PostGIS, the details are hidden from us with the help of the ST_Line_Interpolate_Point function. Here is the result of interpolating address 6884 on a segment ranging (on the left) from 6878 to 6876:

The Query

At this point, we have everything we need to build a basic geocoding query, which we wrap inside a SQL function (mostly for syntactic purposes):

create function geocode(street_nb int, street_name text) returns setof geometry as $$
    select st_line_interpolate_point(the_geom, 
        case when $1 % 2 = from_left % 2 and 
                  $1 between least(from_left, to_left) and greatest(from_left, to_left)
                  then case when from_left = to_left then 0
                       else ($1 - least(from_left, to_left)) / 
                            (greatest(from_left, to_left) - least(from_left, to_left))::real
                       end
             when $1 % 2 = from_right % 2 and 
                  $1 between least(from_right, to_right) and greatest(from_right, to_right)
                  then case when from_right = to_right then 0            
                       else ($1 - least(from_right, to_right)) / 
                            (greatest(from_right, to_right) - least(from_right, to_right))::real
                       end
        end)
    from rnf
    where name = $2 and
    ((from_left % 2 = $1 % 2 and $1 between least(from_left, to_left) and greatest(from_left, to_left))
        or
    (from_right % 2 = $1 % 2 and $1 between least(from_right, to_right) and greatest(from_right, to_right)))
$$ language sql immutable;

select astext(geocode(1234, 'Jean-Talon')); -- POINT(-73.6108985068823 45.5437626198824)

The where clause finds the target segment, while the nested case expression as a whole calculates the fraction to be fed (along with the segment geometry) to the ST_Line_Interpolate_Point function (i.e. the proportion of the target segment length at which the address point should be located). Both work with the between, least and greatest functions (without which the query would be even more verbose), the last two needed in particular because the order of the range values cannot be assumed. The outer case expression finds the side of the street, depending on the range value parity (easily checked with %, the modulo operator), while the inner one checks for the special case where the range is comprised of only the target address (which would yield a division by zero, if not properly handled). Finally, the function can be declared immutable for optimal performance, since it should have no effect on the database.

Street Name Spelling Tolerance

The mechanism described above works well if we happen to use the exact same spelling for street names as the database (for instance, if the input UI is some kind of autocompletion field tapping directly into it), but it breaks rapidly if we introduce real-world inputs (e.g. via a free text field without any constaints). For instance, the query above wouldn't work with even a slight variation (e.g. "1234 Jean Talon", where just the hyphen is missing), because the string matching performed is not flexible at all: either the name is right or it isn't. Moreoever, in a real world system, the very notion of a spelling error is vague, because it spans a wide spectrum ranging from expectable variations, like the use of abbreviations ("Saint-Jerome" vs. "St-Jerome"), use of letter case ("SAINT-JEROME"), accents ("Saint-Jérôme", at least in some languages), extra specifiers ("av. Saint-Jerome E.") to downright spelling errors ("Saint-Jerrome"). Although we cannot solve this problem entirely, the use of certain string transformation and matching techniques can help, by introducing some tolerance in the process. The simple mechanism I will describe has two components. The street names (both from the query and the database) are first normalized (or standardized), to ease the process of comparing them (by effectively reducing the space of possible name representations). The comparison of the normalized names is then performed using an edit distance metric, with the goal of quantifying the similarity of two names, to determine whether they match or not (within a certain tolerance parameter).
Normalization
The most obvious normalizing transformations are: (1) replacing any non-alphanumeric character by a single one, (2) converting to upper case and (3) removing accents (which is particularly relevant in Canada, given the French speaking Quebec province, but also unfortunately requires a small hack, at least with my particular version of PG). We can also use regular expressions to detect and normalize certain common naming patterns. All the required transformations can be encapsulated in such a PL/pgSQL function, for instance:

create function normalize(in text) returns text as $$ 
  declare
    s text;
  begin
     s = to_ascii(convert_to(upper(trim($1)), 'latin1'), 'latin1');
     s = regexp_replace(s, E'\\W+', '%', 'g');
     s = regexp_replace(s, E'^SAINTE%|^SAINT%|^STE%|^ST%', 'S*%', 'g');
     return s;
  end;
$$ language plpgsql;

select normalize('St-Jérôme'); -- S*%JEROME

which we can now use to precompute the normalized database street names:

$ psql -d geocoding -c 'alter table rnf add column name_norm text'
$ psql -d geocoding -c 'update rnf set name_norm = normalize(name)'

Similarity Matching
The last step is performed by using an edit distance metric to define what is an acceptable matching level (or similarity) between two names. A particular instance of such a metric is the well-known Levenshtein distance, which is implemented by the fuzzystrmatch PG module, and can be interpreted roughly as the "number of edits needed to transform a string A into string B". Here is an updated version of our geocoding function (with an additional parameter for the edit distance deviation tolerance), with which slight errors can now be forgiven, thanks to our flexible name spelling mechanism:

create function geocode(street_nb int, street_name text, tol int) returns setof geometry as $$
    select st_line_interpolate_point(the_geom, 
        case when $1 % 2 = from_left % 2 and 
                  $1 between least(from_left, to_left) and greatest(from_left, to_left)
                  then case when from_left = to_left then 0
                       else ($1 - least(from_left, to_left)) / 
                            (greatest(from_left, to_left) - least(from_left, to_left))::real
                       end
             when $1 % 2 = from_right % 2 and 
                  $1 between least(from_right, to_right) and greatest(from_right, to_right)
                  then case when from_right = to_right then 0            
                       else ($1 - least(from_right, to_right)) / 
                            (greatest(from_right, to_right) - least(from_right, to_right))::real
                       end
        end)
    from rnf
    where levenshtein(normalize($2), name_norm) <= $3 and
    ((from_left % 2 = $1 % 2 and $1 between least(from_left, to_left) and greatest(from_left, to_left))
        or
    (from_right % 2 = $1 % 2 and $1 between least(from_right, to_right) and greatest(from_right, to_right)))
$$ language sql immutable;

select astext(geocode(1234, 'Jean Tallon', 3)); -- POINT(-73.6108985068823 45.5437626198824)