Tuesday, April 24, 2012

Lean Geocoding

Geocoding is the art of turning street addresses into geographical coordinates (e.g. lat/long). In this article we will explore some of the essential ideas upon which a geocoder can be built, along with a simple example assembled from freely available components.

The Data Model

A geocoder is comprised of two main components: a road network database (with an associated data model) and a set of queries acting on it. The database is spatial in the sense that it contains a representation of the road network as a set of geometric features (i.e. typically lines). The representation mechanism works at the street segment level: a row corresponds to the subset of a street defining a block (usually mostly linear, but not always, as in most North American cities, for instance):

In addition to a geometric attribute (describing its shape), a given street segment is associated to four numeric (i.e. non-spatial) attributes: from_left, to_left, from_right and to_right. These numbers correspond to the ranging values of the typical house numbering scheme, having the odd numbers on one side of the street, and the even ones on the other. If we add its name as an additional string attribute, a street in this model can be thus defined (in most cases) as the set of segments sharing it:

    name    | from_left | to_left | from_right | to_right |                  astext(the_geom)                  
------------+-----------+---------+------------+----------+---------------------------------------------------
 [...]
 Jean-Talon |      1000 |    1024 |       1001 |     1035 | LINESTRING(-73.625538467 45.524777946,-73.625[...]
 Jean-Talon |      1001 |    1025 |       1000 |     1080 | LINESTRING(-73.6126363129999 45.5417330830001[...]
 Jean-Talon |      1101 |    1185 |       1110 |     1150 | LINESTRING(-73.612026869 45.5425537690001,-73[...]
 Jean-Talon |      1201 |    1247 |       1210 |     1244 | LINESTRING(-73.611316541 45.543310246,-73.610[...]
 Jean-Talon |      1213 |    1245 |        200 |      260 | LINESTRING(-71.2765479409999 46.8713327090001[...]
 Jean-Talon |      1247 |    1273 |        270 |      298 | LINESTRING(-71.2756217529999 46.8717741090001[...]
 Jean-Talon |      1251 |    1293 |       1250 |     1294 | LINESTRING(-73.610724326 45.543951109,-73.610[...]
 Jean-Talon |      1275 |    1299 |        300 |      360 | LINESTRING(-71.274682868 46.872224756,-71.273[...]
 Jean-Talon |      1383 |    1383 |       1360 |     1388 | LINESTRING(-73.609408209 45.545375275,-73.608[...]
 Jean-Talon |      1385 |    1385 |       1390 |     1408 | LINESTRING(-73.608972737 45.545850934,-73.608[...]
 [...]

The Database

As part of the 2011 Census, Statistics Canada offers (free of charge!) the Road Network File, modeling the full country. Let's first import the shapefile version (grnf000r11a_e.shp) of the dataset into a PostGIS spatial database:

$ createdb -T template_postgis geocoding
$ shp2pgsql -W latin1 -s 4269 -S grnf000r11a_e.shp rnf | psql -d geocoding

Note that the -s 4269 option refers to the mapping projection system used (and specified in the manual), while -S enforces the generation of simple, rather than MULTI geometries for spatial data. Let's also cast some columns to a numeric type and rename them, to make our life simpler later:

$ psql -d geocoding -c 'alter table rnf alter afl_val type int using afl_val::int
$ psql -d geocoding -c 'alter table rnf rename afl_val to from_left'
$ psql -d geocoding -c 'alter table rnf alter atl_val type int using atl_val::int'
$ psql -d geocoding -c 'alter table rnf rename atl_val to to_left'
$ psql -d geocoding -c 'alter table rnf alter afr_val type int using afr_val::int'
$ psql -d geocoding -c 'alter table rnf rename afr_val to from_right'
$ psql -d geocoding -c 'alter table rnf alter atr_val type int using atr_val::int'
$ psql -d geocoding -c 'alter table rnf rename atr_val to to_right'

Address Interpolation

Since it wouldn't make sense to store an exhaustive list of all the existing address locations in which we'd perform a simple lookup (there are too many, and their configuration is rapidly evolving), our database is actually a compressed model of reality. To perform the geocoding of an actual address, we need an additional tool, interpolation, with which we can find any new point on a line on the basis of already defined points. For instance, given a street segment ranging from 1000 to 2000, it takes a very simple calculation to determine that a house at 1500 should be about halfway. For a piecewise linear segment, the calculation is a little bit more involved (although as intuitively clear), but thanks to PostGIS, the details are hidden from us with the help of the ST_Line_Interpolate_Point function. Here is the result of interpolating address 6884 on a segment ranging (on the left) from 6878 to 6876:

The Query

At this point, we have everything we need to build a basic geocoding query, which we wrap inside a SQL function (mostly for syntactic purposes):

create function geocode(street_nb int, street_name text) returns setof geometry as $$
    select st_line_interpolate_point(the_geom, 
        case when $1 % 2 = from_left % 2 and 
                  $1 between least(from_left, to_left) and greatest(from_left, to_left)
                  then case when from_left = to_left then 0
                       else ($1 - least(from_left, to_left)) / 
                            (greatest(from_left, to_left) - least(from_left, to_left))::real
                       end
             when $1 % 2 = from_right % 2 and 
                  $1 between least(from_right, to_right) and greatest(from_right, to_right)
                  then case when from_right = to_right then 0            
                       else ($1 - least(from_right, to_right)) / 
                            (greatest(from_right, to_right) - least(from_right, to_right))::real
                       end
        end)
    from rnf
    where name = $2 and
    ((from_left % 2 = $1 % 2 and $1 between least(from_left, to_left) and greatest(from_left, to_left))
        or
    (from_right % 2 = $1 % 2 and $1 between least(from_right, to_right) and greatest(from_right, to_right)))
$$ language sql immutable;

select astext(geocode(1234, 'Jean-Talon')); -- POINT(-73.6108985068823 45.5437626198824)

The where clause finds the target segment, while the nested case expression as a whole calculates the fraction to be fed (along with the segment geometry) to the ST_Line_Interpolate_Point function (i.e. the proportion of the target segment length at which the address point should be located). Both work with the between, least and greatest functions (without which the query would be even more verbose), the last two needed in particular because the order of the range values cannot be assumed. The outer case expression finds the side of the street, depending on the range value parity (easily checked with %, the modulo operator), while the inner one checks for the special case where the range is comprised of only the target address (which would yield a division by zero, if not properly handled). Finally, the function can be declared immutable for optimal performance, since it should have no effect on the database.

Street Name Spelling Tolerance

The mechanism described above works well if we happen to use the exact same spelling for street names as the database (for instance, if the input UI is some kind of autocompletion field tapping directly into it), but it breaks rapidly if we introduce real-world inputs (e.g. via a free text field without any constaints). For instance, the query above wouldn't work with even a slight variation (e.g. "1234 Jean Talon", where just the hyphen is missing), because the string matching performed is not flexible at all: either the name is right or it isn't. Moreoever, in a real world system, the very notion of a spelling error is vague, because it spans a wide spectrum ranging from expectable variations, like the use of abbreviations ("Saint-Jerome" vs. "St-Jerome"), use of letter case ("SAINT-JEROME"), accents ("Saint-Jérôme", at least in some languages), extra specifiers ("av. Saint-Jerome E.") to downright spelling errors ("Saint-Jerrome"). Although we cannot solve this problem entirely, the use of certain string transformation and matching techniques can help, by introducing some tolerance in the process. The simple mechanism I will describe has two components. The street names (both from the query and the database) are first normalized (or standardized), to ease the process of comparing them (by effectively reducing the space of possible name representations). The comparison of the normalized names is then performed using an edit distance metric, with the goal of quantifying the similarity of two names, to determine whether they match or not (within a certain tolerance parameter).
Normalization
The most obvious normalizing transformations are: (1) replacing any non-alphanumeric character by a single one, (2) converting to upper case and (3) removing accents (which is particularly relevant in Canada, given the French speaking Quebec province, but also unfortunately requires a small hack, at least with my particular version of PG). We can also use regular expressions to detect and normalize certain common naming patterns. All the required transformations can be encapsulated in such a PL/pgSQL function, for instance:

create function normalize(in text) returns text as $$ 
  declare
    s text;
  begin
     s = to_ascii(convert_to(upper(trim($1)), 'latin1'), 'latin1');
     s = regexp_replace(s, E'\\W+', '%', 'g');
     s = regexp_replace(s, E'^SAINTE%|^SAINT%|^STE%|^ST%', 'S*%', 'g');
     return s;
  end;
$$ language plpgsql;

select normalize('St-Jérôme'); -- S*%JEROME

which we can now use to precompute the normalized database street names:

$ psql -d geocoding -c 'alter table rnf add column name_norm text'
$ psql -d geocoding -c 'update rnf set name_norm = normalize(name)'

Similarity Matching
The last step is performed by using an edit distance metric to define what is an acceptable matching level (or similarity) between two names. A particular instance of such a metric is the well-known Levenshtein distance, which is implemented by the fuzzystrmatch PG module, and can be interpreted roughly as the "number of edits needed to transform a string A into string B". Here is an updated version of our geocoding function (with an additional parameter for the edit distance deviation tolerance), with which slight errors can now be forgiven, thanks to our flexible name spelling mechanism:

create function geocode(street_nb int, street_name text, tol int) returns setof geometry as $$
    select st_line_interpolate_point(the_geom, 
        case when $1 % 2 = from_left % 2 and 
                  $1 between least(from_left, to_left) and greatest(from_left, to_left)
                  then case when from_left = to_left then 0
                       else ($1 - least(from_left, to_left)) / 
                            (greatest(from_left, to_left) - least(from_left, to_left))::real
                       end
             when $1 % 2 = from_right % 2 and 
                  $1 between least(from_right, to_right) and greatest(from_right, to_right)
                  then case when from_right = to_right then 0            
                       else ($1 - least(from_right, to_right)) / 
                            (greatest(from_right, to_right) - least(from_right, to_right))::real
                       end
        end)
    from rnf
    where levenshtein(normalize($2), name_norm) <= $3 and
    ((from_left % 2 = $1 % 2 and $1 between least(from_left, to_left) and greatest(from_left, to_left))
        or
    (from_right % 2 = $1 % 2 and $1 between least(from_right, to_right) and greatest(from_right, to_right)))
$$ language sql immutable;

select astext(geocode(1234, 'Jean Tallon', 3)); -- POINT(-73.6108985068823 45.5437626198824)

Friday, February 17, 2012

An Eulerian Hoax

Let's end the suspense.. there was a big catch with the previous puzzle: it was impossible to solve!

As you may have suspected, the reason is mathematical, and it boils down to the fact that the puzzle actually corresponds to the dual graph of another graph, this one lacking an essential property: an Eulerian path. Even though it may appear different at first, this is actually quite similar to the well-known problem of the Seven Bridges of Königsberg, which was solved by Euler in 1735.

To explore what it means in more details, let's first consider the first, easy problem and its dual graph. The dual graph (in red) models the topology of the regions implicitly formed by the edge network of the blue graph: it tells which region connects to which. Notice that the space surrounding the blue graph, being a single region, corresponds to the top red node, reachable by all the red nodes lying in a region with "outside facing (blue) walls":

Starting at the top red node (i.e. "outside", in terms of the dual graph), there exists a path (that can be followed with the arrows) with which we will eventually follow every red edges, without visiting any one twice: this is an Eulerian path (since this path is starting and ending at the same node, it's actually more precisely an Eulerian circuit). A path in this graph corresponds to a "pencil path" across the blue edges (or "walls") of the puzzle graph:

Now while trying to solve the previous, impossible puzzle (in blue), you were in reality trying to find an Eulerian path within its dual graph (in red):

which was proven impossible by Euler in this particular case, because it does not satisfy the (necessary and sufficient) condition of having zero or two (red) nodes of degree two. Specifically, as can be seen, it has four nodes of odd degree.

Just for fun, let's do the opposite, and try to cast the graph of the Königsberg problem into a form suitable for another interactive puzzle applet. From the simple and compact graph that can be derived from the bridge and island structure, it's not immediately clear how to do this (at least not in terms of the straight segments that the applet requires, as far as I can see):

But if we simply rearrange the edge between the top and bottom nodes, it becomes easy:

which yields this (impossible, you've been warned this time!) puzzle (click anywhere on the grid to place the cursor, and drag any of its ends through the segments; you can press "r" to reset, "b" or right-click to move back, or use the yellow and blue buttons at the top):

(The applet was created with Processing.js)

I'd say this one is more suspect as a puzzle, in the sense that you can rapidly almost feel its impossibility.. in a way that the previous, more complicated one was better able to conceal.

Maybe that explains why, even after my father had told me it was impossible, I kept trying to solve it stubbornly, fascinated by its paradoxical nature.