Updated on 12 July with more tinkering. Updated on 13 July with a solution and closing thoughts.

I am playing around with OpenStreetMap data and bumped into an issue trying to import it into PostgreSQL with the PostGIS extension enabled. I have pennsylvania-latest.osm.pbf from Geofabrik and am using osm2pgsql to import it into PostgreSQL. I am using osm2pgsql’s flex Lua script. Specifically, I am only importing administrative relations (i.e., those with an admin_level set) and am asking osm2pgsql to parse their geometries and save it all in the database. This seems to be working fine, I am not doing anything funky, but by accident I notice that Adams County is not being imported. Why?

Is the Input File Valid?

I first add some print debugging in the Lua importing script. Adams County is encountered in the .osm.pbf file.

function osm2pgsql.process_relation(object)
  if object.tags.name and object.tags.admin_level then
    print(object.tags.name) -- Added
    regions:add_row({
      -- omitted
    })
  end
end

Any Osm2pgsql Errors?

Osm2pgsql does not spit out any errors in the default verbosity. I crank up the verbosity to see if I spot anything. To do so, I add the flags --log-level=debug, --log-sql, and --log-sql-data to the osm2pgsql invocation. This dumps out a lot, so just redirect it to a file. In the file, I find Adams County due to the earlier print debugging still being in the Lua script. No other mention of it, however. The areas before and after it do show in the logging as being added to the table. Odd.

I search a bit for my table (regions) where I expect to find Adams County in. I notice a reference to a SQL trigger on the table and a SQL function that is called by it (both called regions_osm2pgsql_valid). The function definition is also there and I see a call to ST_IsValid. That is a function provided by PostGIS. I find the following in the osm2pgsql documentation.

Validity is more complex for Polygon and MultiPolygon geometries: There are multiple ways how such a geometry can be invalid. For instance, if the boundary of a polygon is drawn in a figure eight, the result will not be a valid polygon. The database will happily store such invalid polygons, but this can lead to problems later, when you try to draw them or do calculations based on the invalid geometry (such as calculating the area).

Osm2pgsql makes sure that there are no invalid geometries in the database, either by not importing them in the first place or by using an ST_IsValid() check after import. You’ll either get NULL in your geometry columns instead or the row is not inserted at all (depending on config).

I notice in PostGIS documentation that there are also an ST_IsValidReason and an ST_IsValidDetail, both give some more information about why a geometry is invalid. I see no obvious way to make osm2pgsql call that for me and tell me the reason.

Finding Why Adams County is Not Valid

Get a Hold of All the Data

Osm2pgsql has a flag that will save the OpenStreetMap data as-is in some extra tables, alongside the processed tables you desire. To do so, add the --slim flag. This creates the tables planet_osm_nodes, planet_osm_ways, and planet_osm_rels. I think you can figure out what type of OSM objects each table holds. Time for some SQL tinkering.

From Relation to Nodes

To figure out the schema of a table, just use \d tablename. I care more to just see every column’s values so I know what I can work with here, so a SELECT does the trick. 417441 is the id of the Adams County relation in OpenStreetMap.

SELECT
    *
FROM
    planet_osm_rels
WHERE
    id = 417441;

Sadly there is no geometry or similar column. Instead, the interesting column seems to be parts, which is just the ids of the members of the relation. Luckily, Adams County is (seems to be) a simple polygon, so I only have to contend with a bunch of outer members. It also has label and admin_centre members, but I can easily ignore those. parts is an array of 45 members, 2 I do not care about. To make it easy on myself, I just copy the list and move on.

SELECT
    *
FROM
    planet_osm_ways
WHERE
    id IN (...that list...);

This returns a row per way and has a column nodes which is, again, an array of ids. Now the ids refer to nodes. There are about 1700 so just a quick copy is not possible any more. Instead, you can modify the query to get one node id per row instead. None of the other information matters so that is thrown out.

SELECT
    unnest(nodes) AS nodes
FROM
    planet_osm_ways
WHERE
    id IN (...);

With a list of node ids, we can finally find some geometry information!

SELECT
    *
FROM
    planet_osm_nodes AS n
WHERE
    n.id IN (
        SELECT
            unnest(nodes) AS nodes
        FROM
            planet_osm_ways
        WHERE
            id IN (...));

This gives a lat and a lon column, but they are in integers to avoid the usual floating point pitfalls. So 39.8987036 is stored as 398987036.

Lat and Lon to PostGIS Geometries

I want to work with the geometries PostGIS has to offer, so rework the data a little. The 4326 indicates the SRID.

SELECT
    ST_Point (lon / 10000000.0, lat / 10000000.0, 4326)
FROM
    planet_osm_nodes AS n
WHERE
    n.id IN (
        SELECT
            unnest(nodes) AS nodes
        FROM
            planet_osm_ways
        WHERE
            id IN (...));

The ways and nodes are all stored in the right order (day three pondering: actually, I am not sure I can make that assumption any more. I think the order is correct when I unnest, but once I use that in a WHERE ... IN it actually sounds unlikely to still be correct. Idea would be to dump the unnest result into a table with an order column, then join with planet_osm_nodes and ORDER BY order. This also explains why I am getting the incorrect geometry error I get later. This also invalidates any pondering I do about that result later.), so I want to just stitch the 1712 points together. For that you need to throw the result into an ARRAY and pass it on to ST_Makeline first. Then you have a PostGIS line geometry. I already name that column theline for referencing it later.

SELECT
    ST_Makeline (ARRAY (
            SELECT
                ST_Point ... previous code block
    )) AS theline;

Then to make a polygon out of it. Our line has every point for a polygon, but does not loop back to its starting point yet. So just add the starting point to the back of the line as well. Finally, make a polygon geometry out of it all.

SELECT
    ST_MakePolygon (ST_AddPoint (theline, ST_StartPoint (theline)))
FROM (
    SELECT
        ST_Makeline (
            ...
        ) AS theline) AS makeline_table;

Is It Valid?

Remember that ST_IsValidDetail? Finally there is a polygon to feed to it! The function returns an object with valid (boolean whether the polygon is valid), reason (textual reason), and location (the geometry position of the reason). I will also immediately check the type of the location.

SELECT
    (validity).valid,
    (validity).reason,
    (validity).location,
    GeometryType((validity).location)
FROM (
    SELECT
        ST_IsValidDetail (ST_MakePolygon (...)) validity
    FROM (
        ...
    ) AS makeline_table)
    AS reasons;

We were debugging Adams County, remember? I get a self intersection as reason and the location is of type point. So, I change the SELECT to give me the coordinates.

SELECT
    (validity).valid,
    (validity).reason,
    ST_X ((validity).location),
    ST_Y ((validity).location)
FROM (
    ...
) AS reasons;

Not Solving Adams County

I get as coordinates -76.97132396038843 longitude and 39.93156765348423 latitude. I know I am dealing with a self-intersect. I go check it out on openstreetmap and find… nothing. Welp. JOSM validation on the spot also yields nothing. I double check the relation members in JOSM and they seem to indeed be in the correct order.

I run out of ideas. The osm2pgsql documentation also mentions Geofabrik’s OSM Inspector tool, but that shows nothing wrong for Adams County either.

I give up for now, hope you got something useful from reading this fruitless endeavour.


 


Next day continuation.

Different Projections?

There I am, out of ideas, scrolling through the osm2pgsql documentation for any hint of what to do. I see a hint that changing the map projection (keyword: SRID) could potentially project two different points onto the same spot. Logical next step in my head: that can mess up geometries.

By default, osm2pgsql stores things in Web Mercator projection (EPSG 3857), since a major user of osm2pgsql is people wanting to use the data to create tiles for maps. In OpenStreetMap, the coordinates are stored as WGS84 (SRID: 4326). Anyway, lots of technical details to tell you that there is a transformation of the projection happening. Point number go change.

So I adjust the Lua importing script and specifically the geometries. In the table definition in it, add the projection key to your geometry column. For example:

{ column = "geom", type = "multipolygon", projection = "WGS84" }

Anyway, I do that, run a new import. Still no worky. I assume that is not to blame then.

Further Inspecting the Input

As mentioned earlier, I am working with pennsylvania-latest.osm.pbf from Geofabrik. It is an export of, you guessed it, just the OpenStreetMap data in the state of Pennsylvania, USA. While scrolling the osm2pgsql documentation, I see them reference Osmium. I had heard of this tool before, but have not used it. Time to change that.

I am running all of this in a Debian container on Docker, so I hook up a shell to it and install Osmium.

apt install osmium-tool

Meanwhile, the documentation scrolling shifts focus to Osmium’s documentation. The focus seems to be on validating .osm.pbf files. So I give that a try again.

With Osmium, I can extract information I care about. Recalling that Adams County’s id is 417441, I create an .osm.pbf file with just the Adams County administrative boundary. The -r also gets everything referenced by the relation I am asking information about. So for a relation, get all its members. For any ways in it, get all the nodes. Note the r before the number to tell Osmium I am looking for a relation.

osmium getid -r pennsylvania-latest.osm.pbf r417441 -o adamscounty.osm.pbf

So now I have a small file (11 kB vs the 241 MB of Pennsylvania) with just Adams County information. Osmium can check whether everything referenced is also present. It seems obvious to me that it would be, but for completeness sake, I run it.

$ osmium check-refs -r adamscounty.osm.pbf
[======================================================================] 100%
There are 1714 nodes, 39 ways, and 1 relations in this file.
Nodes     in ways      missing: 0
Nodes     in relations missing: 0
Ways      in relations missing: 4
Relations in relations missing: 0

Wait what?

Are we finally getting somewhere in this problem? I am missing some ways, apparently, somehow. I run getid again, but now with debug information. This is basically a text dump that you can look through. I first run it on adamscounty.osm.pbf, but then think maybe I messed something up when creating the smaller file. So instead I run it on the Pennsylvania dump.

osmium getid -r pennsylvania-latest.osm.pbf r417441 -f debug

I redirect it to a file so I can just skim and search through it with Vim. All the way at the bottom is the Adams County relation. It shows the metadata and lists the members. There are only 45 members total in the relation, 43 of which are the outer boundaries of Adams County. Those are the one I am interested in. If Osmium is to be believed, four of them are not present. I quickly just search for each of them in the debug file and indeed, four of them cannot be found. (I later realise Osmium’s check-refs has a --show-ids flag that does this for me)

The culprits:

  • Member #1: 60204285
  • Member #2: 87287068
  • Member #3: 87287085
  • Member #43, i.e., the last one: 83031652

I check them out on OpenStreetMap and these four form the Southern border of Adams County. Adams County’s Southern border is also part of Pennsylvania’s Southern border (“Mason-Dixon Line”). I assume that is why it was cut off. I double check that nobody messed up the borders and placed Adams County’s border outside of Pennsylvania, but that looks correct: those ways are also in the PA relation as outer. I double check that it is not some Osmium quirk by looking in the database. The planet_osm_ways from before does not have those four ways. It does have a random sampling of the other ways in the Adams County relation.

This is not ideal, but I am also not sure that this is problematic. Those four ways are a straight line. The two ways (they go away from that line) adjacent to the four missing ones are present. In other words, the nodes on either end of the straight line are there. In other other words, when you connect those two nodes to form an area, you will still have a straight line anyway!

Missing Ways = Broken Geometries

I check the counties to either side of Adams County. York County to the east was imported correctly. Franklin County to the west was not. Checking a few more along that Southern edge of Pennsylvania, those do seem to have been correctly imported and processed. Around this time I notice Pennsylvania had not been imported either, but is present in the input. At first I had assumed that is was just not included. Perhaps the missing way really is messing it all up.

I do the same Osmium getid extraction for a few of those. I use check-refs on them. The ones that are not imported correctly have a way missing. The other ones seem fine, nothing missing there.

The osm2pgsql manual says it tries to figure it out when data is incomplete.

When osm2pgsql encounters incomplete OSM data it will still try to do its best to use it. Ways missing some nodes will be shortened to the available part. Multipolygons might be missing some parts. In most cases this will work well (if your extract has been created with a sufficiently large buffer), but sometimes this will lead to wrong results. In the worst case, if a complete outer ring of a multipolygon is missing, the multipolygon will appear “inverted”, with outer and inner rings switching their roles.

Though a few sections down it also says

The as_* functions will return a NULL geometry (check with is_null()) if the geometry can not be created for some reason, for instance a polygon can only be created from closed ways. This can also happen if your input data is incomplete, for instance when nodes referenced from a way are missing.

but I believe this is not related to what I am doing, but more if you do manual processing of geometries in the Lua import script.

So where lies the fault in not processing the incomplete data:

  • osm2pgsql not actually trying because of the missing ways, or
  • PostGIS being grumpy and not accepting the geometry osm2pgsql creates.

Note: I don’t know if that is even to blame. The earlier tinkering in PostGIS indicated a self-intersection way more North-East in Adams County, seemingly unrelated to this missing border. Still, it is weird and a bit too coincidental that these two miss part of their border and are not getting imported. So: worth looking into.

Blame Osm2pgsql or PostGIS?

Earlier I created an adamscounty.osm.pbf with Osmium. Since it is much smaller, debugging is made a little easier. I run an import with all the earlier logging enabled and look through it.

I notice that in the log you can see osm2pgsql feeding a bunch of processed data into PostGIS by means of stdin. What is fed in is also logged. Specifically, I see processed ways getting inserted. I do not see Adams County getting inserted in this manner. This leads me to believe that osm2pgsql’s claim of trying to fix up your polygon is bogus (or I am messing up a setting, definitely possible too). I am asking it to create an area for me, it notices that without all the ways there, it cannot tell me whether all of it closes up, it decides not to create an area.

Side Step: Why is the Border Missing?

As I mentioned, I get the data from Geofabrik. I check their technical details and notice the following (emphasis mine)

We use polygonal boundaries for the splitting boundaries that are sometiems derived and simplified from OSM data, sometimes just hand-drawn. The boundaries usually follow country borders, but occasionally we take liberties and include a litte more of a neighbouring country if this greatly simplifies the polygon. The Osmium extract function that we use keeps ways and multipolygon relations that cross an extract border complete, i.e. when a very large mutlipolygon crosses the border, an extract can occasionally contain a lot more that expected.

The hand-drawn part makes me check their page for Pennsylvania. On it is a map that colours the area you are downloading. Zooming in on Adams County’s Southern border shows that the coloured in part is slightly above the border there. The same is true for a part of Franklin County. Further to the east and west, the coloured part matches up with the actual border again.

I presume these two got out of sync, so I send them an e-mail about this, asking them to fix it.

Summary

An overview of what I conclude after these two days.

  • The pennsylvania-latest.osm.pbf file is generated by Geofabrik with a boundary slightly smaller than Pennsylvania. This cuts off certain parts of the Pennsylvania border and thus also of the border of the counties there. I have e-mailed them about this issue and hope they can fix it for me.
    • An alternative is for me to download the entirety of the USA, then use Osmium to create my own Pennsylvania .osm.pbf file. If Geofabrik does not reply, I might go that route.
  • osm2pgsql claims it can work around incomplete data. This seems to not be the case for the areas I want to end up with though. The lack of closed way trips it up. If there is a setting in the flex Lua script to handle this, I am not seeing it in the documentation.
  • The geometry never reaches PostGIS. However, when I tried to throw together a geometry from the points I had in the database, PostGIS did tell me the geometry was invalid due to a self-intersect. I am currently thinking I just did something wrong here. I will investigate the other avenues first.

 


Day three.

Extracting Pennsylvania Ourselves

Looking at the Osmium documentation, it seems pretty straightforward to create your own extract.

First, get an extract larger than Pennsylvania. Geofabrik offers a “US North East”, but Pennsylvania is the south border of that extract. It suffers the same border problem as the Pennsylvania extract. Instead, I download the entire United States of America. That one is 9.1 GB, so a bit too big to bother fetching more often, but at this point I am just doing it for us to have some fun.

wget http://download.geofabrik.de/north-america/us-latest.osm.pbf

Next, get the boundary for Pennsylvania (OSM relation id: 162109). The documentation says you can get it directly from OSM, but notes that this may time out for larger boundaries.

wget https://www.openstreetmap.org/api/0.6/relation/162109/full --output-document pennsylvania-boundary.osm

Then it goes on to say you can also extract it from a .osm.pbf file you have lying around. Surely that is the case if you are trying to create an extract? Extracting is also just a command away. You might recognise it from the earlier getid calls. The only new flag is -t, which removes tags from all objects except the one you specify. Makes for a cleaner and smaller file, no other multipolygons and what not.

osmium getid -r -t us-latest.osm.pbf r162109 -o pennsylvania-boundary.osm

Actually running that explains why it was not the first option. It takes a bit of time and makes my laptop lift off. Actually a bit surprised it is that inefficient to read tiny parts of the big file. My assumption was this would be one of the use cases it is optimised for. Anyway, that takes like 10–15 minutes.

Now I can finally run the actual extract. I use my self-created boundary because my computer worked for it. -p says which polygon to use.

osmium extract -p pennsylvania-boundary.osm us-latest.osm.pbf -o pennsylvania.osm.pbf

This… seems to go faster than the boundary extract? You get a progress bar to look at, so maybe it is just perception. It ends up taking 10 minutes, so similar time. The final size is just about the same size as the PA .osm.pbf file I had gotten from Geofabrik, so that is encouraging.

I import my new Pennsylvania and… it works! Adams County is imported. Franklin County is imported. Pennsylvania is imported.

Final Thoughts

  • Annoying that a slightly incorrect boundary in Geofabrik exists.
  • More annoying that osm2pgsql did not go for a best effort there.
  • More more annoying that osm2pgsql has no (obvious) way to make it log something like failing geometries (with reasons). Pinpointing the issue would have gone a lot quicker.
  • Neat that I can fix it by just generating the files myself rather easily, once I had stumbled upon the proper tool.
  • I hope an incomplete data error like that does not happen too often.