Trying to Import "Adams County, PA" from OpenStreetMap to PostGIS
Updated on 12 July with more tinkering. Updated on 13 July with a solution and closing thoughts.
I am playing around with OpenStreetMap data and bumped into an issue trying to
import it into PostgreSQL with the PostGIS extension enabled. I have
pennsylvania-latest.osm.pbf
from Geofabrik and am using
osm2pgsql to import it into PostgreSQL. I am using osm2pgsql’s
flex Lua script. Specifically, I am only importing administrative relations
(i.e., those with an admin_level
set) and am asking osm2pgsql to parse their
geometries and save it all in the database. This seems to be working fine, I am
not doing anything funky, but by accident I notice that Adams
County is not being imported. Why?
Is the Input File Valid?
I first add some print
debugging in the Lua importing script. Adams County
is encountered in the .osm.pbf
file.
function osm2pgsql.process_relation(object)
if object.tags.name and object.tags.admin_level then
print(object.tags.name) -- Added
regions:add_row({
-- omitted
})
end
end
Any Osm2pgsql Errors?
Osm2pgsql does not spit out any errors in the default verbosity. I crank up the
verbosity to see if I spot anything. To do so, I add the flags
--log-level=debug
, --log-sql
, and --log-sql-data
to the osm2pgsql
invocation. This dumps out a lot, so just redirect it to a file. In the file, I
find Adams County due to the earlier print
debugging still being in the Lua
script. No other mention of it, however. The areas before and after it do
show in the logging as being added to the table. Odd.
I search a bit for my table (regions
) where I expect to find Adams County in.
I notice a reference to a SQL trigger on the table and a SQL function that is
called by it (both called regions_osm2pgsql_valid
). The function definition
is also there and I see a call to ST_IsValid
. That is a function provided by
PostGIS. I find the following in the osm2pgsql documentation.
Validity is more complex for Polygon and MultiPolygon geometries: There are multiple ways how such a geometry can be invalid. For instance, if the boundary of a polygon is drawn in a figure eight, the result will not be a valid polygon. The database will happily store such invalid polygons, but this can lead to problems later, when you try to draw them or do calculations based on the invalid geometry (such as calculating the area).
Osm2pgsql makes sure that there are no invalid geometries in the database, either by not importing them in the first place or by using an
ST_IsValid()
check after import. You’ll either getNULL
in your geometry columns instead or the row is not inserted at all (depending on config).
I notice in PostGIS documentation that there are also an ST_IsValidReason
and
an ST_IsValidDetail
, both give some more information about why a geometry
is invalid. I see no obvious way to make osm2pgsql call that for me and tell me
the reason.
Finding Why Adams County is Not Valid
Get a Hold of All the Data
Osm2pgsql has a flag that will save the OpenStreetMap data as-is in some extra
tables, alongside the processed tables you desire. To do so, add the --slim
flag. This creates the tables planet_osm_nodes
, planet_osm_ways
, and
planet_osm_rels
. I think you can figure out what type of OSM objects each
table holds. Time for some SQL tinkering.
From Relation to Nodes
To figure out the schema of a table, just use \d tablename
. I care more to
just see every column’s values so I know what I can work with here, so a
SELECT
does the trick. 417441
is the id of the Adams County relation in
OpenStreetMap.
SELECT
*
FROM
planet_osm_rels
WHERE
id = 417441;
Sadly there is no geometry or similar column. Instead, the interesting column
seems to be parts
, which is just the ids of the members of the relation.
Luckily, Adams County is (seems to be) a simple polygon, so I only have to
contend with a bunch of outer
members. It also has label
and admin_centre
members, but I can easily ignore those. parts
is an array of 45 members, 2 I
do not care about. To make it easy on myself, I just copy the list and move on.
SELECT
*
FROM
planet_osm_ways
WHERE
id IN (...that list...);
This returns a row per way and has a column nodes
which is, again, an array
of ids. Now the ids refer to nodes. There are about 1700 so just a quick copy
is not possible any more. Instead, you can modify the query to get one node id
per row instead. None of the other information matters so that is thrown out.
SELECT
unnest(nodes) AS nodes
FROM
planet_osm_ways
WHERE
id IN (...);
With a list of node ids, we can finally find some geometry information!
SELECT
*
FROM
planet_osm_nodes AS n
WHERE
n.id IN (
SELECT
unnest(nodes) AS nodes
FROM
planet_osm_ways
WHERE
id IN (...));
This gives a lat
and a lon
column, but they are in integers to avoid the
usual floating point pitfalls. So 39.8987036 is stored as 398987036.
Lat and Lon to PostGIS Geometries
I want to work with the geometries PostGIS has to offer, so rework the data a
little. The 4326
indicates the SRID.
SELECT
ST_Point (lon / 10000000.0, lat / 10000000.0, 4326)
FROM
planet_osm_nodes AS n
WHERE
n.id IN (
SELECT
unnest(nodes) AS nodes
FROM
planet_osm_ways
WHERE
id IN (...));
The ways and nodes are all stored in the right order (day three pondering:
actually, I am not sure I can make that assumption any more. I think the order
is correct when I unnest
, but once I use that in a WHERE ... IN
it actually
sounds unlikely to still be correct. Idea would be to dump the unnest
result
into a table with an order
column, then join with planet_osm_nodes
and
ORDER BY order
. This also explains why I am getting the incorrect geometry
error I get later. This also invalidates any pondering I do about that result
later.), so I want to just stitch the 1712 points together. For that you need
to throw the result into an ARRAY
and pass it on to ST_Makeline
first. Then
you have a PostGIS line geometry. I already name that column theline
for
referencing it later.
SELECT
ST_Makeline (ARRAY (
SELECT
ST_Point ... previous code block
)) AS theline;
Then to make a polygon out of it. Our line has every point for a polygon, but does not loop back to its starting point yet. So just add the starting point to the back of the line as well. Finally, make a polygon geometry out of it all.
SELECT
ST_MakePolygon (ST_AddPoint (theline, ST_StartPoint (theline)))
FROM (
SELECT
ST_Makeline (
...
) AS theline) AS makeline_table;
Is It Valid?
Remember that ST_IsValidDetail
? Finally there is a polygon to feed to it! The
function returns an object with valid
(boolean whether the polygon is valid),
reason
(textual reason), and location
(the geometry position of the
reason). I will also immediately check the type of the location.
SELECT
(validity).valid,
(validity).reason,
(validity).location,
GeometryType((validity).location)
FROM (
SELECT
ST_IsValidDetail (ST_MakePolygon (...)) validity
FROM (
...
) AS makeline_table)
AS reasons;
We were debugging Adams County, remember? I get a self intersection as reason
and the location is of type point. So, I change the SELECT
to give me the
coordinates.
SELECT
(validity).valid,
(validity).reason,
ST_X ((validity).location),
ST_Y ((validity).location)
FROM (
...
) AS reasons;
Not Solving Adams County
I get as coordinates -76.97132396038843 longitude and 39.93156765348423 latitude. I know I am dealing with a self-intersect. I go check it out on openstreetmap and find… nothing. Welp. JOSM validation on the spot also yields nothing. I double check the relation members in JOSM and they seem to indeed be in the correct order.
I run out of ideas. The osm2pgsql documentation also mentions Geofabrik’s OSM Inspector tool, but that shows nothing wrong for Adams County either.
I give up for now, hope you got something useful from reading this fruitless endeavour.
Next day continuation.
Different Projections?
There I am, out of ideas, scrolling through the osm2pgsql documentation for any hint of what to do. I see a hint that changing the map projection (keyword: SRID) could potentially project two different points onto the same spot. Logical next step in my head: that can mess up geometries.
By default, osm2pgsql stores things in Web Mercator projection (EPSG 3857), since a major user of osm2pgsql is people wanting to use the data to create tiles for maps. In OpenStreetMap, the coordinates are stored as WGS84 (SRID: 4326). Anyway, lots of technical details to tell you that there is a transformation of the projection happening. Point number go change.
So I adjust the Lua importing script and specifically the geometries. In the
table definition in it, add the projection
key to your geometry column. For
example:
{ column = "geom", type = "multipolygon", projection = "WGS84" }
Anyway, I do that, run a new import. Still no worky. I assume that is not to blame then.
Further Inspecting the Input
As mentioned earlier, I am working with pennsylvania-latest.osm.pbf
from
Geofabrik. It is an export of, you guessed it, just the OpenStreetMap data in
the state of Pennsylvania, USA. While scrolling the osm2pgsql documentation, I
see them reference Osmium. I had
heard of this tool before, but have not used it. Time to change that.
I am running all of this in a Debian container on Docker, so I hook up a shell to it and install Osmium.
apt install osmium-tool
Meanwhile, the documentation scrolling shifts focus to Osmium’s documentation.
The focus seems to be on validating .osm.pbf
files. So I give that a try
again.
With Osmium, I can extract information I care about. Recalling that Adams
County’s id is 417441, I create an .osm.pbf
file with just the Adams County
administrative boundary. The -r
also gets everything referenced by the
relation I am asking information about. So for a relation, get all its members.
For any ways in it, get all the nodes. Note the r
before the number to tell
Osmium I am looking for a relation.
osmium getid -r pennsylvania-latest.osm.pbf r417441 -o adamscounty.osm.pbf
So now I have a small file (11 kB vs the 241 MB of Pennsylvania) with just Adams County information. Osmium can check whether everything referenced is also present. It seems obvious to me that it would be, but for completeness sake, I run it.
$ osmium check-refs -r adamscounty.osm.pbf
[======================================================================] 100%
There are 1714 nodes, 39 ways, and 1 relations in this file.
Nodes in ways missing: 0
Nodes in relations missing: 0
Ways in relations missing: 4
Relations in relations missing: 0
Wait what?
Are we finally getting somewhere in this problem? I am missing some ways,
apparently, somehow. I run getid
again, but now with debug information. This
is basically a text dump that you can look through. I first run it on
adamscounty.osm.pbf
, but then think maybe I messed something up when creating
the smaller file. So instead I run it on the Pennsylvania dump.
osmium getid -r pennsylvania-latest.osm.pbf r417441 -f debug
I redirect it to a file so I can just skim and search through it with Vim. All
the way at the bottom is the Adams County relation. It shows the metadata and
lists the members. There are only 45 members total in the relation, 43 of which
are the outer boundaries of Adams County. Those are the one I am interested in.
If Osmium is to be believed, four of them are not present. I quickly just
search for each of them in the debug file and indeed, four of them cannot be
found. (I later realise Osmium’s check-refs
has a --show-ids
flag that does
this for me)
The culprits:
- Member #1: 60204285
- Member #2: 87287068
- Member #3: 87287085
- Member #43, i.e., the last one: 83031652
I check them out on OpenStreetMap and these four form the Southern border of
Adams County. Adams County’s Southern border is also part of Pennsylvania’s
Southern border (“Mason-Dixon Line”). I assume that is why it was cut off. I
double check that nobody messed up the borders and placed Adams County’s border
outside of Pennsylvania, but that looks correct: those ways are also in the PA
relation as outer
. I double check that it is not some Osmium quirk by looking
in the database. The planet_osm_ways
from before does not have those four
ways. It does have a random sampling of the other ways in the Adams County
relation.
This is not ideal, but I am also not sure that this is problematic. Those four ways are a straight line. The two ways (they go away from that line) adjacent to the four missing ones are present. In other words, the nodes on either end of the straight line are there. In other other words, when you connect those two nodes to form an area, you will still have a straight line anyway!
Missing Ways = Broken Geometries
I check the counties to either side of Adams County. York County to the east was imported correctly. Franklin County to the west was not. Checking a few more along that Southern edge of Pennsylvania, those do seem to have been correctly imported and processed. Around this time I notice Pennsylvania had not been imported either, but is present in the input. At first I had assumed that is was just not included. Perhaps the missing way really is messing it all up.
I do the same Osmium getid
extraction for a few of those. I use check-refs
on them. The ones that are not imported correctly have a way missing. The other
ones seem fine, nothing missing there.
The osm2pgsql manual says it tries to figure it out when data is incomplete.
When osm2pgsql encounters incomplete OSM data it will still try to do its best to use it. Ways missing some nodes will be shortened to the available part. Multipolygons might be missing some parts. In most cases this will work well (if your extract has been created with a sufficiently large buffer), but sometimes this will lead to wrong results. In the worst case, if a complete outer ring of a multipolygon is missing, the multipolygon will appear “inverted”, with outer and inner rings switching their roles.
Though a few sections down it also says
The
as_*
functions will return aNULL
geometry (check withis_null()
) if the geometry can not be created for some reason, for instance a polygon can only be created from closed ways. This can also happen if your input data is incomplete, for instance when nodes referenced from a way are missing.
but I believe this is not related to what I am doing, but more if you do manual processing of geometries in the Lua import script.
So where lies the fault in not processing the incomplete data:
- osm2pgsql not actually trying because of the missing ways, or
- PostGIS being grumpy and not accepting the geometry osm2pgsql creates.
Note: I don’t know if that is even to blame. The earlier tinkering in PostGIS indicated a self-intersection way more North-East in Adams County, seemingly unrelated to this missing border. Still, it is weird and a bit too coincidental that these two miss part of their border and are not getting imported. So: worth looking into.
Blame Osm2pgsql or PostGIS?
Earlier I created an adamscounty.osm.pbf
with Osmium. Since it is much
smaller, debugging is made a little easier. I run an import with all the
earlier logging enabled and look through it.
I notice that in the log you can see osm2pgsql feeding a bunch of processed data into PostGIS by means of stdin. What is fed in is also logged. Specifically, I see processed ways getting inserted. I do not see Adams County getting inserted in this manner. This leads me to believe that osm2pgsql’s claim of trying to fix up your polygon is bogus (or I am messing up a setting, definitely possible too). I am asking it to create an area for me, it notices that without all the ways there, it cannot tell me whether all of it closes up, it decides not to create an area.
Side Step: Why is the Border Missing?
As I mentioned, I get the data from Geofabrik. I check their technical details and notice the following (emphasis mine)
We use polygonal boundaries for the splitting boundaries that are sometiems derived and simplified from OSM data, sometimes just hand-drawn. The boundaries usually follow country borders, but occasionally we take liberties and include a litte more of a neighbouring country if this greatly simplifies the polygon. The Osmium extract function that we use keeps ways and multipolygon relations that cross an extract border complete, i.e. when a very large mutlipolygon crosses the border, an extract can occasionally contain a lot more that expected.
The hand-drawn part makes me check their page for Pennsylvania. On it is a map that colours the area you are downloading. Zooming in on Adams County’s Southern border shows that the coloured in part is slightly above the border there. The same is true for a part of Franklin County. Further to the east and west, the coloured part matches up with the actual border again.
I presume these two got out of sync, so I send them an e-mail about this, asking them to fix it.
Summary
An overview of what I conclude after these two days.
- The
pennsylvania-latest.osm.pbf
file is generated by Geofabrik with a boundary slightly smaller than Pennsylvania. This cuts off certain parts of the Pennsylvania border and thus also of the border of the counties there. I have e-mailed them about this issue and hope they can fix it for me.- An alternative is for me to download the entirety of the USA, then use
Osmium to create my own Pennsylvania
.osm.pbf
file. If Geofabrik does not reply, I might go that route.
- An alternative is for me to download the entirety of the USA, then use
Osmium to create my own Pennsylvania
- osm2pgsql claims it can work around incomplete data. This seems to not be the case for the areas I want to end up with though. The lack of closed way trips it up. If there is a setting in the flex Lua script to handle this, I am not seeing it in the documentation.
- The geometry never reaches PostGIS. However, when I tried to throw together a geometry from the points I had in the database, PostGIS did tell me the geometry was invalid due to a self-intersect. I am currently thinking I just did something wrong here. I will investigate the other avenues first.
Day three.
Extracting Pennsylvania Ourselves
Looking at the Osmium documentation, it seems pretty straightforward to create your own extract.
First, get an extract larger than Pennsylvania. Geofabrik offers a “US North East”, but Pennsylvania is the south border of that extract. It suffers the same border problem as the Pennsylvania extract. Instead, I download the entire United States of America. That one is 9.1 GB, so a bit too big to bother fetching more often, but at this point I am just doing it for us to have some fun.
wget http://download.geofabrik.de/north-america/us-latest.osm.pbf
Next, get the boundary for Pennsylvania (OSM relation id: 162109). The documentation says you can get it directly from OSM, but notes that this may time out for larger boundaries.
wget https://www.openstreetmap.org/api/0.6/relation/162109/full --output-document pennsylvania-boundary.osm
Then it goes on to say you can also extract it from a .osm.pbf
file you have
lying around. Surely that is the case if you are trying to create an extract?
Extracting is also just a command away. You might recognise it from the earlier
getid
calls. The only new flag is -t
, which removes tags from all objects
except the one you specify. Makes for a cleaner and smaller file, no other
multipolygons and what not.
osmium getid -r -t us-latest.osm.pbf r162109 -o pennsylvania-boundary.osm
Actually running that explains why it was not the first option. It takes a bit of time and makes my laptop lift off. Actually a bit surprised it is that inefficient to read tiny parts of the big file. My assumption was this would be one of the use cases it is optimised for. Anyway, that takes like 10–15 minutes.
Now I can finally run the actual extract. I use my self-created boundary
because my computer worked for it. -p
says which polygon to use.
osmium extract -p pennsylvania-boundary.osm us-latest.osm.pbf -o pennsylvania.osm.pbf
This… seems to go faster than the boundary extract? You get a progress bar to
look at, so maybe it is just perception. It ends up taking 10 minutes, so
similar time. The final size is just about the same size as the PA .osm.pbf
file I had gotten from Geofabrik, so that is encouraging.
I import my new Pennsylvania and… it works! Adams County is imported. Franklin County is imported. Pennsylvania is imported.
Final Thoughts
- Annoying that a slightly incorrect boundary in Geofabrik exists.
- More annoying that osm2pgsql did not go for a best effort there.
- More more annoying that osm2pgsql has no (obvious) way to make it log something like failing geometries (with reasons). Pinpointing the issue would have gone a lot quicker.
- Neat that I can fix it by just generating the files myself rather easily, once I had stumbled upon the proper tool.
- I hope an incomplete data error like that does not happen too often.