Free Our Data: the blog

A Guardian Technology campaign for free public access to data about the UK and its citizens


PDFs are bad for open government, says Sunlight Foundation in US

This is always worth remembering:

Government releasing data in PDF tends to be catastrophic for Open Government advocates, journalists and our readers because of the amount of overhead it takes to get data out of it. When a government agency publishes its data and documents as PDFs, it makes us Open Government advocates and developers cringe, tear our hair out, and swear a little (just a little). Most earmark requests by members of congress are published as PDF files of scanned letters, leading the Sunlight Foundation and others to write custom parsers for each letter.

I know that a lot of the efforts going on in the data.gov.uk channels are about finding effective ways of parsing data. The hope has to be though that very little of that involves finding ways of reversing data that has been output to PDF. The point being of course that turning PDF into useful data is, in the famous quote, “about as easy as turning hamburger into cow”.

Back to the Sunlight Foundation again:

Here at Sunlight we want the government to STOP publishing bills, and data in PDFs and Flash and start publish them in open, machine readable formats like XML and XSLT. What’s most frustrating is, Government seems to transform documents that are in XML into PDF to release them to the public, thinking that that’s a good thing for citizens. Government: We can turn XML into PDFs. We can’t turn PDFs into XML.

And another word for Flash. Ah, Flash:

Flash isn’t off the hook either. Government has spent lots of time and money developing flash tools to allow citizens to view charts and graphs online, and while we’re happy the government is interested in allowing citizens to do this, Government’s primary method of disclosure should not be these visualizations, but rather publishing the APIs and datasets that allow citizens to make their own

The comments are worth it too, such as Adrian Holovaty: “If I had a dollar for each hour I’ve spent trying to finagle raw data out of PDFs, I could afford Adobe Photoshop.”

And the rather scary one from Michael Friis: “Here in Denmark Parliament publishes many ancillary documents as PNGs.” Which is quite scary, though in line with Ordnance Survey’s tendency to release FOI requests as TIFFs.

    The following posts may be related...(the database guesses):