PDFs are bad for open government, says Sunlight Foundation in US
This is always worth remembering:
Government releasing data in PDF tends to be catastrophic for Open Government advocates, journalists and our readers because of the amount of overhead it takes to get data out of it. When a government agency publishes its data and documents as PDFs, it makes us Open Government advocates and developers cringe, tear our hair out, and swear a little (just a little). Most earmark requests by members of congress are published as PDF files of scanned letters, leading the Sunlight Foundation and others to write custom parsers for each letter.
I know that a lot of the efforts going on in the data.gov.uk channels are about finding effective ways of parsing data. The hope has to be though that very little of that involves finding ways of reversing data that has been output to PDF. The point being of course that turning PDF into useful data is, in the famous quote, “about as easy as turning hamburger into cow”.
Back to the Sunlight Foundation again:
Here at Sunlight we want the government to STOP publishing bills, and data in PDFs and Flash and start publish them in open, machine readable formats like XML and XSLT. What’s most frustrating is, Government seems to transform documents that are in XML into PDF to release them to the public, thinking that that’s a good thing for citizens. Government: We can turn XML into PDFs. We can’t turn PDFs into XML.
And another word for Flash. Ah, Flash:
Flash isn’t off the hook either. Government has spent lots of time and money developing flash tools to allow citizens to view charts and graphs online, and while we’re happy the government is interested in allowing citizens to do this, Government’s primary method of disclosure should not be these visualizations, but rather publishing the APIs and datasets that allow citizens to make their own
The comments are worth it too, such as Adrian Holovaty: “If I had a dollar for each hour I’ve spent trying to finagle raw data out of PDFs, I could afford Adobe Photoshop.”
And the rather scary one from Michael Friis: “Here in Denmark Parliament publishes many ancillary documents as PNGs.” Which is quite scary, though in line with Ordnance Survey’s tendency to release FOI requests as TIFFs.
- The following posts may be related...(the database guesses):
- ESRC events and Open Knowledge day: podcasts/transcripts available (20 March 2007; score: 32.67%)
- Free Our Data: sessions this Thursday and Saturday, in Manchester and London (14 March 2007; score: 21.27%)
- 'What happens at the next Lockerbie?' - the risks of killing NIMSA (20 March 2007; score: 20.16%)
- In The Guardian: surgeons' deathrates online (but not for reuse) (12 June 2008; score: 19.29%)
- Times article echoes Free Our Data campaign (20 September 2006; score: 15.16%)

November 8th, 2009 at 11:20 am
Surely government needs to do both?
For the many who want to read information or do some simple mangling in a spreadsheet data needs to be published as pdf or csv (as appropriate). For those who want to mash the data with other information and maybe access realtime info it needs to be availabel as a feed.
November 11th, 2009 at 5:53 am
[...] hadn’t realised this was a general problem with governments until I saw PDFs are bad for open government, says Sunlight Foundation in US on Free our Data blog last week. Should I change my request to emphasise the original data [...]
November 13th, 2009 at 9:10 am
It’s worth noting that when you submit an FOI request, you can specify the format in which you wish to receive it, and also the FOI officer has a duty to assist you in obtaining the information as easily as possible.