How the news breaks

Jacob Kaplan-Moss

November 8, 2006

I swear, sometimes this programming thing is really just the digital equivalent of baling twine and duct tape.

If you happen to be watching 6News in Lawrence last night, you’d have seen the election results crawling across the bottom of the screen:

Database-backed TV

Pretty much par for the course in terms of local TV coverage… but do you have any idea how that information gets there?

Let me break it down:

  1. Votes are collected by the precincts, who report the totals to the Secretary of State’s office. The Secretary of State publishes those results on a web page, but their IT department is paranoid, so only a single IP — that of the Journal-World’s corporate firewall — is given access to that page of results.

    (It almost goes without saying that the HTML of this result page is grossly invalid.)

    Our web servers, however, sit outside of the corporate firewall on a separate network, and so are unable to see that page.

  2. So, a small script on a Linux box under Matt‘s desk downloads the page of results every time it changes, and then turns around and uploads it to a production server.

  3. Another small script (on our server this time) scrapes this shoddy HTML (using BeautifulSoup, of course) and inserts the data into our database. At this point the data shows up online, but the journey to the airwaves is far from complete.

  4. At this point, a third script fetches the data back out of the database and writes an Excel spreadsheet (using pyExcelerator).

  5. This spreadsheet is moved to a publicly-accessible URL.

  6. Over at 6News, a Windows box sits and runs a batch file which, using a Windows binary of wget, downloads the Excel file.

  7. Finally, the on-air graphics system reads this Excel file, and the data appears in the crawl.

If you’ve been keeping track, this process involves eight different machines:

  1. the Secretary of State’s vote machine,
  2. the Secretary of State’s web server,
  3. the Linux box under Matt’s desk,
  4. two of our web servers,
  5. our database server,
  6. the Windows box over at 6News, and
  7. the on-air graphics machine

and four glue scripts, in three different languages:

  1. the script that copies the results from the Secretary of State to our public server (shell),
  2. the data scraper (python),
  3. the Excel sheet writer (python)
  4. the Windows downloader (batch)

Here’s the kicker, though:

Despite — or because of — all of this, all night we had fresher data — often by 30 minutes or more — than any of our competition.

Baling twine and duct tape, man…

Comments:

Michael Hessling:

Worked for MacGyver.

Matt Croydon:

Another bit of duct tape not mentioned above is the sleep command in the batch script that downloaded the excel file. We didn't have administrator rights to install a Microsoft resource pack that contains an implementation of a "sleep" command, so we used a ping-based hack (http://malektips.com/dos001...) instead.

Great fun!

Malcolm Tredinnick:

Hopefully yesterday was a huge ton of fun for all of you and mostly went according to plan. Looking at everybody's photo streams (from the LJ folk), it looks like things went well and you had enough time to take screenshots.

I'd love to read more stories like this about some of the "behind the scenes" work that went on. Partly for my own entertainment, partly to point people to in future -- people need to realise this IT stuff isn't so much magic as creativity and hard work put together in the right proportions.

Jason Salas:

Hi Simon,

Good work! I ran a similar system of disparate devices calling a centralized tier of election data for my station's web site, KUAM.com, out here in Guam. Our big challenge is that we publish data multi-platform, exacerbated by the fact that in addition to being the main architect for all of our systems, I'm also co-anchor of our coverage and newscasts, so I can't be the guy to physically work on the systems.

Basically, I wrote a .NET sub-application that integrates with an Excel spreadsheet which holds vote tallies. These tallies are, sadly, faxed to us from the tabulation center, since our local election commission can't find a more automated way of distributing the data. I built a component that reads-in the information into a C# caching tier that then writes an XML structure to disk. This XML is incorporated into various clients via a series of XSLT processes - our web site (http://www.kuam.com/decisio...), our mobile framework, and our Pinnacle TV graphics system.

FXDeko software within Pinnacle reads the data from a URL and dynamically creates graphics based on templates. So while it's not as technically-impressive as 6News' system, it does work for us and lets us get results out in near-real time in a whole bunch of different platforms and formats.

Jason Salas:

Whoops...that should have said, "Good work, Jacob." :-)

Leonhard Markert:

I know this is the wrong place to tell you, but I don't know where to put it, so here we go:
Your RSS seems to be broken. For example, the RSS feed for this article points at http://www.jacobian.org/arc... whereas it really is at http://www.jacobian.org/wri....

Leave a comment:

Use your real name, or risk deletion.

Optional.

No markup allowed. Linebreaks will be converted; links will be linkified.

Be nice; don't be that guy.