Biz & IT —

Tutorial: consuming Twitter’s real-time stream API in Python

Twitter is preparing to roll out a new real-time streaming API for user …

Twitter is preparing to launch several impressive new features, including a new streaming API that will give desktop client applications real-time access to the user's message timeline. The new streaming API was announced last week at Twitter's Chirp conference, where it was made available to conference attendees on-site for some preliminary experimentation. Twitter opened it up to the broader third-party developer community on Monday so that programmers can begin testing it to offer informed feedback.

This tutorial will show you how to consume and process data from Twitter's new streaming API. The code examples, which are written in the Python programming language, demonstrate how to establish a long-lived HTTP connection with PyCurl, buffer the incoming data, and process it to perform the basic message display functions of a Twitter client application. We will also take a close look at how the new streaming API differs from the existing polling-based REST API.

Understanding the polling model

Twitter client software generally uses a polling method to communicate with the Twitter service. Applications post and retrieve Twitter messages by sending HTTP requests with certain parameters to specific Twitter URLs. Twitter's servers send back XML or JSON responses, which are then parsed and processed by the client application. This mechanism is based loosely on the concept of Representational State Transfer (REST), though it does not strictly conform with the REST paradigm. Twitter provides REST endpoints that make virtually all of its functionality accessible to third-party software.

Most Twitter applications that display a message timeline are programmed to request message updates on a configurable interval, typically ranging from two to five minutes. This polling model fundamentally distinguishes Twitter from instant messaging because it introduces unavoidable latency between when messages are posted and when they are received. The only way to decrease the latency with Twitter's current standard API is to reduce the polling interval and request updates more frequently. That obviously puts a high load on Twitter's servers and creates scalability challenges. In order to prevent abuse, Twitter has a rate-limiting mechanism that sets a cap on the number of API requests that a single user can make per hour.

Twitter's heavy reliance on the polling model poses a bit of a paradox. One of Twitter's greatest strengths is timeliness—the site's front page touts it as a source of "instant information" and a window into what is happening "right now" in the world. Although Twitter largely delivers on that promise by enabling news and ideas to propagate faster than is generally possible in other mediums, its reliance on the polling model prevents it from facilitating truly real-time messaging.

Overcoming the limitations of polling

Twitter's new streaming API will make it possible for third-party software to overcome the limitations of polling, giving Twitter a big boost as its emphasis continues to shift toward real-time messaging. The streaming API allows client applications to establish a persistent connection with Twitter's servers and constantly receive messages right after they are posted, obviating the need for polling on an interval. This communication model could be described as "push" messaging.

Handling millions of simultaneous persistent connections comes with its own set of scalability challenges, so Twitter is testing the waters before it proceeds with a full-scale production roll-out. At this time, the streaming API is still experimental and is not ready to be adopted in applications that are released for general use. The streaming capacity is limited and the API is still subject to change. It is being made available with pre-beta status so that developers can begin testing the functionality.

Twitter's documentation warns that it could change the streaming endpoint arbitrarily and without notice if it thinks it's being abused. As the testing progresses, Twitter will begin hammering out a schedule for a beta launch and a full production release. Although it's not yet ready for widespread use in client applications, this is a great time to start playing around to see how it works.

A look at the streaming API

Unlike Twitter's conventional REST API, which supports both JSON and XML, the streaming API offers only JSON output. JSON, which stands for JavaScript Object Notation, is a simple and elegant format for describing structured data. The syntax, which is based on JavaScript, is human-readable and very easy to parse. It's so simple that the entire grammar can fit on the back of a business card.

To use the streaming API, an application makes a long-lived HTTP request to the streaming endpoint. Unlike a conventional REST API request, where the connection to the server is terminated after data is received, the streaming API leaves the connection open for as long as possible and will perpetually push new data as it is available. The data is sent as blobs of JSON that describe messages and events, such as retweets and message deletion. The structure of the message data that is emitted by the streaming API matches that of the REST API, which means that application developers who are already using the JSON output format can reuse their existing message parsing code.

There are several different techniques that are commonly used in Web programming to achieve push messaging with long-lived HTTP connections. These techniques are collectively referred to as Comet communication. It's important to understand that the Twitter streaming API is designed for true HTTP streaming and isn't intended to be used with other common Comet methods such as long-polling. In a long-polling scenario, a connection is established and held open until data is received, at which point the connection is terminated and then reestablished in anticipation of more data. When HTTP streaming is used, the connection just stays open and keeps getting data.

Consuming the streaming API with PyCurl

The easiest way to handle HTTP streaming in Python is to use PyCurl, the Python bindings for the well-known Curl network library. PyCurl allows you to provide a callback function that will be executed every time a new block of data is available. The following code is a simple demonstration of HTTP streaming with PyCurl:

import pycurl, json

STREAM_URL = "http://chirpstream.twitter.com/2b/user.json"

USER = "segphault"
PASS = "XXXXXXXXX"

def on_receive(data):
  print data

conn = pycurl.Curl()
conn.setopt(pycurl.USERPWD, "%s:%s" % (USER, PASS))
conn.setopt(pycurl.URL, STREAM_URL)
conn.setopt(pycurl.WRITEFUNCTION, on_receive)
conn.perform()

The code example above shows how to instantiate a Curl object, set the URL, provide login credentials, and send the data to a callback function. The callback function in the example simply echoes the received data to the terminal. If you put in your own Twitter credentials and run that code in a Python script at the command line, you will see the stream of JSON data transmitted by the Twitter service.

When the connection is idle and there is no other data to send, the streaming API will emit an empty line every 30 seconds. The empty line is a keep-alive signal that is intended to prevent client applications from timing out and dropping the connection. PyCurl doesn't require any special configuration, but other network libraries might require the user to set a custom timeout duration for idle connections. You should make sure that it is set to something that is higher than the streaming API's 30-second keep-alive interval so that the connection isn't dropped.

Establishing the connection and receiving the data is clearly easy, but processing it is a bit more complex. The blocks that are sent to the callback function are each less than 1500 bytes. If a JSON object from Twitter exceeds the size of the block of data that is sent to the callback, then we have to do some buffering in order to ensure that we get the complete object. The JSON parser will choke if you try to pass it a partial structure.

At the end of every complete JSON object, the Twitter streaming API outputs a carriage return, which we can use to determine if we have a full object. Every time we receive a block, we append it to a buffer string and then check to see if it has the carriage return at the end. If it does, then we can assume that the buffer string contains a complete JSON object, which can be handed off to the JSON parser and then processed. After we parse the JSON in the buffer string, we clear the buffer so that it can start collecting new data.

Channel Ars Technica