Parsing Twitter with RegExp


I needed a very simple Twitter cache for a project I’m working on. And I was very happy to trade off some realtime accuracy for reliability. In addition to caching the tweets, I also needed to pre-process them into css-able html with clickable links, usernames, and hashtags. The web had a few nice examples of how to use regular expressions to parse the raw tweet text, but I decided to take what I liked and do the rest myself.


Here’s the PHP code for parsing links out of the raw tweet text:

$text = preg_replace(
     '<a href="$1">$1</a>',

I only wanted http and https links, with an optional query part (\?\S+)? and an option anchor part (#\S+)?. The conversion of a text link into an html link is done using back references, which in PHP is $1, $2, etc. In the expression above, I use $1 twice to put the matched link into both the href attribute and the link text.


Here’s the PHP code for parsing Twitter usernames:

$text = preg_replace(
    '<a href="$1">@$1</a>',

Nothing special, just take the @ and all following word characters (letters, digits, and underscores), and turn it into a user link.


Here’s the PHP code for parsing Twitter hashtags:

$text = preg_replace(
    ' <a href="$1">#$1</a>',

Getting the hashtags right was the most tricky of the three. I decided to only grab hashtags that were proceeded by one or more spaces. The real magic is the %23 in the query string, which forces a search on the complete hashtag, including the # part. For example, compare a search for #flex to a search for flex.

The Cache

The cache is just a simple cron job that periodically queries Twitter and retrieves the latest tweets. Most importantly, the cache fails gracefully if Twitter is inaccessible, which it does by doing exactly nothing if Twitter is down. This guarantees that my app always has valid data (when my server is up, the cache is up too), but with the possibility that the data is a little old.

Here’s the notable function in the cache:

function getTweets($user, $num = 3) {
    //first, get the user's timeline
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, "$user.json?count=$num");
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $json = curl_exec($ch);
    if ($json === false) { return false; } //abort on error
    //second, convert the resulting json into PHP
    $result = json_decode($json);
    //third, build up the html output
    $s = '';
    foreach ($result as $item) {
        //handle any special characters
        $text = htmlentities($item->text, ENT_QUOTES, 'utf-8');
        //build the metadata part
        $meta = date('g:ia M jS', strtotime($item->created_at)) . ' from ' . $item->source;
        //parse the tweet text into html
        $text = preg_replace('@(https?://([-\w\.]+)+(/([\w/_\.]*(\?\S+)?(#\S+)?)?)?)@', '<a href="$1">$1</a>', $text);
        $text = preg_replace('/@(\w+)/', '<a href="$1">@$1</a>', $text);
        $text = preg_replace('/\s#(\w+)/', ' <a href="$1">#$1</a>', $text);
        //assemble everything
        $s .= '<p class="tweet">' . $text . "<br />\n" . '<span class="tweet-meta">' . $meta . "</span></p>\n";
    return $s;

First, we query the user’s JSON timeline using cURL. Second, we use PHP’s awesome json_decode function to convert the JSON into objects. And lastly, we iterate over the tweets and parse everything into our desired HTML output.

Here some sample output from my twitter feed:

<p class="tweet">Been reading Programming Goggle App Engine. Actually feeling dumber now than before I started. Too much to learn.<br /> 
<span class="tweet-meta">2:58pm Feb 14th from <a href="" rel="nofollow">TweetDeck</a></span></p>
<p class="tweet">Blog Post :: Async Testing with FlexUnit 4 :: <a href=""></a><br /> 
<span class="tweet-meta">3:33pm Feb 11th from <a href="" rel="nofollow">TweetDeck</a></span></p>
<p class="tweet">Blog Post :: A Better HTML Template for Flex 4 :: <a href=""></a><br /> 
<span class="tweet-meta">12:55pm Jan 25th from <a href="" rel="nofollow">TweetDeck</a></span></p>

Once I have the output, I can do whatever I want with it: save to disk, stick it in the database, keep it in memory, cache it in memcache, etc. In my case, I wanted the simplest possible option, so I chose to write it out as a static html file.

The end. The rest of the app’s not ready yet…




Looks great… any chance you would provide final complete code?




@Steve: Here’s a simple file cache implementation for you…

Here is tweets_cron.php (also include the getTweets() function from the blog post):

$tweets = getTweets('saturnboy');
if ($tweets !== false) {
    file_put_contents('tweets_saturnboy.txt', $tweets);

Here is a snippet from my blog sidebar (not really, but stay with me):

$tweets = file_get_contents('tweets_saturnboy.txt');
print $tweets;

Done. Just put tweets_cron.php in the crontab hourly or faster. Of course, if you need multiple users, you can use an array of usernames and loop.



Thank you so much!



Now what’s the solution if you’re running php 4.4.4 and JSON isn’t on the server?

I suppose you can do a JSON backwards compatibility, anyone have a ref to that?



@Austin: Sounds like your server is super lame. Maybe you should ask for an upgrade? Good luck.




Thanks Justin! Killer code!



for the hashtag parsing, if I put the hashtag in the very front of string e.g. ‘#mychannel bla bla bla’ it will not parse properly.



@Sony AK: If you are worried about that particular case, just add a forth regexp like this:

$text = preg_replace('/^#(\w+)/', '<a href="$1" rel="nofollow">#$1</a>', $text);




Amazing code. I have a little proble when parsing email. For example, only the part is parsed as a link. Any help?



@argenisleon: A little weird to find emails in tweets, but you can adjust the user regexp like this:

$text = preg_replace('/^@(\w+)/', '<a href="$1" rel="nofollow">@$1</a>', $text);
$text = preg_replace('/ @(\w+)/', ' <a href="$1" rel="nofollow">@$1</a>', $text);

The first one get’s all your retweets and such that begin with @, and the second one gets only those with a space then an @. This will skip any email addresses.



Works perfect :D Thanks!




Your hashtag regexp doesn’t match characteres that are not english.



Thanks for your tutorial, I am trying to do the same but using jQuery. Does anybody know how to parse Twitter text using jQuery with href element.



Begin with

$stringText = preg_replace('@(https?://([-\w\.]+)+(/([\w/_\.]*(\?\S+)?(#\S+)?)?)?)@', '<a href="$1" rel="nofollow">$1</a>', $stringText);
$stringText = preg_replace('/@(\w+)/', '<a href="$1" rel="nofollow">@$1</a>', $stringText);
$stringText = preg_replace('/\#(\w+)/', ' <a href="$1" rel="nofollow">#$1</a>', $stringText);

Because if you have an #hashtag at the beginning…



I was looking for a way to embed and properly mark-up tweets on our club’s website. This blows away the wacky loop I created.



Do you have a way in the regex for the url’s to ensure that it opens in a new tab? Thanks for the great code!



Thanks for that tutorial, it was very helpful!



Twitter has depreciated the$user.json?count=$num usage – I am trying to figure out how it works now, but it is different and it changed around October. I’ve been using what you posted here for some time, but it stopped working and I thought I would let you know.



I am curious to find out what blog platform you have been using?

I’m experiencing some minor security problems with my latest site and I’d like to find something more secure. Do you have any recommendations?

© 2021