<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Saturnboy &#187; php</title>
	<atom:link href="http://saturnboy.com/tag/php/feed/" rel="self" type="application/rss+xml" />
	<link>http://saturnboy.com</link>
	<description>Code, Work, and Life</description>
	<lastBuildDate>Thu, 01 Mar 2012 22:35:28 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Perfect Gradients for Perfect Buttons</title>
		<link>http://saturnboy.com/2010/05/perfect-gradients-perfect-buttons/</link>
		<comments>http://saturnboy.com/2010/05/perfect-gradients-perfect-buttons/#comments</comments>
		<pubDate>Tue, 11 May 2010 03:21:02 +0000</pubDate>
		<dc:creator>justin</dc:creator>
				<category><![CDATA[Code]]></category>
		<category><![CDATA[design]]></category>
		<category><![CDATA[iPhone]]></category>
		<category><![CDATA[php]]></category>

		<guid isPermaLink="false">http://saturnboy.com/?p=1334</guid>
		<description><![CDATA[Photoshop does this annoying thing where purely vertical gradients have some horizontal variation. Yes, it&#8217;s usually only plus or minus one bit of color, but it offends! I&#8217;ve battled Photoshop for a while on this, but I just can&#8217;t seem to get exactly what I want out of it. So to make a perfect gradient, [...]]]></description>
			<content:encoded><![CDATA[<p>Photoshop does this annoying thing where purely vertical gradients have some horizontal variation.  Yes, it&#8217;s usually only plus or minus one bit of color, but it offends!  I&#8217;ve battled Photoshop for a while on this, but I just can&#8217;t seem to get <b>exactly</b> what I want out of it.  So to make a perfect gradient, I decided to write some code.  The requirements are simple: given a starting color and a set of deltas, output a perfect gradient.</p>
<p class="bottom">Here are some quick examples:</p>
<div class="span-14 last">
<div class="span-1 comm-idx">1</div>
<div class="span-3">
<img src="http://saturnboy.com/proj/php/perfect_gradient/gradient1.png" alt="gradient1" title="gradient1" width="60" height="100" />
</div>
<div class="prepend-1 span-1 comm-idx">2</div>
<div class="span-3">
<img src="http://saturnboy.com/proj/php/perfect_gradient/gradient2.png" alt="gradient2" title="gradient2" width="60" height="100" />
</div>
<div class="prepend-1 span-1 comm-idx">3</div>
<div class="span-3 last">
<img src="http://saturnboy.com/proj/php/perfect_gradient/gradient3.png" alt="gradient3" title="gradient3" width="60" height="100" />
</div>
</div>
<div class="span-14 last">
<div class="prepend-1 span-3"><b>#000000</b><br />4, 1, 0.25</div>
<div class="prepend-2 span-3"><b>#eeeeff</b><br />-2.2, -1, -0.3</div>
<div class="prepend-2 span-3 last"><b>#ff0099</b><br />-1, 0, 1</div>
</div>
<div class="span-14 last">&nbsp;</div>
<p class="bottom">If we zoom in on example #1, which starts with black (#000000) and has deltas of 4, 1, 0.25, we see the following:</p>
<div class="prepend-1 span-13 last">
<img src="http://saturnboy.com/proj/php/perfect_gradient/gradient-diagram.png" alt="zoomed gradient" title="zoomed gradient" width="252" height="200" />
</div>
<div class="span-14 last">&nbsp;</div>
<p>The diagram shows the first ten rows of the gradient.  The delta values are accumulated with each row, and only the whole part of the resulting color value is used (aka I take the <b>floor</b> of each color bit).  So in this example, using the fractional delta of 0.25 results in exactly one additional blue bit every four rows.  Ahhh, perfect!</p>
<h3>The Code</h3>
<p class="bottom">No need to <a href="http://saturnboy.com/2010/04/the-schizophrenic-programmer/">use some fancy new language</a>, I wrote a simple PHP program to handle commandline input and output a perfect PNG gradient.  The interesting part is the function that generates and saves the gradient:</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">function</span> build_image<span style="color: #009900;">&#40;</span><span style="color: #000088;">$filename</span><span style="color: #339933;">,</span> <span style="color: #000088;">$w</span><span style="color: #339933;">,</span> <span style="color: #000088;">$h</span><span style="color: #339933;">,</span> <span style="color: #000088;">$color</span><span style="color: #339933;">,</span> <span style="color: #000088;">$delta</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #000088;">$img</span> <span style="color: #339933;">=</span> <span style="color: #990000;">imagecreatetruecolor</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$w</span><span style="color: #339933;">,</span> <span style="color: #000088;">$h</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #000088;">$c</span> <span style="color: #339933;">=</span> <span style="color: #990000;">imagecolorallocate</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$img</span><span style="color: #339933;">,</span> <span style="color: #000088;">$color</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> <span style="color: #000088;">$color</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> <span style="color: #000088;">$color</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
  <span style="color: #000088;">$d</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$delta</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$y</span> <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span> <span style="color: #000088;">$y</span> <span style="color: #339933;">&lt;</span> <span style="color: #000088;">$h</span><span style="color: #339933;">;</span> <span style="color: #000088;">$y</span><span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #990000;">imagefilledrectangle</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$img</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">,</span> <span style="color: #000088;">$y</span><span style="color: #339933;">,</span> <span style="color: #000088;">$w</span> <span style="color: #339933;">-</span> <span style="color: #cc66cc;">1</span><span style="color: #339933;">,</span> <span style="color: #000088;">$y</span> <span style="color: #339933;">+</span> <span style="color: #cc66cc;">1</span><span style="color: #339933;">,</span> <span style="color: #000088;">$c</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #000088;">$c</span> <span style="color: #339933;">=</span> <span style="color: #990000;">imagecolorallocate</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$img</span><span style="color: #339933;">,</span>
      clamp<span style="color: #009900;">&#40;</span><span style="color: #990000;">floor</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$color</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">+</span> <span style="color: #000088;">$d</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
      clamp<span style="color: #009900;">&#40;</span><span style="color: #990000;">floor</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$color</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">+</span> <span style="color: #000088;">$d</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
      clamp<span style="color: #009900;">&#40;</span><span style="color: #990000;">floor</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$color</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">+</span> <span style="color: #000088;">$d</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
    <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #000088;">$d</span> <span style="color: #339933;">=</span> <span style="color: #990000;">array</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$d</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">+</span> <span style="color: #000088;">$delta</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> <span style="color: #000088;">$d</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">+</span> <span style="color: #000088;">$delta</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> <span style="color: #000088;">$d</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">+</span> <span style="color: #000088;">$delta</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  <span style="color: #990000;">imagepng</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$img</span><span style="color: #339933;">,</span> <span style="color: #000088;">$filename</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
  <span style="color: #990000;">imagedestroy</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$img</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>The code is straight forward.  First, create the image via <code>imagecreatetruecolor()</code>.  Then, starting with the starting color, draw a one pixel tall rectangle for each row of the image.  The next row&#8217;s color is computed in each iteration by adding the accumulated delta to the starting color.  Finally, output the image as a PNG via <code>imagepng()</code> and free the memory.  The complete php source can be downloaded <a href="http://saturnboy.com/proj/php/perfect_gradient/gradient.php.gz">here</a>.</p>
<h3>Button Time </h3>
<p>Once we have our perfect gradient engine in place, it&#8217;s time to make some perfect buttons.  To achieve the standard <i>glass button</i> look-and-feel, I typically fuse two gradients together: light on the top, dark on the bottom.</p>
<p class="bottom">Here are the two halves of a pretty red button, along with their starting color and deltas:</p>
<div class="span-14 last">
<div class="prepend-1 span-3 quiet">TOP</div>
<div class="prepend-2 span-3 last quiet">BOTTOM</div>
</div>
<div class="span-14 last">
<div class="prepend-1 span-3">
<img src="http://saturnboy.com/proj/php/perfect_gradient/red-top.png" alt="top" title="top" width="100" height="16" />
</div>
<div class="prepend-2 span-3 last">
<img src="http://saturnboy.com/proj/php/perfect_gradient/red-bottom.png" alt="bottom" title="bottom" width="100" height="16" />
</div>
</div>
<div class="span-14 last">
<div class="prepend-1 span-3"><b>#ff8080</b><br />-3,-3,-3</div>
<div class="prepend-2 span-3 last"><b>#d23c3c</b><br />-3,-3,-3</div>
</div>
<div class="span-14 last">&nbsp;</div>
<p class="bottom">And the two commandline invocations of <code>gradient.php</code> to create the gradients:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">php gradient.php 100x16 ff8080 -<span style="color: #000000;">3</span>,-<span style="color: #000000;">3</span>,-<span style="color: #000000;">3</span> top.png
php gradient.php 100x16 d23c3c -<span style="color: #000000;">3</span>,-<span style="color: #000000;">3</span>,-<span style="color: #000000;">3</span> bottom.png</pre></div></div>

<p>If I want my buttons to be sexy, rounded corners are a must.  My favorite photoshop trick to create multiple rounded buttons is to use a rounded alpha-transparent button with each gradient as a clipping mask.  Using a clipping mask is a simple way to guarantee button geometry remains fixed while colors are changed.</p>
<p class="bottom">Here is the layers pane showing the two gradients fused together and used as a clipping mask for the rounded alpha-transparent button:</p>
<div class="prepend-1 span-13 last">
<img src="http://saturnboy.com/proj/php/perfect_gradient/layers-dialog.png" alt="clip mask" title="clip mask" width="220" height="177" />
</div>
<div class="span-14 last">&nbsp;</div>
<p class="bottom">The result is a horizontally stretchable gradient button, that doesn&#8217;t look half bad.  See for yourself:</p>
<div class="prepend-1 span-13 last">
<img src="http://saturnboy.com/proj/php/perfect_gradient/btn-red.png" alt="button" title="button" width="26" height="32" />
</div>
<div class="span-14 last">&nbsp;</div>
<h3>Custom UIButton</h3>
<p class="bottom">The final button asset can be used as desired, but here is a simple Objective-C example since I&#8217;ve been in iPhone world lately:</p>

<div class="wp_syntax"><div class="code"><pre class="objc" style="font-family:monospace;">UIButton <span style="color: #002200;">*</span>btn <span style="color: #002200;">=</span> <span style="color: #002200;">&#91;</span>UIButton buttonWithType<span style="color: #002200;">:</span>UIButtonTypeCustom<span style="color: #002200;">&#93;</span>;
<span style="color: #002200;">&#91;</span>btn setFrame<span style="color: #002200;">:</span>CGRectMake<span style="color: #002200;">&#40;</span><span style="color: #2400d9;">20</span>, <span style="color: #2400d9;">20</span>, <span style="color: #2400d9;">140</span>, <span style="color: #2400d9;">32</span><span style="color: #002200;">&#41;</span><span style="color: #002200;">&#93;</span>;
<span style="color: #002200;">&#91;</span>btn setBackgroundImage<span style="color: #002200;">:</span><span style="color: #002200;">&#91;</span><span style="color: #002200;">&#91;</span>UIImage imageNamed<span style="color: #002200;">:</span><span style="color: #bf1d1a;">@</span><span style="color: #bf1d1a;">&quot;btn-red.png&quot;</span><span style="color: #002200;">&#93;</span>
      stretchableImageWithLeftCapWidth<span style="color: #002200;">:</span><span style="color: #2400d9;">10.0</span>
      topCapHeight<span style="color: #002200;">:</span><span style="color: #2400d9;">0.0</span><span style="color: #002200;">&#93;</span> forState<span style="color: #002200;">:</span>UIControlStateNormal<span style="color: #002200;">&#93;</span>;
<span style="color: #002200;">&#91;</span>btn setTitle<span style="color: #002200;">:</span><span style="color: #bf1d1a;">@</span><span style="color: #bf1d1a;">&quot;BUTTON&quot;</span> forState<span style="color: #002200;">:</span>UIControlStateNormal<span style="color: #002200;">&#93;</span>;
<span style="color: #002200;">&#91;</span>btn setTitleColor<span style="color: #002200;">:</span><span style="color: #002200;">&#91;</span>UIColor whiteColor<span style="color: #002200;">&#93;</span> forState<span style="color: #002200;">:</span>UIControlStateNormal<span style="color: #002200;">&#93;</span>;
<span style="color: #002200;">&#91;</span>btn.titleLabel setFont<span style="color: #002200;">:</span><span style="color: #002200;">&#91;</span>UIFont boldSystemFontOfSize<span style="color: #002200;">:</span><span style="color: #2400d9;">14</span><span style="color: #002200;">&#93;</span><span style="color: #002200;">&#93;</span>;</pre></div></div>

<p>Create a new <code>UIButton</code> of type <code>UIButtonTypeCustom</code> and then set the button skin as the <code>backgroundImage</code>.  The horizontal stretchability is due to the <code>stretchableImageWithLeftCapWidth</code> and <code>topCapHeight</code>.</p>
<p class="bottom">Here is a screenshot from the iPhone simulator showing the button in action:</p>
<div class="prepend-1 span-13 last">
<img src="http://saturnboy.com/proj/php/perfect_gradient/screenshot.png" alt="screenshot" title="screenshot" width="326" height="486" />
</div>
<div class="span-14 last">&nbsp;</div>
<h5>Files</h5>
<ul>
<li><a href="http://saturnboy.com/proj/php/perfect_gradient/gradient.php.gz">gradient.php</a> &ndash; the perfect gradient engine</li>
<li><a href="http://saturnboy.com/proj/php/perfect_gradient/gradient-button.psd">gradient-button.psd</a> &ndash; the photoshop source for the red button image, including the rounded alpha-transparent button and fused red gradients</li>
<li><a href="http://saturnboy.com/proj/php/perfect_gradient/btn-red.png">btn-red.png</a> &ndash; the red button image</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://saturnboy.com/2010/05/perfect-gradients-perfect-buttons/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Scraping Google Groups</title>
		<link>http://saturnboy.com/2010/03/scraping-google-groups/</link>
		<comments>http://saturnboy.com/2010/03/scraping-google-groups/#comments</comments>
		<pubDate>Mon, 29 Mar 2010 03:31:31 +0000</pubDate>
		<dc:creator>justin</dc:creator>
				<category><![CDATA[Code]]></category>
		<category><![CDATA[Work]]></category>
		<category><![CDATA[flexmonkey]]></category>
		<category><![CDATA[php]]></category>

		<guid isPermaLink="false">http://saturnboy.com/?p=1105</guid>
		<description><![CDATA[When we launched the new and improved Gorilla Logic website, we decided to bring all our open source projects together under one roof. In order to migrate all things FlexMonkey back to our website, we need to get our forum data migrated out of Google Groups. Alas, Google doesn&#8217;t provide any way to export data [...]]]></description>
			<content:encoded><![CDATA[<p>When we launched the new and improved <a href="http://www.gorillalogic.com/">Gorilla Logic</a> website, we decided to bring all our open source projects together under one roof.  In order to migrate all things <a href="http://www.gorillalogic.com/flexmonkey">FlexMonkey</a> back to our website, we need to get our forum data migrated out of Google Groups.  Alas, Google doesn&#8217;t provide any way to export data from Google Groups.  The only way to preserve the amazing contributions from the FlexMonkey community was to scrape Google Groups.  So that&#8217;s just what we did.</p>
<p>With a very minimal amount of PHP, I was able to walk the entire FlexMonkey Google Group, scrap all the topics (aka threads) and all the posts inside each thread.  The first step was to build a generic scraper class that grabs an html page (using <a href="http://curl.haxx.se/">cURL</a>) and parses out all unique outbound links.</p>
<p class="bottom">Here&#8217;s the code for the <code>Scraper</code> class:</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">class</span> Scraper <span style="color: #009900;">&#123;</span>
    <span style="color: #000000; font-weight: bold;">private</span> <span style="color: #000088;">$url</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">''</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000088;">$html</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">''</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000088;">$links</span> <span style="color: #339933;">=</span> <span style="color: #990000;">array</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">function</span> __construct<span style="color: #009900;">&#40;</span><span style="color: #000088;">$url</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
        <span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">url</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$url</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
&nbsp;
    <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">function</span> run<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
        <span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">html</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">''</span><span style="color: #339933;">;</span>
        <span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">links</span> <span style="color: #339933;">=</span> <span style="color: #990000;">array</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
        <span style="color: #666666; font-style: italic;">//scrape url &amp; store html</span>
        <span style="color: #000088;">$ch</span> <span style="color: #339933;">=</span> <span style="color: #990000;">curl_init</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #990000;">curl_setopt</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$ch</span><span style="color: #339933;">,</span> CURLOPT_URL<span style="color: #339933;">,</span> <span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">url</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #990000;">curl_setopt</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$ch</span><span style="color: #339933;">,</span> CURLOPT_HEADER<span style="color: #339933;">,</span> <span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #990000;">curl_setopt</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$ch</span><span style="color: #339933;">,</span> CURLOPT_RETURNTRANSFER<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">true</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">html</span> <span style="color: #339933;">=</span> <span style="color: #990000;">curl_exec</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$ch</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #990000;">curl_close</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$ch</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
         <span style="color: #666666; font-style: italic;">//parse html for all links</span>
        <span style="color: #000088;">$matches</span> <span style="color: #339933;">=</span> <span style="color: #990000;">array</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #990000;">preg_match_all</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'#&lt;a.*?href\s*=\s*&quot;(.*?)&quot;.*?&gt;(.*?)&lt;/a&gt;#i'</span><span style="color: #339933;">,</span> <span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">html</span><span style="color: #339933;">,</span> <span style="color: #000088;">$matches</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
        <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$matches</span> <span style="color: #339933;">!==</span> <span style="color: #009900; font-weight: bold;">false</span> <span style="color: #339933;">&amp;&amp;</span> <span style="color: #990000;">count</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$matches</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">==</span> <span style="color: #cc66cc;">3</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
            <span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$i</span> <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span> <span style="color: #000088;">$i</span> <span style="color: #339933;">&lt;</span> <span style="color: #990000;">count</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$matches</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #000088;">$i</span><span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
                <span style="color: #000088;">$href</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$matches</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span><span style="color: #000088;">$i</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
                <span style="color: #000088;">$val</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$matches</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span><span style="color: #000088;">$i</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
&nbsp;
                <span style="color: #666666; font-style: italic;">//unique links</span>
                <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">!</span><span style="color: #990000;">array_key_exists</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$href</span><span style="color: #339933;">,</span> <span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">links</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
                    <span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">links</span><span style="color: #009900;">&#91;</span><span style="color: #000088;">$href</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$val</span><span style="color: #339933;">;</span>
                <span style="color: #009900;">&#125;</span>
            <span style="color: #009900;">&#125;</span>
        <span style="color: #009900;">&#125;</span>
    <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>In the <code>run()</code> method, cURL is used to grab the html.  Next, a regular expression is used to match all outbound links.  The links are are stored in a hash, while making sure they point to unique urls.</p>
<p class="bottom">Built on top of the generic <code>Scraper</code> class is a specialized Google Groups scraper class, aptly named <code>GoogleGroupsScraper</code>.  For a given Google Group, the url of the main page (containing a list of most recent topics) is:</p>

<div class="wp_syntax"><div class="code"><pre class="xml" style="font-family:monospace;">http://groups.google.com/group/[GROUP]/topics</pre></div></div>

<p class="bottom">And the url of a single topic (aka thread) is:</p>

<div class="wp_syntax"><div class="code"><pre class="xml" style="font-family:monospace;">http://groups.google.com/group/[GROUP]/browse_thread/thread/[THEAD_ID]#</pre></div></div>

<p>Where <code>[GROUP]</code> is the name of the Google Group, and <code>[THREAD_ID]</code> is some alphanumeric id.  Most importantly, at the bottom of the main page is an <u>Older &raquo;</u> link that points to the next page of topics.  The <code>GoogleGroupsScraper</code> exploits this to spider the entire group, recording topic title and topic url as it walks each page.</p>
<p>Next, each individual topic page is scraped by the <code>GoogleGroupsTopicScraper</code> class and parsed into a list of posts with author name, date, timestamp, etc.  The topic scraper uses various regular expressions to extract and massage the html to extract the different parts of each post.  In particular, the post body needs a lots of work to strip out any Google Groups specific links and code.</p>
<p>Lastly, the topics and their posts are assembled into an XML document with a nice big CDATA block around the post body to preserve the html content.</p>
<p class="bottom">Here&#8217;s some sample output from the scraper:</p>

<div class="wp_syntax"><div class="code"><pre class="xml" style="font-family:monospace;"><span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;?xml</span> <span style="color: #000066;">version</span>=<span style="color: #ff0000;">&quot;1.0&quot;</span> <span style="color: #000066;">encoding</span>=<span style="color: #ff0000;">&quot;UTF-8&quot;</span><span style="color: #000000; font-weight: bold;">?&gt;</span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;scrape</span> <span style="color: #000066;">group</span>=<span style="color: #ff0000;">&quot;flexmonkey&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
  <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;topic<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
    <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;title<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>FlexMonkey User Group is now located at www.gorillalogic.com/flexmonkey!<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/title<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
    <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;link<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>http://groups.google.com/group/flexmonkey/browse_thread/thread/fe9ed66bf56db88e#<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/link<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
    <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;posts<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
      <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;post</span> <span style="color: #000066;">idx</span>=<span style="color: #ff0000;">&quot;0&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;author<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>Stu<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/author<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;email<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>stu.st...@gorillalogic.com<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/email<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;date<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>February 10, 2010 21:17:52 UTC<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/date<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;timestamp<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>1265836672<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/timestamp<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;body<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #339933;">&lt;![CDATA[</span>
<span style="color: #339933;">&lt;p&gt;People of FlexMonkey, &lt;p&gt;We have migrated the FlexMonkey discussion forum to &lt;a href=&quot;http://www.gorillalogic.com/flexmonkey&quot;&gt;http://www.gorillalogic.com/flexmonkey&lt;/a&gt;. Please note that you will need to re-subscribe to the new forum to continue receiving FlexMonkey discussion messages. &lt;p&gt;-Stu &lt;br&gt;</span>
<span style="color: #339933;">]]&gt;</span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/body<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
      <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/post<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
    <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/posts<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
  <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/topic<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
  <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;topic<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
    <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;title<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>Record button clicks based on Ids instead of names?<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/title<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
    <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;link<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>http://groups.google.com/group/flexmonkey/browse_thread/thread/4f079b1959374f53#<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/link<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
    <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;posts<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
      <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;post</span> <span style="color: #000066;">idx</span>=<span style="color: #ff0000;">&quot;0&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;author<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>Shilpa<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/author<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;email<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>shilpa.g...@gmail.com<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/email<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;date<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>February 9, 2010 23:44:44 UTC<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/date<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;timestamp<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>1265759084<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/timestamp<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;body<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>...<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/body<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
      <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/post<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
      <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;post</span> <span style="color: #000066;">idx</span>=<span style="color: #ff0000;">&quot;1&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;author<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>Shilpa<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/author<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;email<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>shilpa.g...@gmail.com<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/email<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;date<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>February 10, 2010 00:05:44 UTC<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/date<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;timestamp<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>1265760344<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/timestamp<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;body<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>...<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/body<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
      <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/post<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
      <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;post</span> <span style="color: #000066;">idx</span>=<span style="color: #ff0000;">&quot;2&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;author<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>Gokuldas K Pillai<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/author<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;email<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>gokul...@gmail.com<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/email<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;date<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>February 10, 2010 00:16:34 UTC<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/date<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;timestamp<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>1265760994<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/timestamp<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;body<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>...<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/body<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
      <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/post<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
      <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;post</span> <span style="color: #000066;">idx</span>=<span style="color: #ff0000;">&quot;3&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;author<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>Shilpa<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/author<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;email<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>shilpa.g...@gmail.com<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/email<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;date<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>February 10, 2010 01:18:42 UTC<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/date<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;timestamp<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>1265764722<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/timestamp<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
        <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;body<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>...<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/body<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
      <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/post<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
...</pre></div></div>

<p class="bottom">Finally, there is a very simple PHP driver for the scraper that runs the scraping process:</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;"><span style="color: #b1b100;">require_once</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'GoogleGroupsScraper.class.php'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000088;">$scraper</span> <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> GoogleGroupsScraper<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'[GROUP]'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #000088;">$scraper</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">run</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #b1b100;">print</span> <span style="color: #000088;">$scraper</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">getXML</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p class="bottom">And you run it as usual:</p>

<div class="wp_syntax"><div class="code"><pre class="xml" style="font-family:monospace;">php scrape.php &gt; output.xml</pre></div></div>

<p>Just enter the name of the Google Group you wish to scrap, and away you go.  Here are a couple of notes to help you along:</p>
<ol>
<li><code>[GROUP]</code> is the group name as it appears in the url, so no spaces, etc.</li>
<li>It&#8217;s not fast, so be patient, or modify the scraper code to generate some intermediate output.</li>
<li>Via a browser, Google Group displays 30 topics per page, but via PHP &amp; cURL you only get 10.  Probably some Cookie or User Agent magic going on.</li>
<li>Not much error handling.   The error handling that exists isn&#8217;t very good. It will break.</li>
<li>Good luck!</li>
</ol>
<p>Please download the code and use it however you wish.  Hopefully, putting the code online and writing this post will save someone else some time when migrating data off Google Groups.</p>
<h5>Files</h5>
<ul>
<li><a href="http://saturnboy.com/proj/gorilla/scraper/GoogleGroupsScraper.tgz">GoogleGroupsScraper.tgz</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://saturnboy.com/2010/03/scraping-google-groups/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Parsing Twitter with RegExp</title>
		<link>http://saturnboy.com/2010/02/parsing-twitter-with-regexp/</link>
		<comments>http://saturnboy.com/2010/02/parsing-twitter-with-regexp/#comments</comments>
		<pubDate>Tue, 23 Feb 2010 11:51:02 +0000</pubDate>
		<dc:creator>justin</dc:creator>
				<category><![CDATA[Code]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[regexp]]></category>

		<guid isPermaLink="false">http://saturnboy.com/?p=1055</guid>
		<description><![CDATA[I needed a very simple Twitter cache for a project I&#8217;m working on. And I was very happy to trade off some realtime accuracy for reliability. In addition to caching the tweets, I also needed to pre-process them into css-able html with clickable links, usernames, and hashtags. The web had a few nice examples of [...]]]></description>
			<content:encoded><![CDATA[<p>I needed a very simple Twitter cache for a project I&#8217;m working on.  And I was very happy to trade off some realtime accuracy for reliability.  In addition to caching the tweets, I also needed to pre-process them into css-able html with clickable links, usernames, and hashtags.  The web had a <a href="http://www.simonwhatley.co.uk/parsing-twitter-usernames-hashtags-and-urls-with-coldfusion">few</a> <a href="http://www.snipe.net/2009/09/php-twitter-clickable-links/">nice</a> <a href="http://snipplr.com/view/28483/regex-to-make-twitter-links-clickable/">examples</a> of how to use regular expressions to parse the raw tweet text, but I decided to take what I liked and do the rest myself.</p>
<h5>Links</h5>
<p class="bottom">Here&#8217;s the PHP code for parsing links out of the raw tweet text:</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;"><span style="color: #000088;">$text</span> <span style="color: #339933;">=</span> <span style="color: #990000;">preg_replace</span><span style="color: #009900;">&#40;</span>
    <span style="color: #0000ff;">'@(https?://([-\w\.]+)+(/([\w/_\.]*(\?\S+)?(#\S+)?)?)?)@'</span><span style="color: #339933;">,</span>
     <span style="color: #0000ff;">'&lt;a href=&quot;$1&quot;&gt;$1&lt;/a&gt;'</span><span style="color: #339933;">,</span>
    <span style="color: #000088;">$text</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>I only wanted <code>http</code> and <code>https</code> links, with an optional query part <code>(\?\S+)?</code> and an option anchor part <code>(#\S+)?</code>.  The conversion of a text link into an html link is done using back references, which in PHP is <code>$1</code>, <code>$2</code>, etc.  In the expression above, I use <code>$1</code> twice to put the matched link into both the <code>href</code> attribute and the link text.</p>
<h5>Users</h5>
<p class="bottom">Here&#8217;s the PHP code for parsing Twitter usernames:</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;"><span style="color: #000088;">$text</span> <span style="color: #339933;">=</span> <span style="color: #990000;">preg_replace</span><span style="color: #009900;">&#40;</span>
    <span style="color: #0000ff;">'/@(\w+)/'</span><span style="color: #339933;">,</span>
    <span style="color: #0000ff;">'&lt;a href=&quot;http://twitter.com/$1&quot;&gt;@$1&lt;/a&gt;'</span><span style="color: #339933;">,</span>
    <span style="color: #000088;">$text</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>Nothing special, just take the @ and all following word characters (letters, digits, and underscores), and turn it into a user link.</p>
<h5>Hashtags</h5>
<p class="bottom">Here&#8217;s the PHP code for parsing Twitter hashtags:</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;"><span style="color: #000088;">$text</span> <span style="color: #339933;">=</span> <span style="color: #990000;">preg_replace</span><span style="color: #009900;">&#40;</span>
    <span style="color: #0000ff;">'/\s+#(\w+)/'</span><span style="color: #339933;">,</span>
    <span style="color: #0000ff;">' &lt;a href=&quot;http://search.twitter.com/search?q=%23$1&quot;&gt;#$1&lt;/a&gt;'</span><span style="color: #339933;">,</span>
    <span style="color: #000088;">$text</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>Getting the hashtags right was the most tricky of the three.  I decided to only grab hashtags that were proceeded by one or more spaces.  The real magic is the <code>%23</code> in the query string, which forces a search on the complete hashtag, including the <code>#</code> part.  For example, compare a search for <a href="http://search.twitter.com/search?q=%23flex">#flex</a> to a search for <a href="http://search.twitter.com/search?q=flex">flex</a>.</p>
<h5>The Cache</h5>
<p>The cache is just a simple cron job that periodically queries Twitter and retrieves the latest tweets.  Most importantly, the cache fails gracefully if Twitter is inaccessible, which it does by doing exactly nothing if Twitter is down.  This guarantees that my app always has valid data (when my server is up, the cache is up too), but with the possibility that the data is a little old.</p>
<p class="bottom">Here&#8217;s the notable function in the cache:</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">function</span> getTweets<span style="color: #009900;">&#40;</span><span style="color: #000088;">$user</span><span style="color: #339933;">,</span> <span style="color: #000088;">$num</span> <span style="color: #339933;">=</span> <span style="color: #cc66cc;">3</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #666666; font-style: italic;">//first, get the user's timeline</span>
    <span style="color: #000088;">$ch</span> <span style="color: #339933;">=</span> <span style="color: #990000;">curl_init</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #990000;">curl_setopt</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$ch</span><span style="color: #339933;">,</span> CURLOPT_URL<span style="color: #339933;">,</span> <span style="color: #0000ff;">&quot;http://twitter.com/statuses/user_timeline/<span style="color: #006699; font-weight: bold;">$user</span>.json?count=<span style="color: #006699; font-weight: bold;">$num</span>&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #990000;">curl_setopt</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$ch</span><span style="color: #339933;">,</span> CURLOPT_RETURNTRANSFER<span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">true</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #000088;">$json</span> <span style="color: #339933;">=</span> <span style="color: #990000;">curl_exec</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$ch</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #990000;">curl_close</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$ch</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$json</span> <span style="color: #339933;">===</span> <span style="color: #009900; font-weight: bold;">false</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span> <span style="color: #b1b100;">return</span> <span style="color: #009900; font-weight: bold;">false</span><span style="color: #339933;">;</span> <span style="color: #009900;">&#125;</span> <span style="color: #666666; font-style: italic;">//abort on error</span>
&nbsp;
    <span style="color: #666666; font-style: italic;">//second, convert the resulting json into PHP</span>
    <span style="color: #000088;">$result</span> <span style="color: #339933;">=</span> <span style="color: #990000;">json_decode</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$json</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #666666; font-style: italic;">//third, build up the html output</span>
    <span style="color: #000088;">$s</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">''</span><span style="color: #339933;">;</span>
    <span style="color: #b1b100;">foreach</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$result</span> <span style="color: #b1b100;">as</span> <span style="color: #000088;">$item</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
        <span style="color: #666666; font-style: italic;">//handle any special characters</span>
        <span style="color: #000088;">$text</span> <span style="color: #339933;">=</span> <span style="color: #990000;">htmlentities</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$item</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">text</span><span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">ENT_QUOTES</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">'utf-8'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
        <span style="color: #666666; font-style: italic;">//build the metadata part</span>
        <span style="color: #000088;">$meta</span> <span style="color: #339933;">=</span> <span style="color: #990000;">date</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'g:ia M jS'</span><span style="color: #339933;">,</span> <span style="color: #990000;">strtotime</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$item</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">created_at</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">.</span> <span style="color: #0000ff;">' from '</span> <span style="color: #339933;">.</span> <span style="color: #000088;">$item</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">source</span><span style="color: #339933;">;</span>
&nbsp;
        <span style="color: #666666; font-style: italic;">//parse the tweet text into html</span>
        <span style="color: #000088;">$text</span> <span style="color: #339933;">=</span> <span style="color: #990000;">preg_replace</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'@(https?://([-\w\.]+)+(/([\w/_\.]*(\?\S+)?(#\S+)?)?)?)@'</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">'&lt;a href=&quot;$1&quot;&gt;$1&lt;/a&gt;'</span><span style="color: #339933;">,</span> <span style="color: #000088;">$text</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #000088;">$text</span> <span style="color: #339933;">=</span> <span style="color: #990000;">preg_replace</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'/@(\w+)/'</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">'&lt;a href=&quot;http://twitter.com/$1&quot;&gt;@$1&lt;/a&gt;'</span><span style="color: #339933;">,</span> <span style="color: #000088;">$text</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #000088;">$text</span> <span style="color: #339933;">=</span> <span style="color: #990000;">preg_replace</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'/\s#(\w+)/'</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">' &lt;a href=&quot;http://search.twitter.com/search?q=%23$1&quot;&gt;#$1&lt;/a&gt;'</span><span style="color: #339933;">,</span> <span style="color: #000088;">$text</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
        <span style="color: #666666; font-style: italic;">//assemble everything</span>
        <span style="color: #000088;">$s</span> <span style="color: #339933;">.=</span> <span style="color: #0000ff;">'&lt;p class=&quot;tweet&quot;&gt;'</span> <span style="color: #339933;">.</span> <span style="color: #000088;">$text</span> <span style="color: #339933;">.</span> <span style="color: #0000ff;">&quot;&lt;br /&gt;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span> <span style="color: #339933;">.</span> <span style="color: #0000ff;">'&lt;span class=&quot;tweet-meta&quot;&gt;'</span> <span style="color: #339933;">.</span> <span style="color: #000088;">$meta</span> <span style="color: #339933;">.</span> <span style="color: #0000ff;">&quot;&lt;/span&gt;&lt;/p&gt;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
&nbsp;
    <span style="color: #b1b100;">return</span> <span style="color: #000088;">$s</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>First, we query the user&#8217;s JSON timeline using <a href="http://curl.haxx.se/">cURL</a>.  Second, we use PHP&#8217;s awesome json_decode function to convert the JSON into objects.  And lastly, we iterate over the tweets and parse everything into our desired HTML output.</p>
<p class="bottom">Here some sample output from <a href="http://twitter.com/saturnboy">my twitter</a> feed:</p>

<div class="wp_syntax"><div class="code"><pre class="html" style="font-family:monospace;">&lt;p class=&quot;tweet&quot;&gt;Been reading Programming Goggle App Engine. Actually feeling dumber now than before I started. Too much to learn.&lt;br /&gt; 
&lt;span class=&quot;tweet-meta&quot;&gt;2:58pm Feb 14th from &lt;a href=&quot;http://www.tweetdeck.com/&quot; rel=&quot;nofollow&quot;&gt;TweetDeck&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&nbsp;
&lt;p class=&quot;tweet&quot;&gt;Blog Post :: Async Testing with FlexUnit 4 :: &lt;a href=&quot;http://bit.ly/cGLnaI&quot;&gt;http://bit.ly/cGLnaI&lt;/a&gt;&lt;br /&gt; 
&lt;span class=&quot;tweet-meta&quot;&gt;3:33pm Feb 11th from &lt;a href=&quot;http://www.tweetdeck.com/&quot; rel=&quot;nofollow&quot;&gt;TweetDeck&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&nbsp;
&lt;p class=&quot;tweet&quot;&gt;Blog Post :: A Better HTML Template for Flex 4 :: &lt;a href=&quot;http://bit.ly/70DLsj&quot;&gt;http://bit.ly/70DLsj&lt;/a&gt;&lt;br /&gt; 
&lt;span class=&quot;tweet-meta&quot;&gt;12:55pm Jan 25th from &lt;a href=&quot;http://www.tweetdeck.com/&quot; rel=&quot;nofollow&quot;&gt;TweetDeck&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;</pre></div></div>

<p>Once I have the output, I can do whatever I want with it: save to disk, stick it in the database, keep it in memory, cache it in <a href="http://memcached.org/">memcache</a>, etc.  In my case, I wanted the simplest possible option, so I chose to write it out as a static html file.</p>
<p>The end.  The rest of the app&#8217;s not ready yet&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://saturnboy.com/2010/02/parsing-twitter-with-regexp/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
	</channel>
</rss>

