When we launched the new and improved Gorilla Logic website, we decided to bring all our open source projects together under one roof. In order to migrate all things FlexMonkey back to our website, we need to get our forum data migrated out of Google Groups. Alas, Google doesn’t provide any way to export data from Google Groups. The only way to preserve the amazing contributions from the FlexMonkey community was to scrape Google Groups. So that’s just what we did.
With a very minimal amount of PHP, I was able to walk the entire FlexMonkey Google Group, scrap all the topics (aka threads) and all the posts inside each thread. The first step was to build a generic scraper class that grabs an html page (using cURL) and parses out all unique outbound links.
Here’s the code for the Scraper
class:
class Scraper {
private $url = '';
public $html = '';
public $links = array();
public function __construct($url) {
$this->url = $url;
}
public function run() {
$this->html = '';
$this->links = array();
//scrape url & store html
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $this->url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$this->html = curl_exec($ch);
curl_close($ch);
//parse html for all links
$matches = array();
preg_match_all('#<a.*?href\s*=\s*"(.*?)".*?>(.*?)</a>#i', $this->html, $matches);
if ($matches !== false && count($matches) == 3) {
for ($i = 0; $i < count($matches[1]); $i++) {
$href = $matches[1][$i];
$val = $matches[2][$i];
//unique links
if (!array_key_exists($href, $this->links)) {
$this->links[$href] = $val;
}
}
}
}
}
In the run()
method, cURL is used to grab the html. Next, a regular expression is used to match all outbound links. The links are are stored in a hash, while making sure they point to unique urls.
Built on top of the generic Scraper
class is a specialized Google Groups scraper class, aptly named GoogleGroupsScraper
. For a given Google Group, the url of the main page (containing a list of most recent topics) is:
http://groups.google.com/group/[GROUP]/topics
And the url of a single topic (aka thread) is:
http://groups.google.com/group/[GROUP]/browse_thread/thread/[THEAD_ID]#
Where [GROUP]
is the name of the Google Group, and [THREAD_ID]
is some alphanumeric id. Most importantly, at the bottom of the main page is an Older » link that points to the next page of topics. The GoogleGroupsScraper
exploits this to spider the entire group, recording topic title and topic url as it walks each page.
Next, each individual topic page is scraped by the GoogleGroupsTopicScraper
class and parsed into a list of posts with author name, date, timestamp, etc. The topic scraper uses various regular expressions to extract and massage the html to extract the different parts of each post. In particular, the post body needs a lots of work to strip out any Google Groups specific links and code.
Lastly, the topics and their posts are assembled into an XML document with a nice big CDATA block around the post body to preserve the html content.
Here’s some sample output from the scraper:
<?xml version="1.0" encoding="UTF-8"?>
<scrape group="flexmonkey">
<topic>
<title>FlexMonkey User Group is now located at www.gorillalogic.com/flexmonkey!</title>
<link>http://groups.google.com/group/flexmonkey/browse_thread/thread/fe9ed66bf56db88e#</link>
<posts>
<post idx="0">
<author>Stu</author>
<email>stu.st...@gorillalogic.com</email>
<date>February 10, 2010 21:17:52 UTC</date>
<timestamp>1265836672</timestamp>
<body>
<![CDATA[
<p>People of FlexMonkey, <p>We have migrated the FlexMonkey discussion forum to <a href="http://www.gorillalogic.com/flexmonkey">http://www.gorillalogic.com/flexmonkey</a>. Please note that you will need to re-subscribe to the new forum to continue receiving FlexMonkey discussion messages. <p>-Stu <br>
]]>
</body>
</post>
</posts>
</topic>
<topic>
<title>Record button clicks based on Ids instead of names?</title>
<link>http://groups.google.com/group/flexmonkey/browse_thread/thread/4f079b1959374f53#</link>
<posts>
<post idx="0">
<author>Shilpa</author>
<email>shilpa.g...@gmail.com</email>
<date>February 9, 2010 23:44:44 UTC</date>
<timestamp>1265759084</timestamp>
<body>...</body>
</post>
<post idx="1">
<author>Shilpa</author>
<email>shilpa.g...@gmail.com</email>
<date>February 10, 2010 00:05:44 UTC</date>
<timestamp>1265760344</timestamp>
<body>...</body>
</post>
<post idx="2">
<author>Gokuldas K Pillai</author>
<email>gokul...@gmail.com</email>
<date>February 10, 2010 00:16:34 UTC</date>
<timestamp>1265760994</timestamp>
<body>...</body>
</post>
<post idx="3">
<author>Shilpa</author>
<email>shilpa.g...@gmail.com</email>
<date>February 10, 2010 01:18:42 UTC</date>
<timestamp>1265764722</timestamp>
<body>...</body>
</post>
...
Finally, there is a very simple PHP driver for the scraper that runs the scraping process:
require_once('GoogleGroupsScraper.class.php');
$scraper = new GoogleGroupsScraper('[GROUP]');
$scraper->run();
print $scraper->getXML();
And you run it as usual:
php scrape.php > output.xml
Just enter the name of the Google Group you wish to scrap, and away you go. Here are a couple of notes to help you along:
[GROUP]
is the group name as it appears in the url, so no spaces, etc.
- It’s not fast, so be patient, or modify the scraper code to generate some intermediate output.
- Via a browser, Google Group displays 30 topics per page, but via PHP & cURL you only get 10. Probably some Cookie or User Agent magic going on.
- Not much error handling. The error handling that exists isn’t very good. It will break.
- Good luck!
Please download the code and use it however you wish. Hopefully, putting the code online and writing this post will save someone else some time when migrating data off Google Groups.
Files