Saturnboy
 3.28

Scraping Google Groups

,

When we launched the new and improved Gorilla Logic website, we decided to bring all our open source projects together under one roof. In order to migrate all things FlexMonkey back to our website, we need to get our forum data migrated out of Google Groups. Alas, Google doesn’t provide any way to export data from Google Groups. The only way to preserve the amazing contributions from the FlexMonkey community was to scrape Google Groups. So that’s just what we did.

With a very minimal amount of PHP, I was able to walk the entire FlexMonkey Google Group, scrap all the topics (aka threads) and all the posts inside each thread. The first step was to build a generic scraper class that grabs an html page (using cURL) and parses out all unique outbound links.

Here’s the code for the Scraper class:

class Scraper {
    private $url = '';
    public $html = '';
    public $links = array();
 
    public function __construct($url) {
        $this->url = $url;
    }
 
    public function run() {
        $this->html = '';
        $this->links = array();
 
        //scrape url & store html
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $this->url);
        curl_setopt($ch, CURLOPT_HEADER, 0);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        $this->html = curl_exec($ch);
        curl_close($ch);
 
         //parse html for all links
        $matches = array();
        preg_match_all('#<a.*?href\s*=\s*"(.*?)".*?>(.*?)</a>#i', $this->html, $matches);
 
        if ($matches !== false && count($matches) == 3) {
            for ($i = 0; $i < count($matches[1]); $i++) {
                $href = $matches[1][$i];
                $val = $matches[2][$i];
 
                //unique links
                if (!array_key_exists($href, $this->links)) {
                    $this->links[$href] = $val;
                }
            }
        }
    }
}

In the run() method, cURL is used to grab the html. Next, a regular expression is used to match all outbound links. The links are are stored in a hash, while making sure they point to unique urls.

Built on top of the generic Scraper class is a specialized Google Groups scraper class, aptly named GoogleGroupsScraper. For a given Google Group, the url of the main page (containing a list of most recent topics) is:

http://groups.google.com/group/[GROUP]/topics

And the url of a single topic (aka thread) is:

http://groups.google.com/group/[GROUP]/browse_thread/thread/[THEAD_ID]#

Where [GROUP] is the name of the Google Group, and [THREAD_ID] is some alphanumeric id. Most importantly, at the bottom of the main page is an Older » link that points to the next page of topics. The GoogleGroupsScraper exploits this to spider the entire group, recording topic title and topic url as it walks each page.

Next, each individual topic page is scraped by the GoogleGroupsTopicScraper class and parsed into a list of posts with author name, date, timestamp, etc. The topic scraper uses various regular expressions to extract and massage the html to extract the different parts of each post. In particular, the post body needs a lots of work to strip out any Google Groups specific links and code.

Lastly, the topics and their posts are assembled into an XML document with a nice big CDATA block around the post body to preserve the html content.

Here’s some sample output from the scraper:

<?xml version="1.0" encoding="UTF-8"?>
<scrape group="flexmonkey">
  <topic>
    <title>FlexMonkey User Group is now located at www.gorillalogic.com/flexmonkey!</title>
    <link>http://groups.google.com/group/flexmonkey/browse_thread/thread/fe9ed66bf56db88e#</link>
    <posts>
      <post idx="0">
        <author>Stu</author>
        <email>stu.st...@gorillalogic.com</email>
        <date>February 10, 2010 21:17:52 UTC</date>
        <timestamp>1265836672</timestamp>
        <body>
<![CDATA[
<p>People of FlexMonkey, <p>We have migrated the FlexMonkey discussion forum to <a href="http://www.gorillalogic.com/flexmonkey">http://www.gorillalogic.com/flexmonkey</a>. Please note that you will need to re-subscribe to the new forum to continue receiving FlexMonkey discussion messages. <p>-Stu <br>
]]>
        </body>
      </post>
    </posts>
  </topic>
  <topic>
    <title>Record button clicks based on Ids instead of names?</title>
    <link>http://groups.google.com/group/flexmonkey/browse_thread/thread/4f079b1959374f53#</link>
    <posts>
      <post idx="0">
        <author>Shilpa</author>
        <email>shilpa.g...@gmail.com</email>
        <date>February 9, 2010 23:44:44 UTC</date>
        <timestamp>1265759084</timestamp>
        <body>...</body>
      </post>
      <post idx="1">
        <author>Shilpa</author>
        <email>shilpa.g...@gmail.com</email>
        <date>February 10, 2010 00:05:44 UTC</date>
        <timestamp>1265760344</timestamp>
        <body>...</body>
      </post>
      <post idx="2">
        <author>Gokuldas K Pillai</author>
        <email>gokul...@gmail.com</email>
        <date>February 10, 2010 00:16:34 UTC</date>
        <timestamp>1265760994</timestamp>
        <body>...</body>
      </post>
      <post idx="3">
        <author>Shilpa</author>
        <email>shilpa.g...@gmail.com</email>
        <date>February 10, 2010 01:18:42 UTC</date>
        <timestamp>1265764722</timestamp>
        <body>...</body>
      </post>
...

Finally, there is a very simple PHP driver for the scraper that runs the scraping process:

require_once('GoogleGroupsScraper.class.php');
 
$scraper = new GoogleGroupsScraper('[GROUP]');
$scraper->run();
 
print $scraper->getXML();

And you run it as usual:

php scrape.php > output.xml

Just enter the name of the Google Group you wish to scrap, and away you go. Here are a couple of notes to help you along:

  1. [GROUP] is the group name as it appears in the url, so no spaces, etc.
  2. It’s not fast, so be patient, or modify the scraper code to generate some intermediate output.
  3. Via a browser, Google Group displays 30 topics per page, but via PHP & cURL you only get 10. Probably some Cookie or User Agent magic going on.
  4. Not much error handling. The error handling that exists isn’t very good. It will break.
  5. Good luck!

Please download the code and use it however you wish. Hopefully, putting the code online and writing this post will save someone else some time when migrating data off Google Groups.

Files

Comments

10.20.2011

1

Thanks – I love it when people help liberate data!

My notion is that it would be good to scrape it into a standard data format, e.g. the ATOM syndication standard from the IETF:

http://tools.ietf.org/html/rfc4287

Then people who write forum and blog software could support imports from ATOM, and this could be come a really easy thing to do.

For some archives, I guess ATOM might not be able to directly represent everything, so additions to the schema might be necessary, but I haven’t looked at that.

Would that make sense? Do you know how what you produced differs from ATOM?

10.21.2011

2

@Neal: Best would be if Google made a Google Groups API that allowed you to get forum data back out. Until they do that, everything is basically a big hack (like my scraping code posted above).

22

11.5.2012

3

two mistakes in code
1) “fales” instead of “false”
2) you need to add “date_default_timezone_set(‘GMT’);” on top otherwise will get a lot of timezone errors

Great piece of code man! Thanks

Raul

11.15.2012

4

Thanks for providing the code. I tried scrapping google group but i am getting error and not able to scrap the data. Could you please help in getting rid of this issue?

Anon

2.6.2013

5

Thanks a lot for releasing your work.

6.28.2013

6

we were using this for scraping forum discussions into our review collection for tosdr.org, but it seems the interface that this uses was recently retired. it now gets a redirect to the javascript-requiring interface — you can see this by adding echo $this->html; on line 27 of Scraper.class.php

will post here if i find a solution

© 2017 saturnboy.com