Here we go, first post on this blog.

I’m currently working on a project where I need to scrape google search results, so I’ll write down some of my experiences here.

So how to start?

When scraping content from a website programmatically (PHP) you need to get that pages HTML. And how to get a pages HTML? Right, a normal GET request. Basically this isn’t different from just entering the URL of the page in the browser. There are multiple ways to do a GET request, but normally I just use the basic file_get_contents. I see much discussion about what to use but I like to use this one because it’s a standard PHP function, so you don’t need any extensions for it. Btw. you can also use a context (see stream_context_create) with this function so you’re also able to set headers and so on, so it’s not that powerless. But it doesn’t really matter how you do the request, as long as you have the pages content in the end.

Update: Wrote a short article on how to set headers with file_get_contents

But what do we need to request exactly?

The request should look something like this:

https://www.google.[tld]/search?[parameter-string]

tld stands for “top level domain” and specifies which Google version (country) to use. But more on localization a little bit later, because it’s not done with only setting the right tld most of the times.

The parameter-string can include multiple values and looks like this:

[key]=[value]&[key]=[value]&…

the most basic parameter is the “q”-parameter, which stands for query and specifies the search query. The words are separated by + signs. You can use the PHP function urlencode for this. So let’s see what a very basic request for “best restaurants” in the US version of google (tld = .com) would look like:

https://www.google.com/search?q=best+restaurants

But like I said that’s really very basic. You will get personalized results and if you or your server isn’t located there then you most likely won’t get any results from the US. What to do about it?

Personalization:

To avoid personalized search results there is a parameter. It’s the “pws”-parameter and to turn it of you have to set it to 0.

Localization:

For the localization there are 2 parameters. First there is the “hl”-parameter to set the language. May not be important for keywords that are english anyways but e.g. when you’re located in a german speaking country where restaurant is the same word as in english then it definitly is a factor. The second parameter is the “gl”-parameter which specifies the country by it’s country code.

So our updated request URL looks like this:

https://www.google.com/search?q=best+restaurants&pws=0&hl=en&gl=us

With this you should get the US results for the keyword-phrase “best restaurants”

With the .com tld, I think mainly if you’re located outside of the US, there could still be problems with this stupid country redirect. For this, refer to the “Country redirect” paragraph below.

Now we should have the content of the result page. But how to scrape the results?

To know what we need to get out of all that HTML code we need to know how it is structured. You could just write the HTML to a file and then look through it but personally I prefer using the browsers developer tools for this.

The theory

After doing some research on this I found out that all results are in “li”-elements with the class “g”, so first thing to do is extract all content between all <li class=”g”> and their closing </li> tags. Doing that you’ll also get ads and news and stuff which aren’t real search result so we need to figure out a way to avoid including these. What I noticed is that all “real” search results have their descriptions in a div-container with the class “s”, so when looping through the content you can skip the results which don’t include <div class=”s”>, then you just have the search results you want.

We will try to get the headline and the URL of the search results. The headline is the anchortext of the link which also has the right URL as href-attribute. What I did now was searching the content in the <li> element for the opening “<a” and the closing “</a>” then you have the whole link element where everything you want to extract is included. Now first search this string for “href=”” and extract the content until the next ” . Now you have the link. Nearly done here. Next search for the “>” which ends the opening “<a”-tag and extract all content between this and the closing “</a>” tag and you have the headline. Now save both these values to an array and go to the next result. Repeat until you don’t find any “<li class=”g”>” elements anymore, now you have all the results from Google.

The practice

I will be using a while-loop to go through the content, the function strpos to get the end and start positions of the parts to extract and substr to extract it. The argument for the while loop will also be a strpos function. strpos returns false when the string that’s being searched for doesn’t occur. The third parameter of strpos is the offset, so it will start to search from this position. Therefor I will set a variable $offs which is 0 in the beginning and in every loop set to a position after the last occurence of <li class=”g”>. Hope you get the idea. The result of the GET request will be saved in a variable called $content. Also we set an array $results where we can store the results afterwards. So the while loop will look like this:

$results = array();
$offs = 0;
while ($pos = strpos($content, '<li class="g">', $offs)) {
    ...
}

Now $pos contains the position of <li class=”g”> but the starting position so we have to add it’s length (which is 14) to get the starting position of its content. Then we search for the next </li> to get the end position of the content. When we got both these values we can extract the content using substr.

$startPos = $pos + 14;
$endPos = strpos($content, '</li>', $startPos);
$resultContent = substr($content, $startPos, $endPos - $startPos);

The third argument of substr is the length of the content to extract. We get this by substracting the start position from the end position. Now we have the content of the result saved in a variable, but we still don’t know if it’s a “real” result so we need to check if it contains <div class=”s”> and if not continue the loop. Don’t forget to set the $offs variable before that otherwise it would loop through the same result again and again.

$offs = $endPos;
if(!strpos($resultContent, '<div class="s">')) {
    continue;
}

Now that we are sure that the result we have is one that we want to have we can start searching for the link and the headline. First lets get the whole “a”-element and save it to a variable called $link:

$linkSPos = strpos($resultContent, '<a');
$linkEPos = strpos($resultContent, '</a>', $linkSPos) + 4;
$link = substr($resultContent, $linkSPos, $linkEPos - $linkSPos);

The URL of the result is in the href attribute of this link so lets get this:

$urlSPos = strpos($link, 'href"=') + 6;
$urlEPos = strpos($link, '"', $urlSPos);
$url = substr($link, $urlSPos, $urlEPos - $urlSPos);

The headline is the “anchortext” of the link:

$hlSPos = strpos($link, '>', $urlEPos) + 1;
$hlEPos = strpos($link, '</a>', $hlSPos);
$headline = substr($link, $hlSPos, $hlEPos - $hlSPos);

You know that the keywords in the headline are bold (between <b>…</b>) so you might want to remove these tags to have the headline as a pure string. You could do this like this:

if(strpos($headline, '<b>') !== false) {
    $headline = str_replace('<b>', '', $headline);
    $headline = str_replace('</b>', '', $headline);
}

In the last line of the while-loop just add the values to the $results array:

$results[] = array("url" => $url, "headline" => headline);

That’s it. Now you have an array with all the extracted results ready to use. The position of the result would always be the results index in the array + 1.

Encoding and User-Agent

When getting the contents of the result page it may be that you have some problems with the encodings of the data. I don’t want to say to much about this because I haven’t tested this very extensively and I don’t want to spread too much bullshit about this but here is a little hint:

Google has 2 parameters for encoding:

  • ie = input encoding
  • oe = output encoding

E.g. if you want the output to be in UTF-8 add &oe=utf8 to your parameter string.

Also it could be that you have to decode special chars when extracting the results.

Another thing I haven’t tested quite heavily but could lead to problems is the user agent. The HTML and maybe even the results may differ when looking at the results from different devices. Therefor it could be a good idea to set a User-Agent header in your GET request to make sure it always sends back the right HTML markup. For example you could use this one:

Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36

Works for me.

Get more results

This article is mainly about the first page of google results but I will drop some hints how to get more results. I haven’t used them yet so I can’t go into too many details.

The first thing to get more results is the easiest as you don’t have to get more than one page. There is a parameter which sets how many results should be shown on one page. This parameter is “num”. So if you e.g. add &num=50 to your parameters then you’ll get more results. Try it out.

Other ways would be to do multiple requests where 1 request is 1 page. There are two ways I could think of:

  • scrape the link to the next page. There’s always the links to the next pages in the bottom. Try to extract those and make a request to them (not tested). Maybe think of setting the referer header.
  • use the “start”-parameter. You can use a parameter “start” to get other pages. Default it is 10 for second page, 20 for third, 30 for fourth and so on but be careful if you use the “num”-parameter because you have to set the numbers different then.

Country redirect

I don’t know how much this is needed when located in the US but I’m located in Austria and I have this problem pretty often. When using the .com tld it redirects you to your countries version of google. I think I figured out a little trick to avoid this:

When looking at the URL it redirected me to, I noticed this “gws_rd”-parameter set to “cr”. After doing a bit of research on this I figured out that gws_rd stand for something like “Google web server redirect” and cr for “country redirect” (I guess). So basically this tells Google that a country redirect has been done. Now if you set this parameter yourself when using the .com tld it doesn’t redirect you anymore. Quite useful in my opinion.

Therefor it may be a good idea to always add &gws_rd=cr to your parameters when doing a request.

Conclusion

There might be better ways to do all this and I don’t wanna say I’m such a pro in this topic but the stuff I wrote down works for me, so I thought I’d share it. Much of all this are more like guidelines and hints than step by step instructions but maybe somebody may find this useful. Could be that I’ll edit some of this stuff from time to time or add something when finding out more stuff.

Anyways, proud of my first ever blog post :)

Cheers,

Mo