King Mo's Blog

Coding 'n shit

Magento “Random Upsell Products” Module

What it does

This module automatically adds random upsell products to any product view. The upsell products are picked based on on the products categories and price. It also adds a configuration option for the upsell product limit. Specific categories can be excluded as a search criteria.

How it works

When a product page is loaded and random upsell products are activated the module first checks if there are manually assigned upsell products and if these reach the upsell product limit already. If not it checks the products categories and if some of them are excluded.

The categories left are the ones the upsell products will be picked from. Next it takes the shown products price and searches for products which are in the categories which are used and which have a higher price than the shown product.

If not enough products are found with this criteria it searches again but takes only half the price as minimum. This will be repeated until enough products are found (but max. 5 times). Then for the last step it randomly picks products from the found ones until it reaches the upsell product limit. These products will then be shown in the product view as upsell products.

How to get it and how much it costs

You can buy this module here:

http://magento-modules.king-mo.solutions/random-upsell-products

Price: 25€

Installation

If you bought the module you’ll have the random-upsell-products.zip downloaded.

The zip contains a second module called ConfigTab. This is just for keeping configurations organized and prevent conflicts. Install this also for Random Upsell Products to work.

Now there are two ways to install the module:

app-folder:

Inside the zip file there is a folder ./app. The contents of this folder mirror exactly the paths were the module files should be copied to. Therefor you can simply copy this folder into the root folder of your Magento installation and all necessary files will be copied to the correct places.

In this case no extra installation for ConfigTag is needed.

Magento Connect Manager:

Beside the ./app folder there is also a file contained in the zip: KingMo_RandomUpsell-1.0.0.tgz (version number may change). This file you can upload to the Magento Connect Manager which can be found in the admin backend under “System” -> “Magento Connect”. Inside the Manager there is a section called “Direct package file upload”. Upload the .tgz file there and the module will be installed.

In this case please do the same with the KingMo_ConfigTab-1.0.0.tgz file!

If you have any problems

If you have any problems with the download or installation of this module please send me an email to magento-modules@king-mo.solutions.

If you need the modules functionality extended or even a custom module or anything contact me to, I’d be glad to hear from you!

Peace, Mo

 

Magento “Cookie Guest Wishlist” Module

What it does

This modules enables customers to use the Magento wishlist feature without having to be logged in. They are able to add and remove products just like when being logged in. The only feature which is disabled for guests is sharing the wishlist since I thought it wouldn’t be a good ideo letting anybody send emails via your shop. If a customer logs in the guest wishlist will be assigned to his account and if the account already has a wishlist they will be merged.

How it works

Every wishlist has an internal id assigned to it. Normally it also has a customer id assigned to it so Magento knows which wishlist to load when a customer is logged in. This module makes it possible to save wishlists without an assigned customer. The id of the wishlist will then be written into a cookie and saved in the guests browser. Therefor when a guest returns the id will be retrieved from the cookie and the corresponding wishlist will be loaded.

The module does not change anything in the look and feel of the wishlist it just determines when and how a wishlist will be loaded.

Now there could be the case that a customer has an account and already created a wishlist with this account but comes back as guest and creates another wishlist. In this case, as soon as he logs in again the module will compare both wishlists, add all products added as guest which aren’t already in the customers wishlist to it, delete the guest wishlist and set the cookie value to the customers wishlist id. Therefor from then on the customer always has the same wishlist either if he’s logged in or not.

How to get it and how much it costs

You can buy this module here:

http://magento-modules.king-mo.solutions/cookie-guest-wishlist

Price: 25€

Installation

If you bought the module you’ll have the cookie-guest-wishlist.zip downloaded. Now there are two ways to install the module:

app-folder:

Inside the zip file there is a folder ./app. The contents of this folder mirror exactly the paths were the module files should be copied to. Therefor you can simply copy this folder into the root folder of your Magento installation and all necessary files will be copied to the correct places.

Magento Connect Manager:

Beside the ./app folder there is also a file contained in the zip: KingMo_CookieWishlist-1.0.0.tgz (version number may change). This file you can upload to the Magento Connect Manager which can be found in the admin backend under “System” -> “Magento Connect”. Inside the Manager there is a section called “Direct package file upload”. Upload the .tgz file there and the module will be installed.

If you have any problems

If you have any problems with the download or installation of this module please send me an email to magento-modules@king-mo.solutions.

If you need the modules functionality extended or even a custom module or anything contact me to, I’d be glad to hear from you!

Peace, Mo

Magento “Manually Complete Orders” Module

What it does

This plugin adds a “Complete”-button the order view and a “Complete”-bulk action in the order grid view in the admin backend. With it you can manually set the status of any order to complete. It also sets the amount of money paid to the total of the order, so it will be added to the statistics correctly.

The problem it originally solved

One day the guy running a shop I was working on came to me and told me that they have a problem with an order being stuck in Magento. This happened because someone wanted to buy a product but canceled the order, for whatever reason, while being in the payment process. The order was completed “offline” but in the Magento backend it had the status “Pending Payment” and simply no way to change that. That’s how/why this module was developed.

How to get it and how much it costs

You can buy this module here:

http://magento-modules.king-mo.solutions/manually-complete-orders

Price: 5€

Installation

If you bought the module you’ll have the manually-complete-orders.zip downloaded. Now there are two ways to install the module:

app-folder:

Inside the zip file there is a folder ./app. The contents of this folder mirror exactly the paths were the module files should be copied to. Therefor you can simply copy this folder into the root folder of your Magento installation and all necessary files will be copied to the correct places.

Magento Connect Manager:

Beside the ./app folder there is also a file contained in the zip: KingMo_CompleteOrders-1.0.0.tgz (version number may change). This file you can upload to the Magento Connect Manager which can be found in the admin backend under “System” -> “Magento Connect”. Inside the Manager there is a section called “Direct package file upload”. Upload the .tgz file there and the module will be installed.

If you have any problems

If you have any problems with the download or installation of this module please send me an email to magento-modules@king-mo.solutions.

If you need the modules functionality extended or even a custom module or anything contact me to, I’d be glad to hear from you!

Peace, Mo

My Magento Modules

I’m back

Well, well, it’s been a while since I wrote something on this blog. What a stressful year 2015 was.. But it’s over soon and I have big plans for 2016. This blog and the whole “King Mo Solutions” thing is a part of this plans so there will be coming more soon. But telling that isn’t really the purpose of this post, I just had the urge to write this somehow :) anyways, here’s the real post:

Me and Magento

In the past (and currently) I was working with magento from time to time and therefor I dare to say that I’m quite familiar with this e-commerce system by now. And I mean the development part not selling anything 😉

Why am I telling you that?

Well, the first thing I plan is to put some of the things I learned, especially about module development, in some blog posts. Maybe that’ll give someone some hints who is just starting and even if not I’ll have my own resource to find and look up things that have a hard time staying in my memory :)

The second thing is, this times I was working on magento stores and adding functionality for this and that resulted in some modules which I think could be useful to some other people also. And they were just lying around on my laptop so I thought, why not sell them?

You might say I’m greedy ’cause I’m not giving them away for free but hey, who doesn’t want his work to be valued somehow. Also I tried to keep the prices low. So, I needed a page for selling them ’cause I wanted a seperate page for this and here it is:

http://magento-modules.king-mo.solutions

Not that beautiful I know but it’s enough for me for now :)

What modules are on this page?

For now there are three modules available to buy which I will describe shortly here and later on do an own blog post for each of them (will link them):

Manually Complete Orders

On the last shop I worked on one day we had a problem. Someone wanted to buy something but (don’t ask me why) canceled somewhere in the payment process. Now in the backend of Magento for this order there was the status “Pending Payment”. Wouldn’t be that much of a problem but you just couldn’t change the status in no way. The order was completed manually (offline) but in the backend it was stuck and not paid and that was something which bothered the one running the shop. Understandable in my opinion that’s somehow just dirty, for the lack of a better word right now.

Well, that’s how this module came to life. It simply adds a “Complete”-button for any order which enables you to set the order to complete and paid, so it will also fall into the statistics. It also adds a “Complete”-bulk action, so you can do this to multiple orders at once as well.

Price: 5€

You can buy it here: http://magento-modules.king-mo.solutions/manually-complete-orders

Cookie Guest Wishlist

Pretty self explaining I guess. This module allows customers to use the built-in Magento wishlist without being logged-in. This works by setting a cookie containing the wishlist-id. If there is a customer who already created a wishlist when logged-in creates a “offline”-wishlist also they’ll get merged as soon as he logs in again.

Price: 25€

You can buy it here: http://magento-modules.king-mo.solutions/cookie-guest-wishlist

Random Upsell Products

The problem here: Wanted to use the “Upsell” feature but soo many products. That’ll be quite some work assigning upsell products to all of them manually.

The solution: Just let them be added randomly!

This module does exactly that. It automatically searches for products in the same category but with a higher price and shows them as the upsell products. If there are already upsell products assigned they will still be shown but until the limit is reached random products will be added. If not enough more expensive products are found it takes half the price and searches again to make sure the upsell limit will be reached.

Another feature here is that you can set that upsell limit which by default is set by the theme and can’t be changed in the backend.

It also adds a configuration to exclude specific categories as a search criteria. For example if you have set up brands as categories you might want to exclude that because a brand may include products from multiple (real) categories. Hope I got my point across.

Price: 25€

You can buy it here: http://magento-modules.king-mo.solutions/random-upsell-products

That’s it for now

Maybe that’s useful for someone. If someone buys one of them I’d be glad to hear some feedback, maybe as a comment :)

If you have any problems with the download or the installation or anything else please don’t hesitate to contact me either with the contact form on this page or at magento-modules@king-mo.solutions (makes it easier to seperate).

Also I’d be glad to hear from you if you need some extended functionality or even a custom module developed!

Peace, Mo

My “Auto Image Adder” – WordPress Plugin

Edit:

I haven’t developed this any further yet and am currently not really planning on doing so. If you still are interested in this or have any bugs which you really really need to have fixed you still can contact me for sure and we’ll see what I can do :) Just wanted to drop this little warning here, I don’t really know if this is still working or not, maybe I’ll come back to this some day 😉

Description

This Plugin can automatically add Pictures to your posts when publishing. For this it does a Google image search for the Post title and scrapes the first Image. You can choose if the image should be saved in your WordPress or just linked to the original source. Further you can change if the image should be added to the post itself, if so you can choose the alignment, or if it should be set as the Posts featured image or both.

Below is a guide on how to set it up, the download link can be found at the bottom.

How to use it

Installation

In the ZIP-file there is a folder named “auto-image-adder”. Just add this folder to your Plugin directory and then activate it in the WordPress Dashboard. That’s it.

Configuration

All settings can be found under “Settings” -> “Writing”.

  • Save image or just link it – If set to “Save”, the picture will be saved in your own WordPress. If set to “Link”, the picture, when added to the post, will just be an <img>-tag with it’s src-attribute set to the original source.
  • Set image as featured image – If this option is selected the image will be set as the Posts featured image. Also the first option will be ignored because the image has to be saved in WordPress to be set as featured image anyways.
  • Add image to the post – If this is set, the image will be prepended to the Posts content when publishing.
  • Image class(es) – When an image is added to the post, here you can define which class(es) it should be given.
  • Image size – This is used for the scraping. These are the same values as in Google’s image search. Therefor this isn’t do “crop” the images but to initially only search for images this size.
  • Image align – If the image is added to the posts content, this decides if it is aligned left, right or in the middle.

Additional feature – Shortcodes

Another feature built into this plugin is the possible usage of shortcodes. You can use the shortcode [google-search-img] to add scraped images to any post.

Parameters

  • q=”[search query]”
  • align=”left|center|right”
  • size=”large|medium|pictogram|lt400x300|lt640x480|lt800x600|lt1024x768|lt1600x1200|lt2272x1704″
  • offset=[offset-number]
  • link=”[URL]”
  • class=”[class(es)]”
  • id=”[id]”
  • a-class=”[link class(es)]”
  • a-id=”[link id]”

with q you can set another query used to search an image.

offset is the 0 based value of which image from google to take. So 0 is the first, 1 is the second, 2 is the third and so on.

Example

Let’s say you want to add the first 3 results for a search for “ice cream cone” and you want them centered and “large” then you would add the following to your post content:

[google-search-img q="ice cream cone" align="center" size="large" offset=0]
[google-search-img q="ice cream cone" align="center" size="large" offset=1]
[google-search-img q="ice cream cone" align="center" size="large" offset=2]

Note

Be aware that with the shortcodes (yet) there is no way to save the scraped images. Might build in this option soon but by now it scrapes the images every time the post is loaded. Therefor too extensive use of this feature may slow down the loading time of the page. Also it is not given that these images are always the same because the order of results in the image search may change.

I hope you’ll like it

There is quite some effort put into this Plugin so hopefully somebody may find it useful. If you have any suggestions on how to improve this Plugin or just appriciate it I would be happy if you leave me a comment. Also if there are any bugs please let me know so I can fix it as soon as possible!

Here’s the link

http://blog.king-mo.solutions/wp-content/uploads/2015/04/auto-image-adder.zip

Blueprint for a secure and cloaked content locking solution

The problems and what I wanted to achieve

When looking at some content locking solutions I came to the conclusion that most of them are pretty insecure. I mean they just add some overlay over the content? Ever heard of developer tools? You can just remove this stuff by deleting it from the HTML markup. I mean most people may not know how to do this and maybe this isn’t too much of a problem but e.g. I know someone who has a page in the gaming niche where he would give away stuff which costs him 30-40c for every lead someone gets him. Unlimited. Now think someone could go around this, that would be pretty bad.

The next thing is when you’re doing stuff which is a little bit “blackhat” you maybe don’t want your CPA network to see where the guys filling out the offers come from. So you need some way to “cloak” your locked pages.

Another problem I had was that I needed a way to let the user see the content only one time and really only one time, so when he reloads the page it should be locked again immediately. Most content locking solutions I saw would relock the page only after some time span like min. 24h.

My solution

This whole thing is split into two parts:

  • the page with the offers (let’s call it “offer-page”)
  • all pages where content needs to be locked (“content-page”)

But how to connect them?

The idea here is that on the “offer-page”, when someone fills out an offer and generates a lead it shows some kind of “key” (random generated hash). This page has an API which allows you to “talk” to it from another page and check if a key is valid.

The “content-page” shows at first only a form with an input box for such a key and some notice like “You need an unlock key to view this content” along with a link that says “Get your unlock key here!” which points to the offer-page. When an unlock key is entered on the content-page it will make a request to the offer-page to check if the key is valid and if it is it show the locked content.

The offer-page

Let’s go into more detail about this.

What do you need

  • a domain and hosting naturally
  • an API from a CPA network
  • a database

How to set it up

You need a frontend for the user which displays the available offers. Pull the data for the offers from the API of your CPA network and display it on the page.

Now you’ll need a JavaScript which sends AJAX request back to the server to check if a lead has been generated. This should send back false if no lead has been done or otherwise a key. This JS should send the requests back every few seconds and should first start with this as soon as someone clicks on an offer. This is because you need to display the key to the page and you can’t really tell the user to refresh the page every now and then.

On the server this check should do the following:

It checks the API of the network with the IP of the user. If the lead count is more then 0 it should generate a key and store this key into a DB table and send it back. Now there’s the problem that the lead count will stay at 1 also if a key is generated already so what I did was make a second DB table which stores the IP of the user and a counter that’s always set to the lead count when a key is generated. That way also if you get 1 lead back from the API but 1 key was generated already it would still not count as lead and wouldn’t create another key.

When doing it that way you should always get the all time lead count from the API or otherwise set an expiration time in the IP database.

Now when the JS gets a key sent back from the server it just removes the offers with something like “Your unlock key is: [key]”.

Now the user has a key and this key is stored in your database.

The last thing you’ll need for this offer-page is some API endpoint where you can send a request to from another page, including a key, and which sends back either if it is valid or not. Since we only want the user to see the page a single time it also should delete the key from the database after confirming it true.

The content page

The scenario is that the user can view the page only once with a key. Therefor when he loads the page he has a possibility of entering a key and if it’s valid the real content shows up.

For this the easiest way to achieve is in checking the request method. When the user enters a URL in the browser or gets to the page it’s always a GET request. But if he submits a form a POST request is done. So in the background it checks for this and sends back the key form if it’s a GET request. But if it’s a POST request which includes the key it sends the key to the offer page via API call and if it’s a valid key it sends back the real content. Otherwise it again just sends back the key form (maybe with some “invalid key” message).

That’s it.

Update: A member on BHW told me that his conversion rate was higher when using the overlay instead of completely hiding the locked content away. I haven’t tested this but it sounds pretty logical, because I guess it’s more tempting for a user to get the content when he already gets an hint of what is waiting for him. So when using this solution you might think of either adding a background image which makes it look like it’s an overlay or adding a screenshot of the content on this page. So basically let the user see what he’s getting this key for, but without making him able to access it without the key 😉

Advanteges

  • People can’t just remove some elements from the HTML markup and go around the locker
  • You can use the offer-page for any content you want to lock, on any page. Could also easily be built as WordPress Plugin.
  • For your CPA network all leads and clicks are coming from one single page and if you do blackhat stuff and they ask you where your traffic is coming from you could easily fool them since you can lock stuff on any page. Just take some legitimate blog, add the key form lock to some articles and tell them that’s where your traffic comes from
  • You could also include multiple CPA networks API’s so you would have even more offers to show and that could make you more revenue per lead since you could place more rentable offers on the page
  • You could be sure people have to fill out an offer any time they want to see that content again

Conclusion

Maybe this gives someone some ideas, it’s not a step-by-step instruction but more of a guideline but I think it’s pretty useful for some projects. Also this could be adjusted for other scenarios quite easily.

If you need someone to implement this or any similar solution to your page(s) don’t hesitate to conact me 😉

Cheers,

Mo

Setting headers and doing POST requests with file_get_contents

I often see the discussion cURL vs file_get_contents and most of the times you find this statement somewhere in the discussion:

“Use cURL because with file_get_contents you can’t set any headers or do POST request…”

I think people should use what they are more comfortable with but this statement is just not true. You can pass a stream context, created with stream_context_create, as third parameter and with this you can set stuff like headers and also the request method to use. Here’s how:

Setting headers

To create a stream context you have to set the options first. For this we’ll need an array like this:

$opts = array(
    "http" => array(
        "method"  => "GET",
        "headers" => "[header-name]: [header-value]\r\n" .
                     "[header-name]: [header-value]\r\n"
    )
);

With this you can set all the headers you like, just don’t forget to always add \r\n at the end of each header.

Then you just have to create the context from this array and pass it to file_get_contents.

$context = stream_context_create($opts);
$result = file_get_contents("http://www.domain.tld", false, $context);

That’s it, you’ve done a GET request with file_get_contents where you set some headers.

Doing a POST request

You might have a clue already by looking at the first example with the headers. Just set the ‘method’ part of the array to POST and you’re doing a POST request 😉 But with a POST request you normally send along some data. That’s also not a problem at all. Here’s an example of how you do it:

$data = http_build_query(
    array(
        "value1name" => "some_value",
        "value2name" => "another_value"
    )
);

$opts = array('http' =>
    array(
        "method' => "POST",
        "header" => "Content-type: application/x-www-form-urlencoded\r\n",
        "content" => $data
    )
);

$context = stream_context_create($opts);

$result = file_get_contents("http://www.domain.tld", false, $context);

In this example it’s url encoded data like it would be when submitting a form but it could also be JSON data or anything, you just have to set the Content-Type header right.

Conclusion

I don’t want to say cURL is bad in any way but there are two main reasons I prefer file_get_contents over cURL:

  • file_get_contents is a built in PHP function so no extra extensions needed
  • it’s way less complex than cURL

Sure, cURL is a useful tool but I think for the most cases where you need to do a HTTP request file_get_contents is enough to achieve what you want to achieve and it’s, in my opinion, way more easy to get your head around.

Cheers,

Mo

How to scrape google with PHP

Here we go, first post on this blog.

I’m currently working on a project where I need to scrape google search results, so I’ll write down some of my experiences here.

So how to start?

When scraping content from a website programmatically (PHP) you need to get that pages HTML. And how to get a pages HTML? Right, a normal GET request. Basically this isn’t different from just entering the URL of the page in the browser. There are multiple ways to do a GET request, but normally I just use the basic file_get_contents. I see much discussion about what to use but I like to use this one because it’s a standard PHP function, so you don’t need any extensions for it. Btw. you can also use a context (see stream_context_create) with this function so you’re also able to set headers and so on, so it’s not that powerless. But it doesn’t really matter how you do the request, as long as you have the pages content in the end.

Update: Wrote a short article on how to set headers with file_get_contents

But what do we need to request exactly?

The request should look something like this:

https://www.google.[tld]/search?[parameter-string]

tld stands for “top level domain” and specifies which Google version (country) to use. But more on localization a little bit later, because it’s not done with only setting the right tld most of the times.

The parameter-string can include multiple values and looks like this:

[key]=[value]&[key]=[value]&…

the most basic parameter is the “q”-parameter, which stands for query and specifies the search query. The words are separated by + signs. You can use the PHP function urlencode for this. So let’s see what a very basic request for “best restaurants” in the US version of google (tld = .com) would look like:

https://www.google.com/search?q=best+restaurants

But like I said that’s really very basic. You will get personalized results and if you or your server isn’t located there then you most likely won’t get any results from the US. What to do about it?

Personalization:

To avoid personalized search results there is a parameter. It’s the “pws”-parameter and to turn it of you have to set it to 0.

Localization:

For the localization there are 2 parameters. First there is the “hl”-parameter to set the language. May not be important for keywords that are english anyways but e.g. when you’re located in a german speaking country where restaurant is the same word as in english then it definitly is a factor. The second parameter is the “gl”-parameter which specifies the country by it’s country code.

So our updated request URL looks like this:

https://www.google.com/search?q=best+restaurants&pws=0&hl=en&gl=us

With this you should get the US results for the keyword-phrase “best restaurants”

With the .com tld, I think mainly if you’re located outside of the US, there could still be problems with this stupid country redirect. For this, refer to the “Country redirect” paragraph below.

Now we should have the content of the result page. But how to scrape the results?

To know what we need to get out of all that HTML code we need to know how it is structured. You could just write the HTML to a file and then look through it but personally I prefer using the browsers developer tools for this.

The theory

After doing some research on this I found out that all results are in “li”-elements with the class “g”, so first thing to do is extract all content between all <li class=”g”> and their closing </li> tags. Doing that you’ll also get ads and news and stuff which aren’t real search result so we need to figure out a way to avoid including these. What I noticed is that all “real” search results have their descriptions in a div-container with the class “s”, so when looping through the content you can skip the results which don’t include <div class=”s”>, then you just have the search results you want.

We will try to get the headline and the URL of the search results. The headline is the anchortext of the link which also has the right URL as href-attribute. What I did now was searching the content in the <li> element for the opening “<a” and the closing “</a>” then you have the whole link element where everything you want to extract is included. Now first search this string for “href=”” and extract the content until the next ” . Now you have the link. Nearly done here. Next search for the “>” which ends the opening “<a”-tag and extract all content between this and the closing “</a>” tag and you have the headline. Now save both these values to an array and go to the next result. Repeat until you don’t find any “<li class=”g”>” elements anymore, now you have all the results from Google.

The practice

I will be using a while-loop to go through the content, the function strpos to get the end and start positions of the parts to extract and substr to extract it. The argument for the while loop will also be a strpos function. strpos returns false when the string that’s being searched for doesn’t occur. The third parameter of strpos is the offset, so it will start to search from this position. Therefor I will set a variable $offs which is 0 in the beginning and in every loop set to a position after the last occurence of <li class=”g”>. Hope you get the idea. The result of the GET request will be saved in a variable called $content. Also we set an array $results where we can store the results afterwards. So the while loop will look like this:

$results = array();
$offs = 0;
while ($pos = strpos($content, '<li class="g">', $offs)) {
    ...
}

Now $pos contains the position of <li class=”g”> but the starting position so we have to add it’s length (which is 14) to get the starting position of its content. Then we search for the next </li> to get the end position of the content. When we got both these values we can extract the content using substr.

$startPos = $pos + 14;
$endPos = strpos($content, '</li>', $startPos);
$resultContent = substr($content, $startPos, $endPos - $startPos);

The third argument of substr is the length of the content to extract. We get this by substracting the start position from the end position. Now we have the content of the result saved in a variable, but we still don’t know if it’s a “real” result so we need to check if it contains <div class=”s”> and if not continue the loop. Don’t forget to set the $offs variable before that otherwise it would loop through the same result again and again.

$offs = $endPos;
if(!strpos($resultContent, '<div class="s">')) {
    continue;
}

Now that we are sure that the result we have is one that we want to have we can start searching for the link and the headline. First lets get the whole “a”-element and save it to a variable called $link:

$linkSPos = strpos($resultContent, '<a');
$linkEPos = strpos($resultContent, '</a>', $linkSPos) + 4;
$link = substr($resultContent, $linkSPos, $linkEPos - $linkSPos);

The URL of the result is in the href attribute of this link so lets get this:

$urlSPos = strpos($link, 'href"=') + 6;
$urlEPos = strpos($link, '"', $urlSPos);
$url = substr($link, $urlSPos, $urlEPos - $urlSPos);

The headline is the “anchortext” of the link:

$hlSPos = strpos($link, '>', $urlEPos) + 1;
$hlEPos = strpos($link, '</a>', $hlSPos);
$headline = substr($link, $hlSPos, $hlEPos - $hlSPos);

You know that the keywords in the headline are bold (between <b>…</b>) so you might want to remove these tags to have the headline as a pure string. You could do this like this:

if(strpos($headline, '<b>') !== false) {
    $headline = str_replace('<b>', '', $headline);
    $headline = str_replace('</b>', '', $headline);
}

In the last line of the while-loop just add the values to the $results array:

$results[] = array("url" => $url, "headline" => headline);

That’s it. Now you have an array with all the extracted results ready to use. The position of the result would always be the results index in the array + 1.

Encoding and User-Agent

When getting the contents of the result page it may be that you have some problems with the encodings of the data. I don’t want to say to much about this because I haven’t tested this very extensively and I don’t want to spread too much bullshit about this but here is a little hint:

Google has 2 parameters for encoding:

  • ie = input encoding
  • oe = output encoding

E.g. if you want the output to be in UTF-8 add &oe=utf8 to your parameter string.

Also it could be that you have to decode special chars when extracting the results.

Another thing I haven’t tested quite heavily but could lead to problems is the user agent. The HTML and maybe even the results may differ when looking at the results from different devices. Therefor it could be a good idea to set a User-Agent header in your GET request to make sure it always sends back the right HTML markup. For example you could use this one:

Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36

Works for me.

Get more results

This article is mainly about the first page of google results but I will drop some hints how to get more results. I haven’t used them yet so I can’t go into too many details.

The first thing to get more results is the easiest as you don’t have to get more than one page. There is a parameter which sets how many results should be shown on one page. This parameter is “num”. So if you e.g. add &num=50 to your parameters then you’ll get more results. Try it out.

Other ways would be to do multiple requests where 1 request is 1 page. There are two ways I could think of:

  • scrape the link to the next page. There’s always the links to the next pages in the bottom. Try to extract those and make a request to them (not tested). Maybe think of setting the referer header.
  • use the “start”-parameter. You can use a parameter “start” to get other pages. Default it is 10 for second page, 20 for third, 30 for fourth and so on but be careful if you use the “num”-parameter because you have to set the numbers different then.

Country redirect

I don’t know how much this is needed when located in the US but I’m located in Austria and I have this problem pretty often. When using the .com tld it redirects you to your countries version of google. I think I figured out a little trick to avoid this:

When looking at the URL it redirected me to, I noticed this “gws_rd”-parameter set to “cr”. After doing a bit of research on this I figured out that gws_rd stand for something like “Google web server redirect” and cr for “country redirect” (I guess). So basically this tells Google that a country redirect has been done. Now if you set this parameter yourself when using the .com tld it doesn’t redirect you anymore. Quite useful in my opinion.

Therefor it may be a good idea to always add &gws_rd=cr to your parameters when doing a request.

Conclusion

There might be better ways to do all this and I don’t wanna say I’m such a pro in this topic but the stuff I wrote down works for me, so I thought I’d share it. Much of all this are more like guidelines and hints than step by step instructions but maybe somebody may find this useful. Could be that I’ll edit some of this stuff from time to time or add something when finding out more stuff.

Anyways, proud of my first ever blog post :)

Cheers,

Mo

Powered by WordPress & Theme by Anders Norén