King Mo's Blog

Coding 'n shit

Month: April 2015

My “Auto Image Adder” – WordPress Plugin

Edit:

I haven’t developed this any further yet and am currently not really planning on doing so. If you still are interested in this or have any bugs which you really really need to have fixed you still can contact me for sure and we’ll see what I can do :) Just wanted to drop this little warning here, I don’t really know if this is still working or not, maybe I’ll come back to this some day ūüėČ

Description

This Plugin can automatically add Pictures to your posts when publishing. For this it does a Google image search for the Post title and scrapes the first Image. You can choose if the image should be saved in your WordPress or just linked to the original source. Further you can change if the image should be added to the post itself, if so you can choose the alignment, or if it should be set as the Posts featured image or both.

Below is a guide on how to set it up, the download link can be found at the bottom.

How to use it

Installation

In the ZIP-file there is a folder named “auto-image-adder”. Just add this folder to your Plugin directory and then activate it in the WordPress Dashboard. That’s it.

Configuration

All settings can be found under “Settings” -> “Writing”.

  • Save image or just link it – If set to “Save”, the picture will be saved in your own WordPress. If set to “Link”, the picture, when added to the post, will just be an <img>-tag with it’s src-attribute set to the original source.
  • Set image as featured image¬†– If this option is selected the image will be set as the Posts featured image. Also the first option will be ignored because the image has to be saved in WordPress to be set as featured image anyways.
  • Add image to the post¬†– If this is set, the image will be prepended to the Posts content when publishing.
  • Image class(es)¬†– When an image is added to the post, here you can define which class(es) it should be given.
  • Image size – This is used for the scraping. These are the same values as in Google’s image search. Therefor this isn’t do “crop” the images but to initially only search for images this size.
  • Image align – If the image is added to the posts content, this decides if it is aligned left, right or in the middle.

Additional feature – Shortcodes

Another feature built into this plugin is the possible usage of shortcodes. You can use the shortcode [google-search-img] to add scraped images to any post.

Parameters

  • q=”[search query]”
  • align=”left|center|right”
  • size=”large|medium|pictogram|lt400x300|lt640x480|lt800x600|lt1024x768|lt1600x1200|lt2272x1704″
  • offset=[offset-number]
  • link=”[URL]”
  • class=”[class(es)]”
  • id=”[id]”
  • a-class=”[link class(es)]”
  • a-id=”[link id]”

with q you can set another query used to search an image.

offset is the 0 based value of which image from google to take. So 0 is the first, 1 is the second, 2 is the third and so on.

Example

Let’s say you want to add the first 3 results for a search for “ice cream cone” and you want them centered and “large” then you would add the following to your post content:

[google-search-img q="ice cream cone" align="center" size="large" offset=0]
[google-search-img q="ice cream cone" align="center" size="large" offset=1]
[google-search-img q="ice cream cone" align="center" size="large" offset=2]

Note

Be aware that with the shortcodes (yet) there is no way to save the scraped images. Might build in this option soon but by now it scrapes the images every time the post is loaded. Therefor too extensive use of this feature may slow down the loading time of the page. Also it is not given that these images are always the same because the order of results in the image search may change.

I hope you’ll like it

There is quite some effort put into this Plugin so hopefully somebody may find it useful. If you have any suggestions on how to improve this Plugin or just appriciate it I would be happy if you leave me a comment. Also if there are any bugs please let me know so I can fix it as soon as possible!

Here’s the link

http://blog.king-mo.solutions/wp-content/uploads/2015/04/auto-image-adder.zip

Blueprint for a secure and cloaked content locking solution

The problems and what I wanted to achieve

When looking at some content locking solutions I came to the conclusion that most of them are pretty insecure. I mean they just add some overlay over the content? Ever heard of developer tools? You can just remove this stuff by deleting it from the HTML markup. I mean most people may not know how to do this and maybe this isn’t too much of a problem but e.g. I know someone who has a page in the gaming niche where he would give away stuff which costs him 30-40c for every lead someone gets him. Unlimited. Now think someone could go around this, that would be pretty bad.

The next thing is when you’re doing stuff which is a little bit “blackhat” you maybe don’t want your CPA network to see where the guys filling out the offers come from. So you need some way to “cloak” your locked pages.

Another problem I had was that I needed a way to let the user see the content only one time and really only one time, so when he reloads the page it should be locked again immediately. Most content locking solutions I saw would relock the page only after some time span like min. 24h.

My solution

This whole thing is split into two parts:

  • the page with the offers (let’s call it “offer-page”)
  • all pages where content needs to be locked (“content-page”)

But how to connect them?

The idea here is that on the “offer-page”, when someone fills out an offer and generates a lead it shows¬†some kind of “key” (random generated hash). This page has an¬†API which allows you to “talk” to it from another page and check if a key is valid.

The “content-page” shows at first only a form with an input box for such a key and some notice like “You need an unlock key to view this content” along with a link that says “Get your unlock key here!” which points to the offer-page. When an unlock key is entered on the content-page it will make a request to the offer-page to check if the key is valid and if it is it show the locked content.

The offer-page

Let’s go into more detail about this.

What do you need

  • a domain and hosting naturally
  • an API from a CPA network
  • a database

How to set it up

You need a frontend for the user which displays the available offers. Pull the data for the offers from the API of your CPA network and display it on the page.

Now you’ll need a JavaScript which sends AJAX request back to the server to check if a lead has been generated. This should send back false if no lead has been done or otherwise a key. This JS should send the requests back every few seconds and should first start with this as soon as someone clicks on an offer. This is because you need to display the key to the page and you can’t really tell the user to refresh the page every now and then.

On the server this check should do the following:

It checks the API of the network with the IP of the user. If the lead count is more then 0 it should generate a key and store this key into a DB table and send it back. Now there’s the problem that the lead count will stay at 1 also if a key is generated already so what I did was make a second DB table which stores the IP of the user and a counter that’s always set to the lead count when a key is generated. That way also if you get 1 lead back from the API but 1 key was generated already it would still not count as lead and wouldn’t create another key.

When doing it that way you should always get the all time lead count from the API or otherwise set an expiration time in the IP database.

Now when the JS gets a key sent back from the server it just removes the offers with something like “Your unlock key is: [key]”.

Now the user has a key and this key is stored in your database.

The last thing you’ll need for this offer-page is some API endpoint where you can send a request to from another page, including a key, and which sends back either if it is valid or not. Since we only want the user to see the page a single time it also should delete the key from the database after confirming it true.

The content page

The scenario is that the user can view the page only once with a key. Therefor when he loads the page he has a possibility of entering a key and if it’s valid the real content shows up.

For this the easiest way to achieve is in checking the request method. When the user enters a URL in the browser or gets to the page it’s always a GET request. But if he submits a form a POST request is done. So in the background it checks for this and sends back the key form if it’s a GET request. But if it’s a POST request which includes the key it sends the key to the offer page via API call and if it’s a valid key it sends back the real content. Otherwise it again just sends back the key form (maybe with some “invalid key” message).

That’s it.

Update: A member on BHW told me that his conversion rate was higher when using the overlay instead of completely hiding the locked content away. I haven’t tested this but it sounds pretty logical, because I guess it’s more tempting for a user to get the content when he already gets an hint of what is waiting for him. So when using this solution you might think of either adding a background image which makes it look like it’s an overlay or adding a screenshot of the content on this page. So basically let the user see what he’s getting this key for, but without making him able to access it without the key ūüėČ

Advanteges

  • People¬†can’t just remove some elements from the HTML markup and go around the locker
  • You can use the offer-page for any content you want to lock, on any page. Could also easily be built as WordPress Plugin.
  • For your CPA network all leads and clicks are coming from one single page and if you do blackhat stuff and they ask you where your traffic is coming from you could easily fool them since you can lock stuff on any page. Just take some legitimate blog, add the key form lock to some articles and tell them that’s where your traffic comes from
  • You could also include multiple CPA networks API’s so you would have even more offers to show and that could make you more revenue per lead since you could place more rentable offers on the page
  • You could be sure people have to fill out an offer any time they want to see that content again

Conclusion

Maybe this gives someone some ideas, it’s not a step-by-step instruction but more of a guideline but I think it’s pretty useful for some projects. Also this could be adjusted for other scenarios quite easily.

If you need someone to implement this or any similar solution to your page(s) don’t hesitate to conact me¬†ūüėČ

Cheers,

Mo

Setting headers and doing POST requests with file_get_contents

I often see the discussion cURL vs file_get_contents and most of the times you find this statement somewhere in the discussion:

“Use cURL because with file_get_contents you can’t set any headers or do POST request…”

I think people should use what they are more comfortable with but this statement is just not true. You can pass a stream context, created with stream_context_create, as third parameter and with this you can set stuff like headers and also the request method to use. Here’s how:

Setting headers

To create a stream context you have to set the options first. For this we’ll need an array like this:

$opts = array(
    "http" => array(
        "method"  => "GET",
        "headers" => "[header-name]: [header-value]\r\n" .
                     "[header-name]: [header-value]\r\n"
    )
);

With this you can set all the headers you like, just don’t forget to always add \r\n at the end of each header.

Then you just have to create the context from this array and pass it to file_get_contents.

$context = stream_context_create($opts);
$result = file_get_contents("http://www.domain.tld", false, $context);

That’s it, you’ve done a GET request with file_get_contents where you set some headers.

Doing a POST request

You might have a clue already by looking at the first example with the headers. Just set the ‘method’ part of the array to POST and you’re doing a POST request ūüėČ But with a POST request you normally send along some data. That’s also not a problem at all. Here’s an example of how you do it:

$data = http_build_query(
    array(
        "value1name" => "some_value",
        "value2name" => "another_value"
    )
);

$opts = array('http' =>
    array(
        "method' => "POST",
        "header" => "Content-type: application/x-www-form-urlencoded\r\n",
        "content" => $data
    )
);

$context = stream_context_create($opts);

$result = file_get_contents("http://www.domain.tld", false, $context);

In this example it’s url encoded data like it would be when submitting a form but it could also be JSON data or anything, you just have to set the Content-Type header right.

Conclusion

I don’t want to say cURL is bad in any way but there are two main reasons I prefer file_get_contents over cURL:

  • file_get_contents is a built in PHP function so no extra extensions needed
  • it’s way less complex than cURL

Sure, cURL is a useful tool but I think for the most cases where you need to do a HTTP request file_get_contents is enough to achieve what you want to achieve and it’s, in my opinion, way more easy to get your head around.

Cheers,

Mo

How to scrape google with PHP

Here we go, first post on this blog.

I’m currently working on a project where I need to scrape google search results, so I’ll write down some of my experiences here.

So how to start?

When scraping content from a website programmatically (PHP) you need to get that pages HTML. And how to get a pages HTML? Right, a normal GET request. Basically this isn’t different from¬†just entering the URL of the page in the browser. There are multiple ways to do a GET request, but normally I just use the basic file_get_contents. I see much discussion about what to use but I like to use this one because it’s a standard PHP function, so you don’t need any extensions for it. Btw.¬†you can also use a context (see¬†stream_context_create) with this function so you’re also able to set headers and so on, so it’s not that powerless. But it doesn’t really matter how you do the request, as long as you have the pages content in the end.

Update: Wrote a short article on how to set headers with file_get_contents

But what do we need to request exactly?

The request should look something like this:

https://www.google.[tld]/search?[parameter-string]

tld stands for “top level domain” and specifies which Google version (country) to use. But more on localization a little bit later, because it’s not done with only setting the right tld most of the times.

The parameter-string can include multiple values and looks like this:

[key]=[value]&[key]=[value]&…

the most basic parameter is the “q”-parameter, which stands for query and specifies the search query. The words are separated by + signs. You can use the PHP function urlencode¬†for this. So let’s see what a very basic request for “best restaurants” in the US version of google (tld = .com) would look like:

https://www.google.com/search?q=best+restaurants

But like I said that’s really very basic. You will get personalized results and if you or your server isn’t located there then you most likely won’t get any results from the US. What to do about it?

Personalization:

To avoid personalized search results there is a parameter. It’s the “pws”-parameter and to turn it of you have to set it to 0.

Localization:

For the localization there are 2 parameters. First there is the “hl”-parameter to set the language. May not be important for keywords that are english anyways but e.g. when you’re located in a german speaking country where restaurant is the same word as in english then it definitly is a factor. The second parameter is the “gl”-parameter which specifies the country by it’s country code.

So our updated request URL looks like this:

https://www.google.com/search?q=best+restaurants&pws=0&hl=en&gl=us

With this you should get the US results for the keyword-phrase “best restaurants”

With the .com tld, I think mainly if you’re located outside of the US, there could still be problems with this stupid country redirect. For this, refer to the “Country redirect” paragraph below.

Now we should have the content of the result page. But how to scrape the results?

To know what we need to get out of all that HTML code we need to know how it is structured. You could just write the HTML to a file and then look through it but personally I prefer using the browsers developer tools for this.

The theory

After doing some research on this I found out that all results are in “li”-elements with the class “g”, so first thing to do is extract all content between all <li class=”g”> and their closing </li> tags. Doing that you’ll also get ads and news and stuff which aren’t real search result so we need to figure out a way to avoid including these. What I noticed is that all “real” search results have their descriptions in a div-container with the class “s”, so when looping through the content you can skip the results which don’t include <div class=”s”>, then you just have the search results you want.

We will try to get the headline and the URL of the search results. The headline is the anchortext of the link which also has the right¬†URL as href-attribute. What I did now was searching the content in the <li> element for the opening “<a” and the closing “</a>” then you have the whole link element where everything you want to extract is included. Now first search this string for “href=”” and extract the content until the next ” . Now you have the link. Nearly done here. Next search for the “>” which ends the opening “<a”-tag and extract all content between this and the closing “</a>” tag and you have the headline. Now save both these values to an array and go to the next result. Repeat until you don’t find any “<li class=”g”>” elements anymore, now you have all the results from Google.

The practice

I will be using a while-loop to go through the content, the function strpos¬†to get the end and start positions of the parts to extract and substr¬†to extract it. The argument for the while loop will also be a strpos function. strpos returns false when the string that’s being searched for doesn’t occur. The third parameter of strpos is the offset, so it will start to search from this position. Therefor I will set a variable $offs which is 0 in the beginning and in every loop set to a position after the last occurence of <li class=”g”>. Hope you get the idea. The result of the GET request will be saved in a variable called $content. Also we set an array $results where we can store the results afterwards. So the while loop will look like this:

$results = array();
$offs = 0;
while ($pos = strpos($content, '<li class="g">', $offs)) {
    ...
}

Now $pos contains the position of <li class=”g”> but the starting position so we have to add it’s length (which is 14) to get the starting position of its content. Then we search for the next </li> to get the end position of the content. When we got both these values we can extract the content using substr.

$startPos = $pos + 14;
$endPos = strpos($content, '</li>', $startPos);
$resultContent = substr($content, $startPos, $endPos - $startPos);

The third argument of substr is the length of the content to extract. We get this by substracting the start position from the end position. Now we have the content of the result saved in a variable, but we still don’t know if it’s a “real” result so we need to check if it contains <div class=”s”> and if not continue the loop. Don’t forget to set the $offs variable before that otherwise it would loop through the same result again and again.

$offs = $endPos;
if(!strpos($resultContent, '<div class="s">')) {
    continue;
}

Now that we are sure that the result we have is one that we want to have we can start searching for the link and the headline. First lets get the whole “a”-element and save it to a variable called $link:

$linkSPos = strpos($resultContent, '<a');
$linkEPos = strpos($resultContent, '</a>', $linkSPos) + 4;
$link = substr($resultContent, $linkSPos, $linkEPos - $linkSPos);

The URL of the result is in the href attribute of this link so lets get this:

$urlSPos = strpos($link, 'href"=') + 6;
$urlEPos = strpos($link, '"', $urlSPos);
$url = substr($link, $urlSPos, $urlEPos - $urlSPos);

The headline is the “anchortext” of the link:

$hlSPos = strpos($link, '>', $urlEPos) + 1;
$hlEPos = strpos($link, '</a>', $hlSPos);
$headline = substr($link, $hlSPos, $hlEPos - $hlSPos);

You know that the keywords in the headline are bold (between <b>…</b>) so you might want to remove these tags to have the headline as a pure string. You could do this like this:

if(strpos($headline, '<b>') !== false) {
    $headline = str_replace('<b>', '', $headline);
    $headline = str_replace('</b>', '', $headline);
}

In the last line of the while-loop just add the values to the $results array:

$results[] = array("url" => $url, "headline" => headline);

That’s it. Now you have an array with all the extracted results ready to use. The position of the result would always be the results index in the array + 1.

Encoding and User-Agent

When getting the contents of the result page it may be that you have some problems with the encodings of the data. I don’t want to say to much about this because I haven’t tested this very extensively and I don’t want to spread too much bullshit about this but here is a little hint:

Google has 2 parameters for encoding:

  • ie = input encoding
  • oe = output encoding

E.g. if you want the output to be in UTF-8 add &oe=utf8 to your parameter string.

Also it could be that you have to decode special chars when extracting the results.

Another thing I haven’t tested quite heavily but could lead to problems is the user agent. The HTML and maybe even the results may differ when looking at the results from different devices. Therefor it could be a good idea to set a User-Agent header in your GET request to make sure it always sends back the right HTML markup. For example you could use this one:

Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36

Works for me.

Get more results

This article is mainly about the first page of google results but I will drop some hints how to get more results. I haven’t used them yet so I can’t go into too many details.

The first thing to get more results is the easiest as you don’t have to get more than one page. There is a parameter which sets how many results should be shown on one page. This parameter is “num”. So if you e.g. add &num=50 to your parameters then you’ll get more results. Try it out.

Other ways would be to do multiple requests where 1 request is 1 page. There are two ways I could think of:

  • scrape the link to the next page. There’s always the links to the next pages in the bottom. Try to extract those and make a request to them (not tested). Maybe think of setting the referer header.
  • use the “start”-parameter. You can use a parameter “start” to get other pages. Default it is 10 for second page, 20 for third, 30 for fourth and so on but be careful if you use the “num”-parameter because you have to set the numbers different then.

Country redirect

I don’t know how much this is needed when located in the US but I’m located in Austria and I have this problem pretty often. When using the .com tld it redirects you to your countries version of google. I think I figured out a little trick to avoid this:

When looking at the URL it redirected me to, I noticed this “gws_rd”-parameter set to “cr”. After doing a bit of research on this I figured out that gws_rd stand for something like “Google web server redirect” and cr for “country redirect” (I guess). So basically this tells Google that a country redirect has been done. Now if you set this parameter yourself when using the .com tld it doesn’t redirect you anymore. Quite useful in my opinion.

Therefor it may be a good idea to always add &gws_rd=cr to your parameters when doing a request.

Conclusion

There might be better ways to do all this and I don’t wanna say I’m such a pro in this topic but the stuff I wrote down works for me, so I thought I’d share it. Much of all this are more like guidelines and hints than step by step instructions but maybe somebody may find this useful. Could be that I’ll edit some of this stuff from time to time or add something when finding out more stuff.

Anyways, proud of my first ever blog post :)

Cheers,

Mo

Powered by WordPress & Theme by Anders Norén