PHP Simple Web Crawler

PHP Simple Web Crawler

Last Updated on Mar 22, 2023

Introduction

A lot of times we might need to crawl the web for specific information and data gathering.

We can do it with PHP easily. We don’t even need any extra package.

We can do it in 2 steps

  1. Send a request and get the response
  2. Parse the response to get the data we need

Today we are going to send a request to IMDB and get box office’s top movies.

So let get started

Send Request and Get Response

The first step which is sending the data can be done with PHP curl. Do you remember how to send requests with PHP Curl?

and you can see imdb’s box office here.

So now we are going to send a GET request to that address.

Here is how our code looks like for sending the request with PHP Curl to imdb’s box office:

$curl = curl_init();
$requestType = 'GET'; // GET POST DELETE etc.
$url = 'https://www.imdb.com/chart/boxoffice';
curl_setopt_array($curl, array(
    CURLOPT_URL => $url,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_ENCODING => '',
    CURLOPT_MAXREDIRS => 10,
    CURLOPT_TIMEOUT => 30,
    CURLOPT_FOLLOWLOCATION => false,
    CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
    CURLOPT_CUSTOMREQUEST => $requestType,
    CURLOPT_POSTFIELDS => '',
    CURLOPT_HTTPHEADER => [],
));
$response = curl_exec($curl);
curl_close($curl);

Next we need to parse the data we’ve received.

Parse Data

First we need to create a domdocument

$dom = new DOMDocument();

Then we need to load our html into this domdocument

$dom->loadHTML($response);

Then in order to get specific data we need to use xpath so let’s create a new xpath and pass our domdocument to it as the first argument

$xpath = new DOMXPath($dom);

We’re almost done. Now if I run my code I get a lot of warnings from PHP. Most of these warnings are not very important for our purposes and we can turn them off. To do that add the following code before loading the document

libxml_use_internal_errors(true);

So till now our code looks like this:

$curl = curl_init();
$requestType = 'GET'; // GET POST DELETE etc.
$url = 'https://www.imdb.com/chart/boxoffice';
curl_setopt_array($curl, array(
    CURLOPT_URL => $url,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_ENCODING => '',
    CURLOPT_MAXREDIRS => 10,
    CURLOPT_TIMEOUT => 30,
    CURLOPT_FOLLOWLOCATION => false,
    CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
    CURLOPT_CUSTOMREQUEST => $requestType,
    CURLOPT_POSTFIELDS => '',
    CURLOPT_HTTPHEADER => [],
));
$response = curl_exec($curl);
curl_close($curl);
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($response);
$xpath = new DOMXPath($dom);

Great. Xpath helps us to search the elements. So Now we can easily get the data we need with xpath. If you don’t know how to work with xpath devhints has a great cheat sheet

You can also use the inspect element in both chrome and firefox and right click on the element you want and then copy > xpath

The first thing I need is to get the title of the page where it says weekend of … because it shows the dates of the box office and I want to show that in my page as well.

With the help of chrome inspect element I got its xpath and in order to search my xpath object i need to use the query function so let’s do it

$weekendTitleNode = $xpath->query('//*[@id="boxoffice"]/h4');

This will return a list of objects. I can loop through it or I get a specific item with the function item() and pass the item I want. So in order to get the first element I can write:

$firstItem =  $weekendTitleNode->item(0);

Now it gives me the specific node and I want to get its value. I can do it by running the function

$title = $firstItem->nodeValue;
echo $title;

Great, now let’s get to the movies. Each movie name is in a td element and the td element has the class of titleColumn. So our xpath would be

//td[@class='titleColumn']

Let’s query that and store it in a variable

$movies = $xpath->query("//td[@class='titleColumn']");

It gives me a list. Now I can loop through it and get all the movies

foreach ($movies as $movie) {
   echo $movie->nodeValue . "<br>";
}

Perfect. I get the following list

Weekend of April 15 - 17, 2022
Fantastic Beasts: The Secrets of Dumbledore
Sonic the Hedgehog 2
The Lost City
Everything Everywhere All at Once
Father Stu
Morbius
Ambulance
The Batman
K.G.F: Chapter 2
Uncharted

And the full code looks like this

$curl = curl_init();
$requestType = 'GET'; // GET POST DELETE etc.
$url = 'https://www.imdb.com/chart/boxoffice';
curl_setopt_array($curl, array(
   CURLOPT_URL => $url,
   CURLOPT_RETURNTRANSFER => true,
   CURLOPT_ENCODING => '',
   CURLOPT_MAXREDIRS => 10,
   CURLOPT_TIMEOUT => 30,
   CURLOPT_FOLLOWLOCATION => false,
   CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
   CURLOPT_CUSTOMREQUEST => $requestType,
   CURLOPT_POSTFIELDS => '',
   CURLOPT_HTTPHEADER => [],
));
$response = curl_exec($curl);
curl_close($curl);
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($response);
$xpath = new DOMXPath($dom);
$weekendTitleNode = $xpath->query('//*[@id="boxoffice"]/h4');
$firstItem =  $weekendTitleNode->item(0);
$title = $firstItem->nodeValue;
echo $title;
echo "<br>";
$movies = $xpath->query("//td[@class='titleColumn']");
foreach ($movies as $movie) {
   echo $movie->nodeValue . "<br>";
}

In 30 lines of code we could parse the movie list on IMDB’s box office page. Isn’t it amazing?

with this knowledge, the possibilites are endless. one example that I have seen a lot is that you can get product data and pricing from different websites and compare them on your website.

https://youtu.be/lOnXY35YRsA

Conclusion

Now you know about web crawling in PHP.

I recommend you to open a PHP files and try send a request to a website and try to get specific infromation from that website and show it to the user.

If you have any suggestions, questions, or opinions, please contact me. I’m looking forward to hearing from you!

Key takeaways

  • send request with php curl
  • working with domdocument
  • working with xpath
  • find specific data in a page
  • web crawling and web scraping

Category: programming

Tags: #php #tips and tricks

Join the Newsletter

Subscribe to get my latest content by email.

I won't send you spam. Unsubscribe at any time.

Related Posts

Courses