Complete Guide To Web Scraping Using Make.com (Formerly Integromat)
Let’s learn one of the coolest things you can do with Make.com: web scraping. We’re not going to write a single line of Python code. Just follow the steps in this post and the next thing you know, your coworkers will be turning to you for all their web scraping needs.
First things first: get your template!
Here’s the Google Sheets we’re going to use in this demo, and of course, the Make.com scenario template.
You’ll be asked to sign up to my automation newsletter to get access, where I regularly share some of my best resources:
- Exclusive templates: Ready-to-use templates that save you hours on setup.
- Automation secrets and hacks: Expert insights to optimize workflows and unlock hidden potential.
- Top emerging automation tools: Stay Ahead with the Latest Innovations and Insights.
Setting up a Google Sheets module
Let’s say we have a Google Sheet where you have a list of links you want to extract content from. I collected some links from Product Hunt:
Now, go to your Make account and create a scenario.
If you have my template imported, all you need to do is press ‘Start Guided Setup’ and simply follow the provided steps.
If you decided to follow along and build from scratch, well, let’s build from scratch.
For the first module (the word module refers to each of those little balls in your Make scenario), we pick Google Sheets > Search Rows. If you have the links you want to scrape in Airtable, for example, you choose Airtable > Search Records here.
Go ahead and connect your Google account in the module. Setting up a filter is a good practice. For this demo, I’m going to say “Find the rows that have the link column filled, and the Product name column not filled, and the Product desc column not filled.”
This ensures that already-filled rows are not returned. This is important because if they ARE returned, you’ll end up scraping them again, wasting your Make operations (which is basically what you are paying Make for).
If you scroll down, you should see the filed named ‘Limit’. I recommend you set this value to 1 when the scenario is still in test. Setting it to 1 means that we are only retrieving the first row that matches our criteria, and we should be using that one row to do some experiments that follow later.
After closing the module settings by pressing OK, right-click on the module and press ▶️ Run this module only.
Now, you should be able to see the first link and everything outputted by the Google Sheets module.
Fun part: Setting up the HTTP module
Let’s go ahead and add the HTTP > Make a request module right next to the Google Sheets module.
What does the HTTP > Make a request module do?
When you make a request to a website via HTTP, it’s like you’re opening the website but in an automated way. Instead of visiting the site manually, you’re asking Make to open it and retrieve the source code for you.
Let’s start mapping!
Map the link from the previous Google Sheets > Search Rows module into the URL field of the HTTP > Make a request module, and press OK.
Leave other fields untouched.
Let’s ▶️ Run once
We have the content we’re after under the Data label. Pretty amazing that we didn’t write a single line of code to get this done. But we’re not done yet.
Snag the Good Stuff Only
So, basic web scraping is done. Great. Let’s see what we have:
Well, obviously, we don’t want the entire source code. So, what do we do?
Meet XPath 👋
With XPath, we can precisely target specific elements in the source code. It’s like you are saying things like “Access that thing inside that other thing that lives inside that.”
Let’s go ahead and add an XML > Perform XPath Query module.
After selecting the module, you will see two fields.
- XML: Map the Data from the previous HTTP > Make a request module output.
- XPath Query: This is where you enter an XPath (I’ll talk about this right after the below screenshot).
The second field “XPath Query” is where things get a little tricky, but we’ll do this one step at a time.
Remember that the list of links I have in my Google Sheets are from Product Hunt?
Let’s look open the link in the first row which is a page for this product called CV Score.
And, as you can see in the Google Sheet, here’s what we are after:
- Product name
- Product description
Select the product name (CV Score), right-click on it, and select Inspect. This will open up Chrome DevTools.
This is what you should be seeing.
What’s happening is that the it’s showing us where the product name (CV Score) lives in the entire source code, and that is exactly how you locate whatever element you are trying to extract.
Let’s continue. Right-click on the element (<h1 class=”text-18 sm…), then hover on Copy, then click on Copy XPath.
With the XPath in your clipboard, let’s go back to Make and paste the XPath into the XPath Query field.
Let’s ▶️ Run once and see if the product name (CV Score) comes right out!
We’re very close! We just don’t want the h1 tag wrapping the product name (CV Score). To get only the text inside the tag, we can add /text() at the end of your XPath like so:
your_xpath/text()
With that added, let’s try again.
Once that’s done, let’s go ahead and do the same for the product description (Make your CV stands out with AI).
Add the XML > Perform XPath Query module again and paste the XPath for the product description (Make your CV stands out with AI).
▶️ Run once, and as you can see in the below image, we have successfully extracted the product description.
What do I do if XPath isn’t returning anything?
There are times when the XPath copied from Chrome DevTools doesn’t work. When that happens, you end up with this screen:
The XML > Perform XPath Query module isn’t returning an output because it didn’t find any match. Even if you copy the XPath from Chrome DevTools and paste it into the right field, no output. Weird, but I’ve seen it happen.
Luckily, we can try another XPath because there’s no single right XPath. You can create different XPaths to point to the same element in a bunch of ways. You can be more specific or broader, using things like classes, IDs, and so on.
But I think we can skip learning every little detail of XPath because we can just ask ChatGPT to make one for us.
Here’s the prompt:
<div class=”flex flex-ro… (omitted for brevity)
Write some XPaths for ‘{Element you want to access}’. Keep it relatively broad as I’ll be using it for different pages with the same structure.
It works better if some parent elements are included in the HTML pasted into ChatGPT.
ChatGPT should construct multiple XPaths for you, and at least one of them should work.
Final Step: Syncing Data Back to Google Sheets
So far, we have successfully built a scenario that takes care of web scraping, which is super-exciting, but our Google Sheet is actually still empty.
So, let’s update these rows with:
- Product name
- Product description
Add the Google Sheets > Update a Row module at the end of your scenario.
Let’s not mistake the Update a Row with the Add a Row module here.
- Update a Row: Takes a row number to update as an input
- Add a Row: Does not take a row number as an input
Row number? What is that?
The numbers you see on the leftmost of your screen. Those are the row numbers.
We need these numbers to update the right rows. If we want to add a product name and product description to the first link, we need to say “Update the row 2 with these values”.
It might be easier to demonstrate. Let’s just go ahead and keep setting up the Update a Row module first.
For the Search Method field, select Enter manually. This is where you select your Google Sheet and the right Sheet (tab). You can also select Search by path, but I recommend Enter manually to keep your scenario less prone to change.
After selecting Enter manually, map the output of the first Google Sheets > Search Rows module.
What we’re doing here is telling the Google Sheets > Update a Row module to access the Google Sheets and the Sheet (tab) we have accessed at the beginning of the automation, and update the row that we have retrieved during the process. Let me provide the row number of the row, so you know which row I want you to update.
And that is why we need to know about row numbers!
Almost done!
Map the outputs of the XML > Perform XPath Query modules into the corresponding Google Sheets columns.
Now, press OK and save the scenario by pressing the 💾 icon or ⌘ + s (Ctrl + s for Windows). Then, ▶️ Run once.
Did it work?
If it did work, we can finally delete the 1 from the very first the first Google Sheets > Search Rows module. You can leave this field empty, or set it to 10, 50, 100, and so on depending on how many links you have and how many minutes you want to set the run interval to. For this demo, I’ll just leave it empty and let the scenario take care of all the empty rows.
And there you have it! All done!
We were able to build a web scraper in a matter of 15 minutes solely with Make.com and Google Sheets. All that’s left to do is watch the data flow in.
Let’s hit ▶️ Run once and see it in action.
From here, you can build things like a scheduled web scraper that regularly retrieves links from a webpage and extracts content from the links. But if your web scraping project only needs a single run, ▶️ Run once does the job.
Want to learn more?
Web scraping with Make.com is much more than what we covered in this post. To unlock the its full potential, check out my text-based course that goes beyond the basics. From powerful functions to ready-to-use scenario templates, this guide has everything you need to master web scraping with Make.
Get access here:
.
.
.
.
.
.
.
.
.
.
Oh, hi, thanks for reaching the very end of this post.
I’ll leave a 20% discount code for you 😉
IREADYOURMEDIUM