Complete Guide To Web Scraping Using Make.com (Formerly Integromat)

Hideyuki Shibata
11 min readNov 2, 2023

--

Let’s learn one of the coolest things you can do with Make.com: web scraping. We’re not going to write a single line of Python code. Just follow the steps in this post and the next thing you know, your coworkers will be turning to you for all their web scraping needs.

This is what we are building today 🛠️

First things first: get your template!

Here’s the Google Sheets we’re going to use in this demo, and of course, the Make.com scenario template.

🔗 Get the template here!

You’ll be asked to sign up to my automation newsletter to get access, where I regularly share some of my best resources:

  • Exclusive templates: Ready-to-use templates that save you hours on setup.
  • Automation secrets and hacks: Expert insights to optimize workflows and unlock hidden potential.
  • Top emerging automation tools: Stay Ahead with the Latest Innovations and Insights.

🔗 Get the template here!

Setting up a Google Sheets module

Let’s say we have a Google Sheet where you have a list of links you want to extract content from. I collected some links from Product Hunt:

List of links from Product Hunt 🔗

Now, go to your Make account and create a scenario.

Pro tip: Try typing ‘add’ in your address bar and see if it remembers the ‘Create a new scenario’ URL (https://your_region.make.com/your_org_id/scenarios/add). If it does, press enter for quick launch!

If you have my template imported, all you need to do is press ‘Start Guided Setup’ and simply follow the provided steps.

Much easier with the template ⚙️

If you decided to follow along and build from scratch, well, let’s build from scratch.

For the first module (the word module refers to each of those little balls in your Make scenario), we pick Google Sheets > Search Rows. If you have the links you want to scrape in Airtable, for example, you choose Airtable > Search Records here.

Google Sheets > Search Rows

Go ahead and connect your Google account in the module. Setting up a filter is a good practice. For this demo, I’m going to say “Find the rows that have the link column filled, and the Product name column not filled, and the Product desc column not filled.”

🟢 Meets criteria 🔴 Does not meet criteria

This ensures that already-filled rows are not returned. This is important because if they ARE returned, you’ll end up scraping them again, wasting your Make operations (which is basically what you are paying Make for).

Filter setup

If you scroll down, you should see the filed named ‘Limit’. I recommend you set this value to 1 when the scenario is still in test. Setting it to 1 means that we are only retrieving the first row that matches our criteria, and we should be using that one row to do some experiments that follow later.

We’ll remove 1 when publishing the scenario.

After closing the module settings by pressing OK, right-click on the module and press ▶️ Run this module only.

Always a good practice to “Run this module only” for easier mapping of the module’s output in a later step. Mapping Google Sheets’ output is not hard, so this is not mandatory here, but I recommend you make this a habit for when you start doing things like custom API calls.

Now, you should be able to see the first link and everything outputted by the Google Sheets module.

Good start 🏃

Fun part: Setting up the HTTP module

Let’s go ahead and add the HTTP > Make a request module right next to the Google Sheets module.

HTTP > Make a request

What does the HTTP > Make a request module do?

When you make a request to a website via HTTP, it’s like you’re opening the website but in an automated way. Instead of visiting the site manually, you’re asking Make to open it and retrieve the source code for you.

Let’s start mapping!

Map the link from the previous Google Sheets > Search Rows module into the URL field of the HTTP > Make a request module, and press OK.

I hope you’re still with me 😅

Leave other fields untouched.

Let’s ▶️ Run once

Make sure to save the scenario by pressing 💾 or ⌘ + s (Ctrl + s for Windows)
This is what we’re after!

We have the content we’re after under the Data label. Pretty amazing that we didn’t write a single line of code to get this done. But we’re not done yet.

Snag the Good Stuff Only

So, basic web scraping is done. Great. Let’s see what we have:

😅

Well, obviously, we don’t want the entire source code. So, what do we do?

Meet XPath 👋

With XPath, we can precisely target specific elements in the source code. It’s like you are saying things like “Access that thing inside that other thing that lives inside that.”

Let’s go ahead and add an XML > Perform XPath Query module.

XML > Perform XPath Query

After selecting the module, you will see two fields.

Two fields to fill ✏️
  1. XML: Map the Data from the previous HTTP > Make a request module output.
  2. XPath Query: This is where you enter an XPath (I’ll talk about this right after the below screenshot).
Data → XML

The second field “XPath Query” is where things get a little tricky, but we’ll do this one step at a time.

Remember that the list of links I have in my Google Sheets are from Product Hunt?

List of links from Product Hunt 🔗

Let’s look open the link in the first row which is a page for this product called CV Score.

And, as you can see in the Google Sheet, here’s what we are after:

  1. Product name
  2. Product description
🔴 Product name 🟢 Product description

Select the product name (CV Score), right-click on it, and select Inspect. This will open up Chrome DevTools.

Select product name → Right-click → Inspect 🔍

This is what you should be seeing.

What’s happening is that the it’s showing us where the product name (CV Score) lives in the entire source code, and that is exactly how you locate whatever element you are trying to extract.

Let’s continue. Right-click on the element (<h1 class=”text-18 sm…), then hover on Copy, then click on Copy XPath.

If Copy XPath doesn’t give you the thing you are after, try Copy full XPath

With the XPath in your clipboard, let’s go back to Make and paste the XPath into the XPath Query field.

Don’t forget to press 🆗

Let’s ▶️ Run once and see if the product name (CV Score) comes right out!

We did it! (kind of)

We’re very close! We just don’t want the h1 tag wrapping the product name (CV Score). To get only the text inside the tag, we can add /text() at the end of your XPath like so:

your_xpath/text()

With that added, let’s try again.

Add /text() to extract the text
▶️ Run once
We did it 🥳

Once that’s done, let’s go ahead and do the same for the product description (Make your CV stands out with AI).

🔴 Product name 🟢 Product description

Add the XML > Perform XPath Query module again and paste the XPath for the product description (Make your CV stands out with AI).

Don’t forget to add /text()

▶️ Run once, and as you can see in the below image, we have successfully extracted the product description.

Nicely done!

What do I do if XPath isn’t returning anything?

There are times when the XPath copied from Chrome DevTools doesn’t work. When that happens, you end up with this screen:

No output

The XML > Perform XPath Query module isn’t returning an output because it didn’t find any match. Even if you copy the XPath from Chrome DevTools and paste it into the right field, no output. Weird, but I’ve seen it happen.

Luckily, we can try another XPath because there’s no single right XPath. You can create different XPaths to point to the same element in a bunch of ways. You can be more specific or broader, using things like classes, IDs, and so on.

But I think we can skip learning every little detail of XPath because we can just ask ChatGPT to make one for us.

Here’s the prompt:

<div class=”flex flex-ro… (omitted for brevity)

Write some XPaths for ‘{Element you want to access}’. Keep it relatively broad as I’ll be using it for different pages with the same structure.

🔗 https://chatgpt.com/share/671bf113-824c-800e-8e74-cb9d08d13130

It works better if some parent elements are included in the HTML pasted into ChatGPT.

This is just an example: 🔴 What you are after 🟢 Try copying this element

ChatGPT should construct multiple XPaths for you, and at least one of them should work.

Final Step: Syncing Data Back to Google Sheets

This is what our scenario currently looks like

So far, we have successfully built a scenario that takes care of web scraping, which is super-exciting, but our Google Sheet is actually still empty.

😅

So, let’s update these rows with:

  1. Product name
  2. Product description

Add the Google Sheets > Update a Row module at the end of your scenario.

Almost there! This is our final module ⛰️

Let’s not mistake the Update a Row with the Add a Row module here.

  • Update a Row: Takes a row number to update as an input
  • Add a Row: Does not take a row number as an input

Row number? What is that?

The numbers you see on the leftmost of your screen. Those are the row numbers.

Row numbers

We need these numbers to update the right rows. If we want to add a product name and product description to the first link, we need to say “Update the row 2 with these values”.

It might be easier to demonstrate. Let’s just go ahead and keep setting up the Update a Row module first.

For the Search Method field, select Enter manually. This is where you select your Google Sheet and the right Sheet (tab). You can also select Search by path, but I recommend Enter manually to keep your scenario less prone to change.

Enter manually so we can map values into the fields

After selecting Enter manually, map the output of the first Google Sheets > Search Rows module.

Sorry for the mess 😅

What we’re doing here is telling the Google Sheets > Update a Row module to access the Google Sheets and the Sheet (tab) we have accessed at the beginning of the automation, and update the row that we have retrieved during the process. Let me provide the row number of the row, so you know which row I want you to update.

And that is why we need to know about row numbers!

Almost done!

Map the outputs of the XML > Perform XPath Query modules into the corresponding Google Sheets columns.

B: Product name | C: Product description

Now, press OK and save the scenario by pressing the 💾 icon or ⌘ + s (Ctrl + s for Windows). Then, ▶️ Run once.

Did it work?

Mine did! 🥳

If it did work, we can finally delete the 1 from the very first the first Google Sheets > Search Rows module. You can leave this field empty, or set it to 10, 50, 100, and so on depending on how many links you have and how many minutes you want to set the run interval to. For this demo, I’ll just leave it empty and let the scenario take care of all the empty rows.

Let’s remove 1 and leave it empty

And there you have it! All done!

We were able to build a web scraper in a matter of 15 minutes solely with Make.com and Google Sheets. All that’s left to do is watch the data flow in.

👏

Let’s hit ▶️ Run once and see it in action.

Success!

From here, you can build things like a scheduled web scraper that regularly retrieves links from a webpage and extracts content from the links. But if your web scraping project only needs a single run, ▶️ Run once does the job.

Want to learn more?

Web scraping with Make.com is much more than what we covered in this post. To unlock the its full potential, check out my text-based course that goes beyond the basics. From powerful functions to ready-to-use scenario templates, this guide has everything you need to master web scraping with Make.

Sneak peek 🤫

Get access here:

.

.

.

.

.

.

.

.

.

.

Oh, hi, thanks for reaching the very end of this post.

I’ll leave a 20% discount code for you 😉

IREADYOURMEDIUM

--

--