Complete Guide To Web Scraping Using Make (Formerly Integromat)
In this post, I’ll cover one of the most handy things you can do in Make: web scraping. You don’t need to write a single line of Python script. Just follow along this post and you’ll be on the right track!
I left a scenario template of Make that you can try out at the bottom of this post. It’s up to you if you read through the post or just skip to it!
If you need a much more detailed guide that covers a lot more about web scraping with Make from little functions to scenario templates, I’ve just published a course that’s worth reading.
Use the following discount code to get 20% off :)
IREADYOURMEDIUM
Setting up a Google Sheets module
Let’s say we have a Google Sheet where you have a list of links you want to extract content from. I collected some links from Product Hunt:
Now, go to Make and create a scenario. For the first module, we pick Google Sheets > Search Rows. If you have the links you want to scrape in Airtable, for example, you choose Airtable > Search Records here.
Go ahead and connect to the right Google Sheet. Setting up a filter is a good practice. For the sake of this demo, I’ll just simply say “Rows where links exist”. Let’s not worry about other fields for now.
Right click the module and press “Run this module only”.
Now, you should see the collections of links and other pieces of information like the image below.
If you don’t see the green check at this point, you might doing something wrong. If you do see the green check, the next step is setting up the actual scraping step!
Moving on to the HTTP module
Now, add an HTTP module right after the Google Sheets module. You access the URLs you want to scrape with an HTTP module. You don’t really have to know in detail what HTTP means for now. Just remember when you want to extract information, this is the module you will go for.
Choose Make a request.
The rest is very simple. Map the link from the previous Google Sheets module into the URL field of the HTTP module. Again, there’s no need to worry about other fields.
Let’s run once
Setting up the actual scraping part is pretty much done. That was fast! Let,s run the scenario once and see what it does. Before hitting Run once ▶️ button, go back to the Google Sheets module and set the Maximum number of returned rows to 1. This ensures that this test run will apply to only one link. If you have a 1,000 links for example, and if you don’t limit the number of rows returned, you will end up actually scraping the 1,000 links for a simple test (unless you forcefully stop the scenario) and you don’t really want that.
Once you are done with the setting, it’s time to Run once ▶️.
Yes! We got the content under the Data label. We didn’t write a single line of code to achieve this. And we’re not done!
Extracting the information you want
While being able to extract the full HTML from a webpage feels great, often times what you want from the webpage isn’t the whole thing. You might be trying to extract the title, url, image, or author. We’ll nowlook at how we can extract the right information from the full HTML.
Let’s add an XML module.
After clicking on XML, choose Perform XPath Query. Basically what you can do with it is specify which part of the full content you want to extract.
When you are done choosing the right module, you will be shown two fields.
First field is XML. Here, map the Data from the previous HTTP module output.
Now, the second field “Xpath Query” is where things get a little tricky, but we’ll proceed one step at a time. Remember that the list of links I have in my Google Sheets are of Product Hunt? Let’s look at one of the pages and decide what to extract.
For this demo, I’ll go for the product name and the description as highlighted above. Now, open the Developer Tools, then navigate to the element you want to extract, the title and description in my case.
Then right click on the HTML element that holds the content. In my case, an h1 tag holds the product name so I right click on that, and then, Copy > Copy XPath.
Go back to Make, and paste whatever you copied into the XPath Query field. Press OK when you are done.
Let’s Run once ▶️. If it works, you are all set!
Unfortunately though, I didn’t get any output as you can see blow although the scenario did not incur any error. What this means is that we did not specify the right XPath. Not sure why this happens as we literally copy-pasted the XPath but it is what it is. We have to go with something else.
Now, there are other ways to write the XPath we are after. You can manually write the XPath but it can be a little complicated. So let’s do this instead:
The XPath we created before considers the structure of the full HTML, but that is not all there is to it. You can also create an XPath using only the element in question although they don’t come in the same format. ChatGPT should create multiple XPaths for you so let’s try those and at least one of them should work!
Do the same for other pieces of information you want to extract. I’ll go ahead and add another XML module to extract the product description.
Putting data back into Google Sheets
We’ve successfully built a scenario that extracts information from web pages. What we want to do now is to put the extracted information back into the Google Sheet we are working with.
I’ve added two columns in the Google Sheet. One for product names, the other for product descriptions.
Let’s go ahead and add a Google Sheets > Update a Row module. Let’s not choose the Add a Row module here because we are literally updating the existing rows where the links reside.
Moving onto the setup. After choosing the right Sheet, you will need to fill out Row number field, and the columns you want to update. For the Row number field, you look at the very first Google Sheets module. The search operation it performs returns the row number of the link that you are working with. So go ahead and map it into the field.
You can leave the link field empty, or any column where something already exists for that matter. Leaving these fields empty in the Make side will not override the existing text in Google Sheets.
Finally, we delete whatever number we input into Maximum number of returned rows in the first Google Sheets module. You can set it to 10, 50, 100, or any other number depending on your needs but it really isn’t what matters in this demo, so I’ll just leave it empty.
And there you have it! I’ve built a web scraping tool in a matter of 15 minutes. All that’s left to do is watch the data flow into your Google Sheet.
Let’s hit Run Once ▶️ and see if it does what it’s supposed to do.
Perfect! If you have a lot of links, say 1,000 or 10,000, maybe more, or if you are expecting for another scenario or people to add more and more links here, you might want to schedule the scenario to run every once in a while. If not, Run Once ▶️ will do the job just fine.
Get your template!
I leave the Google Sheet I used in this demo here. And the Make scenario too. Feel free to give it a try!
If you need a much more detailed guide that covers a lot more about web scraping with Make from little functions to scenario templates, I’ve just published a course that’s worth reading.
Use the following discount code to get 20% off :)
IREADYOURMEDIUM