Web Scraping in Make.com A secret hack that accelerates the process by 100x (formerly Integromat)
Ugh! Why so slow!
Let me show you what I mean by the title:
Column B says “Conventional way”, meaning the approach you would normally take in Make.com to do web scraping.
Column C says “Fast way”, meaning this is the approach that should be taken to speed up your scraping process. It’s obviously a little bit complicated than the conventional way, but it is extremely useful.
And it’s not just web scraping where this strategy comes in handy, but you can also apply this technique in numerous situations.
Now, I wouldn’t say this technique is of beginner or intermediate level, and I think I should be categorizing it as advanced. That being said, it’ll probably look pretty easy once you get the hang of it.
So, without further ado, let’s start talking about the approach.
The conventional way
So, assuming you have a list of URLs, you would normally scrape the content from them like this:
It works totally fine, but the problem with this approach is that you are creating a long queue of operations because you cannot handle them in parallel in Make. That’s why it takes a relatively long time to go through all the URLs.
That means, if we could go through all the URLs simultaneously, that’ll obviously speed up the whole process.
Question is, how?
Want no-code automation tips and secrets? I got you covered.
Subscribe to my newsletter. Don’t worry. I can’t code either.
Solution (fast way)
I said earlier that Make cannot run a scenario in parallel. However, there’s an exception. And that is if you use webhook as the trigger. So, what we’ll basically do is set up a scenario that keeps posting to the webhook trigger of another scenario which then handles all the requests simultaneously.
So, let’s start building!
First scenario
Here’s how we’ll set up our first scenario:
We search rows in a Google Sheet (or any database for that matter) then pass the URLs to the HTTP module. Let’s open the HTTP module to see how it’s set up.
It’s no different from calling an API. You post to the webhook URL of the second scenario (we’ll cover it later) with some query strings attached.
Although what you pass as query strings (or in the body) depends on your needs, the basic request should look something like that above.
If you are testing things out, you can just follow what you see here.
- Name: url | Value: actual URL
- Name: xpath | Value: XPath query of the element you want to scrape
- Name row_number | Value: actual row number
Second scenario
The second scenario looks like this:
We have a webhook module as the trigger.
In the second module (HTTP), let’s map the URL we received from the query string of the webhook module. You can leave the other fields emtpy and unchanged.
The third module will be an XML: Perform XPath Query module. You pass the entire HTML scraped in the previous HTTP module here along with the XPath that you received in the query string of the webhook.
Finally, we have a Google Sheets: Update a Row module. Start by choosing the right Sheet, map the row number received via the webhook query string, and the content you have just extracted using an XPath query into the corresponding fields.
One thing you DON’T want to do here is returning the extracted data back into the first module like an API endpoint. This makes the HTTP module from which you are posting to webhook wait for a response, again creating a long queue of themselves. What you should do here instead is let the second scenario take care of updating the database as well so that everything stays parallel.
Enjoy!
That’s all! Now, all that’s left to do is watch the data flood in.
Need help?
If you want to me to build a scenario for you instead of getting stuck every step of the way, you can always talk to me here! https://bento.me/hide