CASE· 2026-03-08· n8n · JavaScript · Residential proxies

Market Research Automation

An n8n workflow that mines public market data into a structured, deduplicated database of potential customers. Built for teams that did this by hand.

Stack

· n8n
· JavaScript
· Residential proxies
· Google Sheets

Scope

1 person, ~4 weeks, in production

Result

400+ keywords processed per run; clean provider list with partner attribution, no duplicates, delivered by email

The biggest lesson from this build: not every workflow needs AI. The first version used a vision model to extract data. The final version uses no AI at all. Plain JavaScript, rule-based. Cheaper, faster, more reliable.

The problem

The team had to figure out, on a recurring basis, which providers were active in a given market and which of them looked like potential customers. The information is public: search results, directories, vertical portals. But gathering it by hand takes forever.

Dozens of results per keyword. Each result has to be checked, the provider identified, information compared. At 400–500 keywords and several thousand results per run, doing this manually is not realistic.

Why it ended up without AI

My first approach was an AI flow: screenshot result pages, feed them to a vision model. It didn't work. The model hallucinated data, invented provider names, mismatched fields, dropped entries. Depending on viewport, the same site renders content as carousel, sidebar, or grid. The model couldn't keep up with the shifting layouts.

So I rebuilt it. Structured data extraction from HTML doesn't need AI. A JavaScript parser looking for fixed patterns in the markup returns consistent results, regardless of how the page renders. No tokens spent, no hallucination risk.

The approach

Two phases. Phase 1 collects the public market data and extracts the relevant fields. Phase 2 aggregates, deduplicates, and enriches. A staging store sits between them so collection and analysis can run independently.

Keywords get processed in batches to respect API limits. Each batch runs to completion before the next starts. Requests route through proxy services; a cookie-banner handler keeps content reachable.

The hard parts

Writing the parser was the easy part. Most of the dev time went into edge cases, especially when scaling from small test runs to the full 400+ keywords.

The index problem

Under 10 results, everything was clean. The moment a new batch with another 10 results came in, fake duplicates appeared. If 9 of 10 requests come back (timeout, blocked request, empty response), the index shifts. Keyword 5 gets matched to keyword 6's results. The entire mapping breaks.

I fixed it by tying every result back to its original input directly. Not by array position. By explicit reference. Each result knows which keyword and which row it belongs to, no matter how many other requests failed.

The duplicate problem

A provider often shows up several times: once through their own account, once through a partner channel, sometimes both at once for different products. Dropping one of the rows loses the partner attribution.

Dedup fell out of the index fix. I first tried batching on other parameters, but in the end a "cluster and enrich" approach was needed: all entries for a provider get grouped. From the cluster, a master record is picked, with the most direct entry winning. The remaining entries are scanned for additional information and merged into the master. Two entries for the same provider get merged, not deleted.

Fuzzy duplicates

Tiny differences block clean grouping. Example GmbH and Example GmbH (trailing space) get treated as two providers. Casing, special characters, same problem. I normalise aggressively when grouping: lowercase, strip whitespace, strip special characters. Different spellings end up in the same cluster.

API constraints

Plenty of sites block automated requests without residential proxies. Cookie banners have to be accepted programmatically before content shows up. And export APIs often reject strings over a certain length. Long tracking URLs or HTML artefacts crash the export. Every field gets checked for length and special characters before it's saved.

What surprised me most: how hard it is to make this kind of automation run reliably at that size. Writing the parser is an afternoon. Validating the results until 400+ keywords and thousands of rows come through clean took weeks.

The result

What lands in the spreadsheet:

One provider per row, no duplicates
Partner attribution: which tools or services does the provider already use?
Multiple mappings if a provider operates across channels
Direct links to provider profiles
Keyword mapping: which search term surfaced which provider
Email notification when the run finishes

Feedback was strong. What used to be hours of manual research now runs end to end and produces a ready contact list. I'd set up flows like this faster today. I work through the problems more logically now. But every flow is different.

Written by

Eric Hinzpeter

Eric Hinzpeter, Senior B2B Content Strategist. He builds production AI agents and marketing automation, and documents the results here.

About LinkedIn

More cases4 of many