Build a webscraper with node js and puppeteer

Siva Kishore G

Posted : 11 Apr 2022
Modified : 21 Jun 2022

Expert Developers

Scrape Emails With Puppeteer

Email marketing is still #1 lead getter. There are companies that will sell the business info along with their emails, phone numbers etc. But if you are a developer like me, you would want to cut these costs and build your own list of contacts. In this tutorial, we will go over the steps of scraping emails from the websites that are available publicly.

Scraping is legal as long as the targeted websites are available publicly

Prerequisites

MongoDB

MongoDB is an excellent database to work with and they even have a free version. Check it out MongoDB
Node JS

Node JS is a superfast non-blocking IO javascript runtime environment built on Chrome's V8 engine. A perfect tool for projects that are not too computational.
Puppeteer

Puppeteer is a predominantly headless browser which means it can work in the background and expose features for us (developers) to interact with. Can be used for testing, scraping and auditing websites. With the advent of SPA's (single page applications) where the content is loaded dynamically, traditional scraping will not work on those sites. So we need a real browser to scrape the data. Puppeteer launches chromium browser in the background.

Task Manager showing chromium as background

Flow of the App

Let's discuss the steps involved in building this email scraper. The whole point of getting emails is to have some sort of campaign going on. We will choose mongDB to hold the data. In my opinion mySQL is a better choice if you want to develop more features like email tracking, event hooks, subscribe, unsubscribe etc. Without further due, let's get into it.

Choose a search term
Scrape the results using puppeteer
Extract the link and the tile
Save the values into the database
Redo all the above for page 2 OR Redo all the above with a different keyword (Optional)
Fetch urls from the database one by one and run email scraper
Clean the emails from unnecessary noise
Save the emails back to the database

Setup

MongoDB

Since we will be running the scraper in loops, it's important to reuse the mongodb connection to prevent memory leaks.

npm i --save mongodb

// mongoPool.js
const mongo = require('mongodb').MongoClient;
let mUrl = '<REPLACE WITH YOUR MONGODB URL>'
var connection = null

function getConnection(cb) {
  if (connection != null) {
    // console.log("Connection reused")
    cb(connection)
  } else {
    mongo.connect(mUrl, {
      useNewUrlParser: true,
      useUnifiedTopology: true
    }, function(err, db) {
      if (err == null) {
        // console.log("Connection Created")
        connection = db
        cb(connection)
      } else {
        // console.log("Connection Failed " , err)
        connection = null
        cb(connection)
      }
    })
  }
}
module.exports.getConnection = getConnection

Puppeteer

Similarly, we would want to reuse the browser object without closing. Instead, we will open and close the pages programmatically. Puppeteer has all that we need to scrape without requiring any other packages like "cheerio", "request" etc.

npm i --save puppeteer

// google.js
const puppeteer = require('puppeteer');
var browser = null;

async function initializeLaunch(){
  if(browser === null){
    browser = await puppeteer.launch();
  }
}

async function search(query){
  try{
    await initializeLaunch()
    const page = await browser.newPage();
    await page.goto('https://www.google.com/');
    await page.waitForSelector('input[aria-label="Search"]', {visible: true});
    // Inject code here to Scrape results
    await page.close()
  }catch(err){
    await browser.close()
    browser = null
  }
}

async function closeBrowser(){
  await browser.close()
}

module.exports.search = search
module.exports.closeBrowser = closeBrowser

Process

Scrape list of urls

I chose the keyword "Digital Marketing in CT" and ran it through the scraper.

Wait for the search box using the command

await page.waitForSelector('input[aria-label="Search"]', {visible: true});

Wait for all the results

await page.waitForSelector(".LC20lb", {visible: true});

Scrape the links and their titles and also filter the results if they are of social media accounts'. We want to scrape only the website addresses. So removing a known list of online directories and social media accounts improves the chance of getting the company email. See the array of "sites" in the code below.

await page.evaluate(() =>{
  var sites = ['linkedin','merchantcircle','angi','facebook','houzz','pinterest','instagram','twitter','bark','yelp','upcity','signalhire','clover','youtube','zoominfo','pixabay','dandb','dnb','manta','buzzfile','mapquest','smallbusinessdb','bbb','porch','whereisalocal','yellowpages','yellow.place','allpeople','verview','wikipedia','showmelocal','chamberofcommerce','finduslocal']
  var nodes = document.querySelectorAll(".LC20lb")
  var serp = []

  mongoPool((pool) => {
    if (pool !== null) { // you can also use "if(pool)" but i went for a cleaner syntax
      const dbo = pool.db('myDB').collection("profiles")
      for(var n of nodes){
        var link = n.parentNode.href
        var title = n.innerText
        var allow = true
        for(var s of sites){
          if(link.includes(s)) {allow=false;break;}
        }
        if(allow) dbo.insertOne({title,link}, function(err, res) {});
      }
    }
  })
});

Close the page
```
await page.close()
```
Surround the above code in a try catch block, close the browser and nullify its value.
```
await browser.close();
browser = null
```
By this step, we would have populated our database with results of digital marketing companies. Repeat this step with different keywords of your choice (Optional)

Scrape emails from urls

Open the page in puppeteer and wait till it loads completely. Then get the source code using await page.content(). Apply javascript regex on the content to filter the emails into an array. These emails need to be unique and not of any false positives.

function extractEmails(text) {
  return text.match(/([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)/gi);
}

function onlyUnique(value, index, self) {
  return self.indexOf(value) === index;
}

await page.goto(url, {
  waitUntil: 'networkidle0'
});
// await page.waitForNavigation();

var pageContent = await page.content()
var email = extractEmails(pageContent)
if(email !== null) {
  email = email.filter(y=>(!y.includes('@2x.png') && !y.includes('@3x.png')))
  email = email.filter(onlyUnique)
}
await page.close()

Save filtered emails back to the database

var oid =  require('mongodb').ObjectID
db.updateOne({_id : oid(id)}, {$set: {'d.url' : url}}, function(err, result) {
  // Rerun for another url (Pagination)
});

Puppeteer consuming too much disk space?

When you run puppeteer continuously, it tends to save chrome profiles in the default temp /tmp/ folder (Ubuntu linux). A lot of profiles are saved under snap.chromium/tmp folder and it requires constant cleanup if you wish not to run out of disk space.

Use this code to delete only those profiles that are not needed.

try{
  var fs = require('fs')
  var act = '/tmp/snap.chromium/tmp'
  var keep = fs.readdirSync('/tmp/')
  var remo = fs.readdirSync(act)
  remo.map(file => {
    if(file.includes('puppeteer_dev_chrome_profile')){
      if(keep.indexOf(file) === -1) {
        fs.rmdirSync(`${act}/${file}`,{recursive:true})
        console.log("Removed", file)
      }else{
        console.log("Kept", file)
      }
    }
  })
}catch(err){
  console.log("Some error", err)
}

Conclusion

This scraper can have extended functionality such as scraping the complete website to increase the chance of getting emails. For example use this with the npm package website-scraper. Or even build a standalone app using Electron JS.

Post a comment

Full Name *

Email address *

Website address

Phone

Comment *

I promise, I keep it clean *

Recent Utilities

Bootstrap visual website template editor

Build a webscraper with node js and puppeteer

Blog

Scrape Emails With Puppeteer

Prerequisites

MongoDB

Node JS

Puppeteer

Flow of the App

Setup

MongoDB

Puppeteer

Process

Scrape list of urls

Scrape emails from urls

Puppeteer consuming too much disk space?

Conclusion

Post a comment

Comments

Recent Posts

Webcam to CCTV

Calendar Picker

File Sharing Server

Tailwind Navbars

Dashboard UIs

HTML Email Tempaltes

Payment Page UI

Auto Logout Feature

HTML Popups

Geolocation

Recent Utilities

Cookie Consent

OOPS !

ADBlock Detected !

Build a webscraper with node js and puppeteer

Blog

Scrape Emails With Puppeteer

Prerequisites

MongoDB

Node JS

Puppeteer

Flow of the App

Setup

MongoDB

Puppeteer

Process

Scrape list of urls

Scrape emails from urls

Puppeteer consuming too much disk space?

Conclusion

Post a comment

Comments

Recent Posts

Recent Utilities

Cookie Consent

OOPS !

ADBlock Detected !

SUBSCRIBE