Puppeteer is an awesome Node.js library which provide us a lot of commands to control a headless (or not) chromium instance and automatize navigation with few lines of code. In this post we are going to use the puppeteer superpowers and build simple web scraper tool.

Initial Setup

We are about to create a Node.js application so, the first step of course, is create a new npm project in a new directory. With the -y parameter the package.json will be created with default values and add the puppeteer dependency to your project.

    npm init -y

    npm install --save puppeteer

    # or, if you prefer Yarn:
    yarn add puppeteer

Finally in our package.json file, add the following script:

    "scripts": {
        "start": "node index.js"
    }

This script simplifies running our app - now we can do it with just npm start command

Coding Part

With our npm project successfully configured, the next step is, yes, coding , let’s create our index.js file. Then here is the skeleton for our puppeteer app

    'use strict'

    const puppeteer = require('puppeteer')
    async function run() {

    const browser = await puppeteer.launch()
    const page = await browser.newPage()

    browser.close()

    }
    run(); 

Basically we are importing a puppeteer dependency at line 3, then we open an async function in order to wrap all browser/puppeteer interactions, in the following lines we get an instance for chromium browser and then open a new tab (page) … at the end in the last lines, we are closing the browser (and its process) and finally running the async function.

Navigating to our target site

Going to a specific website is a simple task using our tab instance (page). We just need to use the goto method:

     await page.goto('https://techcrunch.com/')

Scraping

In this page we just need to iterate over the DOM nodes and extract the information. Fortunately puppeteer can help us with that too.

    await page.waitForSelector('#tc-main-content > div:nth-child(2) > div > div > div > article:nth-child(1)')
    const news = await page.evaluate(() => {
    const results = Array.from(document.querySelectorAll('#tc-main-content > div:nth-child(2) > div > div > div > article'));
    return results.map(result => {
        return {
            link: result.querySelector('a.post-block__title__link').href,
            title: result.querySelector('.post-block__title a').textContent,
            author: result.querySelector('.river-byline__authors a').textContent,
            datetime: result.querySelector('.river-byline__time').getAttribute('datetime'),
            shortcontent: result.querySelector('.post-block__content p').textContent
        }
    });
    return results
    });

    console.log(news)

In the script above we are using the evaluate method for the results inspection, then with some query selectors we iterate the results list in order to extract the information of each node, producing an output like this for each news

{
    link: "https://techcrunch.com/2019/07/15/pokemon-go-battles-will-soon-be-less-tappy-more-fruit-ninja-y/",
    title: "Pokémon GO battles will soon be less tappy, more Fruit Ninja-y",
    author: "Greg Kumparak",
    datetime: "2019-07-15T23:58:02.000Z",
    shortcontent: "At the end of last year, Pokémon GO finally got a " +
      "player-versus-player battling system. While it was a very much " +
      "welcomed addition, it has always seemed a bit… monotonous. It " +
      "just requires so..."
  }

.

..

参照 : @lex0316