求解, 今日热榜数据是怎么批量爬取的?动态网页用 Puppeteer 不仅性能慢还超时报错

18次阅读

共计 3146 个字符,预计需要花费 8 分钟才能阅读完成。

问题

各位老哥求解, 如能帮忙解决问题, 口令红包私人感谢.

  • 单独解决 私发:¥ 8.88 ¥ 16.88
  • 多人解决 群发:¥ 20
  • 感谢方式: 支付宝口令红包

谢谢各位.

1. 今日热榜数据是怎么批量爬取的

好奇 今日热榜 这些热榜站是如何进行批量爬取的

2. 怎么爬取动态网页

cheerio 抓取静态网页, Puppeteer 批量爬取, 性能好慢, 频繁超时报错找不到原因

page.goto 设置多长时间都超时, 30s 60s 90s

本地 windows 运行又没有问题, 远程服务器 vps 小鸡动不动报超时

报错

数据处理过程错误: TimeoutError: Navigation timeout of 30000 ms exceeded
    at new Deferred (/usr/local/node_modules/puppeteer-core/lib/cjs/puppeteer/util/Deferred.js:59:34)
    at Deferred.create (/usr/local/node_modules/puppeteer-core/lib/cjs/puppeteer/util/Deferred.js:21:16)
    at new LifecycleWatcher (/usr/local/node_modules/puppeteer-core/lib/cjs/puppeteer/cdp/LifecycleWatcher.js:66:60)
    at CdpFrame.goto (/usr/local/node_modules/puppeteer-core/lib/cjs/puppeteer/cdp/Frame.js:143:29)
    at CdpFrame. (/usr/local/node_modules/puppeteer-core/lib/cjs/puppeteer/util/decorators.js:98:27)
    at CdpPage.goto (/usr/local/node_modules/puppeteer-core/lib/cjs/puppeteer/api/Page.js:588:43)
    at fetchData (file:///usr/local/script/%E8%B4%A2%E7%BB%8F%E7%83%AD%E6%A6%9C.js:51:18)
    at async executeProcess (file:///usr/local/script/%E8%B4%A2%E7%BB%8F%E7%83%AD%E6%A6%9C.js:108:24)

代码逻辑

async function fetchData(page, name, url, hrefSelector) {
  const maxRetries = 3; // Maximum number of attempts
  let attempts = 0;

  while (attempts < maxRetries) {
    try {
      attempts++;
      await page.goto(url, { timeout: 1000 * 30});
      await page.waitForSelector(hrefSelector, { timeout: 1000 * 30});

      const results = await page.$$eval(hrefSelector, anchors =>
        anchors.map(anchor => ({href: anchor.href, text: anchor.textContent.trim() }))
      );

      const trade_date = getCurrentDateTime();

      return {name, news: results, trade_date};
    } catch (error) {if (attempts < maxRetries) {console.warn(` 获取数据报错 ${url}. Retry attempt ${attempts}...`);
        await delay(2000); // Wait for 2 seconds before retrying
      } else {console.error(` 获取数据报错 ${url} after ${attempts} attempts:`, error);
        throw error;
      }
    }
  }
}

使用插件

  • xvfb
  • puppeteer-extra
  • puppeteer-extra-plugin-stealth
  • puppeteer-extra-plugin-anonymize-ua
import puppeteer from 'puppeteer-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
import AnonymizeUaPlugin from 'puppeteer-extra-plugin-anonymize-ua'
puppeteer.use(StealthPlugin());
puppeteer.use(AnonymizeUaPlugin());

puppeteer 启动参数

// 启动浏览器和页面
    const browser = await puppeteer.launch({
      args: [
        "--disable-setuid-sandbox",
        "--no-sandbox",
        "--disable-gpu",
        "--no-first-run",
        "--disable-dev-shm-usage",
        "--single-process"
      ],
      headless: true
    });
    console.log('启动浏览器 √');
    const page = await browser.newPage();

    // 设置拦截请求,屏蔽不必要的资源请求
    await page.setRequestInterception(true);
    page.on('request', (request) => {const resourceType = request.resourceType();
      if (['image', 'stylesheet', 'font'].includes(resourceType)) {request.abort();
      } else {request.continue();
      }
    });

    // 抓取数据
    const allContents = [];
    for (const data of config) {const contents = await fetchData(page, data.name, data.url, data.hrefSelector);
      allContents.push(contents);
    }

爬取易报错的网站

{
    name: '第一财经',
    url: 'https://www.yicai.com/news/',
    hrefSelector: '#newsRank div:nth-child(1) > ul > li a'
},
{
    name: '金融界',
    url: 'https://stock.jrj.com.cn/',
    hrefSelector: 'ul.opportunity-list > li a'
},
{
  	name: '八阕',
	url: 'https://news.popyard.space/cgi-mod/threads.cgi?lan=cn&r=0&cid=11&t=all',
	hrefSelector: 'div#page_1 > table b > a'
}

本地 Windows 爬取第一财经也报错..

系统

Extra IPv4None
RAM2.5 GB RAM (Included)
CPU Cores2 CPU Cores (Included)
Operating SystemDebian 12 64 Bit (Recommended Min. 2 GB RAM)
LocationSan Jose, CA (Test IP: 192.210.207.88)
正文完
 0