求解, 今日热榜数据是怎么批量爬取的?动态网页用 Puppeteer 不仅性能慢还超时报错

18次阅读

共计 3146 个字符，预计需要花费 8 分钟才能阅读完成。

各位老哥求解, 如能帮忙解决问题, 口令红包私人感谢.

单独解决私发:￥ 8.88 ￥ 16.88
多人解决群发:￥ 20
感谢方式: 支付宝口令红包

谢谢各位.

好奇今日热榜这些热榜站是如何进行批量爬取的

cheerio 抓取静态网页, Puppeteer 批量爬取, 性能好慢, 频繁超时报错找不到原因

page.goto 设置多长时间都超时, 30s 60s 90s

本地 windows 运行又没有问题, 远程服务器 vps 小鸡动不动报超时

数据处理过程错误: TimeoutError: Navigation timeout of 30000 ms exceeded
    at new Deferred (/usr/local/node_modules/puppeteer-core/lib/cjs/puppeteer/util/Deferred.js:59:34)
    at Deferred.create (/usr/local/node_modules/puppeteer-core/lib/cjs/puppeteer/util/Deferred.js:21:16)
    at new LifecycleWatcher (/usr/local/node_modules/puppeteer-core/lib/cjs/puppeteer/cdp/LifecycleWatcher.js:66:60)
    at CdpFrame.goto (/usr/local/node_modules/puppeteer-core/lib/cjs/puppeteer/cdp/Frame.js:143:29)
    at CdpFrame. (/usr/local/node_modules/puppeteer-core/lib/cjs/puppeteer/util/decorators.js:98:27)
    at CdpPage.goto (/usr/local/node_modules/puppeteer-core/lib/cjs/puppeteer/api/Page.js:588:43)
    at fetchData (file:///usr/local/script/%E8%B4%A2%E7%BB%8F%E7%83%AD%E6%A6%9C.js:51:18)
    at async executeProcess (file:///usr/local/script/%E8%B4%A2%E7%BB%8F%E7%83%AD%E6%A6%9C.js:108:24)

async function fetchData(page, name, url, hrefSelector) {
  const maxRetries = 3; // Maximum number of attempts
  let attempts = 0;

  while (attempts < maxRetries) {
    try {
      attempts++;
      await page.goto(url, { timeout: 1000 * 30});
      await page.waitForSelector(hrefSelector, { timeout: 1000 * 30});

      const results = await page.$$eval(hrefSelector, anchors =>
        anchors.map(anchor => ({href: anchor.href, text: anchor.textContent.trim() }))
      );

      const trade_date = getCurrentDateTime();

      return {name, news: results, trade_date};
    } catch (error) {if (attempts < maxRetries) {console.warn(` 获取数据报错 ${url}. Retry attempt ${attempts}...`);
        await delay(2000); // Wait for 2 seconds before retrying
      } else {console.error(` 获取数据报错 ${url} after ${attempts} attempts:`, error);
        throw error;
      }
    }
  }
}

xvfb
puppeteer-extra
puppeteer-extra-plugin-stealth
puppeteer-extra-plugin-anonymize-ua

import puppeteer from 'puppeteer-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
import AnonymizeUaPlugin from 'puppeteer-extra-plugin-anonymize-ua'
puppeteer.use(StealthPlugin());
puppeteer.use(AnonymizeUaPlugin());

// 启动浏览器和页面
    const browser = await puppeteer.launch({
      args: [
        "--disable-setuid-sandbox",
        "--no-sandbox",
        "--disable-gpu",
        "--no-first-run",
        "--disable-dev-shm-usage",
        "--single-process"
      ],
      headless: true
    });
    console.log('启动浏览器 √');
    const page = await browser.newPage();

    // 设置拦截请求，屏蔽不必要的资源请求
    await page.setRequestInterception(true);
    page.on('request', (request) => {const resourceType = request.resourceType();
      if (['image', 'stylesheet', 'font'].includes(resourceType)) {request.abort();
      } else {request.continue();
      }
    });

    // 抓取数据
    const allContents = [];
    for (const data of config) {const contents = await fetchData(page, data.name, data.url, data.hrefSelector);
      allContents.push(contents);
    }

{
    name: '第一财经',
    url: 'https://www.yicai.com/news/',
    hrefSelector: '#newsRank div:nth-child(1) > ul > li a'
},
{
    name: '金融界',
    url: 'https://stock.jrj.com.cn/',
    hrefSelector: 'ul.opportunity-list > li a'
},
{
  	name: '八阕',
	url: 'https://news.popyard.space/cgi-mod/threads.cgi?lan=cn&r=0&cid=11&t=all',
	hrefSelector: 'div#page_1 > table b > a'
}

本地 Windows 爬取第一财经也报错..

Extra IPv4None
RAM2.5 GB RAM (Included)
CPU Cores2 CPU Cores (Included)
Operating SystemDebian 12 64 Bit (Recommended Min. 2 GB RAM)
LocationSan Jose, CA (Test IP: 192.210.207.88)

正文完

批量热榜爬取

发表至： V2EX

2024-06-15

0

(小白求解) nextjs client component 加载很慢的问题

苏州-网络安全工程师-招聘

海归博士零 offer 现在怎么办

大家有遇到过老赖吗？最后是怎么收场的？

有可能把 iPad 作为 mac 的触摸板吗？

求解, 今日热榜数据是怎么批量爬取的?动态网页用 Puppeteer 不仅性能慢还超时报错

问题

1. 今日热榜数据是怎么批量爬取的

2. 怎么爬取动态网页

报错

代码逻辑

使用插件

puppeteer 启动参数

爬取易报错的网站

系统

关于李星玮以结婚为名玩弄感情并冷暴力的公开信

求一个最新发布的不忘初心系统

国内Docker 的镜像服务器必须下架

发现 Mac 端网易有道翻译的一个有趣的设置项

DockerHub 国内镜像源列表（2024 年 6 月 18 日亲测可用）

求解, 今日热榜数据是怎么批量爬取的?动态网页用 Puppeteer 不仅性能慢还超时报错

问题

1. 今日热榜数据是怎么批量爬取的

2. 怎么爬取动态网页

报错

代码逻辑

使用插件

puppeteer 启动参数

爬取易报错的网站

系统

关于李星玮以结婚为名玩弄感情并冷暴力的公开信

求一个最新发布的不忘初心系统

国内Docker 的镜像服务器必须下架

发现 Mac 端网易有道翻译的一个有趣的设置项

DockerHub 国内镜像源列表（2024 年 6 月 18 日 亲测可用）

DockerHub 国内镜像源列表（2024 年 6 月 18 日亲测可用）