高级爬取指南

高级爬取指南

通过本文了解如何利用 Firecrawl 的高级选项优化数据抓取

基础抓取 (/scrape 端点)

使用 /scrape 端点抓取单个页面并获取干净 Markdown 内容

Python

# 先安装库
# pip install firecrawl-py
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="YOUR_API_KEY")
content = app.scrape_url("https://docs.firecrawl.dev")

PDF 抓取功能

Firecrawl 默认支持 PDF 抓取。直接使用 /scrape 端点抓取 PDF 链接即可获取文本内容。通过设置 parsePDF 为 false 可禁用此功能

抓取参数配置

使用 /scrape 端点时，可通过以下参数定制抓取行为：

响应格式设置 `formats`

• 类型: array • 可选值: ["markdown", "links", "html", "rawHtml", "screenshot", "json"] • 说明: 指定返回内容格式： ◦ markdown: 返回 Markdown 格式内容 ◦ links: 包含页面所有链接 ◦ html: 返回处理后的 HTML ◦ rawHtml: 返回原始 HTML ◦ screenshot: 包含页面截图 ◦ json: 通过 LLM 提取结构化数据 • 默认值: ["markdown"]

完整内容获取 `onlyMainContent`

• 类型: boolean • 说明: 默认仅返回主要内容（排除页眉、导航栏、页脚等），设为 false 获取完整页面内容 • 默认值: true

元素包含规则 `includeTags`

• 类型: array • 说明: 指定需要包含的 HTML 标签/类名/ID • 默认值: 未定义

元素排除规则 `excludeTags`

• 类型: array • 说明: 指定需要排除的 HTML 标签/类名/ID • 默认值: 未定义

页面加载等待 `waitFor`

• 类型: integer • 说明: 设置页面加载等待时间（毫秒），建议仅在必要时使用 • 默认值: 0

超时设置 `timeout`

• 类型: integer • 说明: 设置抓取超时时间（毫秒） • 默认值: 30000 (30秒)

完整参数示例

curl -X POST https://api.firecrawl.dev/v1/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "formats": ["markdown", "links", "html", "rawHtml", "screenshot"],
      "includeTags": ["h1", "p", "a", ".main-content"],
      "excludeTags": ["#ad", "#footer"],
      "onlyMainContent": false,
      "waitFor": 1000,
      "timeout": 15000
    }'

此配置将实现：

• 返回完整页面内容（Markdown 格式） • 包含 Markdown、原始 HTML、处理后 HTML、链接和截图 • 仅包含指定标签元素（h1/p/a/.main-content） • 排除广告和页脚元素（#ad/#footer） • 页面加载等待 1 秒 • 15 秒超时限制

完整 API 文档参考：Scrape Endpoint Documentation

结构化数据抽取

通过 extract 参数配置结构化数据抽取选项：

LLM 抽取配置

数据模式 schema

• 类型: object • 必填: 当未提供 prompt 时必填 • 说明: 定义数据抽取结构

系统指令 system prompt

• 类型: string • 必填: 否 • 说明: LLM 系统级指令

抽取指令 prompt

• 类型: string • 必填: 当未提供 schema 时必填 • 示例: "提取产品功能列表"

使用示例

curl -X POST https://api.firecrawl.dev/v0/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://firecrawl.dev",
      "formats": ["markdown", "json"],
      "json": {
        "prompt": "提取产品功能列表"
      }
    }'

响应示例：

{
  "success": true,
  "data": {
    "extract": {
      "product": "Firecrawl",
      "features": {
        "general": {
          "description": "网站数据 LLM 化工具",
          "openSource": true,
          "useCases": [
            "AI 应用开发",
            "数据科学研究",
            "市场分析",
            "内容聚合"
          ]
        },
        "crawlingAndScraping": {
          "crawlAllAccessiblePages": true,
          "dynamicContentHandling": true
        }
      }
    }
  }
}

交互动作支持

通过配置 actions 参数实现页面交互操作：

支持的动作类型

等待 wait

• 类型: object • 参数: ◦ milliseconds: 等待时长（毫秒） • 示例:

Copy
```json
{ "type": "wait", "milliseconds": 2000 }
```

截图 screenshot

• 类型: object • 参数: ◦ fullPage: 是否截取完整页面（默认 false） • 示例:

Copy
```json
{ "type": "screenshot", "fullPage": true }
```

点击 click

• 类型: object • 参数: ◦ selector: 元素选择器 • 示例:

```json
{ "type": "click", "selector": "#load-more-button" }
```

输入 write

• 类型: object • 参数: ◦ text: 输入内容 ◦ selector: 输入框选择器 • 示例:

Copy
```json
{ "type": "write", "text": "搜索内容", "selector": "#search-input" }
```

按键 press

• 类型: object • 参数: ◦ key: 按键名称 • 示例:

Copy
```json
{ "type": "press", "key": "Enter" }
```

滚动 scroll

• 类型: object • 参数: ◦ direction: 滚动方向（up/down） ◦ amount: 滚动距离（像素） • 示例:

Copy
```json
{ "type": "scroll", "direction": "down", "amount": 500 }
```

完整动作参数参考：API Reference

多页面爬取 (/crawl 端点)

使用 /crawl 端点爬取网站所有可访问页面：

Copy

curl -X POST https://api.firecrawl.dev/v1/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev"
    }'

返回任务 ID:

{ "id": "1234-5678-9101" }

查询爬取状态

curl -X GET https://api.firecrawl.dev/v1/crawl/1234-5678-9101 \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_API_KEY'

分页机制

当数据超过 10MB 或任务未完成时，响应包含 next 参数用于获取后续数据

爬取参数配置

路径包含规则 includePaths

• 类型: array • 示例: ["/blog/*", "/products/*"]

路径排除规则 excludePaths

• 类型: array • 示例: ["/admin/*", "/login/*"]

最大深度 maxDepth

• 类型: integer • 说明: ◦ 0: 仅爬取当前页面 ◦ 1: 爬取当前页面及一级子页面 ◦ 2: 爬取至二级子页面

数量限制 limit

• 类型: integer • 默认值: 10000

允许回链 allowBackwardLinks

• 类型: boolean • 说明: 允许爬取上级目录链接 • 默认值: false

允许外链 allowExternalLinks

• 类型: boolean • 说明: 允许爬取外部域名链接 • 默认值: false

嵌套抓取配置 scrapeOptions

• 类型: object • 说明: 继承抓取参数配置 • 默认值: { "formats": ["markdown"] }

完整配置示例

curl -X POST https://api.firecrawl.dev/v1/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "includePaths": ["/blog/*", "/products/*"],
      "excludePaths": ["/admin/*", "/login/*"],
      "maxDepth": 2,
      "limit": 1000
    }'

此配置将：

• 仅爬取 /blog 和 /products 路径 • 排除管理后台和登录页面 • 爬取深度限制为 2 级 • 最多爬取 1000 个页面

网站链接地图 (/map 端点)

通过 /map 端点快速获取网站链接关系图

Copy

curl -X POST https://api.firecrawl.dev/v1/map \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev"
    }'

响应示例：

Copy

{
  "success": true,
  "links": [
    "https://docs.firecrawl.dev",
    "https://docs.firecrawl.dev/api-reference/crawl-endpoint",
    "https://docs.firecrawl.dev/getting-started",
    "..."
  ]
}

高级参数

关键词搜索 search

• 类型: string • 示例: "blog"

数量限制 limit

• 类型: integer • 默认值: 100

忽略站点地图 ignoreSitemap

• 类型: boolean • 默认值: true

包含子域名 includeSubdomains

• 类型: boolean • 默认值: false

完整文档参考：Map Endpoint Documentation

高级抓取技巧

高级爬取指南

基础抓取 (/scrape 端点)

PDF 抓取功能

抓取参数配置

响应格式设置 formats

完整内容获取 onlyMainContent

元素包含规则 includeTags

元素排除规则 excludeTags

页面加载等待 waitFor

超时设置 timeout

完整参数示例

结构化数据抽取

LLM 抽取配置

数据模式 schema

系统指令 system prompt

抽取指令 prompt

使用示例

交互动作支持

支持的动作类型

等待 wait

截图 screenshot

点击 click

输入 write

按键 press

滚动 scroll

多页面爬取 (/crawl 端点)

查询爬取状态

分页机制

爬取参数配置

路径包含规则 includePaths

路径排除规则 excludePaths

最大深度 maxDepth

数量限制 limit

允许回链 allowBackwardLinks

允许外链 allowExternalLinks

嵌套抓取配置 scrapeOptions

完整配置示例

网站链接地图 (/map 端点)

高级参数

关键词搜索 search

数量限制 limit

忽略站点地图 ignoreSitemap

包含子域名 includeSubdomains

响应格式设置 `formats`

完整内容获取 `onlyMainContent`

元素包含规则 `includeTags`

元素排除规则 `excludeTags`

页面加载等待 `waitFor`

超时设置 `timeout`