Playwright 浏览器自动化教程
Playwright 是由 Microsoft 开发的现代浏览器自动化框架,支持 Chromium、Firefox 和 WebKit 三大浏览器引擎,能够在无头模式下运行,非常适合在搬瓦工 VPS 上执行自动化测试、网页截图、数据采集等任务。相比传统方案,Playwright 具备自动等待、网络拦截、多标签页管理等先进特性。
一、系统要求
- 操作系统:Ubuntu 20.04+(推荐 Ubuntu 22.04)。
- 内存:至少 1GB,建议 2GB 以上(浏览器引擎较占内存)。
- Python:3.8 以上版本,或使用 Node.js 16+ 版本。
二、安装 Playwright(Python 版)
2.1 安装系统依赖
apt update && apt upgrade -y
apt install python3 python3-pip python3-venv -y
2.2 创建项目环境
mkdir -p /opt/playwright-project && cd /opt/playwright-project
python3 -m venv venv
source venv/bin/activate
2.3 安装 Playwright
pip install playwright
playwright install --with-deps chromium
--with-deps 参数会自动安装浏览器运行所需的系统依赖库。如需安装全部浏览器:
playwright install --with-deps
三、安装 Playwright(Node.js 版)
3.1 安装 Node.js
curl -fsSL https://deb.nodesource.com/setup_20.x | bash -
apt install nodejs -y
node --version && npm --version
3.2 初始化项目
mkdir -p /opt/playwright-node && cd /opt/playwright-node
npm init -y
npm install playwright
3.3 安装浏览器
npx playwright install --with-deps chromium
四、基础操作示例(Python)
4.1 网页截图
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page(viewport={'width': 1920, 'height': 1080})
page.goto('https://example.com')
page.screenshot(path='screenshot.png', full_page=True)
browser.close()
print('截图已保存')
4.2 表单填写与提交
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto('https://example.com/login')
# 填写表单
page.fill('input[name="username"]', 'myuser')
page.fill('input[name="password"]', 'mypassword')
# 点击登录按钮
page.click('button[type="submit"]')
# 等待导航完成
page.wait_for_load_state('networkidle')
print(f'当前页面: {page.url}')
browser.close()
4.3 页面内容提取
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto('https://example.com')
# 获取页面标题
title = page.title()
print(f'页面标题: {title}')
# 提取所有链接
links = page.eval_on_selector_all('a[href]',
'elements => elements.map(e => ({text: e.textContent.trim(), href: e.href}))')
for link in links:
print(f'{link["text"]} -> {link["href"]}')
browser.close()
五、异步操作
Playwright 支持异步 API,适合高并发场景:
import asyncio
from playwright.async_api import async_playwright
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
# 创建多个页面并行处理
urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3',
]
tasks = []
for url in urls:
page = await browser.new_page()
tasks.append(process_page(page, url))
await asyncio.gather(*tasks)
await browser.close()
async def process_page(page, url):
await page.goto(url)
title = await page.title()
await page.screenshot(path=f'screenshot_{hash(url)}.png')
print(f'{url} -> {title}')
asyncio.run(main())
六、网络请求拦截
Playwright 可以拦截和修改网络请求,用于屏蔽广告、模拟接口等场景:
from playwright.sync_api import sync_playwright
def handle_route(route):
# 屏蔽图片和字体请求以加速加载
if route.request.resource_type in ['image', 'font', 'stylesheet']:
route.abort()
else:
route.continue_()
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# 设置路由拦截
page.route('**/*', handle_route)
page.goto('https://example.com')
content = page.content()
print(f'页面大小: {len(content)} bytes')
browser.close()
七、录制操作脚本
Playwright 提供了 codegen 工具,可以录制浏览器操作并自动生成代码:
# 录制操作(需要图形界面,适合本地开发)
playwright codegen https://example.com
# 在 VPS 上录制,生成 Python 代码
playwright codegen --target python -o script.py https://example.com
八、配置浏览器上下文
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
args=['--no-sandbox', '--disable-dev-shm-usage']
)
context = browser.new_context(
viewport={'width': 1920, 'height': 1080},
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
locale='zh-CN',
timezone_id='Asia/Shanghai',
permissions=['geolocation'],
geolocation={'latitude': 39.9042, 'longitude': 116.4074},
)
page = context.new_page()
page.goto('https://example.com')
page.screenshot(path='configured.png')
context.close()
browser.close()
九、使用 systemd 管理服务
如果需要将 Playwright 脚本作为常驻服务运行:
cat > /etc/systemd/system/playwright-task.service <<EOF
[Unit]
Description=Playwright Automation Task
After=network.target
[Service]
Type=simple
User=root
WorkingDirectory=/opt/playwright-project
ExecStart=/opt/playwright-project/venv/bin/python script.py
Restart=on-failure
RestartSec=30
Environment=PLAYWRIGHT_BROWSERS_PATH=/opt/playwright-project/browsers
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable playwright-task
systemctl start playwright-task
十、常见问题
浏览器启动失败
通常是缺少系统依赖,运行以下命令安装:
playwright install-deps
内存不足导致崩溃
在低内存 VPS 上,建议配置 swap 空间并仅安装 Chromium:
fallocate -l 2G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo '/swapfile none swap sw 0 0' >> /etc/fstab
总结
Playwright 是功能全面的浏览器自动化框架,在搬瓦工 VPS 上可以实现网页测试、数据采集、截图监控等多种自动化任务。建议选择 2GB 以上内存的方案以获得更好的运行体验。选购搬瓦工 VPS 请参考 全部方案,购买时使用优惠码 NODESEEK2026 可享受 6.77% 折扣。如需了解其他自动化工具,可参考 Puppeteer 教程 或 Selenium 教程。