Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

♻️ refactor: Add reject pattern for browserless to boost crawl performance #6996

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cy948
Copy link
Contributor

@cy948 cy948 commented Mar 16, 2025

💻 变更类型 | Change Type

  • ✨ feat
  • 🐛 fix
  • ♻️ refactor
  • 💄 style
  • 👷 build
  • ⚡️ perf
  • 📝 docs
  • 🔨 chore

🔀 变更说明 | Description of Change

  • packages/web-crawler/src/crawImpl/browserless.ts: 通过环境变量 BROWSERLESS_REJECT_REQUEST_PATTERN 向 browserless 传递忽略的爬取文件规则。如图像、音频等,从而提升返回速度。

📝 补充信息 | Additional Information

BROWSERLESS_REJECT_REQUEST_PATTERN="\.(jpg|jpeg|png|gif|bmp|webp|svg|ico|tif|tiff|woff|woff2|raw|heic|avif|mp3|wav|ogg|flac|aac|m4a|wma|mp4|webm|mov|avi|wmv|flv|mkv|m4v|3gp|swf)(\?.*|\#.*)?$"
  • Browserless接收到的设置
  browserless.io:ChromiumContentPostRoute:info 127.0.0.1 Content API invoked with body: {
  gotoOptions: { waitUntil: 'networkidle2' },
  rejectRequestPattern: [
    '\\.(jpg|jpeg|png|gif|bmp|webp|svg|ico|tif|tiff|woff|woff2|raw|heic|avif|mp3|wav|ogg|flac|aac|m4a|wma|mp4|webm|mov|avi|wmv|flv|mkv|m4v|3gp|swf)(\\?.*|\\#.*)?$'
  ],
  url: 'https://www.google.cn/'
}

js 已经对\进行处理,但此处环境变量还是要用 "" 包裹,防止\造成转义

网页为例,在爬取图片时普遍会增加时延(红色高亮)。在开启上述的过滤规则后,browserless会主动 abort 图片的请求(绿色高亮的 Aborting request),从而节省请求时间。

  browserless.io:ChromiumContentPostRoute:trace 127.0.0.1 Setting up file:// protocol request rejection +0ms
  browserless.io:ChromiumContentPostRoute:trace 127.0.0.1 GET: https://www.google.cn/ +63ms
  browserless.io:ChromiumContentPostRoute:trace 127.0.0.1 200: https://www.google.cn/ +302ms
  browserless.io:ChromiumContentPostRoute:trace 127.0.0.1 Navigation to https://www.google.cn/ +3ms
-  browserless.io:ChromiumContentPostRoute:trace 127.0.0.1 GET: https://www.google.cn/intl/zh-CN_cn/landing/cnexp/google-search.png +6ms
-  browserless.io:ChromiumContentPostRoute:trace 127.0.0.1 200: https://www.google.cn/intl/zh-CN_cn/landing/cnexp/google-search.png +124ms
+   browserless.io:ChromiumContentPostRoute:debug 127.0.0.1 Aborting request GET: https://www.google.cn/intl/zh-CN_cn/landing/cnexp/google-search.png +0ms
+  browserless.io:ChromiumContentPostRoute:warn 127.0.0.1 "net::ERR_FAILED": https://www.google.cn/intl/zh-CN_cn/landing/cnexp/google-search.png +0ms
  browserless.io:ChromiumContentPostRoute:trace 127.0.0.1 GET: https://www.google.cn/favicon.ico +40ms
  browserless.io:ChromiumContentPostRoute:trace 127.0.0.1 404: https://www.google.cn/favicon.ico +222ms
  browserless.io:ChromiumContentPostRoute:trace 127.0.0.1 error: Failed to load resource: the server responded with a status of 404 () +1ms

同时,该规则只是让 browserLess 不去下载媒体,而媒体本身的 url 仍然存在,所以在 llm 处理的 raw text 返回中,图片 url 仍然会存在,不影响最终体验。
这是一个在当前环境变量设置的 raw text 返回:

Google [ ![Google](https://www.google.cn/intl/zh-CN_cn/landing/cnexp/google-search.png) ](https://www.google.com.hk/webhp?hl=zh-CN&sourceid=cnhp) 
# 从 raw html 中处理得到的图片 url 仍然存在

Sorry, something went wrong.

Copy link

vercel bot commented Mar 16, 2025

@cy948 is attempting to deploy a commit to the LobeChat Desktop Team on Vercel.

A member of the Team first needs to authorize it.

@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. ⚡️ Performance Performance issue | 性能问题 labels Mar 16, 2025
@lobehubbot
Copy link
Member

👍 @cy948

Thank you for raising your pull request and contributing to our Community
Please make sure you have followed our contributing guidelines. We will review it as soon as possible.
If you encounter any problems, please feel free to connect with us.
非常感谢您提出拉取请求并为我们的社区做出贡献,请确保您已经遵循了我们的贡献指南,我们会尽快审查它。
如果您遇到任何问题,请随时与我们联系。

Copy link
Contributor

gru-agent bot commented Mar 16, 2025

TestGru Assignment

Summary

Link CommitId Status Reason
Detail 9c194be ✅ Finished

Files

File Pull Request
packages/web-crawler/src/crawImpl/browserless.ts ❌ Failure (I failed to write the unit tests for the file.)

Tip

You can @gru-agent and leave your feedback. TestGru will make adjustments based on your input

Copy link

codecov bot commented Mar 16, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.31%. Comparing base (59cafa0) to head (9c194be).
Report is 12 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6996      +/-   ##
==========================================
- Coverage   91.31%   91.31%   -0.01%     
==========================================
  Files         732      732              
  Lines       69143    69264     +121     
  Branches     4743     3211    -1532     
==========================================
+ Hits        63140    63249     +109     
- Misses       6003     6015      +12     
Flag Coverage Δ
app 91.31% <100.00%> (-0.01%) ⬇️
server 97.49% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@arvinxx
Copy link
Contributor

arvinxx commented Mar 17, 2025

感觉可以不用变成环境变量,直接屏蔽好了

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


It feels like you can do it without turning it into an environment variable, just block it directly

@cy948
Copy link
Contributor Author

cy948 commented Mar 17, 2025

感觉可以不用变成环境变量,直接屏蔽好了

做成环境变量主要是方便以后加屏蔽清单来着。

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


I feel like I can do it without turning it into an environment variable, just block it directly

The main purpose of making environmental variables is to add a shielded list in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⚡️ Performance Performance issue | 性能问题 size:XS This PR changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants