-
-
Notifications
You must be signed in to change notification settings - Fork 12.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
♻️ refactor: Add reject pattern for browserless to boost crawl performance #6996
base: main
Are you sure you want to change the base?
Conversation
@cy948 is attempting to deploy a commit to the LobeChat Desktop Team on Vercel. A member of the Team first needs to authorize it. |
👍 @cy948 Thank you for raising your pull request and contributing to our Community |
TestGru AssignmentSummary
Files
Tip You can |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #6996 +/- ##
==========================================
- Coverage 91.31% 91.31% -0.01%
==========================================
Files 732 732
Lines 69143 69264 +121
Branches 4743 3211 -1532
==========================================
+ Hits 63140 63249 +109
- Misses 6003 6015 +12
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
感觉可以不用变成环境变量,直接屏蔽好了 |
It feels like you can do it without turning it into an environment variable, just block it directly |
做成环境变量主要是方便以后加屏蔽清单来着。 |
The main purpose of making environmental variables is to add a shielded list in the future. |
💻 变更类型 | Change Type
🔀 变更说明 | Description of Change
packages/web-crawler/src/crawImpl/browserless.ts
: 通过环境变量BROWSERLESS_REJECT_REQUEST_PATTERN
向 browserless 传递忽略的爬取文件规则。如图像、音频等,从而提升返回速度。📝 补充信息 | Additional Information
https://docs.browserless.io/baas/http-apis/content#rejecting-undesired-requests 实测reject patterns更有效,能匹配像nextjs这些带query的资源请求
使用ENV示例
以网页为例,在爬取图片时普遍会增加时延(红色高亮)。在开启上述的过滤规则后,browserless会主动 abort 图片的请求(绿色高亮的
Aborting request
),从而节省请求时间。同时,该规则只是让 browserLess 不去下载媒体,而媒体本身的 url 仍然存在,所以在 llm 处理的 raw text 返回中,图片 url 仍然会存在,不影响最终体验。
这是一个在当前环境变量设置的 raw text 返回: