Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(deps): bump unstructured from 0.10.27 to 0.17.2 #437

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dependabot[bot]
Copy link
Contributor

@dependabot dependabot bot commented on behalf of github Mar 21, 2025

Bumps unstructured from 0.10.27 to 0.17.2.

Release notes

Sourced from unstructured's releases.

0.17.2

Enhancements

  • Add image_url of images in html partitioner <img> tags with non-data content include a new image_url metadata field with the content of the src attribute.

  • Use lxml instead of bs4 to parse hOCR data. lxml is much faster than bs4 given the hOCR data format is regular (garanteed because it is programatically generated)

  • bump numpy to >2. And upgrade paddlepaddle, unstructured-paddleocr, onnx so they are compatible with numpy>2.

Fixes

  • Fix Image in a tag is "UncategorizedText" with no .text

What's Changed

Full Changelog: Unstructured-IO/unstructured@0.17.0...0.17.2

0.17.0

What's Changed

Full Changelog: Unstructured-IO/unstructured@0.16.25...0.17.0

0.16.25

Enhancements

Features

Fixes

  • Fixes filetype detection for jsons passed as byte streams - Now it prioritizes magic mimetype prediction over file extension when detecting filetypes

0.16.24

Enhancements

  • Support dynamic partitioner file type registration. Use create_file_type to create new file type that can be handled in unstructured and register_partitioner to enable registering your own partitioner for any file type.

... (truncated)

Changelog

Sourced from unstructured's changelog.

0.17.2

  • Fix Image in a tag is "UncategorizedText" with no .text

0.17.1

Enhancements

  • Add image_url of images in html partitioner <img> tags with non-data content include a new image_url metadata field with the content of the src attribute.

  • Use lxml instead of bs4 to parse hOCR data. lxml is much faster than bs4 given the hOCR data format is regular (garanteed because it is programatically generated)

  • bump numpy to >2. And upgrade paddlepaddle, unstructured-paddleocr, onnx so they are compatible with numpy>2.

Features

Fixes

0.17.0

Enhancements

  • Add support for images in html partitioner <img> tags will now be parsed as Image elements. When extract_image_block_types includes Image and extract_image_block_to_payload=True then the image_base64 will be included for images that specify the base64 data (rather than url) as the source.

  • Use kwargs instead of env to specify ocr_agent and table_ocr_agent for hi_res strategy.

  • stop using PageLayout.elements to save memory and cpu cost. Now only use PageLayout.elements_array throughout the partition, except when analysis=True where the drawing logic still uses elements.

Features

Fixes

0.16.25

Enhancements

Features

Fixes

  • Fixes filetype detection for jsons passed as byte streams - Now it prioritizes magic mimetype prediction over file extension when detecting filetypes

0.16.24

Enhancements

  • Support dynamic partitioner file type registration. Use create_file_type to create new file type that can be handled in unstructured and register_partitioner to enable registering your own partitioner for any file type.

... (truncated)

Commits
  • 0fa5174 Image within div or span with no text is annotated as Image (#3962)
  • 7de630e Feat/bump numpy to 2 (#3961)
  • 4e424ef feat: use lxml instead of bs4 to parse hOCR data (#3960)
  • 66bf4b0 feat: support extracting image url in html (#3955)
  • 2dceac3 Feat/remove reference of PageLayout.elements (#3943)
  • 8759b0a feat: allow passing down of ocr agent and table agent (#3954)
  • 0001a33 fix: pass extract image args to all partitioners (#3950)
  • c0457c1 feat: include images when partitioning html (#3945)
  • 74b0647 Fix json bytes content type detection (#3941)
  • 961c8d5 feat: use block matrix to reduce peak memory usage for matmul (#3947)
  • Additional commits viewable in compare view

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot merge will merge this PR after your CI passes on it
  • @dependabot squash and merge will squash and merge this PR after your CI passes on it
  • @dependabot cancel merge will cancel a previously requested merge and block automerging
  • @dependabot reopen will reopen this PR if it is closed
  • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Bumps [unstructured](https://github.com/Unstructured-IO/unstructured) from 0.10.27 to 0.17.2.
- [Release notes](https://github.com/Unstructured-IO/unstructured/releases)
- [Changelog](https://github.com/Unstructured-IO/unstructured/blob/main/CHANGELOG.md)
- [Commits](Unstructured-IO/unstructured@0.10.27...0.17.2)

---
updated-dependencies:
- dependency-name: unstructured
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot bot added the chore label Mar 21, 2025
@github-actions github-actions bot added the dependencies Pull requests that update a dependency file label Mar 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore dependencies Pull requests that update a dependency file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

0 participants