Pdfplumber Versions Save

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

v0.7.5

1 year ago

Added

  • Add PageImage.show() as alias for PageImage.annotated.show(). (#715 + 5c7787b)

Fixed

  • Fixed issue where py.typed file was not included in PyPi distribution. (#698 + #703 + 6908487) [h/t @jhonatan-lopes]
  • Reinstated the ability to call utils.cluster_objects(...) with any hashable value (str, int, tuple, etc.) as the key_fn parameter, reverting breaking change in 58b1ab1. (#691 + 1e97656) [h/t @jfuruness]

Development Changes

  • Update Wand version in requirements.txt from >=0.6.7 to >=0.6.10. (#713 + 3457d79)

v0.7.4

1 year ago

Added

  • Add utils.outside_bbox(...) and Page.outside_bbox(...) method, which are the inverse of utils.within_bbox(...) and Page.within_bbox(...). (#369 + 3ab1cc4)
  • Add strict=True/False parameter to Page.crop(...), Page.within_bbox(...), and Page.outside_bbox(...); default is True, while False bypasses the test_proposed_bbox(...) check. (#421 + 71ad60f)
  • Add more guidance to exception when .to_image(...) raises PIL.Image.DecompressionBombError. (#413 + b6ff9e8)

Fixed

  • Fix PageImage conversions for PDFs with cmyk colorspaces; convert them to rgb earlier in the process. (28330da)

v0.7.3

1 year ago

Fixed

  • Quick fix for transparency issue in visual debugging mode. b98dd7c

v0.7.2

1 year ago

Added

Changed

  • Change .to_image(...)'s approach, preferring to composite with a white background instead of removing the alpha channel. (1cd1f9a)

Fixed

  • Fix bug in LayoutEngine.calculate(...) when processing char objects with len>1 representations, such as ligatures. (#683)

v0.7.0

2 years ago

Added

  • Add "matrix" property to char objects, representing the current transformation matrix. (ae6f99e)
  • Add pdfplumber.ctm submodule with class CTM, to calculate scale, skew, and translation of a current transformation matrix obtained from a char's "matrix" property. (ae6f99e)
  • Add page.search(...), an experimental feature that allows you to search a page's text via regular expressions and non-regex strings, returning the text, any regex matches, the bounding box coordinates, and the char objects themselves. (#201 + 58b1ab1)
  • Add --include-attrs/--exclude-attrs to CLI (and corresponding params to .to_json(...), .to_csv(...), and Serializer. (4deac25)
  • Add py.typed for PEP561 compatibility and detection of typing hints by mypy. (ca795d1) [h/t @jhonatan-lopes]

Changed

  • Bump pinned pdfminer.six version to 20220524. (486cea8)

Removed

  • Remove utils.collate_chars(...), the old name (and then alias) for utils.extract_text(...). (24f3532)

Fixed

  • Fix IndexError bug for .extract_text(layout=True) on pages without text. (#658 + ad3df11) [h/t @ethanscorey]

v0.6.2

2 years ago

The main news about this version is that it introduces type annotations, and enforces them via mypy --strict. It also fills in the few remaining gaps in the library's test coverage (although all parts of the library could still use stronger tests). See CHANGELOG.md for details.

v0.6.1

2 years ago

See CHANGELOG.md for details. Summary:

  • Bumps pinned pdfminer.six version to 20220319
  • Removes support for Python 3.6 (EOL'ed Dec. 2021), adds tested support for 3.9 and 3.10.
  • Fixes a couple of bugs

v0.6.0

2 years ago

See CHANGELOG.md for a full list of additions, changes, and fixes. In some (hopefully) rare cases, this version may introduce breaking changes, which is why we're bumping to v0.6.0. Highlights from the changelog include:

  • Upgrade pdfminer.six from 20200517 to 20211012; see that library's changelog for details, but a key difference is an improvement in how it assigns line, rect, and curve objects. (Diagonal two-point lines, for instance, are now line objects instead of curve objects.) (#515)
  • Add .extract_text(layout=True), an experimental feature which attempts to mimic the structural layout of the text on the page. (#10)
  • Remove Decimal-ization of parsed object attributes, which are now represented with as much precision as is returned by pdfminer.six (#346 + #520)
  • .extract_text(...) returns "" instead of None when character list is empty. (#482 + cb9900b) [h/t @tungph]
  • Add --precision argument to CLI (#520)
  • Add snap_x_tolerance and snap_y_tolerance to table extraction settings. (#51 + #475) [h/t @dustindall]
  • Add join_x_tolerance and join_y_tolerance to table extraction settings. (cbb34ce)
  • .extract_words(...) now includes doctop among the attributes it returns for each word. (66fef89)

And many thanks to @samkit-jain for his feedback and review of contributions to this release. 🎉

v0.5.28

3 years ago

From CHANGELOG.md:

Added

  • Add --laparams flag to CLI. (#407)

Changed

  • Change .convert_csv(...) to order objects first by page number, rather than object type. (#407)
  • Change .convert_csv(...), .convert_json(...), and CLI so that, by default, they returning all available object types, rather than those in a predefined default list. (#407)

Fixed

  • Fix .extract_text(...) so that it can accept generator objects as its main parameter. (#385) [h/t @alexreg]
  • Fix page-parsing so that LTAnno objects (which have no bounding-box coordinates) are not extracted. (Was only an issue when setting laparams.) (#388)
  • Fix Page.extract_table(...) so that it honors text tolerance settings (#415) [h/t @trifling]

v0.5.27

3 years ago

From CHANGELOG.md:

Fixed

  • Fix regression (introduced in 0.5.26/b1849f4) in closing files opened by PDF.open
  • Reinstate access to higher-level layout objects (such as textboxhorizontal) when laparams is passed to pdfplumber.open(...). Had been removed in 0.5.24 via 1f87898. (#359 + #364)

Development Changes

  • Add a python setup.py build sdist test to main GitHub action. (#365)