Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
py.typed
file was not included in PyPi distribution. (#698 + #703 + 6908487) [h/t @jhonatan-lopes]utils.cluster_objects(...)
with any hashable value (str
, int
, tuple
, etc.) as the key_fn
parameter, reverting breaking change in 58b1ab1. (#691 + 1e97656) [h/t @jfuruness]utils.outside_bbox(...)
and Page.outside_bbox(...)
method, which are the inverse of utils.within_bbox(...)
and Page.within_bbox(...)
. (#369 + 3ab1cc4)strict=True/False
parameter to Page.crop(...)
, Page.within_bbox(...)
, and Page.outside_bbox(...)
; default is True
, while False
bypasses the test_proposed_bbox(...)
check. (#421 + 71ad60f).to_image(...)
raises PIL.Image.DecompressionBombError
. (#413 + b6ff9e8)PageImage
conversions for PDFs with cmyk
colorspaces; convert them to rgb
earlier in the process. (28330da)split_at_punctuation
parameter to .extract_words(...)
and .extract_text(...)
. (#682) [h/t @lolipopshock].to_image(...)
's approach, preferring to composite with a white background instead of removing the alpha channel. (1cd1f9a)LayoutEngine.calculate(...)
when processing char objects with len>1 representations, such as ligatures. (#683)"matrix"
property to char
objects, representing the current transformation matrix. (ae6f99e)pdfplumber.ctm
submodule with class CTM
, to calculate scale, skew, and translation of a current transformation matrix obtained from a char
's "matrix"
property. (ae6f99e)page.search(...)
, an experimental feature that allows you to search a page's text via regular expressions and non-regex strings, returning the text, any regex matches, the bounding box coordinates, and the char objects themselves. (#201 + 58b1ab1)--include-attrs
/--exclude-attrs
to CLI (and corresponding params to .to_json(...)
, .to_csv(...)
, and Serializer
. (4deac25)py.typed
for PEP561 compatibility and detection of typing hints by mypy. (ca795d1) [h/t @jhonatan-lopes]pdfminer.six
version to 20220524
. (486cea8)utils.collate_chars(...)
, the old name (and then alias) for utils.extract_text(...)
. (24f3532)The main news about this version is that it introduces type annotations, and enforces them via mypy --strict
. It also fills in the few remaining gaps in the library's test coverage (although all parts of the library could still use stronger tests). See CHANGELOG.md for details.
See CHANGELOG.md for details. Summary:
pdfminer.six
version to 20220319
See CHANGELOG.md for a full list of additions, changes, and fixes. In some (hopefully) rare cases, this version may introduce breaking changes, which is why we're bumping to v0.6.0
. Highlights from the changelog include:
pdfminer.six
from 20200517
to 20211012
; see that library's changelog for details, but a key difference is an improvement in how it assigns line
, rect
, and curve
objects. (Diagonal two-point lines, for instance, are now line
objects instead of curve
objects.) (#515).extract_text(layout=True)
, an experimental feature which attempts to mimic the structural layout of the text on the page. (#10)pdfminer.six
(#346 + #520).extract_text(...)
returns ""
instead of None
when character list is empty. (#482 + cb9900b) [h/t @tungph]--precision
argument to CLI (#520)snap_x_tolerance
and snap_y_tolerance
to table extraction settings. (#51 + #475) [h/t @dustindall]join_x_tolerance
and join_y_tolerance
to table extraction settings. (cbb34ce).extract_words(...)
now includes doctop
among the attributes it returns for each word. (66fef89)And many thanks to @samkit-jain for his feedback and review of contributions to this release. 🎉
From CHANGELOG.md:
--laparams
flag to CLI. (#407).convert_csv(...)
to order objects first by page number, rather than object type. (#407).convert_csv(...)
, .convert_json(...)
, and CLI so that, by default, they returning all available object types, rather than those in a predefined default list. (#407).extract_text(...)
so that it can accept generator objects as its main parameter. (#385) [h/t @alexreg]LTAnno
objects (which have no bounding-box coordinates) are not extracted. (Was only an issue when setting laparams
.) (#388)Page.extract_table(...)
so that it honors text tolerance settings (#415) [h/t @trifling]From CHANGELOG.md:
0.5.26
/b1849f4) in closing files opened by PDF.open
textboxhorizontal
) when laparams
is passed to pdfplumber.open(...)
. Had been removed in 0.5.24
via 1f87898. (#359 + #364)python setup.py build sdist
test to main GitHub action. (#365)