Python typography enhacer tool for lxml-based html and raw text
Tool applying a set of common typography rules to the text: non-breaking spaces, dashes and ellipsyses replacements, etc.
Tested under python 2.7 and 3.4.
For html:
from chakert import Typograph
markup = '<p><b>Typography</b> is the art and technique of arranging type.</p>'
result_markup = Typograph.typograph_html(markup)
For parsed lxml.html tree:
from chakert import Typograph
from lxml import html
markup = '<p><b>Typography</b> is the art and technique of arranging type.</p>'
tree = html.fromstring(markup)
result_tree = Typograph.typograph_tree(tree)
For plain text:
from chakert import Typograph
text = 'Typography is the art and technique of arranging type.'
result_text = Typograph.typograph_text(text)
The typograph parses given text and splits it into tokens.
Each token is instance of chakert.tokens.Token
subclass.
Each Token
subclass represents a class of specific lexems with
defined replacement rules. Token
subclasses are, for example,
WordToken
, ParticleToken
, AbbrToken
, SpaceToken
,
PunctuationTokn
, QuoteToken
, DashToken
,
NbspToken
, DigitsToken
.
A set of rules for Russian language.
A set of rules for English language.
In contrast with regexp-based typography fixers, the key feature of chakert is readability of rules and expected simplicity of adding new rules.
The library uses tokenizer, splitting given text to the tokens of various classes. Each token
class defines own replacement rules in morph
method. In this method, it is allowed to iterate
over sibling nodes in forward and backward direction and perform simple text changing operations
through provied API: remove token, replace one token with another.
The only thing you should really carry about is to keep iterator state up-to date while removing or adding a token. It may be useful to learn chakert implementation to understand how it works.
If you are ready to suggest new rules or new languages, but you're not sure if you can implement them well, fill free to send pull requests with test cases!
General policy for rules included by default is that they should be appliable on any general text, and they should not be complicated.
For jinja template tag and filter example, see jinja2_chakert.py.
You can enable them by adding following parameters to jinja2 Environment
:
jinja2.Environment(
...
extensions=[
'jinja2_chakert.TypographExtension',
],
filters={'typograph': jinja2_chakert.do_typograph}
)
Note: it's your choice whether to correct typography when you save a text or when you render it to a template. In most cases, first option would be preferrable.