Pycantonese Versions Save

Cantonese Linguistics and NLP

v3.1.0.dev3

3 years ago

This is another development release towards v3.1.0. Compared to v3.1.0.dev2, this dev release has more word segmentation issues fixed in order to improve part-of-speech tagging being developed.

Installing this version from the GitHub source requires Git LFS on your system, if it's not already installed.

Corresponding PyPI release: https://pypi.org/project/pycantonese/3.1.0.dev3/

v3.1.0.dev2

3 years ago

This is a development release to tag some unreleased features, particularly a part-of-speech tagger under development. (Installing this version from the GitHub source likely requires Git LFS on your system.)

Corresponding PyPI release: https://pypi.org/project/pycantonese/3.1.0.dev2/

v3.0.0

3 years ago

[3.0.0] - 2020-10-25

Added

Word segmentation:
- Segmentation is now customizable for the following:
  - Maximum word length
  - A user-supplied list of words to allow as words
  - A user-supplied list of words to disallow as words
- The default segmentation model has been improved with the rime-cantonese data (CC BY 4.0 license).
Characters-to-Jyutping conversion:
- The conversion returns results in a word-segmented form.
- The conversion model has been improved with the rime-cantonese data (CC BY 4.0 license).
Added the following functions; they are equivalent to their (now deprecated) x2y counterparts:
- characters_to_jyutping
- jyutping_to_tipa
- jyutping_to_yale
Added support for Python 3.9.

Changed

API-breaking Changes

jyutping_to_yale: The default value of the keyword argument as_list has been changed from False to True, so that this function is now more in line with the other "jyutping_to_X" functions for returning a list.
characters_to_jyutping: The returned value is now a list of segmented words, where each is a 2-tuple of (Cantonese characters, Jyutping). Previously, it was a list of Jyutping strings for the individual Cantonese characters.

Non-API-breaking Changes

Switched documentation to the readthedocs theme and numpydoc docstring style.
Improved CircleCI builds with orbs.

Deprecated

The following x2y functions have been deprecated in favor of their counterparts named as x_to_y.
- characters2jyutping
- jyutping2tipa
- jyutping2yale

Security

Turned on HTTPS for the pycantonese.org domain.

v2.4.1

3 years ago

[2.4.1] - 2020-10-10

Fixed

Switched the wordseg dependency to the PyPI source instead of a GitHub direct link.

v2.4.0

3 years ago

[2.4.0] - 2020-10-10

Added

Added the characters2jyutping() function for converting Cantonese characters to Jyutping romanization.
Added the segment() function for word segmentation.

v2.3.0

3 years ago

[2.3.0] - 2020-07-24

Added

Added support for Python 3.7 and 3.8.

Removed

Dropped support for Python 3.4 and 3.5 (supporting 3.6, 3.7, and 3.8 now).

v2.2.0

5 years ago

[2.2.0] - 2018-06-30

Added

104 stop words.

v2.1.0

6 years ago

[2.1.0] - 2018-06-11

Added

Exposed the exclude parameter in various reader methods for excluding specific participants. This parameter was implemented at pylangacq v0.10.0.

Fixed

Allowed "n" to be a syllabic nasal.
Fixed corpus reader not picking up the characters.

v2.0.0

8 years ago

Major update: Shift to the CHAT transcription format for HKCanCor and custom corpus datasets.

v1.0

8 years ago

Overall code restructuring
Only Python 3.x is supported from this point onwards
Used generators instead of lists for corpus access methods
Added the part-of-speech search criterion
Added Jyutping-to-Yale conversion
Added Jyutping-to-TIPA conversion
Disabled the function for reading a custom corpus dataset (it will come back)