Pycantonese Versions Save

Cantonese Linguistics and NLP

v3.1.0.dev3

3 years ago

This is another development release towards v3.1.0. Compared to v3.1.0.dev2, this dev release has more word segmentation issues fixed in order to improve part-of-speech tagging being developed.

Installing this version from the GitHub source requires Git LFS on your system, if it's not already installed.

Corresponding PyPI release: https://pypi.org/project/pycantonese/3.1.0.dev3/

v3.1.0.dev2

3 years ago

This is a development release to tag some unreleased features, particularly a part-of-speech tagger under development. (Installing this version from the GitHub source likely requires Git LFS on your system.)

Corresponding PyPI release: https://pypi.org/project/pycantonese/3.1.0.dev2/

v3.0.0

3 years ago

[3.0.0] - 2020-10-25

Added

  • Word segmentation:
    • Segmentation is now customizable for the following:
      • Maximum word length
      • A user-supplied list of words to allow as words
      • A user-supplied list of words to disallow as words
    • The default segmentation model has been improved with the rime-cantonese data (CC BY 4.0 license).
  • Characters-to-Jyutping conversion:
    • The conversion returns results in a word-segmented form.
    • The conversion model has been improved with the rime-cantonese data (CC BY 4.0 license).
  • Added the following functions; they are equivalent to their (now deprecated) x2y counterparts:
    • characters_to_jyutping
    • jyutping_to_tipa
    • jyutping_to_yale
  • Added support for Python 3.9.

Changed

API-breaking Changes

  • jyutping_to_yale: The default value of the keyword argument as_list has been changed from False to True, so that this function is now more in line with the other "jyutping_to_X" functions for returning a list.
  • characters_to_jyutping: The returned value is now a list of segmented words, where each is a 2-tuple of (Cantonese characters, Jyutping). Previously, it was a list of Jyutping strings for the individual Cantonese characters.

Non-API-breaking Changes

  • Switched documentation to the readthedocs theme and numpydoc docstring style.
  • Improved CircleCI builds with orbs.

Deprecated

  • The following x2y functions have been deprecated in favor of their counterparts named as x_to_y.
    • characters2jyutping
    • jyutping2tipa
    • jyutping2yale

Security

  • Turned on HTTPS for the pycantonese.org domain.

v2.4.1

3 years ago

[2.4.1] - 2020-10-10

Fixed

  • Switched the wordseg dependency to the PyPI source instead of a GitHub direct link.

v2.4.0

3 years ago

[2.4.0] - 2020-10-10

Added

  • Added the characters2jyutping() function for converting Cantonese characters to Jyutping romanization.
  • Added the segment() function for word segmentation.

v2.3.0

3 years ago

[2.3.0] - 2020-07-24

Added

  • Added support for Python 3.7 and 3.8.

Removed

  • Dropped support for Python 3.4 and 3.5 (supporting 3.6, 3.7, and 3.8 now).

v2.2.0

5 years ago

[2.2.0] - 2018-06-30

Added

  • 104 stop words.

v2.1.0

6 years ago

[2.1.0] - 2018-06-11

Added

  • Exposed the exclude parameter in various reader methods for excluding specific participants. This parameter was implemented at pylangacq v0.10.0.

Fixed

  • Allowed "n" to be a syllabic nasal.
  • Fixed corpus reader not picking up the characters.

v2.0.0

8 years ago

Major update: Shift to the CHAT transcription format for HKCanCor and custom corpus datasets.

v1.0

8 years ago
  • Overall code restructuring
  • Only Python 3.x is supported from this point onwards
  • Used generators instead of lists for corpus access methods
  • Added the part-of-speech search criterion
  • Added Jyutping-to-Yale conversion
  • Added Jyutping-to-TIPA conversion
  • Disabled the function for reading a custom corpus dataset (it will come back)