- Added: support for fixing arabic letter yeh barree on
fix_misc_non_persian_chars - Added: replace
U+01c3withǃ - Added: tab/space/zwnj/zwj/nbsp between two new-lines, props @zoghal
- Added:
remove_spaces_before_ellipsisto keeping space before ellipsis - Added: converts all angled dash (
¬) into zwnj - Added: helper method for reverse a string with recursion
- Added: angled dash (
¬) to preserved no-break space entities - Changed:
cleanup_line_breakscleans whitespace/zwnj between new-lines - Changed:
fix_diacriticscleans more than one of each diacritic characters, props @Amm1rr - Changed: now
fix_misc_non_persian_charsapplies on entire text
- Added: removes spaces between dots on
fix_three_dots - Added: new option
normalize_datesto re-order date parts with slash as delimiter - Added: new option
fix_misc_spacingto remove space before braces containing numbers - Added: new option
remove_diacriticsto remove all diacritic characters - Added: removes markdown link spaces inside normal parentheses
- Changed: early fix persian glyphs
- Changed: removing space between different/same marks
- Fixed: lazy seek before dashes on frontmatter preserving
- Fixed: combined preserving markdown links
- Fixed: avoid new-lines as whitespaces
- Fixed: combined pattern for removing kashidas
- Fixed: prevent removing spaces after punctuations on the end of lines
- Fixed: prevent removing double spaces on the end of lines
- Fixed: prevent removing new-lines on normalize braces
- Added: cleanup zwnj before/after ellipsis
- Added: converting kashida between numbers to ndash
- Added: new option for converting arabic hamzeh
- Added: replaces spaces after ellipsis
- Added: support for markdown images
- Added: support for more punctuations before/after zwnjs cleanup
- Changed: also removes nbsp from beginning of new-lines
- Changed: late checks for zwnjs
- Changed: revert to the old and big uri pattern, ref
- Fixed: account for punctuations after domain tlds
- Fixed: account for space before image opening brace within links
- Fixed: also single padding on the beginning of the text
- Fixed: check for non-space after space and prefixes
- Fixed: check for persian chars before suffix spacings
- Fixed: optional space after preservers
- Added: cleaning more than one of diacritic chars on
fix_diacritics, props @languagetool-org - Added: extra method for converting persian numbers back
- Added: fix another arabic kaf char on
fix_misc_non_persian_chars - Added: removes space before common domain tlds on
fix_spacing_for_punctuations - Added: replace comma between numbers to thousands separators on
fix_numeral_symbols - Added: support for man tan shan suffixes
- Changed: begin/end space cleanup after preservers
- Changed: extract diacritics fixes as new option:
fix_diacritics - Changed: yet another pattern for preserving URIs, (ref)
- Fixed: fix ha haye before other suffixes
- Fixed: support for more punctuation types after suffixes
- Added: (undocumented) fix heh + ye, alternative to
fix_hamzeh - Added: cleaning whitespace/zwnj between new-lines on
cleanup_spacing - Added: new option
fix_numeral_symbolsto replace percent signs and decimal separators, props @ebraminio/persiantools - Added: new option
fix_persian_glyphsto replace glyph chars, props @ebraminio/persiantools - Added: new option
fix_suffix_miscto fix hamza with double yeh, props @ebraminio/persiantools - Added: new option
markdown_normalize_braces - Added: new option
markdown_normalize_lists - Added: new option
preserve_frontmatterto preserve frontmatter data - Added: padding the end of string while cleanup
- Added: re-ordering extra marks:
?!into!? - Added: removing space between same marks
- Added: removing unnecessary zwnj on start/end of each line, props @ebraminio/persiantools
- Added: replacing more than one english question mark with just one
- Added: skip cleanup if text is empty or whitespace
- Added: storing markdown links separetly to help space cleanup working
- Changed: deprecated
aggresiveoption - Changed: moved cleaning whitespaces before newlines to
cleanup_begin_and_end - Changed: new option
fix_spacing_for_punctuationsextracted fromfix_spacing_for_braces_and_quotes - Changed: simpler pattern for preserving URIs
- Changed: some options disabled by default:
preserve_braces,preserve_brackets,skip_markdown_ordered_lists_numbers_conversion - Fixed: account for punctuations, braces and quots after suffixes
- Fixed: account for zwnj after yeh on fixing hamzeh
- Fixed: copy options object before parsing
- Fixed: putting back correct whitespace on cleaning zwnjs
- Fixed: unescaped char on space after dots in numbers
- Added: new option
normalize_ellipsisto replace more than one ellipsis with one - Added: convert all soft hyphens into zwnj, on
cleanup_zwnj - Added: remove direction marks from begin and end of text, on
cleanup_begin_and_end - Added: extra method for flipping punctuations
- Fixed: fix three dots as standalone method
- Added: initial check for type of the input
- Added: new option
cleanup_line_breaksto remove more than two contiguous line breaks - Added: new option
preserve_entitiesto preserve html non decoded entities - Added: new option
preserve_commentsto preserve html comments - Added: new option
preserve_nbspsto preserve no-break spaces - Fixed: also fix heh plus standalone hamza
- Fixed: clean spaces before diacritic characters
- Fixed: clean ZWNJs before diacritic characters
- Fixed: putting back chars after suffix white spaces
- Fixed: better pattern for preserving html tags
- Fixed: decode all decimal, hex and some selected entities
- Fixed: decode html entity as standalone method
- Added: support more misc non Persian chars, props @ebraminio/persiantools
- Added: new option:
fix_punctuations - Fixed: better prototype structure
- Fixed: cleanup zwnj as standalone method
- Added: should not replace sprintf directives
- Added: extra method for swaping incorrect quotes
- Fixed: word tokenizer accounts for wraping chars
- Fixed: no space after dots in numbers
- Fixed: no space in time parts in English
- Fixed: mismatch options for Arabic/English numbers
- Added: support prefix:
bi*, props @zoghal - Added: support suffix:
*am,*at,*ash,*ei,*eid,*eem,*and, props @zoghal - Added: support suffix:
*hayee,*hayam,*hayat,*hayash,*hayetan,*hayeman,*hayeshan, props @zoghal - Fixed: check for space before suffix:
*tar,*tari,*tarin, props @zoghal
- Added: convert back quot/apos entities
- Added: new option:
decode_htmlentities - Fixed: test suite
- Added: coding standard
- Added: passing options directly into cleanup method
- Added: new option:
preserve_brackets - Added: new option:
preserve_braces
- Added: new option:
kashidas_as_parenthetic - Fixed: remove all kashida between non-whitespace characters
- Added: new word tokenizer detection
- Added: skip english numbers conversion in html entities
- Added: bower integration
- Fixed: multiline flag on begin/end cleanup