Hi
Firstly, thanks for all the work you have done. In order to avoid fluff, I'll be posting the context serially.
- United States
- I'm using Libpostal to dedupe addresses within the Hadoop ecosystem (Hive on Tez).
- I have a farily large set of over 200 million addresses, a size-able chunk of which are human entered values. Given the nature of my data, I have encountered a few cases which causes the expand_address function to hang and stop my job.
a) The most baffling case.
>>> expand_address(u'5-19�� Nakamachi')
WARN invalid UTF-8
at transliterate (transliterate.c:791) errno: None
WARN invalid UTF-8
at transliterate (transliterate.c:791) errno: None
WARN invalid UTF-8
at transliterate (transliterate.c:791) errno: None
WARN invalid UTF-8
at transliterate (transliterate.c:791) errno: None
WARN invalid UTF-8
at transliterate (transliterate.c:791) errno: None
WARN invalid UTF-8
at transliterate (transliterate.c:791) errno: None
this is an entirely ASCII string, which halts the program. Using parse_address also throws warnings, but continues gracefully.
>>> parse_address(u'5-19�� Nakamachi')
WARN invalid UTF-8
at transliterate (transliterate.c:791) errno: None
WARN invalid UTF-8
at transliterate (transliterate.c:791) errno: None
WARN invalid UTF-8
at transliterate (transliterate.c:791) errno: None
WARN invalid UTF-8
at transliterate (transliterate.c:791) errno: None
WARN invalid UTF-8
at transliterate (transliterate.c:791) errno: None
WARN invalid UTF-8
at transliterate (transliterate.c:791) errno: None
[(u'5-19� � Nakamachi', u'house')]
b) Address: "No. \uD835\uDFE3\uD835\uDFE3"
This looks like "No. 11". Works fine using pypostal, however, it similarly halts the program when using jpostal. My guess is this has something to do with the C interface's GetStringUTFChars not working well with 4 byte utf-8 characters, since Java converts its internal UTF-16 String to a Modified UTF-8 format.
These cases are rare, but can block processes, which makes them problematic. Is there some way we can have this function exit gracefully in case of utf-8 parsing errors?
Thanks,
Nitin
Hi
Firstly, thanks for all the work you have done. In order to avoid fluff, I'll be posting the context serially.
a) The most baffling case.
this is an entirely ASCII string, which halts the program. Using parse_address also throws warnings, but continues gracefully.
b) Address: "No. \uD835\uDFE3\uD835\uDFE3"
This looks like "No. 11". Works fine using pypostal, however, it similarly halts the program when using jpostal. My guess is this has something to do with the C interface's
GetStringUTFCharsnot working well with 4 byte utf-8 characters, since Java converts its internal UTF-16 String to a Modified UTF-8 format.These cases are rare, but can block processes, which makes them problematic. Is there some way we can have this function exit gracefully in case of utf-8 parsing errors?
Thanks,
Nitin