Company logo with the letters 'NotTooBad Software' TextSmith Blog

Parsing Unicode property data files

TextPicker  Patterns  

When developing the word boundary recogniser for the Patterns framework, I needed to access data in Unicode property data files from Swift. These files look something like this:


# Total code points: 88

# ================================================

0780..07A5    ; Thaana # Lo  [38] THAANA LETTER HAA..THAANA LETTER WAAVU
07B1          ; Thaana # Lo       THAANA LETTER NAA

For all the Unicode property data files you could possibly want, see here and here.

where the hexadecimal numbers/ranges at the beginning of the line and the property name (“Thaana”) are the interesting parts. So we need to find all hexadecimal numbers that are at the beginning of a line – optionally followed by “..” and another hexadecimal – followed by spaces, a semi-colon, a single space, the property name and another space.

Want to hear about new posts?