Company logo with the letters 'NotTooBad Software' TextSmith Blog

Parsing Unicode property data files

TextPicker  Patterns  

When developing the word boundary recogniser for the Patterns framework, I needed to access data in Unicode property data files from Swift. These files look something like this:


# Total code points: 88

# ================================================

0780..07A5    ; Thaana # Lo  [38] THAANA LETTER HAA..THAANA LETTER WAAVU
07B1          ; Thaana # Lo       THAANA LETTER NAA

For all the Unicode property data files you could possibly want, see here and here.

where the hexadecimal numbers/ranges at the beginning of the line and the property name (“Thaana”) are the interesting parts. So we need to find all hexadecimal numbers that are at the beginning of a line – optionally followed by “..” and another hexadecimal – followed by spaces, a semi-colon, a single space, the property name and another space.

Design spec for TextPicker

TextPicker  

It’s hard to find examples of simple descriptions of simple applications online. Most seem to be meant for larger organisations. Here is the design specification for TextPicker, which the designers at draftss.com used when they designed the UI.

A Mac application for extracting text.

Background

Programmers, data scientists, prosumers and others often need to extract specific information from raw text, and may also wish to automate this or include this functionality in software they are making. Current methods are find/replace, regular expressions, writing parsers manually, or command line tools like grep, sed, awk, etc. These methods can be very complex, even for experienced programmers, and may often take longer to get working than just copying the desired information manually.

Want to hear about new posts?