Parsing Unicode property data files

18 Aug 2019 ❘ TextPicker Patterns

When developing the word boundary recogniser for the Patterns framework, I needed to access data in Unicode property data files from Swift. These files look something like this:

# Total code points: 88

# ================================================

0780..07A5    ; Thaana # Lo  [38] THAANA LETTER HAA..THAANA LETTER WAAVU
07B1          ; Thaana # Lo       THAANA LETTER NAA

For all the Unicode property data files you could possibly want, see here and here.

where the hexadecimal numbers/ranges at the beginning of the line and the property name (“Thaana”) are the interesting parts. So we need to find all hexadecimal numbers that are at the beginning of a line – optionally followed by “..” and another hexadecimal – followed by spaces, a semi-colon, a single space, the property name and another space.

Programmatically

So we have some text, and we want to get a series of ranges and their corresponding property names. If a line has only a single number and not a range, we turn it into a single-element range. In the example above it would be 07B1...07B1. We convert the hexadecimal numbers to UInt32 because Unicode code points can be up to 21 bits long.

typealias RangesAndProperties = [(range: ClosedRange<UInt32>, property: Substring)]

func unicodeProperty(fromDataFile text: String) -> RangesAndProperties {

We will be using the Patterns framework itself for the text processing. It serves the same purpose as regex’es, except it might actually be readable for people who haven’t used it before. That’s the general idea, anyway.

We begin by defining a hexadecimal number. Thankfully a pattern for hexadecimal digits is already provided by Patterns:

let hexNumber = Capture(name: "hexNumber", hexDigit+)

Here we repeat the hexadecimal digit one or more times to get a number. Capture means this is a part of the text we want to extract, and we can retrieve it later using the name “hexNumber”.

let hexRange = hexNumber • ".." • hexNumber / hexNumber

The • operator joins together a sequence of patterns. / provides a choice between 2 other patterns; if the pattern to its left fails, it tries the one to its right.

let rangeAndProperty = Line.start • hexRange • Skip() • "; " • Capture(name: "property", Skip()) • " "

This puts it all together. We start at the beginning of a line, match the 1 or 2 numbers in hexRange, skip everything until “; ”, and then capture everything until the next space.

return try! Parser(search: rangeAndProperty).matches(in: text).map { match in
	let propertyName = text[match[one: "property"]!]
	let oneOrTwoNumbers = match[multiple: "hexNumber"].map { UInt32(text[$0], radix: 16)! }
	let range = oneOrTwoNumbers.first! ... oneOrTwoNumbers.last!
	return (range, propertyName)
}

rangeAndProperty.matches(in: text) returns a lazy sequence of matches, so it doesn’t start the actual text processing until you start reading elements from it. Using subscripting we can get hold of the index ranges matched by Capture patterns.

You can find all the code in the “unicode_property” commandline application in the Patterns framework.

Graphically

If you just want to quickly copy out the information you can use my TextPicker app. It lets you select parts of the text you are interested in, and tries to find other parts that are similar:

The application doesn’t support marking 2 different types of information simultaneously (yet) so I had to copy the numbers and names separately.

Parsing Unicode property data files

Programmatically

Graphically

Comments

Want to hear about new posts?