Splitting text and collections lazily in Swift
❘ Swift
Swift version 2.1
There are already methods for splitting collections in the Swift Standard Library, but they do all the work immediately and return the results in an array. When dealing with large strings, or streams of text, I find it better to do the work lazily, when needed. The overall performance is not necessarily better, but it is smoother, as you get the first results immediately instead of having to wait a little while and then get everything at once. And memory usage is lower, no need to store everything in an array first.
The original
This is how the CollectionType.split method in the Standard Library splits strings over “,”:
func testCollectionTypeSplit_AllowingEmptySlices () {
let split = {(s: String) -> [String] in
s.characters.split(",", allowEmptySlices: true).map {String($0)}
}
XCTAssertEqual(split("ab,c,de,f"), ["ab","c","de","f"])
XCTAssertEqual(split(",a"), ["","a"])
XCTAssertEqual(split("a,"), ["a",""])
XCTAssertEqual(split("a,,b,,,c"), ["a","","b","","","c"])
XCTAssertEqual(split(""), [""])
XCTAssertEqual(split(","), ["",""])
XCTAssertEqual(split("ab"), ["ab"])
}
func testCollectionTypeSplit_NoEmptySlices () {
let split = {(s: String) -> [String] in
s.characters.split(",", allowEmptySlices: false).map {String($0)}
}
XCTAssertEqual(split("ab,c,de,f"), ["ab","c","de","f"])
XCTAssertEqual(split(",a"), ["a"])
XCTAssertEqual(split("a,"), ["a"])
XCTAssertEqual(split("a,,b,,,c"), ["a","b","c"])
XCTAssertEqual(split(""), [])
XCTAssertEqual(split(","), [])
XCTAssertEqual(split("ab"), ["ab"])
}
I want our split method to behave in exactly the same way.
The results when allowing empty slices may seem a bit strange, but think of it as text split on the newline character into separate lines, and consider there may be several empty lines in a row plus they may be at the beginning and/or the end.
The first step
So what is the simplest operation needed here, and will it be useful on its own? When splitting collections into smaller pieces the smallest possible operation is to split the collection once into 2 pieces. And yes I believe that operation will be useful on its own:
extension CollectionType where Generator.Element: Equatable {
/**
Return everything before the first occurrence of ‘separator’ as 'head', and everything after it as 'tail'.
If ‘separator’ is first, ‘head’ is empty. If it is last, ‘tail’ is empty.
If ‘separator’ is not found then ‘head’ contains everything and 'tail' is nil.
*/
public func splitOnce (separator: Generator.Element) -> (head: SubSequence, tail: SubSequence?) {
guard let nextindex = indexOf(separator) else { return (self[startIndex..<endIndex], nil) }
return (self[startIndex..<nextindex], self[nextindex.successor()..<endIndex])
}
}
Pretty straightforward, just find the first separator and return everything before it and everything after it.
The output
This functionality naturally belongs in LazyCollectionType. We will be following Apple’s guidelines for extending it (they are the same as for LazySequenceType), except we will use the same struct for the sequence and generator because I don’t want to write this monster of a generic where
clause more often than strictly necessary:
public struct LazySplitSequence <Base: CollectionType where Base.Generator.Element: Equatable,
Base.SubSequence: CollectionType,
Base.SubSequence.Generator.Element==Base.Generator.Element,
Base.SubSequence==Base.SubSequence.SubSequence>: GeneratorType, LazySequenceType {
This defines restrictions on the source collection’s subsequences which you would think must be valid for all subsequences. All sequences, and therefore collections, should satisfy them, so I don’t think we are excluding anything here. These restrictions are in any case necessary for the following code, which deals entirely in subsequences and their subsequences:
private var remaining: Base.SubSequence?
private let separator: Base.Generator.Element
private let allowEmptySlices: Bool
public init (_ base: Base, separator: Base.Generator.Element, allowEmptySlices: Bool = false) {
self.separator = separator
self.remaining = base[base.startIndex..<base.endIndex]
self.allowEmptySlices = allowEmptySlices
}
public mutating func next () -> Base.SubSequence? {
guard let remaining = self.remaining else { return nil }
let (head, tail) = remaining.splitOnce(separator)
self.remaining = tail
return (!allowEmptySlices && head.isEmpty) ? next() : head
}
}
In the initialiser we set remaining
to a subsequence of the entire collection. Then for every iteration we return everything before the next occurrence of separator
, set everything after it to remaining
and optionally skip empty results.
You may have noticed this sequence does not have a ‘generate’ method even though it is a required part of the protocol. That is because of this clever little extension in the standard library, which automatically generates one for us.
The actual method
Then all that remains is the method itself, where we meet the monstrous generic where
clause again:
extension LazyCollectionType where Elements.Generator.Element: Equatable,
Elements.SubSequence: CollectionType,
Elements.SubSequence.Generator.Element==Elements.Generator.Element,
Elements.SubSequence==Elements.SubSequence.SubSequence {
public func split (separator: Elements.Generator.Element, allowEmptySlices: Bool = false) -> LazySplitSequence<Elements> {
return LazySplitSequence(self.elements, separator: separator, allowEmptySlices: allowEmptySlices)
}
}
It can be called like this:
for line in largetext.characters.lazy.split("\n") {
print(line)
}
// or
let r = largetext.characters.lazy.split("\n").map { /* do something else with it */ }\
The complete code can be found in the SwiftShell project, with unit tests.
Comments