1 Jan 2018: Update
This post is intended to serve as a quick intro to many options in the Go standard library to read files. In this year’s Advent of Code 2017 there were many problems that required different styles of reading input, and I ended up using most of them. When I started learning Go, it was hard for me to find out the easist way to do this. That was the main impulse that made me write the techniques down. The methods are not necessarily in decreasing order of simplicity.
- Reading byte-wise
- Ruby-ish style
- More helper functions
Some basic assumptions
- All code examples are wrapped inside a
- I’ll be using the words “array” and “slice” interchangeably to refer to slices most of the time, but they are NOT the same. These blog posts are are two really good resources to understand the differences.
- I’m uploading all the working examples to kgrz/reading-files-in-go.
In Go—and for that matter, most low level languages and some dynamic languages like Node—return a bytestream. There is a benefit of not auto-converting everything to a string, one of which is to avoid expensive string allocations which increases GC pressure.
To have a simpler mental model for this post, I’m converting an array of
bytes to a string using the case
string(arrayOfBytes). This should not be taken as a general advice when shipping production code, though.
Reading the entire file into memory
To start off, the standard library provides multiple functions and utilities to read file data. We’ll start with a basic case that’s provided in the
os package. This means two pre-requisites:
- The file has to fit in memory
- We need to know the size of the file upfront in order to instantiate a buffer large enough to hold it.
Having a handle to the
os.File object, we can query the size and instantiate a list of bytes upfront.
Reading a file in chunks
While reading a file at one shot works in most cases, sometimes we’d want to use a more memory-conservative approach. Say, read a file in chunks of some size, process them, and repeat until the end. In the following example, a buffer size of 100 bytes is used.
Compared to reading a file entirely, the main differences are:
- We read until we get a EOF marker, so we add a specific check for
err == io.EOF. If you’re new to Go, and are confused about the conventions of errors, it might be useful to check out this post by Rob Pike: Errors are values
- We define the buffer size, so we have control on the “chunk” size we want. This can improve performance when used correctly because of the way operating systems work with caching a file that’s being read.
- If the file size is not a whole multiple of the buffer size, the last iteration will add only the remainder number of bytes to the buffer, hence the call to
buffer[:bytesread]. In the normal case,
bytesreadwill be the same as the buffer size.
This is quite similar to the following, in Ruby:
For each iteration of the loop, an internal file pointer gets updated. When the next read happens, the data starting from the file pointer offset, and upto the size of the buffer gets returned. This pointer is not a construct of the language, but is one of the OS. On linux, this pointer is a property of the file descriptor that gets created. All the read/Read calls (in Ruby/Go respectively) get translated internally to system calls and sent to the kernel, and the kernel manages this pointer.
Reading file chunks concurrently
What if we wanted to speed up the processing of the chunks mentioned above? One way to do that is to use multiple go routines! The one extra work we need to do compared to reading chunks serially is we need to know the offset for each routine. Note that
ReadAt behaves slightly different from the way
Read does when the size of the target buffer is larger than the number of bytes left over.
Also note that I’m not restricting the number of goroutines, and it’s only defined by the buffer size. In reality, there might be an upper bound on this number.
That’s way more compared to any of the previous methods:
- I’m trying to create a specific number of Go-routines, depending on file size and our buffer size (100, in our case).
- We need a way to ensure that we are “waiting” for all the go routines to finish. In this example, I’m using a wait group.
- Instead of
break-ing out of the infinite
forloop, we signal from inside each Goroutine when it’s done. Since we
deferthe call to
wg.Done(), it gets called when the go routine “return”s.
Note: Always check for the number of bytes returned, and reslice the output buffer.
You can go a long way with the
Read() way of reading files, but sometimes you need more convenience. Something that gets used very often in Ruby are the IO functions like
each_codepoint etc. We can achieve something similar by using the
Scanner type, and associated functions from the
bufio.Scanner type implements functions that take a “split” function, and advance a pointer based on this function. For instance, the built-in
bufio.ScanLines split function, for every iteration, advances the pointer until the next newline character. At each step, the type also exposes methods to obtain the byte array/string between the start and the end position. For example:
Therefore, to read the entire file this way on a line-to-line basis, something like this can be used:
Scanning word by word
bufio package contains basic predefined split functions:
- ScanLines (default)
- ScanRunes (highly useful for iterating over UTF-8 codepoints, as opposed to bytes)
So to read a file, and create a list of words in the file, something like this can be used:
ScanBytes split function will give the same output as what we’ve seen in the early
Read() examples. One major difference between the two is the issue of dynamic allocation every time we need to append to the byte/string array in the case of a scanner. This can be circumvented by techniques like pre-initializing the buffer to a certain length, and increasing the size only when you reach the previous limit. Using the same example as above:
So we end up making far fewer slice “grow” operations, but we might endup with some empty slots towards the end depending on
bufferSize and the number of words in the file, and that’s a tradeoff.
Splitting a long string into words
bufio.NewScanner takes, as an argument, a type that satisfies an
io.Reader interface, which means it’ll work with any type that has a
Read method defined on it. One of the string utility methods in the standard library that return a “reader” type is
strings.NewReader function. We can combine these both when reading words out of a string:
Scanning Comma-seperated string
Parsing a CSV file/string manually with the basic
file.Read() or the
Scanner type is cumbersome, because a “word” as per the split function
bufio.ScanWords is defined as a bunch of runes delimited by a unicode space. Reading in individual runes, and keeping track of the buffer size and the position (like what’s done in lexing/parsing) is too much work and manipulation.
This can be avoided though. We can define a new split function that reads characters until the reader encounters a comma, and then return that chunk when
Bytes() is called. The function signature of a
bufio.SplitFunc function looks like:
datais the input byte string
atEOFis a flag that’s passed to the function signifiying the end of tokens
advanceusing which we can specify the number of positions to treat as the current read length. This value is used to update the cursor position once the scan loop is complete
tokenis the actual data of the scan operation
errincase you want to signal a problem.
For simplicity purposes, I’m showing an example for reading a string, and not a file. A simple reader for a CSV string using the above signature can be:
We’ve seen multiple ways to read a file, in increasing order of convenience and power. But what if you just want to read a file into a buffer?
ioutil is a package in the standard library that contains some functions to make it a one-liner.
Reading an entire file
This is way closer to what we see in higher-level scripting languages.
Reading an entire directory of files
Needless to say, DO NOT run this script if you have large files :D
More helper functions
There are more functions to read a file (or more precisely, a Reader) in the standard library. To avoid bloating this already long article, I’m listing out the functions I found:
ioutil.ReadAll()-> Takes an io-like object and returns the entire data as a byte array
io.MultiReader-> A very useful primitive to combine multiple io-like objects. So you can have a list of files to be read, and treat them as a single contiguous block of data rather than managing the complexity of switching the file objects at the end of each of the previous objects.
In an attempt to highlight the “read” functions, I chose to use an error function that prints out and closes the file:
I missed a crucial detail: I wasn’t closing the file handle when there were no errors, and may leak file descriptors. This was pointed out on reddit by u/shovelpost.
I wanted to avoid
defer was because
os.Exit internally which doesn’t run deferred functions, so I chose to explicitly close the file, but then missed out the success case 🤦🏽♂️.
I’ve updated the examples to use
return-ing instead of relying on