1 Jan 2018: Update
When I started learning Go, it was hard for me to get a handle on the various APIs and techniques that can be used to read files. My attempting at writing a multi-core wordcounter (kgrz/kwc) shows evidence of this initial confusion—multiple ways used in the same file.
In this year’s Advent of Code there were some problems that required different styles of reading input. I ended up using each technique at least once, and now I have a mental map of these techniques. I’m writing those down in this post. I’m listing out the methods in the same way I encountered them, and not necessarily in decreasing order of simplicity.
- Reading byte-wise
- Ruby-ish style
- More helper functions
Some basic assumptions
- All code examples are wrapped inside a
- I’ll be using the words “array” and “slice” interchangeably to refer to slices most of the time, but they are NOT the same. These blog posts are are two really good resources to understand the differences.
- I’m uploading all the working examples to kgrz/reading-files-in-go.
In Go—and for that matter, most low level languages and some dynamic languages like Node—return a bytestream. There is a benefit of not auto-converting everything to a string, one of which is to avoid expensive string allocations which increases GC pressure.
To have a simpler mental model for this post, I’m converting an array of
bytes to a string using the case
string(arrayOfBytes). This should
not be taken as a general advice when shipping production code, though.
Reading the entire file into memory
To start off, the standard library provides multiple functions and
utilities to read file data. We’ll start with a basic case that’s
provided in the
os package. This means two pre-requisites:
- The file has to fit in memory
- We need to know the size of the file upfront in order to instantiate a buffer large enough to hold it.
Having a handle to the
os.File object, we can query the size and
instantiate a list of bytes upfront.
Reading a file in chunks
While reading a file at one shot works in most cases, sometimes we’d want to use a more memory-conservative approach. Say, read a file in chunks of some size, process them, and repeat until the end. In the following example, a buffer size of 100 bytes is used.
Compared to reading a file entirely, the main differences are:
- We read until we get a EOF marker, so we add a specific check for
err == io.EOF. If you’re new to Go, and are confused about the conventions of errors, it might be useful to check out this post by Rob Pike: Errors are values
- We define the buffer size, so we have control on the “chunk” size we want. This can improve performance when used correctly because of the way operating systems work with caching a file that’s being read.
- If the file size is not a whole multiple of the buffer size, the last
iteration will add only the remainder number of bytes to the buffer,
hence the call to
buffer[:bytesread]. In the normal case,
bytesreadwill be the same as the buffer size.
This is quite similar to the following, in Ruby:
For each iteration of the loop, an internal file pointer gets updated. When the next read happens, the data starting from the file pointer offset, and upto the size of the buffer gets returned. This pointer is not a construct of the language, but is one of the OS. On linux, this pointer is a property of the file descriptor that gets created. All the read/Read calls (in Ruby/Go respectively) get translated internally to system calls and sent to the kernel, and the kernel manages this pointer.
Reading file chunks concurrently
What if we wanted to speed up the processing of the chunks mentioned
above? One way to do that is to use multiple go routines! The one
extra work we need to do compared to reading chunks serially is we need
to know the offset for each routine. Note that
slightly different from the way
Read does when the size of the
target buffer is larger than the number of bytes left over.
Also note that I’m not restricting the number of goroutines, and it’s only defined by the buffer size. In reality, there might be an upper bound on this number.
That’s way more compared to any of the previous methods:
- I’m trying to create a specific number of Go-routines, depending on file size and our buffer size (100, in our case).
- We need a way to ensure that we are “waiting” for all the go routines to finish. In this example, I’m using a wait group.
- Instead of
break-ing out of the infinite
forloop, we signal from inside each Goroutine when it’s done. Since we
deferthe call to
wg.Done(), it gets called when the go routine “return”s.
Note: Always check for the number of bytes returned, and reslice the output buffer.
You can go a long way with the
Read() way of reading files, but
sometimes you need more convenience. Something that gets used very often
in Ruby are the IO functions like
each_codepoint etc. We can achieve something similar by using the
Scanner type, and associated functions from the
bufio.Scanner type implements functions that take a “split”
function, and advance a pointer based on this function. For instance,
bufio.ScanLines split function, for every iteration,
advances the pointer until the next newline character. At each step, the
type also exposes methods to obtain the byte array/string between the
start and the end position. For example:
Therefore, to read the entire file this way on a line-to-line basis, something like this can be used:
Scanning word by word
bufio package contains basic predefined split functions:
- ScanLines (default)
- ScanRunes (highly useful for iterating over UTF-8 codepoints, as opposed to bytes)
So to read a file, and create a list of words in the file, something like this can be used:
ScanBytes split function will give the same output as what we’ve
seen in the early
Read() examples. One major difference between the
two is the issue of dynamic allocation every time we need to append to
the byte/string array in the case of a scanner. This can be circumvented
by techniques like pre-initializing the buffer to a certain length, and
increasing the size only when you reach the previous limit. Using the
same example as above:
So we end up making far fewer slice “grow” operations, but we might
endup with some empty slots towards the end depending on
and the number of words in the file, and that’s a tradeoff.
Splitting a long string into words
bufio.NewScanner takes, as an argument, a type that satisfies an
io.Reader interface, which means it’ll work with any type that has a
Read method defined on it. One of the string utility methods in the
standard library that return a “reader” type is
function. We can combine these both when reading words out of a string:
Scanning Comma-seperated string
Parsing a CSV file/string manually with the basic
file.Read() or the
Scanner type is cumbersome, because a “word” as per the split function
bufio.ScanWords is defined as a bunch of runes delimited by a unicode
space. Reading in individual runes, and keeping track of the buffer size
and the position (like what’s done in lexing/parsing) is too much work
This can be avoided though. We can define a new split function that
reads characters until the reader encounters a comma, and then return
that chunk when
Bytes() is called. The function signature
bufio.SplitFunc function looks like:
datais the input byte string
atEOFis a flag that’s passed to the function signifiying the end of tokens
advanceusing which we can specify the number of positions to treat as the current read length. This value is used to update the cursor position once the scan loop is complete
tokenis the actual data of the scan operation
errincase you want to signal a problem.
For simplicity purposes, I’m showing an example for reading a string, and not a file. A simple reader for a CSV string using the above signature can be:
We’ve seen multiple ways to read a file, in increasing order of
convenience and power. But what if you just want to read a file into a
ioutil is a package in the standard library that contains some
functions to make it a one-liner.
Reading an entire file
This is way closer to what we see in higher-level scripting languages.
Reading an entire directory of files
Needless to say, DO NOT run this script if you have large files :D
More helper functions
There are more functions to read a file (or more precisely, a Reader) in the standard library. To avoid bloating this already long article, I’m listing out the functions I found:
ioutil.ReadAll()-> Takes an io-like object and returns the entire data as a byte array
io.MultiReader-> A very useful primitive to combine multiple io-like objects. So you can have a list of files to be read, and treat them as a single contiguous block of data rather than managing the complexity of switching the file objects at the end of each of the previous objects.
In an attempt to highlight the “read” functions, I chose the path of using a error function that prints out and closes the file:
Doing this, I missed a crucial detail: I wasn’t closing the file handle when there were no errors and the program ran to completion. This results in a leakage of file descriptors if the program is run multiple times without any errors being raised. This was pointed out on reddit by u/shovelpost.
My original intention to avoid
defer was because
os.Exit internally which doesn’t run deferred functions, so I chose to
explicitly close the file, but then missed out the other case of
successful run 🤦🏽♂️.
I’ve updated the examples to use
return-ing instead of