1 Jan 2018: Update
This post is intended to serve as a quick intro to many options in the Go standard library to read files. In this year’s Advent of Code 2017 there were many problems that required different styles of reading input, and I ended up using most of them. When I started learning Go, it was hard for me to find out the easist way to do this. That was the main impulse that made me write the techniques down. The methods are not necessarily in decreasing order of simplicity.
Some basic assumptions
- All code examples are wrapped inside a
main()
function block - I’ll be using the words “array” and “slice” interchangeably to refer to slices most of the time, but they are NOT the same. These blog posts are are two really good resources to understand the differences.
- I’m uploading all the working examples to kgrz/reading-files-in-go.
In Go—and for that matter, most low level languages and some dynamic languages like Node—return a bytestream. There is a benefit of not auto-converting everything to a string, one of which is to avoid expensive string allocations which increases GC pressure.
To have a simpler mental model for this post, I’m converting an array of byte
s to a string using the case string(arrayOfBytes)
. This should not be taken as a general advice when shipping production code, though.
Reading byte-wise
Reading the entire file into memory
To start off, the standard library provides multiple functions and utilities to read file data. We’ll start with a basic case that’s provided in the os
package. This means two pre-requisites:
- The file has to fit in memory
- We need to know the size of the file upfront in order to instantiate a buffer large enough to hold it.
Having a handle to the os.File
object, we can query the size and instantiate a list of bytes upfront.
Reading a file in chunks
While reading a file at one shot works in most cases, sometimes we’d want to use a more memory-conservative approach. Say, read a file in chunks of some size, process them, and repeat until the end. In the following example, a buffer size of 100 bytes is used.
Compared to reading a file entirely, the main differences are:
- We read until we get a EOF marker, so we add a specific check for
err == io.EOF
. If you’re new to Go, and are confused about the conventions of errors, it might be useful to check out this post by Rob Pike: Errors are values - We define the buffer size, so we have control on the “chunk” size we want. This can improve performance when used correctly because of the way operating systems work with caching a file that’s being read.
- If the file size is not a whole multiple of the buffer size, the last iteration will add only the remainder number of bytes to the buffer, hence the call to
buffer[:bytesread]
. In the normal case,bytesread
will be the same as the buffer size.
This is quite similar to the following, in Ruby:
For each iteration of the loop, an internal file pointer gets updated. When the next read happens, the data starting from the file pointer offset, and upto the size of the buffer gets returned. This pointer is not a construct of the language, but is one of the OS. On linux, this pointer is a property of the file descriptor that gets created. All the read/Read calls (in Ruby/Go respectively) get translated internally to system calls and sent to the kernel, and the kernel manages this pointer.
Reading file chunks concurrently
What if we wanted to speed up the processing of the chunks mentioned above? One way to do that is to use multiple go routines! The one extra work we need to do compared to reading chunks serially is we need to know the offset for each routine. Note that ReadAt
behaves slightly different from the way Read
does when the size of the target buffer is larger than the number of bytes left over.
Also note that I’m not restricting the number of goroutines, and it’s only defined by the buffer size. In reality, there might be an upper bound on this number.
That’s way more compared to any of the previous methods:
- I’m trying to create a specific number of Go-routines, depending on file size and our buffer size (100, in our case).
- We need a way to ensure that we are “waiting” for all the go routines to finish. In this example, I’m using a wait group.
- Instead of
break
-ing out of the infinitefor
loop, we signal from inside each Goroutine when it’s done. Since wedefer
the call towg.Done()
, it gets called when the go routine “return”s.
Note: Always check for the number of bytes returned, and reslice the output buffer.
Scanning
You can go a long way with the Read()
way of reading files, but sometimes you need more convenience. Something that gets used very often in Ruby are the IO functions like each_line
, each_char
, each_codepoint
etc. We can achieve something similar by using the Scanner
type, and associated functions from the bufio
package.
The bufio.Scanner
type implements functions that take a “split” function, and advance a pointer based on this function. For instance, the built-in bufio.ScanLines
split function, for every iteration, advances the pointer until the next newline character. At each step, the type also exposes methods to obtain the byte array/string between the start and the end position. For example:
Therefore, to read the entire file this way on a line-to-line basis, something like this can be used:
Scanning word by word
The bufio
package contains basic predefined split functions:
- ScanLines (default)
- ScanWords
- ScanRunes (highly useful for iterating over UTF-8 codepoints, as opposed to bytes)
- ScanBytes
So to read a file, and create a list of words in the file, something like this can be used:
The ScanBytes
split function will give the same output as what we’ve seen in the early Read()
examples. One major difference between the two is the issue of dynamic allocation every time we need to append to the byte/string array in the case of a scanner. This can be circumvented by techniques like pre-initializing the buffer to a certain length, and increasing the size only when you reach the previous limit. Using the same example as above:
So we end up making far fewer slice “grow” operations, but we might endup with some empty slots towards the end depending on bufferSize
and the number of words in the file, and that’s a tradeoff.
Splitting a long string into words
bufio.NewScanner
takes, as an argument, a type that satisfies an io.Reader
interface, which means it’ll work with any type that has a Read
method defined on it. One of the string utility methods in the standard library that return a “reader” type is strings.NewReader
function. We can combine these both when reading words out of a string:
Scanning Comma-seperated string
Parsing a CSV file/string manually with the basic file.Read()
or the Scanner
type is cumbersome, because a “word” as per the split function bufio.ScanWords
is defined as a bunch of runes delimited by a unicode space. Reading in individual runes, and keeping track of the buffer size and the position (like what’s done in lexing/parsing) is too much work and manipulation.
This can be avoided though. We can define a new split function that reads characters until the reader encounters a comma, and then return that chunk when Text()
or Bytes()
is called. The function signature of a bufio.SplitFunc
function looks like:
data
is the input byte stringatEOF
is a flag that’s passed to the function signifiying the end of tokensadvance
using which we can specify the number of positions to treat as the current read length. This value is used to update the cursor position once the scan loop is completetoken
is the actual data of the scan operationerr
incase you want to signal a problem.
For simplicity purposes, I’m showing an example for reading a string, and not a file. A simple reader for a CSV string using the above signature can be:
Ruby-ish style
We’ve seen multiple ways to read a file, in increasing order of convenience and power. But what if you just want to read a file into a buffer? ioutil
is a package in the standard library that contains some functions to make it a one-liner.
Reading an entire file
This is way closer to what we see in higher-level scripting languages.
Reading an entire directory of files
Needless to say, DO NOT run this script if you have large files :D
More helper functions
There are more functions to read a file (or more precisely, a Reader) in the standard library. To avoid bloating this already long article, I’m listing out the functions I found:
ioutil.ReadAll()
-> Takes an io-like object and returns the entire data as a byte arrayio.ReadFull()
io.ReadAtLeast()
io.MultiReader
-> A very useful primitive to combine multiple io-like objects. So you can have a list of files to be read, and treat them as a single contiguous block of data rather than managing the complexity of switching the file objects at the end of each of the previous objects.
Update
In an attempt to highlight the “read” functions, I chose to use an error function that prints out and closes the file:
I missed a crucial detail: I wasn’t closing the file handle when there were no errors, and may leak file descriptors. This was pointed out on reddit by u/shovelpost.
I wanted to avoid defer
was because log.Fatal
calls os.Exit
internally which doesn’t run deferred functions, so I chose to explicitly close the file, but then missed out the success case 🤦🏽♂️.
I’ve updated the examples to use defer
, and return
-ing instead of relying on os.Exit()
.