-.- --. .-. --..

Reading files in Go — an overview

30 Dec 2017

1 Jan 2018: Update


When I started learning Go, it was hard for me to get a handle on the various APIs and techniques that can be used to read files. My attempting at writing a multi-core wordcounter (kgrz/kwc) shows evidence of this initial confusion—multiple ways used in the same file.

In this year’s Advent of Code there were some problems that required different styles of reading input. I ended up using each technique at least once, and now I have a mental map of these techniques. I’m writing those down in this post. I’m listing out the methods in the same way I encountered them, and not necessarily in decreasing order of simplicity.

Some basic assumptions

  • All code examples are wrapped inside a main() function block
  • I’ll be using the words “array” and “slice” interchangeably to refer to slices most of the time, but they are NOT the same. These blog posts are are two really good resources to understand the differences.
  • I’m uploading all the working examples to kgrz/reading-files-in-go.

In Go—and for that matter, most low level languages and some dynamic languages like Node—return a bytestream. There is a benefit of not auto-converting everything to a string, one of which is to avoid expensive string allocations which increases GC pressure.

To have a simpler mental model for this post, I’m converting an array of bytes to a string using the case string(arrayOfBytes). This should not be taken as a general advice when shipping production code, though.

Reading byte-wise

Reading the entire file into memory

To start off, the standard library provides multiple functions and utilities to read file data. We’ll start with a basic case that’s provided in the os package. This means two pre-requisites:

  1. The file has to fit in memory
  2. We need to know the size of the file upfront in order to instantiate a buffer large enough to hold it.

Having a handle to the os.File object, we can query the size and instantiate a list of bytes upfront.

file, err := os.Open("filetoread.txt")
if err != nil {
  fmt.Println(err)
  return
}
defer file.Close()

fileinfo, err := file.Stat()
if err != nil {
  fmt.Println(err)
  return
}

filesize := fileinfo.Size()
buffer := make([]byte, filesize)

bytesread, err := file.Read(buffer)
if err != nil {
  fmt.Println(err)
  return
}

fmt.Println("bytes read: ", bytesread)
fmt.Println("bytestream to string: ", string(buffer))

basic.go on Github

Reading a file in chunks

While reading a file at one shot works in most cases, sometimes we’d want to use a more memory-conservative approach. Say, read a file in chunks of some size, process them, and repeat until the end. In the following example, a buffer size of 100 bytes is used.

const BufferSize = 100
file, err := os.Open("filetoread.txt")
if err != nil {
  fmt.Println(err)
  return
}
defer file.Close()

buffer := make([]byte, BufferSize)

for {
  bytesread, err := file.Read(buffer)

  if err != nil {
    if err != io.EOF {
      fmt.Println(err)
    }

    break
  }

  fmt.Println("bytes read: ", bytesread)
  fmt.Println("bytestream to string: ", string(buffer[:bytesread]))
}

Compared to reading a file entirely, the main differences are:

  1. We read until we get a EOF marker, so we add a specific check for err == io.EOF. If you’re new to Go, and are confused about the conventions of errors, it might be useful to check out this post by Rob Pike: Errors are values
  2. We define the buffer size, so we have control on the “chunk” size we want. This can improve performance when used correctly because of the way operating systems work with caching a file that’s being read.
  3. If the file size is not a whole multiple of the buffer size, the last iteration will add only the remainder number of bytes to the buffer, hence the call to buffer[:bytesread]. In the normal case, bytesread will be the same as the buffer size.

This is quite similar to the following, in Ruby:

bufsize = 100
f = File.new "_config.yml", "r"

while readstring = f.read(bufsize)
  break if readstring.nil?

  puts readstring
end

For each iteration of the loop, an internal file pointer gets updated. When the next read happens, the data starting from the file pointer offset, and upto the size of the buffer gets returned. This pointer is not a construct of the language, but is one of the OS. On linux, this pointer is a property of the file descriptor that gets created. All the read/Read calls (in Ruby/Go respectively) get translated internally to system calls and sent to the kernel, and the kernel manages this pointer.

Reading file chunks concurrently

What if we wanted to speed up the processing of the chunks mentioned above? One way to do that is to use multiple go routines! The one extra work we need to do compared to reading chunks serially is we need to know the offset for each routine. Note that ReadAt behaves slightly different from the way Read does when the size of the target buffer is larger than the number of bytes left over.

Also note that I’m not restricting the number of goroutines, and it’s only defined by the buffer size. In reality, there might be an upper bound on this number.

const BufferSize = 100
file, err := os.Open("filetoread.txt")
if err != nil {
  fmt.Println(err)
  return
}
defer file.Close()

fileinfo, err := file.Stat()
if err != nil {
  fmt.Println(err)
  return
}

filesize := int(fileinfo.Size())
// Number of go routines we need to spawn.
concurrency := filesize / BufferSize

// check for any left over bytes. Add one more go routine if required.
if remainder := filesize % BufferSize; remainder != 0 {
  concurrency++
}

var wg sync.WaitGroup
wg.Add(concurrency)

for i := 0; i < concurrency; i++ {
  go func(chunksizes []chunk, i int) {
    defer wg.Done()

    chunk := chunksizes[i]
    buffer := make([]byte, chunk.bufsize)
    bytesread, err := file.ReadAt(buffer, chunk.offset)

    // As noted above, ReadAt differs slighly compared to Read when the
    // output buffer provided is larger than the data that's available
    // for reading. So, let's return early only if the error is
    // something other than an EOF. Returning early will run the
    // deferred function above
    if err != nil && err != io.EOF {
      fmt.Println(err)
      return
    }

    fmt.Println("bytes read, string(bytestream): ", bytesread)
    fmt.Println("bytestream to string: ", string(buffer[:bytesread]))
  }(chunksizes, i)
}

wg.Wait()

That’s way more compared to any of the previous methods:

  1. I’m trying to create a specific number of Go-routines, depending on file size and our buffer size (100, in our case).
  2. We need a way to ensure that we are “waiting” for all the go routines to finish. In this example, I’m using a wait group.
  3. Instead of break-ing out of the infinite for loop, we signal from inside each Goroutine when it’s done. Since we defer the call to wg.Done(), it gets called when the go routine “return”s.

Note: Always check for the number of bytes returned, and reslice the output buffer.

Scanning

You can go a long way with the Read() way of reading files, but sometimes you need more convenience. Something that gets used very often in Ruby are the IO functions like each_line, each_char, each_codepoint etc. We can achieve something similar by using the Scanner type, and associated functions from the bufio package.

The bufio.Scanner type implements functions that take a “split” function, and advance a pointer based on this function. For instance, the built-in bufio.ScanLines split function, for every iteration, advances the pointer until the next newline character. At each step, the type also exposes methods to obtain the byte array/string between the start and the end position. For example:

file, err := os.Open("filetoread.txt")
if err != nil {
  fmt.Println(err)
  return
}
defer file.Close()

scanner := bufio.NewScanner(file)
scanner.Split(bufio.ScanLines)

// Returns a boolean based on whether there's a next instance of `\n`
// character in the IO stream. This step also advances the internal pointer
// to the next position (after '\n') if it did find that token.
read := scanner.Scan()

if read {
  fmt.Println("read byte array: ", scanner.Bytes())
  fmt.Println("read string: ", scanner.Text())
}

// goto Scan() line, and repeat

scanner-example.go on Github

Therefore, to read the entire file this way on a line-to-line basis, something like this can be used:

file, err := os.Open("filetoread.txt")
if err != nil {
  fmt.Println(err)
  return
}
defer file.Close()

scanner := bufio.NewScanner(file)
scanner.Split(bufio.ScanLines)

// This is our buffer now
var lines []string

for scanner.Scan() {
  lines = append(lines, scanner.Text())
}

fmt.Println("read lines:")
for _, line := range lines {
  fmt.Println(line)
}

scanner.go on Github

Scanning word by word

The bufio package contains basic predefined split functions:

  1. ScanLines (default)
  2. ScanWords
  3. ScanRunes (highly useful for iterating over UTF-8 codepoints, as opposed to bytes)
  4. ScanBytes

So to read a file, and create a list of words in the file, something like this can be used:

file, err := os.Open("filetoread.txt")
if err != nil {
  fmt.Println(err)
  return
}
defer file.Close()

scanner := bufio.NewScanner(file)
scanner.Split(bufio.ScanWords)

var words []string

for scanner.Scan() {
	words = append(words, scanner.Text())
}

fmt.Println("word list:")
for _, word := range words {
	fmt.Println(word)
}

The ScanBytes split function will give the same output as what we’ve seen in the early Read() examples. One major difference between the two is the issue of dynamic allocation every time we need to append to the byte/string array in the case of a scanner. This can be circumvented by techniques like pre-initializing the buffer to a certain length, and increasing the size only when you reach the previous limit. Using the same example as above:

file, err := os.Open("filetoread.txt")
if err != nil {
  fmt.Println(err)
  return
}
defer file.Close()

scanner := bufio.NewScanner(file)
scanner.Split(bufio.ScanWords)

// initial size of our wordlist
bufferSize := 50
words := make([]string, bufferSize)
pos := 0

for scanner.Scan() {
  if err := scanner.Err(); err != nil {
    // This error is a non-EOF error. End the iteration if we encounter
    // an error
    fmt.Println(err)
    break
  }

  words[pos] = scanner.Text()
  pos++

  if pos >= len(words) {
    // expand the buffer by 100 again
    newbuf := make([]string, bufferSize)
    words = append(words, newbuf...)
  }
}

fmt.Println("word list:")
// we are iterating only until the value of "pos" because our buffer size
// might be more than the number of words because we increase the length by
// a constant value. Or the scanner loop might've terminated due to an
// error prematurely. In this case the "pos" contains the index of the last
// successful update.
for _, word := range words[:pos] {
  fmt.Println(word)
}

So we end up making far fewer slice “grow” operations, but we might endup with some empty slots towards the end depending on bufferSize and the number of words in the file, and that’s a tradeoff.

Splitting a long string into words

bufio.NewScanner takes, as an argument, a type that satisfies an io.Reader interface, which means it’ll work with any type that has a Read method defined on it. One of the string utility methods in the standard library that return a “reader” type is strings.NewReader function. We can combine these both when reading words out of a string:

file, err := os.Open("_config.yml")
longstring := "This is a very long string. Not."
handle(err)

var words []string

scanner := bufio.NewScanner(strings.NewReader(longstring))
scanner.Split(bufio.ScanWords)

for scanner.Scan() {
	words = append(words, scanner.Text())
}

fmt.Println("word list:")
for _, word := range words {
	fmt.Println(word)
}

Scanning Comma-seperated string

Parsing a CSV file/string manually with the basic file.Read() or the Scanner type is cumbersome, because a “word” as per the split function bufio.ScanWords is defined as a bunch of runes delimited by a unicode space. Reading in individual runes, and keeping track of the buffer size and the position (like what’s done in lexing/parsing) is too much work and manipulation.

This can be avoided though. We can define a new split function that reads characters until the reader encounters a comma, and then return that chunk when Text() or Bytes() is called. The function signature of a bufio.SplitFunc function looks like:

(data []byte, atEOF bool) -> (advance int, token []byte, err error)
  1. data is the input byte string
  2. atEOF is a flag that’s passed to the function signifiying the end of tokens
  3. advance using which we can specify the number of positions to treat as the current read length. This value is used to update the cursor position once the scan loop is complete
  4. token is the actual data of the scan operation
  5. err incase you want to signal a problem.

For simplicity purposes, I’m showing an example for reading a string, and not a file. A simple reader for a CSV string using the above signature can be:

csvstring := "name, age, occupation"

// An anonymous function declaration to avoid repeating main()
ScanCSV := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
  commaidx := bytes.IndexByte(data, ',')
  if commaidx > 0 {
    // we need to return the next position
    buffer := data[:commaidx]
    return commaidx + 1, bytes.TrimSpace(buffer), nil
  }

  // if we are at the end of the string, just return the entire buffer
  if atEOF {
    // but only do that when there is some data. If not, this might mean
    // that we've reached the end of our input CSV string
    if len(data) > 0 {
      return len(data), bytes.TrimSpace(data), nil
    }
  }

  // when 0, nil, nil is returned, this is a signal to the interface to read
  // more data in from the input reader. In this case, this input is our
  // string reader and this pretty much will never occur.
  return 0, nil, nil
}

scanner := bufio.NewScanner(strings.NewReader(csvstring))
scanner.Split(ScanCSV)

for scanner.Scan() {
  fmt.Println(scanner.Text())
}

Ruby-ish style

We’ve seen multiple ways to read a file, in increasing order of convenience and power. But what if you just want to read a file into a buffer? ioutil is a package in the standard library that contains some functions to make it a one-liner.

Reading an entire file

bytes, err := ioutil.ReadFile("_config.yml")
if err != nil {
  log.Fatal(err)
}

fmt.Println("Bytes read: ", len(bytes))
fmt.Println("String read: ", string(bytes))

This is way closer to what we see in higher-level scripting languages.

Reading an entire directory of files

Needless to say, DO NOT run this script if you have large files :D

filelist, err := ioutil.ReadDir(".")
if err != nil {
  log.Fatal(err)
}

for _, fileinfo := range filelist {
  if fileinfo.Mode().IsRegular() {
    bytes, err := ioutil.ReadFile(fileinfo.Name())
    if err != nil {
      log.Fatal(err)
    }
    fmt.Println("Bytes read: ", len(bytes))
    fmt.Println("String read: ", string(bytes))
  }
}

More helper functions

There are more functions to read a file (or more precisely, a Reader) in the standard library. To avoid bloating this already long article, I’m listing out the functions I found:

  1. ioutil.ReadAll() -> Takes an io-like object and returns the entire data as a byte array
  2. io.ReadFull()
  3. io.ReadAtLeast()
  4. io.MultiReader -> A very useful primitive to combine multiple io-like objects. So you can have a list of files to be read, and treat them as a single contiguous block of data rather than managing the complexity of switching the file objects at the end of each of the previous objects.

Update

In an attempt to highlight the “read” functions, I chose the path of using a error function that prints out and closes the file:

func handleFn(file *os.File) func(error) {
  return func(err error) {
    if err != nil {
      file.Close()
      log.Fatal(err)
    }
  }
}

// inside the main function:
file, err := os.Open("filetoread.txt")
handle := handleFn(file)
handle(err)

Doing this, I missed a crucial detail: I wasn’t closing the file handle when there were no errors and the program ran to completion. This results in a leakage of file descriptors if the program is run multiple times without any errors being raised. This was pointed out on reddit by u/shovelpost.

My original intention to avoid defer was because log.Fatal calls os.Exit internally which doesn’t run deferred functions, so I chose to explicitly close the file, but then missed out the other case of successful run 🤦🏽‍♂️.

I’ve updated the examples to use defer, and return-ing instead of relying on os.Exit().