Robust File validations in Ruby using magic numbers

Traditionally, file validations in Ruby/Rails are done by checking the file extension. Although this is convenient, it presents a security risk in some cases. For example, a binary file that has a .pdf extension might actually be an executable binary that install some malware on the host machine. Case in point: the USB shadow file malware that plagues Windows machines. To avoid possible security holes, more robust validation of files can be achieved using “Magic Numbers”.

Magic numbers

Many binary file formats have a token at a pre-determined location of a file that specifies to the OS what kind of a file it is. These tokens are called “Magic Numbers”. For an exhaustive list, check this article by Gary Kessler.

For example, I’ve created an empty PDF file available here. If you were to open this file using a text editor like Vim or Emacs, these would be the contents of the first line:

%PDF-1.5 %âãÏÓ

This file is a blank file with dimensions of 0.013889 square inches. That is equivalent to a measurement standard known as a point

Also notice the XML-ey structure that’s there in that file. Although the file is blank, it still has a background color of white and other metadata like the date of creation, date of modification embedded inside the file and so, even the simplest file is guaranteed to have a non-zero file size—except, may be an empty plain text file in a simple encoding—encodings like UTF-16LE have headers that specify the encoding, also known as Byte Order Mark(BOM)—in which case, there won’t be any header information resulting in a zero byte length.

For a PDF, the magic number, a.k.a file signature, is 2550446 in hexadecimal notation which translates to the string representation of %PDF in ASCII. Some file signatures might have non ASCII identifiers like, for example, JPEG which has the identifier ffd8ffe0.

Many formats also specify a trailing signature as well. However, some softwares/libraries that create the files might be a bit indisciplined, thus failing to add a proper trailing signature. Hence, care has to be taken while considering trailing signatures for validations.

Usage in Ruby

To check these signatures using the knowledge of the magic numbers for various binary file formats in Ruby, two not-so-frequently used IO methods can be used. They are IO.seek and IO.readpartial.

IO.seek

The IO.seek method updates the position of the cursor to the number of bytes passed in to the method as an argument. One needs to specify which direction to seek from by using the constants IO::SEEK_END, IO::SEEK_CUR or IO::SEEK_SET. If we have to read the last 9 bytes of a file without loading the entire file into memory and reading it, we can use the seek method like so:

file = File.new('small_empty_file.pdf', 'r')

file.seek(-7, IO::SEEK_END)
puts file.read
file.close

# Will output:
#   "%%EOF\r\n"

We pass in a negative number because we need to seek that many bytes back from end of the file.

IO.readpartial

The readpartial method reads the number of bytes specified as an argument to the method and returns the data as a string. In the sample file specified above, using the method, we can get the first 4 bytes of the file which is what we require to check the file is a PDF file like so:

file = File.new('small_empty_file.pdf', 'r')

puts file.readpartial(4)
file.close

# Will output:
#   "%PDF"

Using these two methods, we could prepare a naive PDF file validator like so:

class MagicPdfValidator
  attr_reader :file
  attr_reader :starting_signature, :trailing_signature

  VALID_TRAILING_SIGNATURE = "2525454f460d0a"
  VALID_STARTING_SIGNATURE = "25504446"

  def initialize(file)
    raise "Expecting a file object as an argument" unless file.is_a?(File)

    # Ensure there are sufficient number of bytes to determine the
    # signatures. If this check is not present, an empty text file
    # masquerading as a PDF file will throw EOF Errors while reading the
    # bytes.
    if file.stat.size < minimum_bytes_for_determining_signature
      raise "File too small to calculate signature"
    end

    @file = file
    process_file!
  end

  def starting_signature_bytes
    4
  end

  def trailing_signature_bytes
    7
  end

  def valid?
    starting_signature_valid? && trailing_signature_valid?
  end

  private

  def minimum_bytes_for_determining_signature
    starting_signature_bytes + trailing_signature_bytes
  end

  def process_file!
    read_starting_signature!
    read_trailing_signature!

    # Ensure the file is closed after reading the starting and trailing
    # bytes
    @file.close
  end

  def read_starting_signature!
    @file.rewind
    starting_bytes = @file.readpartial(starting_signature_bytes)
    @starting_signature = starting_bytes.unpack("H*").first
  end

  def read_trailing_signature!
    @file.rewind
    @file.seek(trailing_signature_bytes * -1, IO::SEEK_END)
    trailing_bytes = @file.read
    @trailing_signature = trailing_bytes.unpack("H*").first
  end

  def starting_signature_valid?
    @starting_signature == VALID_STARTING_SIGNATURE
  end

  def trailing_signature_valid?
    @trailing_signature == VALID_TRAILING_SIGNATURE
  end
end

puts MagicPdfValidator.new(File.new('small_empty_file.pdf', 'r')).valid?

#=> true

The same technique can be used for any binary file that has the signatures. If a format has multiple possible signatures—typically seen in cases where the format specs keep updating and newer formats are introduced under the same umbrella. In those cases, the format usually is implemented to be backwards compatible. Cases in point:

PDF actually has multiple possible trailing signatures. We’ve used only one possibility, thus ensuring the code lives up to it’s “naive” tag.
JPEG differs by companies implementing the underlying file creators. JPEG files having EXIF information embedded in them will have a slightly different starting signature than the ones which don’t have them. EXIF information is metadata that is generated by cameras and other instruments and that data would contain information about shutter speed, aperture size, focal length etc. So a JPEG file created by a camera will have different starting signature than the one that was created by a software program like Gimp.

In an application where I used this technique, there were quite a few validation errors when a picture taken from a phone camera was being flagged as an invalid JPG file since the phone added a bunch of EOF characters after the standard trailing signature and so my logic of fetching and checking the last 4 bytes for validating a JPEG file ÿÙ failed. Depending upon the complexity that you prefer, these platform specific nuances should be kept in mind.

Validating plain text files

This technique fails when checking whether a particular file is a .txt file. Even if the file is verified against all the available signatures and is confirmed it’s not one of the known types, it’s hard to confirm that the file is indeed a simple plain text file. To counter this, there are other ways to determine the file type when a plain text file is passed in. I won’t be explaining them here but will introduce a utility that’s available on *nix platforms that can provide an insight into how it checks plain text files.

file(1)

The file utility in *nix systems also uses magic number validations for determining various characteristics of a particular file. The usage for that would be something like:

file -s small_empty_file.pdf
# small_empty_file.pdf: PDF document, version 1.5

file --mime small_empty_file.pdf
# small_empty_file.pdf: application/pdf; charset=binary

More information about this utility can be found by running man 1 file or at the Wikipedia article on this tool. An online version of the man page can be found here for those who don’t have access to a *nix machine but still want to know how it works.

This is what the implementation of file has to say about figuring out text files:

If a file does not match any of the entries in the magic file, it is examined to see if it seems to be a text file. ASCII, ISO-8859-x, non-ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set. If a file passes any of these tests, its character set is reported. ASCII, ISO-8859-x, UTF-8, and extended-ASCII files are identified as “text” because they will be mostly readable on nearly any terminal; UTF-16 and EBCDIC are only “character data” because, while they contain text, it is text that will require translation before it can be read.

← Using Tmux for everything HTTP Request Response Caching Using Faraday: Part 1 →