Traditionally, file validations in Ruby/Rails are done by checking the
file extension. Although this is convenient, it presents a security risk
in some cases. For example, a binary file that has a
Many binary file formats have a token at a pre-determined location of a file that specifies to the OS what kind of a file it is. These tokens are called “Magic Numbers”. For an exhaustive list, check this article by Gary Kessler.
For example, I’ve created an empty PDF file available here. If you were to open this file using a text editor like Vim or Emacs, these would be the contents of the first line:
This file is a blank file with dimensions of 0.013889 square inches. That is equivalent to a measurement standard known as a point
Also notice the XML-ey structure that’s there in that file. Although the file is blank, it still has a background color of white and other metadata like the date of creation, date of modification embedded inside the file and so, even the simplest file is guaranteed to have a non-zero file size—except, may be an empty plain text file in a simple encoding—encodings like UTF-16LE have headers that specify the encoding, also known as Byte Order Mark(BOM)—in which case, there won’t be any header information resulting in a zero byte length.
For a PDF, the magic number, a.k.a file signature, is
hexadecimal notation which translates to the string representation of
Many formats also specify a trailing signature as well. However, some softwares/libraries that create the files might be a bit indisciplined, thus failing to add a proper trailing signature. Hence, care has to be taken while considering trailing signatures for validations.
Usage in Ruby
To check these signatures using the knowledge of the magic numbers for
various binary file formats in Ruby, two not-so-frequently used
methods can be used. They are
IO.seek method updates the position of the cursor to the number of
bytes passed in to the method as an argument. One needs to specify which
direction to seek from by using the constants
IO::SEEK_SET. If we have to read the last 9 bytes of
a file without loading the entire file into memory and reading it, we
can use the
seek method like so:
We pass in a negative number because we need to seek that many bytes back from end of the file.
readpartial method reads the number of bytes specified as an argument to the
method and returns the data as a string. In the sample file specified
above, using the method, we can get the first 4 bytes of the file which
is what we require to check the file is a PDF file like so:
Using these two methods, we could prepare a naive PDF file validator like so:
The same technique can be used for any binary file that has the signatures. If a format has multiple possible signatures—typically seen in cases where the format specs keep updating and newer formats are introduced under the same umbrella. In those cases, the format usually is implemented to be backwards compatible. Cases in point:
- PDF actually has multiple possible trailing signatures. We’ve used only one possibility, thus ensuring the code lives up to it’s “naive” tag.
- JPEG differs by companies implementing the underlying file creators. JPEG files having EXIF information embedded in them will have a slightly different starting signature than the ones which don’t have them. EXIF information is metadata that is generated by cameras and other instruments and that data would contain information about shutter speed, aperture size, focal length etc. So a JPEG file created by a camera will have different starting signature than the one that was created by a software program like Gimp.
In an application where I used this technique, there were quite a few
validation errors when a picture taken from a phone camera was being
flagged as an invalid JPG file since the phone added a bunch of EOF
characters after the standard trailing signature and so my logic of
fetching and checking the last 4 bytes for validating a JPEG file
failed. Depending upon the complexity that you prefer, these platform
specific nuances should be kept in mind.
Validating plain text files
This technique fails when checking whether a particular file is a
file. Even if the file is verified against all the available signatures
and is confirmed it’s not one of the known types, it’s hard to confirm
that the file is indeed a simple plain text file. To counter this, there
are other ways to determine the file type when a plain text file is
passed in. I won’t be explaining them here but will introduce a utility
that’s available on *nix platforms that can provide an insight into how
it checks plain text files.
file utility in *nix systems also uses magic number validations
for determining various characteristics of a particular file. The usage
for that would be something like:
More information about this utility can be found by running
man 1 file
or at the Wikipedia article on this tool. An online version of the
man page can be found here for those who don’t have access to a
*nix machine but still want to know how it works.
This is what the implementation of
file has to say about figuring out
If a file does not match any of the entries in the magic file, it is examined to see if it seems to be a text file. ASCII, ISO-8859-x, non-ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set. If a file passes any of these tests, its character set is reported. ASCII, ISO-8859-x, UTF-8, and extended-ASCII files are identified as “text” because they will be mostly readable on nearly any terminal; UTF-16 and EBCDIC are only “character data” because, while they contain text, it is text that will require translation before it can be read.