<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.3">Jekyll</generator><link href="https://kgrz.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://kgrz.io/" rel="alternate" type="text/html" /><updated>2024-02-11T10:45:01+05:30</updated><id>https://kgrz.io/feed.xml</id><title type="html">Kashyap’s blog</title><subtitle>Kashyap&apos;s Blog</subtitle><author><name>Kashyap Kondamudi</name></author><entry><title type="html">Go can only read 1GiB per Read call</title><link href="https://kgrz.io/go-file-read-max-size-buffer.html" rel="alternate" type="text/html" title="Go can only read 1GiB per Read call" /><published>2024-02-07T00:00:00+05:30</published><updated>2024-02-07T00:00:00+05:30</updated><id>https://kgrz.io/go-file-read-max-size-buffer</id><content type="html" xml:base="https://kgrz.io/go-file-read-max-size-buffer.html"><![CDATA[<p>UPDATE: I don’t mean to say that this is a bad choice, or that it’s a bug, or even a performance implication. It’s just a choice that was made which seemed a bit opaque without doing all the history spelunking I did here, and it’s interesting to see the reasoning behind it.</p>

<p>There’s a 1GiB limit for a single <code class="language-plaintext highlighter-rouge">Read</code> call for an <code class="language-plaintext highlighter-rouge">os.File</code> entity (object? struct?) in Go, even though native <code class="language-plaintext highlighter-rouge">read</code> syscall can fill a 2GiB buffer (as tested in my arm macos and Intel Linux machine). I ran into this when looking at a pprof profile of a sample word count program I was writing, which showed the program was spending way too much time in the <code class="language-plaintext highlighter-rouge">syscall</code> module. That in this context can only mean one thing: way too many <code class="language-plaintext highlighter-rouge">read</code> syscalls were getting called. Something like this would show this behaviour:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">f</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">os</span><span class="o">.</span><span class="n">Open</span><span class="p">(</span><span class="s">"superlargefile.txt"</span><span class="p">)</span>
<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
    <span class="n">log</span><span class="o">.</span><span class="n">Fatal</span><span class="p">(</span><span class="s">"error opening input file: "</span><span class="p">,</span> <span class="n">err</span><span class="p">)</span>
<span class="p">}</span>
<span class="k">defer</span> <span class="n">f</span><span class="o">.</span><span class="n">Close</span><span class="p">()</span>

<span class="n">buf</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">([]</span><span class="kt">byte</span><span class="p">,</span> <span class="m">1024</span><span class="o">*</span><span class="m">1024</span><span class="o">*</span><span class="m">1024</span><span class="o">*</span><span class="m">2</span><span class="p">)</span> <span class="c">// 2GiB buffer</span>
<span class="n">fmt</span><span class="o">.</span><span class="n">Println</span><span class="p">(</span><span class="s">"buffer size"</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">buf</span><span class="p">))</span>

<span class="k">for</span> <span class="n">iter</span> <span class="o">:=</span> <span class="m">1</span><span class="p">;</span> <span class="p">;</span> <span class="n">iter</span> <span class="o">+=</span> <span class="m">1</span> <span class="p">{</span>
    <span class="n">n</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">f</span><span class="o">.</span><span class="n">Read</span><span class="p">(</span><span class="n">buf</span><span class="p">)</span>

    <span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
        <span class="k">if</span> <span class="n">err</span> <span class="o">==</span> <span class="n">io</span><span class="o">.</span><span class="n">EOF</span> <span class="p">{</span>
            <span class="n">fmt</span><span class="o">.</span><span class="n">Println</span><span class="p">(</span><span class="s">"done"</span><span class="p">)</span>
            <span class="k">break</span>
        <span class="p">}</span>

        <span class="n">log</span><span class="o">.</span><span class="n">Fatal</span><span class="p">(</span><span class="s">"error reading input file: "</span><span class="p">,</span> <span class="n">err</span><span class="p">)</span>
    <span class="p">}</span>

    <span class="n">fmt</span><span class="o">.</span><span class="n">Println</span><span class="p">(</span><span class="s">"bytes read: "</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span>
    <span class="n">fmt</span><span class="o">.</span><span class="n">Println</span><span class="p">(</span><span class="s">"iter: "</span><span class="p">,</span> <span class="n">iter</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>

<p>That, on a 2.5G file would output something like:</p>

<pre><code class="language-txt">buffer size 2147483648
bytes read:  1073741824
iter:  1
bytes read:  1073741824
iter:  2
bytes read:  490442752
iter:  3
done
</code></pre>

<p>Even though the initialised buffer size is 2GiB, only 1GiB is read into the buffer per iteration. Upon digging into the source code, it looks like this is a deliberate choice. The main change logs from the history point to the following:</p>

<ol>
  <li><a href="https://codereview.appspot.com/89900044">https://codereview.appspot.com/89900044</a> as a fix for <a href="https://github.com/golang/go/issues/7812">golang/go#7812</a>. This had a fix for failing reads on file sizes greater than or equal to 2GiB on macos and freebsd by capping each <code class="language-plaintext highlighter-rouge">read</code> syscall to only read a 2GiB-1 bytes. For the rest of operating systems, at this point, there was no cap.</li>
  <li><a href="https://codereview.appspot.com/94070044">https://codereview.appspot.com/94070044</a> as a followup of 1, where the limit was decreased without any OS checks to 1GiB, with an explanation that at least it would allow for aligned reads from disk, as opposed to an odd number that might miss page caches (my understanding).</li>
</ol>

<p>Note that a lot has changed since that changeset, and the current file reference for that <code class="language-plaintext highlighter-rouge">_unix.go</code> file in the changeset is <a href="https://github.com/golang/go/blob/release-branch.go1.22/src/internal/poll/fd_unix.go#L132-L137">src/internal/poll/fd_unix.go</a>.</p>

<h3 id="aside-system-limits">Aside: System limits</h3>

<p>As per the linux <a href="https://www.man7.org/linux/man-pages/man2/read.2.html#NOTES"><code class="language-plaintext highlighter-rouge">read</code> syscall documentation</a>, the maximum bytes that can be transferred is 2GiB. And I tested this out with rudimentary scripts in Rust and C. The Rust program is taken verbatim from the example for <a href="https://doc.rust-lang.org/std/io/trait.Read.html#method.read_to_end"><code class="language-plaintext highlighter-rouge">read_to_end()</code></a>. Running that under <code class="language-plaintext highlighter-rouge">strace</code> has the following output (truncated here):</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>read(3, ..., 6594816000) = 2147479552
read(3, ..., 4447336448) = 2147479552
read(3, ..., 2299856896) = 2147479552
read(3, ..., 152377344) = 152377344
read(3, "", 32)         = 0
</code></pre></div></div>

<p>And a similar, simple C program results in similar output, when using the <code class="language-plaintext highlighter-rouge">read</code> syscall in a loop until the file is read:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SSIZE_MAX: 9223372036854775807 # outputting the limits.h constant
bytes read: 2147479552
bytes read: 2147479552
bytes read: 2147479552
bytes read: 152377344

</code></pre></div></div>

<p>Although that’s neither here nor there, it’s still interesting that Go’s choice has been to pick 2GiB-1 and then 1GiB justifying the odd buffer size in the former.</p>]]></content><author><name>Kashyap Kondamudi</name></author><category term="go," /><category term="til" /><summary type="html"><![CDATA[There's a 1GiB limit for a single `Read` call for an `os.File` entity (object? struct?) in Go, and this seems to be a deliberate choice.]]></summary></entry><entry><title type="html">classnames library composes well!</title><link href="https://kgrz.io/composing-classnames.html" rel="alternate" type="text/html" title="classnames library composes well!" /><published>2023-05-02T00:00:00+05:30</published><updated>2023-05-02T00:00:00+05:30</updated><id>https://kgrz.io/composing-classnames</id><content type="html" xml:base="https://kgrz.io/composing-classnames.html"><![CDATA[<p>This is an unpublished draft from 6 years ago. Unpublished until now, that is.</p>

<p>The <a href="https://www.npmjs.com/package/classnames">classnames</a> library is a <em>very</em> handy tool to apply CSS classes conditionally in JavaScript components. Since the output of the function
is just a string, this can be composed very well on multiple conditionals layered on on various
parts of the code.</p>

<p>For example, consider the following:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="k">import</span> <span class="nx">cx</span> <span class="k">from</span> <span class="dl">'</span><span class="s1">classnames</span><span class="dl">'</span><span class="p">;</span>

<span class="k">switch</span> <span class="p">(</span><span class="nx">type</span><span class="p">)</span> <span class="p">{</span>
  <span class="k">case</span><span class="p">:</span> <span class="dl">'</span><span class="s1">textarea</span><span class="dl">'</span><span class="p">:</span>
    <span class="kd">const</span> <span class="nx">textareaClassNames</span> <span class="o">=</span> <span class="nx">cx</span><span class="p">(</span><span class="dl">'</span><span class="s1">text-area</span><span class="dl">'</span><span class="p">,</span> <span class="dl">'</span><span class="s1">text-input</span><span class="dl">'</span><span class="p">,</span> <span class="dl">'</span><span class="s1">invalid</span><span class="dl">'</span><span class="p">:</span> <span class="o">!</span><span class="k">this</span><span class="p">.</span><span class="nx">state</span><span class="p">.</span><span class="nx">valid</span><span class="p">);</span>
    <span class="k">return</span> <span class="o">&lt;</span><span class="nx">textarea</span> <span class="nx">className</span><span class="o">=</span><span class="p">{</span><span class="nx">textareaClassNames</span><span class="p">}</span> <span class="sr">/</span><span class="err">&gt;
</span>  <span class="k">default</span><span class="p">:</span>
    <span class="kd">const</span> <span class="nx">inputClassNames</span> <span class="o">=</span> <span class="nx">cx</span><span class="p">(</span><span class="dl">'</span><span class="s1">text-input</span><span class="dl">'</span><span class="p">,</span> <span class="dl">'</span><span class="s1">invalid</span><span class="dl">'</span><span class="p">:</span> <span class="o">!</span><span class="k">this</span><span class="p">.</span><span class="nx">state</span><span class="p">.</span><span class="nx">valid</span><span class="p">);</span>
    <span class="k">return</span> <span class="o">&lt;</span><span class="nx">input</span> <span class="nx">type</span><span class="o">=</span><span class="p">{</span><span class="nx">type</span><span class="p">}</span> <span class="nx">className</span><span class="o">=</span><span class="p">{</span><span class="nx">inputClassNames</span><span class="p">}</span> <span class="sr">/</span><span class="err">&gt;
</span><span class="p">}</span>

</code></pre></div></div>

<p>The class-names are the same except for one extra item in the case of
textarea type input field. Until today, I would’ve done something like
the above example, since I never bothered to look at the actual output
of the call. A quick glance at the source code of the library made it
evident that the library would enable composition with output of another
classname-generated value (which is a String). So that code can be
simplified to:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="k">import</span> <span class="nx">cx</span> <span class="k">from</span> <span class="dl">'</span><span class="s1">classnames</span><span class="dl">'</span><span class="p">;</span>
<span class="kd">const</span> <span class="nx">className</span> <span class="o">=</span> <span class="nx">cx</span><span class="p">(</span><span class="dl">'</span><span class="s1">text-input</span><span class="dl">'</span><span class="p">,</span> <span class="dl">'</span><span class="s1">invalid</span><span class="dl">'</span><span class="p">:</span> <span class="o">!</span><span class="k">this</span><span class="p">.</span><span class="nx">state</span><span class="p">.</span><span class="nx">valid</span><span class="p">);</span>

<span class="k">switch</span> <span class="p">(</span><span class="nx">type</span><span class="p">)</span> <span class="p">{</span>
  <span class="k">case</span><span class="p">:</span> <span class="dl">'</span><span class="s1">textarea</span><span class="dl">'</span><span class="p">:</span>
    <span class="k">return</span> <span class="o">&lt;</span><span class="nx">textarea</span> <span class="nx">className</span><span class="o">=</span><span class="p">{</span><span class="nx">cx</span><span class="p">(</span><span class="nx">className</span><span class="p">,</span> <span class="dl">'</span><span class="s1">text-area</span><span class="dl">'</span><span class="p">)}</span> <span class="sr">/</span><span class="err">&gt;
</span>  <span class="k">default</span><span class="p">:</span>
    <span class="k">return</span> <span class="o">&lt;</span><span class="nx">input</span> <span class="nx">type</span><span class="o">=</span><span class="p">{</span><span class="nx">type</span><span class="p">}</span> <span class="nx">className</span><span class="o">=</span><span class="p">{</span><span class="nx">className</span><span class="p">}</span> <span class="sr">/</span><span class="err">&gt;
</span><span class="p">}</span>

</code></pre></div></div>

<p>Much better.</p>]]></content><author><name>Kashyap Kondamudi</name></author><category term="React" /><summary type="html"><![CDATA[I love classnames library!]]></summary></entry><entry><title type="html">Node has native CLI argument parsing</title><link href="https://kgrz.io/node-has-native-arg-parsing.html" rel="alternate" type="text/html" title="Node has native CLI argument parsing" /><published>2023-02-09T00:00:00+05:30</published><updated>2023-02-09T00:00:00+05:30</updated><id>https://kgrz.io/node-has-native-arg-parsing</id><content type="html" xml:base="https://kgrz.io/node-has-native-arg-parsing.html"><![CDATA[<p>I knew this was in the works, but wasn’t aware this was shipped with v16!
(released in 2022). I was playing with TypeScript code transforms and wanted to
update the source file after the transformation based on a flag. The script was
basically standalone, so I didn’t want to depend on any external depedencies
like <code class="language-plaintext highlighter-rouge">argparse</code>. The API I was aiming at was basically:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>node enum-to-const-object.mjs source.ts [...]
node enum-to-const-object.mjs -w source.ts [...]
node enum-to-const-object.mjs --write source.ts [...]
</code></pre></div></div>

<p>The first invocation would print out the result to standard out, whereas the
latter two would update the source file in-place, exactly how
<a href="https://prettier.io/"><code class="language-plaintext highlighter-rouge">prettier</code></a> works. The standard library API is pretty neat for such
a simple interface:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// file: enum-to-const-object.mjs
// Note the mjs extension, which is why I'm able to use import. Otherwise,
// you'll have to use require in place of the following line
import { parseArgs } from 'node:util';

const options = {
	write: {
		type: 'boolean',
		short: 'w',
		default: false
	}
};

const { values, positionals } = parseArgs({ options, allowPositionals: true });
// values is of the shape { write: &lt;flag value&gt; }
// positionals: [ source.ts, ... ]
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">options</code> object is the one used by the parser as the configuration of the flags. The keys of this object are the expected flags in long-hand. The <code class="language-plaintext highlighter-rouge">short</code> property for each of these long-hand flags helps with adding aliases.</p>

<p>In addition to the flag format and strings, there is one more additional option that I had to configure: <code class="language-plaintext highlighter-rouge">allowPositionals</code>. This returns rest of the arguments that are not flags, which in my case are the files I wanted to transform. Once <code class="language-plaintext highlighter-rouge">parseArgs</code> is called using the configuration, and (by default) on <code class="language-plaintext highlighter-rouge">process.argv</code>, the flag values as an key-value pair, and the rest of the arguments are returned―<code class="language-plaintext highlighter-rouge">values</code> which contains the flag values, and <code class="language-plaintext highlighter-rouge">positionals</code> which contains the file list.</p>

<p><a href="https://nodejs.org/docs/latest-v16.x/api/util.html#utilparseargsconfig">Docs Link</a></p>]]></content><author><name>Kashyap Kondamudi</name></author><category term="nodejs" /><summary type="html"><![CDATA[I knew this was in the works, but wasn't aware this was shipped with v16! (released in 2022). This is so useful for small cli scripts.]]></summary></entry><entry><title type="html">Using CSP in report-only and enforcement mode</title><link href="https://kgrz.io/multiple-csp.html" rel="alternate" type="text/html" title="Using CSP in report-only and enforcement mode" /><published>2023-01-17T00:00:00+05:30</published><updated>2023-01-17T00:00:00+05:30</updated><id>https://kgrz.io/multiple-csp</id><content type="html" xml:base="https://kgrz.io/multiple-csp.html"><![CDATA[<p>I recently came across this strategy which uses the standardised <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP">Content
Security Policy</a> for both enforcement and script monitoring on a web
page for security. We use CSP for enforcement already, but I was under the
assumption that report-only mode and enforcement mode are exclusive. That is,
if the <code class="language-plaintext highlighter-rouge">Content-Security-Policy</code> header was used with a few rules, I thought
the <code class="language-plaintext highlighter-rouge">Content-Security-Policy-Report-Only</code> can’t be used; or perhaps it doesn’t
work if we send both. But, in hindsight, this was wrong. For example, let’s say
a page returns the following headers, and there’s an <code class="language-plaintext highlighter-rouge">&lt;img&gt;</code> tag on this page
trying to load images from <code class="language-plaintext highlighter-rouge">example.com</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Content-Security-Policy: default-src 'self' images.kgrz.io; report-uri: /report-block
Content-Security-Policy-Report-Only: default-src 'self'; report-uri: /report-only
</code></pre></div></div>
<p>This CSP setting ensures resources only from <code class="language-plaintext highlighter-rouge">images.kgrz.io</code> are successfully
loaded onto the page. I had always had the implicit understanding that the
image load will be blocked, and a report sent out to <code class="language-plaintext highlighter-rouge">/report-block</code> path. But
this is not the case: there’ll be two reports sent-out: one to <code class="language-plaintext highlighter-rouge">/report-block</code>,
and one to <code class="language-plaintext highlighter-rouge">/report-only</code>.</p>

<p>The sample application that demonstrates this example is hosted at
<a href="/apps/csp">/apps/csp</a>. You may have to have a modern-ish website to use this
since I’m using no build pipeline for the JS that’s used on the page. (Anything
that [supports <code class="language-plaintext highlighter-rouge">&lt;script type="module"&gt;][module_can_i_use]</code>)</p>]]></content><author><name>Kashyap Kondamudi</name></author><category term="web" /><summary type="html"><![CDATA[I wanted to test out and understand how multiple CSPs on a single page work, and this post is about that. Not only can you use multiple CSPs, but it can be used to kind of pseudo-monitor everything that happens on the page.]]></summary></entry><entry><title type="html">TIL: Vim’s search is backed by a register!</title><link href="https://kgrz.io/vim-register-search.html" rel="alternate" type="text/html" title="TIL: Vim’s search is backed by a register!" /><published>2023-01-01T00:00:00+05:30</published><updated>2023-01-01T00:00:00+05:30</updated><id>https://kgrz.io/vim-register-search</id><content type="html" xml:base="https://kgrz.io/vim-register-search.html"><![CDATA[<p>When you search for a pattern in Vim, it’s stored to the <code class="language-plaintext highlighter-rouge">/</code> register. This then can be used to store the query in some variable for some Vim command, or perhaps paste the pattern as text. For example, the following invocation in normal mode:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"/p
</code></pre></div></div>

<p>pastes the search pattern in the current buffer. This same register value is used for repeat searches (<code class="language-plaintext highlighter-rouge">n</code> in normal mode). I’m not sure when I’m ever going to use this, but at least I’ll start to understand some shortcut on vim.fandom.com or SO that has an odd <code class="language-plaintext highlighter-rouge">/</code> in the command.</p>]]></content><author><name>Kashyap Kondamudi</name></author><category term="vim," /><category term="til" /><summary type="html"><![CDATA[When you search for a pattern in Vim, it’s stored to the / register. This then can be used to store the query in some variable for some Vim command, or perhaps paste the pattern as text. For example, the following invocation in normal mode:]]></summary></entry><entry><title type="html">The state in Ansible’s docker container module</title><link href="https://kgrz.io/ansible-docker-container-state.html" rel="alternate" type="text/html" title="The state in Ansible’s docker container module" /><published>2022-08-24T00:00:00+05:30</published><updated>2022-08-24T00:00:00+05:30</updated><id>https://kgrz.io/ansible-docker-container-state</id><content type="html" xml:base="https://kgrz.io/ansible-docker-container-state.html"><![CDATA[<p>I spent roughly an hour on a stupid misunderstanding I had with the documentation for the <a href="https://docs.ansible.com/ansible/2.5/modules/docker_container_module.html">docker_container</a> today. The module has a <code class="language-plaintext highlighter-rouge">state</code> option that turns a few knobs. The two options I got confused between are <code class="language-plaintext highlighter-rouge">present</code> and <code class="language-plaintext highlighter-rouge">started</code>. In hindsight, why I used <code class="language-plaintext highlighter-rouge">present</code> when I meant “I want to start the container” is an obvious problem. But I did, and spent time trying to debug what the heck Ansible’s module was doing differently than <code class="language-plaintext highlighter-rouge">docker run</code>. The playbook I was writing had a bunch of <a href="https://docs.ansible.com/ansible/latest/user_guide/playbooks_delegation.html#delegating-tasks">local_action</a>s to build an nginx-based image, spawn a container off of it, and trigger a few tests against it. Obviously, for the tests to succeed, the container had to be up, but since I was using <code class="language-plaintext highlighter-rouge">state: present</code>, it looked as if the container booted up but got shut down immediately. The <code class="language-plaintext highlighter-rouge">docker inspect</code> output in this case won’t have any hint as to <em>why</em> the container is not actually running. Turns out that <code class="language-plaintext highlighter-rouge">state: present</code> doesn’t actually run the container, just creates it.</p>

<p>So, here’s a small reminder for the future-me that if the container has to be in <code class="language-plaintext highlighter-rouge">running</code> state, the <code class="language-plaintext highlighter-rouge">state</code> should be set to <code class="language-plaintext highlighter-rouge">started</code>. <code class="language-plaintext highlighter-rouge">present</code> only ensures the container is there in the process list, and not in running state. i.e., ignoring all the other interactions that this parameter has with other options, <code class="language-plaintext highlighter-rouge">state: present</code> is effectively <a href="https://docs.docker.com/engine/reference/commandline/create/"><code class="language-plaintext highlighter-rouge">docker create</code></a>, whereas <code class="language-plaintext highlighter-rouge">state: started</code> is analogous <code class="language-plaintext highlighter-rouge">docker run</code>.</p>

<h4 id="aside-why-use-ansible-to-run-docker">Aside: Why use ansible to run docker?</h4>

<p>This whole setup might seem convoluted, but there’s a reason why I chose this. Before I had an M1-based Macbook, I was using Vagrant for local testing of the ansible pipeline. Everytime I make code changes for the server configuration, I’ll run the entire Ansible playbook end to end to setup the local VM provisioned by Vagrant, followed by basic <code class="language-plaintext highlighter-rouge">curl</code>-based tests that check for status codes and cache headers against the server. Vagrant uses VirtualBox for running the virtual machines, which is not supported on M1 chips. The process of converting this exact setup to instead use Docker is not that straight forward in my experience so far. Running <code class="language-plaintext highlighter-rouge">docker build</code> and <code class="language-plaintext highlighter-rouge">docker run</code> directly won’t work in all cases since I use Ansible template that interpolates variables into the nginx configuration files, which have to be compiled before I actually build the docker image. So the short-term alternative I was trying out was to use another playbook that runs every Ansible task in it locally, which includes letting Ansible take care of building, running the server container and the tests.</p>]]></content><author><name>Kashyap Kondamudi</name></author><category term="ansible" /><summary type="html"><![CDATA[present in Ansible's state setting in docker_container module means create container, started means run the container]]></summary></entry><entry><title type="html">Minus zero in Ruby and JavaScript</title><link href="https://kgrz.io/negative-zero-ruby-and-javascript.html" rel="alternate" type="text/html" title="Minus zero in Ruby and JavaScript" /><published>2021-03-16T00:00:00+05:30</published><updated>2021-03-16T00:00:00+05:30</updated><id>https://kgrz.io/negative-zero-ruby-and-javascript</id><content type="html" xml:base="https://kgrz.io/negative-zero-ruby-and-javascript.html"><![CDATA[<p>From Daniel Lemire’s recent-ish <a href="https://lemire.me/blog/2021/03/04/how-does-your-programming-language-handle-minus-zero-0-0/">blog post on this topic</a>:</p>

<blockquote>
  <p>The ubiquitous IEEE floating-point standard defines two numbers to represent zero, the positive and the negative zeros. You also have the positive and negative infinity. If you compute the inverse of the positive zero, you get the positive infinity. If you compute the inverse of the negative zero, you get the negative infinity.</p>
</blockquote>

<p>I wanted to check this out for the most frequent languages I tend to use—Ruby and JavaScript.</p>

<p>First up, Ruby:</p>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">minus_zero</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.0</span>
<span class="n">plus_zero</span> <span class="o">=</span> <span class="mf">0.0</span>

<span class="n">converted</span> <span class="o">=</span> <span class="s2">"-0.0"</span><span class="p">.</span><span class="nf">to_f</span>

<span class="nb">puts</span> <span class="mf">1.0</span> <span class="o">/</span> <span class="n">minus_zero</span>
<span class="nb">puts</span> <span class="mf">1.0</span> <span class="o">/</span> <span class="n">plus_zero</span>
<span class="nb">puts</span> <span class="mf">1.0</span> <span class="o">/</span> <span class="n">converted</span>
</code></pre></div></div>

<p>Output: -Infinity, Infinity, -Infinity</p>

<p>Next, JavaScript:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">const</span> <span class="nx">minus_zero</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.0</span>
<span class="kd">const</span> <span class="nx">plus_zero</span> <span class="o">=</span> <span class="mf">0.0</span>

<span class="kd">const</span> <span class="nx">converted</span> <span class="o">=</span> <span class="nb">parseFloat</span><span class="p">(</span><span class="dl">"</span><span class="s2">-0.0</span><span class="dl">"</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>

<span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="mf">1.0</span> <span class="o">/</span> <span class="nx">minus_zero</span><span class="p">)</span>
<span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="mf">1.0</span> <span class="o">/</span> <span class="nx">plus_zero</span><span class="p">)</span>
<span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="mf">1.0</span> <span class="o">/</span> <span class="nx">converted</span><span class="p">)</span>
</code></pre></div></div>

<p>Output: -Infinity, Infinity, -Infinity</p>

<p>Both languages (Ruby v. 2.7.1, JavaScript(NodeJS) v. 12.14.x &amp; Chrome 91.x) return the values as expected in the post.</p>]]></content><author><name>Kashyap Kondamudi</name></author><category term="general" /><summary type="html"><![CDATA[From Daniel Lemire’s recent-ish blog post on this topic:]]></summary></entry><entry><title type="html">Safari custom user agent CSS overrides using webfonts</title><link href="https://kgrz.io/safari-custom-user-agent-css-overrides-using-webfonts.html" rel="alternate" type="text/html" title="Safari custom user agent CSS overrides using webfonts" /><published>2020-11-18T00:00:00+05:30</published><updated>2020-11-18T00:00:00+05:30</updated><id>https://kgrz.io/safari-custom-user-agent-css-overrides-using-webfonts</id><content type="html" xml:base="https://kgrz.io/safari-custom-user-agent-css-overrides-using-webfonts.html"><![CDATA[<p>I like to use better monospace fonts as default fonts in browsers. In Chrome and Firefox, this is pretty straight forward—go to Preferences, and you’ll see a menu to change the default fonts. This is a bit harder on Safari; its Intelligent Tracking Protection disables loading all local fonts by default.</p>

<p>A very simple way to circumvent this restriction is by using a webfont. The simplest option is to use an <code class="language-plaintext highlighter-rouge">@import</code> statement and load a webfont from, say, Google fonts or some other service (your own web server too!).</p>

<p>Example stylesheet:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@import url('https://fonts.googleapis.com/css2?family=IBM+Plex+Mono:ital@0;1&amp;display=swap');

pre, code {
	font-family: 'IBM Plex Mono', monospace;
}
</code></pre></div></div>

<p>And use the Preferences &gt; Advanced &gt; Style sheet, and load this file from disk. This trick, and a little bit of default zoom makes the IETF spec docs much much better to read.</p>

<p><img class="image" loading="lazy" src="/public/images/preferences.png" alt="Preferences pane in Safari to upload the default stylesheet override" /></p>

<p>Sample screenshot of one of the IETF spec docs I’m reading:</p>

<p><img class="image" loading="lazy" src="/public/images/ietf.png" alt="Cookies: HTTP State Management Mechanism                     draft-ietf-httpbis-rfc6265bis-05 sample screenshot" /></p>

<p>This is not perfect, but goes a long way without needing to install extra plugins ala Stylebot or Cascadea for the basic usecase.</p>]]></content><author><name>Kashyap Kondamudi</name></author><category term="misc" /><summary type="html"><![CDATA[Safari doesn't allow loading local user-installed fonts, so you can't add a custom stylesheet with that font. Easy way to circumvent this is to use a webfont.]]></summary></entry><entry><title type="html">Flattening and Filtering JSON for Cleaner Types in Go</title><link href="https://kgrz.io/go-json-flatten-filter-cleaner-types.html" rel="alternate" type="text/html" title="Flattening and Filtering JSON for Cleaner Types in Go" /><published>2020-07-30T00:00:00+05:30</published><updated>2020-07-30T00:00:00+05:30</updated><id>https://kgrz.io/go-json-flatten-filter-cleaner-types</id><content type="html" xml:base="https://kgrz.io/go-json-flatten-filter-cleaner-types.html"><![CDATA[<p>Before I grokked the <code class="language-plaintext highlighter-rouge">Unmarshaler</code> interface, it was hard to know how to parse a complex <span class="smallcaps">JSON</span> string into a type in one-shot, with or without preprocessing. There are many good <a href="https://blog.golang.org/json">blog</a> <a href="https://blog.gopheracademy.com/advent-2016/advanced-encoding-decoding/">posts</a> on techniques to parse <span class="smallcaps">JSON</span> in Go, but I had to learn this by experimentation to finally wrap my head around it.</p>

<p>I’ll use an example from GitHub’s <code class="language-plaintext highlighter-rouge">/commits</code> REST API, using PR: <a href="https://github.com/ruby/ruby/pull/3365">ruby/ruby#3365</a>. I’ve <a href="https://github.com/kgrz/json-parsing-post/blob/master/commits.json">saved the response</a> in the <a href="https://github.com/kgrz/json-parsing-post">repo</a> where I’ve added full implementation of the example used in this post. The commits response from GitHub REST API is <em>very</em> verbose, depending on the PR size, and having depth greater than 1. In the hypothetical application that I’m writing, I need a list of “objects” that have the following information:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">type</span> <span class="n">MetaData</span> <span class="k">struct</span> <span class="p">{</span>
	<span class="n">Author</span> <span class="kt">string</span>
	<span class="n">Committer</span> <span class="kt">string</span>
	<span class="n">SHA</span> <span class="kt">string</span>
	<span class="n">Message</span> <span class="kt">string</span>
<span class="p">}</span>
</code></pre></div></div>

<p>That is, I want parse <a href="https://github.com/kgrz/json-parsing-post/blob/master/commits.json">this response</a> into a <code class="language-plaintext highlighter-rouge">[]MetaData</code> slice. I <strong>do not</strong> want to traverse the structs in the format of the responses in my main “business logic”, as that makes it hard to follow the important bits. I don’t want to use <code class="language-plaintext highlighter-rouge">interface{}</code> as a placeholder. A better trade-off, in my opinion and use case, is to do as much as possible during the parse phase to massage the data into the structure you want<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>. I’m positive that this is a common use case. I ended up learning one way to do this cleanly almost by accident. First, the components involved:</p>

<h4 id="use-anonymous-structs">Use anonymous structs</h4>

<p>Anonymous structs can be used to avoid defining a concrete type and skip giving it a name for one-off use-cases. It’s heavily used in parsing and marshalling code paths, and testing. In our case, this technique can be used to define a “dirty” struct inside the <code class="language-plaintext highlighter-rouge">UnmarshalJSON</code> function on the fly, and use that for parsing the <span class="smallcaps">JSON</span>.</p>

<h4 id="implementing-unmarshaler-interface">Implementing Unmarshaler interface</h4>

<p>Any type that has a <code class="language-plaintext highlighter-rouge">UnmarshalJSON</code> function on it implements the <code class="language-plaintext highlighter-rouge">Unmarshaler</code> interface. This type then can be used as the target for parsing a <span class="smallcaps">JSON</span> sub tree or the entire <span class="smallcaps">JSON</span> itself!</p>

<h3 id="implementation">Implementation</h3>

<p>First step is to mock out the main function:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="n">main</span><span class="p">()</span> <span class="p">{</span>
	<span class="c">// This variable contains the raw json bytes that resulted from the</span>
	<span class="c">// API call. I'm not adding the code for the actual network fetch</span>
	<span class="c">// for now, but in the example repository, I read the commits</span>
	<span class="c">// response from a file</span>
	<span class="k">var</span> <span class="n">jsonb</span> <span class="p">[]</span><span class="kt">byte</span>
	<span class="n">jsonb</span> <span class="o">=</span> <span class="n">JSONFromSomewhere</span><span class="p">()</span>

	<span class="k">var</span> <span class="n">metadatas</span> <span class="p">[]</span><span class="n">MetaData</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">json</span><span class="o">.</span><span class="n">Unmarshal</span><span class="p">(</span><span class="n">jsonb</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">metadatas</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="n">log</span><span class="o">.</span><span class="n">Fatalln</span><span class="p">(</span><span class="s">"error parsing JSON"</span><span class="p">,</span> <span class="n">err</span><span class="p">)</span>
	<span class="p">}</span>

	<span class="n">fmt</span><span class="o">.</span><span class="n">Println</span><span class="p">(</span><span class="n">metadatas</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <span class="smallcaps">JSON</span> response of <code class="language-plaintext highlighter-rouge">/commits</code> endpoint is a list of <code class="language-plaintext highlighter-rouge">commit</code> objects, and I’m using a list of <code class="language-plaintext highlighter-rouge">MetaData</code> types to match that interface. For each commit item from the <span class="smallcaps">JSON</span> array, the raw bytes get passed as the argument to the <code class="language-plaintext highlighter-rouge">UnmarshalJSON</code> function on <code class="language-plaintext highlighter-rouge">MetaData</code>.</p>

<p>Next step is to implement the <code class="language-plaintext highlighter-rouge">UnmarshalJSON</code> function using an anonymous struct to parse out the raw commit object <span class="smallcaps">JSON</span> string into it:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="p">(</span><span class="n">m</span> <span class="o">*</span><span class="n">MetaData</span><span class="p">)</span> <span class="n">UnmarshalJSON</span><span class="p">(</span><span class="n">buf</span> <span class="p">[]</span><span class="kt">byte</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="k">var</span> <span class="n">commit</span> <span class="k">struct</span> <span class="p">{</span>
		<span class="n">SHA</span>    <span class="kt">string</span> <span class="s">`json:"sha"`</span>
		<span class="n">Commit</span> <span class="k">struct</span> <span class="p">{</span>
			<span class="n">Author</span> <span class="k">struct</span> <span class="p">{</span>
				<span class="n">Name</span> <span class="kt">string</span> <span class="s">`json:"name"`</span>
			<span class="p">}</span> <span class="s">`json:"author"`</span>
			<span class="n">Committer</span> <span class="k">struct</span> <span class="p">{</span>
				<span class="n">Name</span> <span class="kt">string</span> <span class="s">`json:"name"`</span>
			<span class="p">}</span> <span class="s">`json:"committer"`</span>
			<span class="n">Message</span> <span class="kt">string</span> <span class="s">`json:"message"`</span>
		<span class="p">}</span> <span class="s">`json:"commit"`</span>
	<span class="p">}</span>

	<span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">json</span><span class="o">.</span><span class="n">Unmarshal</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">commit</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">errors</span><span class="o">.</span><span class="n">Wrap</span><span class="p">(</span><span class="n">err</span><span class="p">,</span> <span class="s">"parsing into MetaData failed"</span><span class="p">)</span>
	<span class="p">}</span>

	<span class="c">// continued</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Final step is to process the <code class="language-plaintext highlighter-rouge">commit</code> struct, and set the appropriate fields on <code class="language-plaintext highlighter-rouge">MetaData</code> struct:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="p">(</span><span class="n">m</span> <span class="o">*</span><span class="n">MetaData</span><span class="p">)</span> <span class="n">UnmarshalJSON</span><span class="p">(</span><span class="n">buf</span> <span class="p">[]</span><span class="kt">byte</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="c">// same as above</span>

	<span class="n">m</span><span class="o">.</span><span class="n">AuthorName</span> <span class="o">=</span> <span class="n">commit</span><span class="o">.</span><span class="n">Commit</span><span class="o">.</span><span class="n">Author</span><span class="o">.</span><span class="n">Name</span>
	<span class="n">m</span><span class="o">.</span><span class="n">CommitterName</span> <span class="o">=</span> <span class="n">commit</span><span class="o">.</span><span class="n">Commit</span><span class="o">.</span><span class="n">Committer</span><span class="o">.</span><span class="n">Name</span>
	<span class="n">m</span><span class="o">.</span><span class="n">SHA</span> <span class="o">=</span> <span class="n">commit</span><span class="o">.</span><span class="n">SHA</span>
	<span class="n">m</span><span class="o">.</span><span class="n">Message</span> <span class="o">=</span> <span class="n">commit</span><span class="o">.</span><span class="n">Commit</span><span class="o">.</span><span class="n">Message</span>

	<span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>

<p>That’s it! An additional advantage to this type of narrow types is it’s easier to test.</p>

<hr />

<h3 id="bonus-filtering-the-slice-further">Bonus: Filtering the slice further</h3>

<p>For bonus points, I want to skip certain <code class="language-plaintext highlighter-rouge">[]MetaData</code> elements based on a condition. A way to do this, keeping the same principles as above in mind, is to define a type that covers <code class="language-plaintext highlighter-rouge">[]MetaData</code>, which implements the <code class="language-plaintext highlighter-rouge">Unmarshaler</code> interface:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">type</span> <span class="n">MetaDatas</span> <span class="p">[]</span><span class="n">MetaData</span>

<span class="k">func</span> <span class="p">(</span><span class="n">ms</span> <span class="o">*</span><span class="n">MetaDatas</span><span class="p">)</span> <span class="n">UnmarshalJSON</span><span class="p">(</span><span class="n">buf</span> <span class="p">[]</span><span class="kt">byte</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="c">// []MetaData is not the same as MetaDatas, and this difference is</span>
	<span class="c">// important!</span>
	<span class="k">var</span> <span class="n">metadatas</span> <span class="p">[]</span><span class="n">MetaData</span>

	<span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">json</span><span class="o">.</span><span class="n">Unmarshal</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">metadatas</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="n">log</span><span class="o">.</span><span class="n">Fatalln</span><span class="p">(</span><span class="s">"error parsing JSON"</span><span class="p">,</span> <span class="n">err</span><span class="p">)</span>
	<span class="p">}</span>

	<span class="c">// filtering without allocations</span>
	<span class="c">// https://github.com/golang/go/wiki/SliceTricks#filtering-without-allocating</span>
	<span class="n">cleanedms</span> <span class="o">:=</span> <span class="n">metadatas</span><span class="p">[</span><span class="o">:</span><span class="m">0</span><span class="p">]</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">metadata</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">metadatas</span> <span class="p">{</span>
		<span class="k">if</span> <span class="o">!</span><span class="n">strings</span><span class="o">.</span><span class="n">HasPrefix</span><span class="p">(</span><span class="n">metadata</span><span class="o">.</span><span class="n">Message</span><span class="p">,</span> <span class="s">"WIP"</span><span class="p">)</span> <span class="p">{</span>
			<span class="n">cleanedms</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">cleanedms</span><span class="p">,</span> <span class="n">metadata</span><span class="p">)</span>
		<span class="p">}</span>
	<span class="p">}</span>
	<span class="o">*</span><span class="n">ms</span> <span class="o">=</span> <span class="n">cleanedms</span>

	<span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Like before, I’m using a temporary type of the kind that matches our main type, and using that to parse into. Then I’m clean out slice based on a condition—I want to skip all the commits that start with <code class="language-plaintext highlighter-rouge">WIP</code>. Note that the <code class="language-plaintext highlighter-rouge">metadatas</code> variable defined inside the <code class="language-plaintext highlighter-rouge">UnmarshalJSON</code> function is defined as <code class="language-plaintext highlighter-rouge">[]MetaData</code> and not as <code class="language-plaintext highlighter-rouge">MetaDatas</code>, since doing that would result in a parse-loop. By design, <code class="language-plaintext highlighter-rouge">var metadatas Metadatas</code> and <code class="language-plaintext highlighter-rouge">var metadatas []MetaData</code> are not the same type.</p>

<p>Finally, the filtered slice gets assigned to the underlying object that the <span class="smallcaps">JSON</span> is getting parsed into.</p>

<hr />

<h4 id="a-note-about-performance">A note about performance</h4>

<p>In these examples, the parse flow will create the entire <code class="language-plaintext highlighter-rouge">[]MetaData</code> slice, even though we filter out many of the elements. To my knowledge, this seems like a necessary hit to take. I’m not aware if there’s a way to avoid allocations by pre-pre-processing the incoming bytes to avoid the allocation in the first place. My thought process here is that if we didn’t filter, or cleanup the <span class="smallcaps">JSON</span> data, it will anyway allocate all the objects, so this may not be a huge difference in allocations per se, but that’s just my opinion at this point.</p>

<hr />

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">

      <p>This may not apply everywhere. There are valid cases where parsing
should be as fast and light as possible. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Kashyap Kondamudi</name></author><category term="go" /><summary type="html"><![CDATA[Before I grokked the `Unmarshaler` interface, it was hard to know how to parse a complex JSON string into a type in one-shot, with or without preprocessing. I go through an example to demonstrate one technique.]]></summary></entry><entry><title type="html">PostgreSQL backup notes</title><link href="https://kgrz.io/postgresql-backup-notes.html" rel="alternate" type="text/html" title="PostgreSQL backup notes" /><published>2020-03-16T00:00:00+05:30</published><updated>2020-03-16T00:00:00+05:30</updated><id>https://kgrz.io/postgresql-backup-notes</id><content type="html" xml:base="https://kgrz.io/postgresql-backup-notes.html"><![CDATA[<p>Some of my very rough notes when going nearly cover-to-cover of the
PostgreSQL <a href="https://www.postgresql.org/docs/current/backup.html">backup and
restore</a>
documentation section. It’s one of the most detailed pieces of
documentation I’ve ever read, so this acts as a very high-level
summary. Most of the content is useful for general context surrounding
data backups of any kind.</p>

<p>I’m using <code class="language-plaintext highlighter-rouge">/data</code> directory to signify the data storage directory. This
depends on the chosen configuration, however.</p>

<ul id="markdown-toc">
  <li><a href="#backups" id="markdown-toc-backups">Backups</a>    <ul>
      <li><a href="#sql-dumpload" id="markdown-toc-sql-dumpload">SQL Dump/Load</a></li>
      <li><a href="#copy-data-directory" id="markdown-toc-copy-data-directory">Copy <code class="language-plaintext highlighter-rouge">/data</code> directory</a>        <ul>
          <li><a href="#frozen-snapshots" id="markdown-toc-frozen-snapshots">frozen snapshots</a></li>
          <li><a href="#rsync-tar-et-al" id="markdown-toc-rsync-tar-et-al"><code class="language-plaintext highlighter-rouge">rsync</code>, <code class="language-plaintext highlighter-rouge">tar</code>, et al.</a></li>
        </ul>
      </li>
      <li><a href="#continuous-archiving" id="markdown-toc-continuous-archiving">Continuous Archiving</a></li>
    </ul>
  </li>
  <li><a href="#restore" id="markdown-toc-restore">Restore</a></li>
  <li><a href="#replication" id="markdown-toc-replication">Replication</a></li>
</ul>

<h2 id="backups">Backups</h2>

<p>Three main types of backup strategies:</p>

<ul>
  <li>SQL dump/load (stop-the-world)</li>
  <li>Backup <code class="language-plaintext highlighter-rouge">/data</code> directory (stop-the-world)</li>
  <li>Continuous Archiving</li>
</ul>

<p>The first two typically need lots of extra space on the database server
to store the backup before you can upload it to some off-site storage.</p>

<p>“stop-the-world” in this context is not an official nomenclature. In
these strategies, <em>most likely</em>, the database needs to be shut down at
some point.</p>

<p>Continous archive-based backups are used for a leader-follower setup, or
even delta backups—where there’s a base backup, and subsequent data as
deltas that can be used to restore the entire database.
Application-supported remote backups are quite simple, and so this is
the best strategy if the database servers are space-constrained.</p>

<h3 id="sql-dumpload">SQL Dump/Load</h3>

<p>This is the easiest strategy. <code class="language-plaintext highlighter-rouge">pg_dump</code> takes a backup, while
<code class="language-plaintext highlighter-rouge">pg_restore</code> command consumes the output of that backup. This strategy
is the simplest to cron-ify a backup, without external dependencies:
take a backup, upload the files to remote storage, test the backup on a
different machine, and do this every night.</p>

<ul>
  <li>
    <p><code class="language-plaintext highlighter-rouge">pg_dump</code> saves the database into a <code class="language-plaintext highlighter-rouge">.sql</code> statement. Requires large
enough space to hold both the database and the backup script.</p>
  </li>
  <li>
    <p>File sizes might be limited by kernel/OS, so that’s something to look
ahead while deciding to use this.</p>
  </li>
  <li>
    <p>Restore from the <code class="language-plaintext highlighter-rouge">pg_dump</code> output might also need extra configuration
tweaking around connection times: too less, and the database might
close the connection before the entire script has run.</p>
  </li>
</ul>

<h3 id="copy-data-directory">Copy <code class="language-plaintext highlighter-rouge">/data</code> directory</h3>

<p>PostgreSQL’s directory layout is straight-forward—once you get to know
it. Most of the data is put in one directory, and this includes the two
main components needed for any future restores: the data files, and the
temporary append-only log files. If the database is shut down, you’re
free to copy over the <code class="language-plaintext highlighter-rouge">data</code> directory to another machine, and start off
from it. Configuration files typically aren’t placed in the <code class="language-plaintext highlighter-rouge">data</code>
directory, so they might need to be copied as well.</p>

<p>Any strategies that have to rely on the file-system layout of
PostgreSQL, or features provided by the file system itself.</p>

<p>Two routes here: frozen snapshots of the file system, or using tools
like rsync, tar etc.</p>

<h4 id="frozen-snapshots">frozen snapshots</h4>

<ul>
  <li>
    <p>If the underlying file system supports atomic volume snapshots (btrfs,
zfs, Apple’s APFS for example), one can snapshot the entire <code class="language-plaintext highlighter-rouge">data</code>
directory. Lots of caveats around how good the snapshot mechanism is
implemented exist.</p>
  </li>
  <li>
    <p>The backup can be taken without stopping the server. During restore,
this strategy would require replaying the logs as there might be some
commits that weren’t turned to data files from the append only log.</p>
  </li>
</ul>

<h4 id="rsync-tar-et-al"><code class="language-plaintext highlighter-rouge">rsync</code>, <code class="language-plaintext highlighter-rouge">tar</code>, et al.</h4>

<ul>
  <li>
    <p><code class="language-plaintext highlighter-rouge">rsync</code>, <code class="language-plaintext highlighter-rouge">gzip</code>, <code class="language-plaintext highlighter-rouge">tar</code> the <code class="language-plaintext highlighter-rouge">data</code> directory. These utilities don’t
take consistent snapshots of the disk, so it’s best to shutdown the
server. Shutting down the server forces a full flush of the data to
disk.</p>
  </li>
  <li>
    <p>An example two step process with <code class="language-plaintext highlighter-rouge">rsync</code>:</p>
  </li>
</ul>

<pre>
run rsync
shutdown the server
rsync --checksum
</pre>

<p>What’s interesting is that this two-step process is similar to the one
used in online backups section. This is like a one-step delta backup
process if we stretch it enough: the first backup is a base backup that
contains the data committed till that point, then the second <code class="language-plaintext highlighter-rouge">rsync</code>
takes the delta and copies that over.</p>

<h3 id="continuous-archiving">Continuous Archiving</h3>

<p>This system can be used to setup a replicated system, consisting of a
leader and potentially multiple followers. The data from the leader is
pushed, and each of the followers might pull the data. Where this data
is stored is customisable. There are many ways to setup replication in
PostgreSQL, and the documentation for it is exhaustive. The archival
part deals with the first part: taking the backup and pushing it
somewhere.</p>

<p>This strategy piggy-backs on the fact that a WAL log may be used to
replay and restore a database. There are many caveats and configuration
tweaks to how long the WAL log files are retained, the size of those log
files and the naming of the files. It’s best to ship the log files as
and when they are created to an external storage service. Rather than do
this manually via <code class="language-plaintext highlighter-rouge">rsync</code> et. al., PostgreSQL provides a way:
<code class="language-plaintext highlighter-rouge">archive_command</code> setting in the configuration, which takes a script.</p>

<ul>
  <li>
    <p>WAL logs should be secured while transmission and remote storage,
because these contain the actual data. (that goes for the main
database too, fwiw)</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">archive_command</code> should exit with <code class="language-plaintext highlighter-rouge">0</code> code. Otherwise, the command
gets retried. The <code class="language-plaintext highlighter-rouge">pg_wal</code> directory <em>may</em> potentially get filled, and
cause the server to crash!</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">archive_command</code> should be designed to ensure it doesn’t override
 existing files on the remote system.</p>
  </li>
  <li>
    <p>Missing WAL logs from the archive might hamper future restore, so
regular base backups will help keep the error surface area a little
small.</p>
  </li>
  <li>
    <p>old base backup + too many WAL logs to restore increase the restore
time. It’s important to determine the maths behind this to figure out
how much downtime you might need and tweak the base backup frequency,
and WAL file size accordingly.</p>
  </li>
</ul>

<p><em>General mechanism</em>:</p>

<ul>
  <li>One base backup as a starting point</li>
  <li>Continuous deltas in the form of the append-only log (WAL) files</li>
</ul>

<p>The base backup marks the point where the backup would start
(<a href="https://www.postgresql.org/docs/current/sql-checkpoint.html">checkpoint</a>).</p>

<p><em>base backup</em></p>

<p>Two ways to take a base backup:</p>

<ul>
  <li>
    <p>Use the <code class="language-plaintext highlighter-rouge">pg_basebackup</code> command from an external machine (or the same
machine, with a different data directory setting), providing the
connection info to connect to the leader.</p>

    <ul>
      <li>
        <p>Multiple commands can be run from multiple machines, but might
depend on replication slots configuration on the leader.</p>
      </li>
      <li>
        <p>Might use one/two connections depending on the variant of backup
used: copy WAL logs at the end (1) or stream WAL logs parallelly (2).</p>
      </li>
      <li>
        <p>Does not run if <code class="language-plaintext highlighter-rouge">/data</code> directory is not empty.</p>
      </li>
    </ul>
  </li>
  <li>
    <p>Two-step process via <code class="language-plaintext highlighter-rouge">rsync</code>. PostgreSQL provides two SQL statements
for signalling the server that the user is taking a backup, and that
a checkpoint has to be created: <code class="language-plaintext highlighter-rouge">pg_start_backup</code>, <code class="language-plaintext highlighter-rouge">pg_stop_backup</code>.</p>
  </li>
</ul>

<pre>
SELECT pg_start_backup('some_label')
rsync /data
SELECT * from pg_stop_backup();
</pre>

<h2 id="restore">Restore</h2>

<p><em>stop-the-world restores</em></p>

<p>Data dumps taken with <code class="language-plaintext highlighter-rouge">pg_dump</code> or the file system strategy mentioned
above can be restored by <code class="language-plaintext highlighter-rouge">pg_restore</code> or just starting the server.
Needless to say, this strategy causes either data loss or needs
downtime, depending on the operations chosen.</p>

<ul>
  <li>
    <p>If a system has a simple <code class="language-plaintext highlighter-rouge">pg_dump</code> cron job that ships the archive to
remote storage, when the leader crashes or dies, the time to
detection, copying the archive to the follower, <code class="language-plaintext highlighter-rouge">pg_restore</code>
completion times is the amount of downtime that’s required.</p>
  </li>
  <li>
    <p>The cron job, if configured at a certain time in the day, differs from
the time the crash happens, the delta in the data until that time on
the leader is a potential loss in data.</p>
  </li>
  <li>
    <p>When the leader crashes/dies, but you still have access to the
physical data disks, recovery using file system snapshot is possible,
and that may potentially recover all the data up till the point of the
last commit. Because this recovery would also have the WAL files
handy, the replay will make sure as much data as possible is
recovered.</p>
  </li>
</ul>

<p><em>Continuous Archive restores</em></p>

<p>If the system is setup with continuous archiving, it may be possible to
recover all the data. Restore times depend on how fast the base backup
archive, WAL logs can be copied over to the new server, and the WAL log
replay.</p>

<h2 id="replication">Replication</h2>

<p>There are many ways to do this, too, depending on the underlying infra:
shared disk (two machines accessing the same disk), file system
replication (a write to one drive is mirrored to a different machine
atomically), side-car middlewares that execute a given statement on
multiple machines simultaneously, or even application-level middlewares
that do this. Streaming/Point-in-time replication is one preferred
approach that can piggy back on the continuous archive backup strategy.</p>

<p>Streaming/Point-in-time replication strategy uses wal logs shipped to a
remote server using <code class="language-plaintext highlighter-rouge">archive_command</code> from the leader to be used to
replay the logs on a follower continously.</p>

<ul>
  <li>
    <p>Note that in streaming replication is possible without using
<code class="language-plaintext highlighter-rouge">archive_command</code>, provided the data ingestion throughput never
exceeds the rate of the follower streaming the logs directly from the
leader, and applying them locally (also depends on the network
latency).</p>

    <p>If the follower is not able to keep up with the logs, the logs on
leader might get recycled, and the follower will keep waiting for the
now-non-existent WAL file. Force-starting the follower in case of
failure will result in data loss.</p>
  </li>
</ul>]]></content><author><name>Kashyap Kondamudi</name></author><category term="PostgreSQL" /><summary type="html"><![CDATA[Some of my very rough notes when going nearly cover-to-cover of the PostgreSQL backup and restore documentation section. It’s one of the most detailed pieces of documentation I’ve ever read, so this acts as a very high-level summary. Most of the content is useful for general context surrounding data backups of any kind.]]></summary></entry></feed>