Skip to content

Latest commit

 

History

History
1419 lines (1114 loc) · 60.5 KB

bosh_create_dns_server.md

File metadata and controls

1419 lines (1114 loc) · 60.5 KB

Deploying BOSH Lite in a Subnet-Accessible Manner

BOSH is a tool that (among other things) deploys VMs. BOSH Lite is a user-friendly means of installing BOSH using Vagrant.

A shortcoming of BOSH Lite is that the resulting BOSH VM can only be accessed from the host which is running the VM (i.e. bosh target from another workstation does not succeed). [1]

This blog post describes three techniques to make BOSH Lite accessible from workstations other than the one running the VM. This would be useful in cases where a development team would need a single BOSH and find it convenient to access it from any development workstation.

We'll describe three scenarios. If you're not sure which one is right for you, choose the first scenario (port-forwarding):

  1. enable port-forwarding to the BOSH VM
  2. configure the BOSH VM to use DHCP
  3. assign the BOSH VM a static IP address

Make sure to secure BOSH after we have made it subnet accessible; the default accounts and passwords are very weak.

We limit our discussion to deploying to VirtualBox (i.e. we won't discuss deploying BOSH Lite to Amazon AWS or VMware Fusion).

Procedure

The procedure follows the BOSH Lite README, stopping at the section, Using the Virtualbox Provider (i.e. we have already cloned the BOSH Lite repository to ~/workspace/bosh-lite)

cd ~/workspace/bosh-lite

Edit the Vagrantfile to enable port-forwarding:

vim Vagrantfile

Add the following line (this is what the changes look from git's viewpoint):

  override.vm.network "forwarded_port", guest: 25555, host: 25555

We now continue with the remainder of the BOSH Lite's README's instructions (e.g. vagrant up).

We are now able to bosh target from other machines. For example, let us say we installed BOSH Lite on the workstation tara.nono.com, and we want to access the BOSH from the laptop hope.nono.com. On hope, we type the command bosh target tara.nono.com, and we can log in with the user admin and password admin.

Caveats

Although you can access the BOSH from other workstations via the BOSH CLI, you will not be able to access other services (e.g. you will not be able to SSH into the BOSH VM from a workstation other than the host workstation).

IPv6: For dual-stack systems, don't target the IPv6 address; target the IPv4 address. [2]

The procedure follows the BOSH Lite README, stopping at the section, Using the Virtualbox Provider (i.e. we have already cloned the BOSH Lite repository to ~/workspace/bosh-lite)

cd ~/workspace/bosh-lite

Edit the Vagrantfile to enable DHCP:

vim Vagrantfile

Add the following line (this is what the changes look from git's viewpoint):

  config.vm.network :public_network, bridge: 'en0: Ethernet 1', mac: '0200424f5348', use_dhcp_assigned_default_route: true

Note:

  • Do not choose the same MAC address we did ('0200424f5348' [3] ). To choose a mac address, set the first byte to "02" and the remaining 5 bytes with random values (i.e. dd if=/dev/random bs=1 count=5 | od -t x1)

We need to create a DNS hostname and assign an IP address for the BOSH VM. In this example we've assigned the hostname bosh.nono.com and assigned the IP address 10.9.9.130

We need to create an entry in our DHCP configuration file to reflect the MAC address and IP address; this is a snippet from our ISC DHCP server's configuration file, dhcpd.conf:

  host bosh	{ hardware ethernet 02:00:42:4f:53:48; fixed-address 10.9.9.130; }

ISC's DHCP daemon requires a restart for the new configuration to take effect.

We now continue with the remainder of the BOSH Lite's README's instructions (e.g. vagrant up).

We are now able to bosh target from other machines. For example, let us say we installed BOSH Lite on the workstation tara.nono.com, and we want to access the BOSH from the laptop hope.nono.com. On hope, we type the command bosh target bosh.nono.com, and we can log in with the user admin and password admin.

Caveats

If the network is hard-wired and uses VMPS, you may need to register the MAC address with IT [4].

The procedure follows the BOSH Lite README, stopping at the section, Using the Virtualbox Provider (i.e. we have already cloned the BOSH Lite repository to ~/workspace/bosh-lite)

In this example, we set the BOSH with the following parameters:

  • IP: 10.9.9.130
  • Subnet: 255.255.255.0
cd ~/workspace/bosh-lite

Edit the Vagrantfile to enable port-forwarding:

vim Vagrantfile

Add the following line (this is what the changes look from git's viewpoint):

  override.vm.network :public_network, bridge: 'en0: Ethernet 1', ip: '10.9.9.130', subnet: '255.255.255.0'

We now continue with the remainder of the BOSH Lite's README's instructions (e.g. vagrant up).

We are now able to bosh target from other machines. For example, let us say we installed BOSH Lite on the workstation tara.nono.com, and we want to access the BOSH from the laptop hope.nono.com. On hope, we type the command bosh target 10.9.9.130, and we can log in with the user admin and password admin.

Caveats

Only machines on the local subnet are able to target the BOSH VM (e.g. in this case, machines whose IP address starts with "10.9.9"). (The BOSH VM's default route is the host VM, which will in turn NAT the traffic, which will break any TCP connection outside the local subnet to 10.9.9.130). This scenario (static IP) is not a good solution for geographically distributed teams, e.g. if the host machine is in San Francisco but some team members are located in Toronto.

If the network is hard-wired and uses VMPS, you may need to register the MAC address with IT [4].

After making BOSH available on the local subnet, we need to secure it (the default BOSH Lite credentials of admin/admin are not terribly secure).

1. Secure the BOSH user

We create a new account director and delete the admin account:

# we target our BOSH
#   scenario 1: target the host workstation (e.g. tara.nono.com)
#   scenarios 2 & 3: target the BOSH VM directly (e.g. bosh.nono.com)
# account is 'admin', password is 'admin'
bosh target tara.nono.com
  Target set to `Bosh Lite Director'
  Your username: admin
  Enter password:
  Logged in as `admin'
# create new account 'director' with hard-to-guess password
bosh create user director
  Enter new password: **************
  Verify new password: **************
  User `director' has been created
# log in as newly-created account 'director'
bosh login director
  Enter password:
  Logged in as `director'
# delete the admin account
bosh -n delete user admin
  User `admin' has been deleted

2. Secure the UNIX user

This step only applies to scenarios 2 & 3 (i.e. to scenarios where users on the local subnet can ssh to the BOSH VM). We need to change the password to the vcap account (default password is c1oudc0w) and to the vagrant account (default password is vagrant):

 # we ssh to the BOSH VM (e.g. bosh.nono.com) as the 'vcap' user
ssh [email protected]
  Warning: Permanently added 'bosh.nono.com,10.9.9.130' (RSA) to the list of known hosts.
[email protected]'s password: c1oudc0w
  Welcome to Ubuntu 14.04 LTS (GNU/Linux 3.13.0-44-generic x86_64)
  ...
vcap@bosh-lite:~$ passwd
  Changing password for vcap.
  (current) UNIX password: c1oudc0w
  Enter new UNIX password:
  Retype new UNIX password:
  passwd: password updated successfully
 # we change the vagrant user's password, too
sudo passwd vagrant
  Enter new UNIX password:
  Retype new UNIX password:
  passwd: password updated successfully

3. Secure other Services (e.g. PostgreSQl)

This step only applies to scenarios 2 & 3.

We will not discuss securing postgres or other services, but be aware that those services are running and are only loosely secured. BOSH Lite should not be deployed in a hostile environment (i.e. on the Internet).

Addendum

If during your BOSH deploy you receive Permission denied @ dir_s_mkdir - /vagrant/tmp, then this post suggests adding the following to your Vagrantfile:

config.vm.provider :virtualbox do |v, override|
  # ensure any VM user can create files in subfolders - eg, /vagrant/tmp
  config.vm.synced_folder ".", "/vagrant", mount_options: ["dmode=777"]
end

Footnotes

1 We acknowledge that there are technical workarounds: the teams working on other workstations can ssh into the workstation that is running the BOSH and execute all BOSH-related commands there. Another example is ssh port-forwarding (port 25555). We find these workarounds clunky.

2 BOSH doesn't listen on IPv6 interfaces. For example, the host tara.nono.com has both IPv4 (10.9.9.30) and IPv6 addresses, so when targeting tara.nono.com, explicitly use the IPv4 address (e.g. bosh target 10.9.9.30). If you mistakenly target the IPv6 address, you will see the message [WARNING] cannot access director, trying 4 more times...

This type (dual-stack) problem can also be troubleshot by attempting to connect directly to the BOSH port (note that telnet falls back to IPv4 when IPv6 fails):

telnet tara.nono.com 25555
Trying 2601:9:8480:3:23e:e1ff:fec2:e1a...
telnet: connect to address 2601:9:8480:3:23e:e1ff:fec2:e1a: Connection refused
Trying 10.9.9.30...
Connected to tara.nono.com.
Escape character is '^]'.

3 We chose a MAC address ('0200424f5348') by first setting the locally-administered bit (the second least-significant bit of the most significant byte), i.e. the leading "02".

We chose "00" as the next byte.

We chose the last four bytes by using the hex representation of the ASCII code for "BOSH" (i.e. echo -n BOSH | od -t x1, i.e. '42:4f:53:48')

4 When VMPS has been deployed, all MAC addresses must be registered with the local IT organization to avoid knocking the host workstation off the network. For example, Pivotal in San Francisco uses VMPS, and when the ethernet switch detects unknown MAC address on one of its ports, it will configure that port's VLAN to be that of the guest network's; however, this will also affect the workstation that is hosting the BOSH VM. In practical terms, the workstation's ethernet connection will ping-pong (switching approximately every 30 seconds from one VLAN to the other), effectively rendering the BOSH VM and the host workstation unusable.

How to Create a BOSH Release of a DNS Server

BOSH is a tool that (among other things) deploys VMs. In this blog post we cover the procedure to create a BOSH release for a DNS server, customizing our release with a manifest, and then deploying the customized release to a VirtualBox VM.

BOSH is frequently used to deploy applications, but rarely to deploy infrastructure services (e.g. NTP, DHCP, LDAP). When our local IT staff queried us about using BOSH to deploy services, we felt it would be both instructive and helpful to map out the proceduring using DNS as an example.

Note: if you're interested in using BOSH to deploy a BIND 9 server (i.e. you are not interested in learning how to create a BOSH release), you should not follow these steps. Instead, you should follow the instructions on our BOSH DNS Server Release repository's page.

We acknowledge that creating a BOSH release is a non-trivial task and there are tools available to make it simpler, tools such as bosh-gen. Although we haven't used bosh-gen, we have nothing but the highest respect for its author, and we encourage you to explore it.

0. Install BOSH Lite

BOSH runs in a special VM. We will install a BOSH VM in VirtualBox using BOSH Lite, an easy-to-use tool to install BOSH under VirtualBox.

We follow the BOSH Lite installation instructions. We follow the instructions up to and including execution of the bosh login command.

The instructions in the remainder of this blog post will not work if not logged into BOSH.

1. Initialize a skeletal BOSH Release

A BOSH release is a package, analogous to Microsoft Windows's .msi, Apple OS X's .app, RedHat's .rpm.

We will create a BOSH release of ISC's BIND 9 [1] .

BIND versus named

BIND is a collection of software that includes, among other things, the named DNS daemon.

You may find it convenient to think of BIND and named as synonyms [2].

Initialize Release

We follow these instructions. We name our release bind-9 because BIND 9 [3] is a the DNS server daemon for which we are creating a release.

cd ~/workspace
bosh init release --git bind-9
cd bind-9

The --git parameter above is helpful if you're using git to version your release, and including it won't break your release if you're not using git.

Since we are using git, we populate the following 3 files:

Our release is available on GitHub.

Create Job Skeletons

Our release consists of one job, named:

bosh generate job named

Let's create the control script:

vim jobs/named/templates/ctl.sh

It should look like this (patterned after the Ubuntu 13.04 bind9 control script):

#!/bin/bash

# named logs to syslog, daemon facility
# BOSH captures these in /var/log/daemon.log
RUN_DIR=/var/vcap/sys/run/named
# PIDFILE is created by named, not by this script
PIDFILE=${RUN_DIR}/named.pid

case $1 in

  start)
   mkdir -p $RUN_DIR
    chown -R vcap:vcap $RUN_DIR
    exec /var/vcap/packages/bind-9-9.10.2/sbin/named -u vcap -c /var/vcap/jobs/named/etc/named.conf
    ;;

  stop)

    PID=$(cat $PIDFILE)
    if [ -n $PID ]; then
      SIGNAL=TERM
      N=1
      while kill -$SIGNAL $PID 2>/dev/null; do
        if [ $N -eq 1 ]; then
          echo "waiting for pid $PID to die"
        fi
        if [ $N -eq 11 ]; then
          echo "giving up on pid $PID with kill -TERM; trying -KILL"
          SIGNAL=KILL
        fi
        if [ $N -gt 20 ]; then
          echo "giving up on pid $PID"
          break
        fi
        N=$(($N+1))
        sleep 1
      done
    fi

    rm -f $PIDFILE

    ;;

  *)
    echo "Usage: ctl {start|stop}" ;;

esac

Edit the monit configuration

vim jobs/named/monit

It should look like this:

check process named
  with pidfile /var/vcap/sys/run/named/named.pid
  start program "/var/vcap/jobs/named/bin/ctl start"
  stop program "/var/vcap/jobs/named/bin/ctl stop"
  group vcap

Create Package Skeletons

0. Create Package bind-9-9.10.2

Create the bind package; we include the version number (9.10.2).

bosh generate package bind-9-9.10.2

Create the spec file. We need to create an empty placeholder file to avoid this error:

`resolve_globs': undefined method `each' for nil:NilClass (NoMethodError)

BOSH expects us to place source files within the BOSH package; however, we are deviating from that model: we don't place the source files withom our release; instead we configure our package to download the source from the ISC. But we need at least one source file to placate BOSH, hence the placeholder file.

vim packages/bind-9-9.10.2/spec

It should look like this:

---
name: bind-9-9.10.2

dependencies:
- libjson

files:
- bind/placeholder

then we need to create the placeholder:

mkdir src/bind
touch src/bind/placeholder

Create the bind-9 backage script:

vim packages/bind-9-9.10.2/packaging

It should look like this:

# abort script on any command that exits with a non zero value
set -e

if [ -f /etc/redhat-release ]; then
  # install libjson0 for CentOS stemcells
  yum install -y json-c
elif [ -f /etc/lsb-release ]; then
  # install libjson0 for Ubuntu stemcells (not tested)
  apt-get install libjson0
fi

curl -OL ftp://ftp.isc.org/isc/bind9/9.10.2/bind-9.10.2.tar.gz
tar xvzf bind-9.10.2.tar.gz
cd bind-9.10.2
./configure \
  --prefix=${BOSH_INSTALL_TARGET} \
  --sysconfdir=/var/vcap/jobs/named/etc \
  --localstatedir=/var/vcap/sys
make
make install
1. Create Package libjson-7.6.1

Configure a blobstore

We skip this section because we're not using the blobstore—we're downloading the source and building from it.

Create Job Properties

We edit jobs/named/templates/named.conf.erb. This will be used to create named's configuration file, named.conf. Note that we don't populate this template; instead, we tell BOSH to populate this template from the config_file section of the properties section of the deployment manifest:

<%= p('config_file') %>

We edit the spec file jobs/named/spec. We point out that the properties → config_file from the deployment manifest is used to create the contents of named.conf:

---
name: named
templates:
  ctl.sh: bin/ctl
  named.conf.erb: etc/named.conf

packages:
- bind-9-9.10.2

properties:
  config_file:
    description: 'The contents of named.conf'

Create a Dev Release

bosh create release --force
    [WARNING] Missing blobstore configuration, please update config/final.yml before making a final release
    Syncing blobs...
    Please enter development release name: bind-9
    ...
    Release name: bind-9
    Release version: 0+dev.1
    Release manifest: /Users/cunnie/workspace/bind-9/dev_releases/bind-9/bind-9-0+dev.1.yml

Upload the Dev Release

bosh upload release dev_releases/bind-9/bind-9-0+dev.1.yml

Create the sample Deployment Manifest

We create an examples subdirectory:

mkdir examples

We create examples/bind-9-bosh-lite.yml. Much of this is boilerplate for a BOSH Lite deployment. Note that we hard-code our VM's IP address to 10.244.0.66:

---
name: bind-9-server
director_uuid: PLACEHOLDER-DIRECTOR-UUID
compilation:
  cloud_properties:
    ram: 2048
    disk: 4096
    cpu: 2
  network: default
  reuse_compilation_vms: true
  workers: 1
jobs:
- instances: 1
  name: named
  networks:
  - default:
    - dns
    - gateway
    name: default
    static_ips:
    - 10.244.0.66
  persistent_disk: 16
  resource_pool: bind-9_pool
  templates:
  - { release: bind-9, name: named }
  properties:
    config_file: |
      options {
        recursion yes;
        allow-recursion { any; };
        forwarders { 8.8.8.8; 8.8.4.4; };
      };
networks:
- name: default
  subnets:
  - cloud_properties:
      name: VirtualBox Network
    range: 10.244.0.64/30
    dns:
      - 8.8.8.8
    gateway: 10.244.0.65
    static:
    - 10.244.0.66
releases:
  - name: bind-9
    version: latest
resource_pools:
- cloud_properties:
    ram: 2048
    disk: 8192
    cpu: 1
  name: bind-9_pool
  network: default
  stemcell:
    name: bosh-warden-boshlite-centos-go_agent
    version: latest
update:
  canaries: 1
  canary_watch_time: 30000 - 600000
  max_in_flight: 8
  serial: false
  update_watch_time: 30000 - 600000

Create the Deployment Manifest

We create the deployment manifest by copying the sample manifest in the examples directory and substituting our BOSH's UUID:

cp examples/bind-9-bosh-lite.yml config/
perl -pi -e "s/PLACEHOLDER-DIRECTOR-UUID/$(bosh status --uuid)/" config/bind-9-bosh-lite.yml

Deploy the Release

bosh deployment config/bind-9-bosh-lite.yml
bosh -n deploy

Configure any Necessary Routes

If we are using BOSH Lite, we need to add routes to the VMs hosted in the Warden containers. The following OS X route command assumes the default IP scheme of the sample BIND 9 deployment manifest:

sudo route add -net 10.244.0.0/24 192.168.50.4

Test the Deployment

We use the nslookup command to ensure our newly-deployed DNS server can resolve pivotal.io:

nslookup pivotal.io 10.244.0.66
    Server:		10.244.0.66
    Address:	10.244.0.66#53

    Non-authoritative answer:
    Name:	pivotal.io
    Address: 54.88.108.63
    Name:	pivotal.io
    Address: 54.210.84.224

We have successfully created a BOSH release including one job. We have also successfully created a deployment manifest customizing the release, and deployed the release using our manifest. Finally we tested that our deployment succeeded.

Addendum: BOSH directory differs from BIND's

The BOSH directory structure differs from BIND's, and systems administrators may find the BOSH structure unfamiliar.

Here are some examples:

File typeBOSH locationUbuntu 13.04 location
executable/var/vcap/packages/bind-9-9.10.2/sbin/named/usr/sbin/named
configuration/var/vcap/jobs/bind/etc/named.conf/etc/bind/named.conf
pid/var/vcap/data/sys/run/named/named.pid/var/run/named/named.pid
logs/var/log/daemon.log(same)

That is not to say the BOSH layout does not have its advantages. For example, The BOSH layout allows multiple instances (jobs) of the same package, each with its own configuration.

That advantage, however, is lost on BIND: running multiple versions of BIND was not a primary consideration—only one program could bind [4] to DNS's assigned port 53 [5] , making it difficult to run more than one BIND job on a given VM.


Footnotes

1 We chose BIND 9 and not BIND 10 (nor the open-source variant Bundy) because BIND 10 had been orphaned by the ISC (read about it here).

There are alternatives to the BIND 9 DNS server. One of my peers, Michael Sierchio, is a strong proponent of djbdns, which was written with a focus on security.

2 Although it is convenient to think of BIND and named as synonyms, they are different, though the differences are subtle.

For example, the software is named BIND, so when creating our BOSH release, we use the term BIND (e.g. bind-9 is the name of the BOSH release)

The daemon that runs is named named. We use the term named where we deem appropriate (e.g. named is the name of the BOSH job). Also, many of the job-related directories and files are named named (a systems administrator would expect to find the configuration file to be named.conf, not bind.conf, for that's what it's named in RedHat, FreeBSD, Ubuntu, et al.)

Even polished distributions struggle with the BIND vs. named dichotomy, and the result is evident in the placement of configuration files. For example, the default location for named.conf in Ubuntu is /etc/bind/named.conf but in FreeBSD is /etc/namedb/named.conf (it's even more complicated in that FreeBSD's directory /etc/namedb is actually a symbolic link to /var/named/etc/namedb, for FreeBSD prefers to run named in a chroot environment whose root is /var/named. This symbolic link has the advantage that named's configuration file has the same location both from within the chroot and without).

3 The number "9" in BIND 9 appears to be a version number, but it isn't: BIND 9 is a distinct codebase from BIND 4, BIND 8, and BIND 10. It's different software.

This is an important distinction because version numbers, by convention, are not used in BOSH release names. For example, the version number of BIND 9 that we are downloading is 9.10.2, but we don't name our release bind-9-9.10.2-release; instead we name it bind-9-release.

4 We refer to the UNIX system call bind (e.g. "binding to port 53") and not the DNS nameserver BIND.

5 One could argue that a multi-homed host could bind [3] different instances of BIND to distinct IP addresses. It's technically feasible though not common practice. And multi-homing is infrequently used in BOSH.

In an interesting side note, the aforementioned nameserver djbdns makes use of multi-homed hosts, for it runs several instances of its nameservers to accommodate different purposes, e.g. one server (dnscache) to handle general DNS queries, another server (tinydns) to handle authoritative queries, another server (axfrdns) to handle zone transfers.

One might be tempted to think that djbdns would be a better fit to BOSH's structure than BIND, but one would be mistaken: djbdns makes very specific decisions about the placement of its files and the manner in which the nameserver is started and stopped, decisions which don't quite dovetail with BOSH's decisions (e.g. BOSH uses monit to supervise processes; djbdns assumes the use of daemontools).

Deploying a DNS Server to Amazon AWS with bosh-init

This post describes how to deploy a BIND 9 DNS server to Amazon AWS using bosh-init, a command-line BOSH tool that enables the deployment of VMs without requiring a Director VM. [1]

This blog post is the second of a series; it picks up where the previous one, How to Create a BOSH Release of a DNS Server, left off. Previously we described how to create a BOSH release (i.e. a BOSH software package) of the BIND 9 DNS server and deploy the server to VirtualBox via BOSH Lite.

0. Download Executables, Releases, Stemcells, and CPIs

Let's begin by installing bosh-init on our OS X development workstation:

curl -L https://s3.amazonaws.com/bosh-init-artifacts/bosh-init-0.0.29-darwin-amd64 -o /usr/local/bin/bosh-init
chmod 755 /usr/local/bin/bosh-init
bosh-init version # test for successful install
    version 0.0.29-9ba06aa-2015-05-14T16:52:13Z

Let's clone our BIND release's git repository:

cd ~/workspace/
git clone https://github.com/cunnie/bosh-bind-9-release.git
cd bosh-bind-9-release

1. Create BOSH DNS Release

We install the BOSH CLI ruby gem:

# if using system ruby, you may need to prepend 'sudo' to the following line:
gem install bosh_cli

We create our release (we use --force to create a development release, --with-tarball to create the .tgz file that we need to deploy to our VM):

bosh create release --force --with-tarball --name bind-9

2. Create BOSH Deployment Manifest

We gather the following information, which we will need in order to create our BOSH deployment manifest:

  • Elastic IP Address (e.g. 52.0.56.137, must have scope vpc not standard)
  • Subnet ID (e.g. subnet-1c90ef6b) (This must be a subnet in your VPC)
  • AWS Key Pair name (e.g. aws_nono)
  • AWS Key Pair's path (e.g. /Users/cunnie/.ssh/aws_nono.pem) (must not be passphrase-protected)
  • AWS Access Key (e.g. AKIAIZWITxxxxxxxxxxx)
  • AWS Secret (e.g. 0+B1XW6VVEYw55J+ZjfqMW+AjELxxxxxxxxxxxxx)
  • Security Group (must have the VPC ID of your VPC, e.g. vpc-e4250881)
  • A password for your agent (generate a hard-to-guess password, e.g. Lif6@stambha)

We copy the sample AWS deployment manifest to our config directory:

cp examples/bind-9-aws.yml config/

We edit the deployment manifest; we search for all occurrences of "CHANGEME" and replace them with the information we gathered earlier:

vim config/bind-9-aws.yml
    # replace all occurences of 'CHANGEME'
    # with appropriate values

3. Deploy

Let's deploy (if you see Gem::Installer::ExtensionBuildError: ERROR: Failed to build gem native extension while installing, you may need to run xcode-select --install and accept the license in order to install the necessary header files):

bosh-init deploy config/bind-9-aws.yml
Deployment manifest: '/Users/cunnie/workspace/bind-9/config/bind-9-aws.yml'
Deployment state: '/Users/cunnie/workspace/bind-9/config/bind-9-aws-state.json'

Started validating
  Validating deployment manifest... Finished (00:00:00)
  Validating release 'bind-9'... Finished (00:00:00)
  Validating release 'bosh-aws-cpi'... Finished (00:00:00)
  Validating jobs... Finished (00:00:00)
  Downloading stemcell... Finished (00:00:00)
  Validating stemcell... Finished (00:00:00)
  Validating cpi release... Finished (00:00:00)
Finished validating (00:00:00)

Started installing CPI
  Compiling package 'ruby_aws_cpi/7903f3a543e8ab35fd980a0ca74f9151282d56b2'...

4. Test

We test our newly-deployed DNS server to ensure it refuses to resolve queries for which it is not authoritative (e.g. google.com), but will answer queries for which it is authoritative (nono.com). We assume that no change has been made to the manifest's jobs/properties/config_file section; i.e. we assume that the server that has been deployed is a slave server for the zone nono.com..

# Substitue your Elastic IP as appropriate, i.e. don't use 52.0.56.137
ELASTIC_IP=52.0.56.137
# status should be 'REFUSED', i.e. no recursive lookups:
dig google.com @$ELASTIC_IP | grep status
    ;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 49040
# status should be 'NOERROR', i.e. authoritative lookups succeed
dig nono.com @$ELASTIC_IP | grep status
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28829

5. Customize and Re-deploy

We have a successful deployment, but not a useful one (a DNS server that acts as a secondary for nono.com does not add much value).

We need to customize our deployment to suit our purposes. Fortunately, our DNS server's configuration (i.e. named.conf) is embedded in our deployment manifest; we need merely make changes to the manifest's jobs→properties→config_file section and re-deploy, i.e.

vim config/bind-9-aws.yml
    # make changes; be sure the configuration section is indented appropriately!
bosh-init deploy config/bind-9-aws.yml

6. Summary

We have shown the ease with which one can deploy BOSH releases to single-instance VMs in Amazon AWS using bosh-init.

One might be tempted to compare BOSH to configuration management tools such as Ansible, Chef, and Puppet, and on closer inspection one would see that their approaches differ radically:

  • BOSH assumes deployment to a cloud infrastructure (AWS, vSphere, OpenStack)
  • BOSH depends on specially-built images/AMIs (stemcells); BOSH does not function with, say, a generic Ubuntu AMI.
  • BOSH does not install 'packages' (e.g. .deb, .rpm), instead, one must build a custom BOSH release or take advantage of community-built releases.

Appendix A. The Importance of Disallowing Recursion

We disable recursive queries on our DNS servers that have been deployed to the Internet because it prevents our server from being used in a DNS Amplification Attack. DNS Amplication Attacks are doubly-damning in the sense that we pay for the attack's bandwidth charges (we pay in the literal sense: Amazon charges us for the outbound traffic).

The good news is since version 9.4 BIND has a non-recursive default (our BOSH release's version is 9.10). If you truly need to allow recursion, add the following stanza to the deployment's manifest's jobs→properties→config_file stanza; it will configure the BIND server to be an Open Resolver (A DNS server that allows recursive queries/recursion is known as an Open Resolver). Don't do this unless your server is behind a firewall:

      options {
        recursion yes;
        // DO NOT put the following line on an Internet-accessible DNS server
        allow-recursion { any; };

An easy way to test if your server is an Open Resolver is to run the following dig command (substitute 52.6.149.97 with the address of your deployed DNS server):

dig freeinfosys.com ANY @52.6.149.97
    ...
    ;; MSG SIZE  rcvd: 33

The "MSG SIZE rcvd: 33" means that recursion was denied (i.e. our server is properly configured). If instead you see "MSG SIZE rcvd: 3185", then you need to edit your deployment's manifest and re-deploy.

Probed within 3 Hours, Exploited within 3 Days

Our server was probed within 3 hours of deployment (logs from /var/log/daemon.log):

May  6 14:43:51 localhost named[23167]: client 80.82.78.2#7678 (freeinfosys.com): query (cache) 'freeinfosys.com/ANY/IN' denied

Hackers attempted to exploit our DNS server to attack voxility. The attempts did not succeed because we had properly configured our server:

May  9 01:42:41 localhost named[23167]: message repeated 15 times: [ client 109.163.224.34#19903 (067.cz): query (cache) '067.cz/ANY/IN' denied]

Acknowledgements

Dmitriy Kalinin's assistance was invaluable when creating the sample manifest, checking the blog post for errors, and simplifying the deployment procedure.


Footnotes

1 We use bosh-init rather than a Director VM primarily for financial reasons: with bosh-init, we need but spin up the DNS server VM (t2.micro instance, $114/year [2] ). Using a Director VM requires an m3.medium instance ($614/year), ballooning our costs 538%.

2 Amazon EC2 Prices are current as of the writing of this document. A t2.micro instance costs $0.013 per hour. Assuming 365.2425 24-hour days/year, this works out to $113.96/year. An m3.medium instances costs $0.070 per hour, $613.60/year. Our calculations do not take into account Spot Instances or Reserved Instances.

Admittedly there are mechanisms to reduce the cost of the Director VM—for example, we could suspend the Director VM instance after it has deployed the DNS server.

Troubleshooting BOSH Releases/Deployments

featured-data-kernel This blog post describes the steps we took to resolve failures when developing our BOSH DNS Release. Although the steps described are specific to our BOSH DNS Release, we feel that they can be generalized to troubleshooting most if not all BOSH Releases.

Debugging a BOSH Release and it subsequent deployment can be challenging, but there are a few tricks which can ease the burden (e.g. preventing the tear-down of the compilation VM in order to troubleshoot the failure in vivo).

Note that several of the problems described here stemmed from our decision to rename our job from bind to named.

0. Troubleshooting BOSH compilation VM problems

Our compilation fails the first time; this is output of the bosh deploy command:

  Started compiling packages > bind-9-9.10.2/9e6f17bcebdc0acf860adf28de34e5a091c32173. Failed: Action Failed get_task: Task 3441c0a4-ce13-4d02-4e12-5c04f886145d result: Compiling package bind-9-9.10.2: Running packaging script: Command exited with 2; Truncated stdout: a - unix/time.o...

We modify the packaging script to help us debug:

  • insert sleep 10800 (10800 seconds (3 hours) should be long enough for us to troubleshoot this particular error)
vim packages/bind-9-9.10.2/packaging
    set -e
    sleep 10800
    ...
bosh create release --force
bosh upload release
bosh -n deploy
    ...
    Started compiling packages > bind-9-9.10.2/3220fc6c003bf67bd07bc63a124d96c315155625.

When the deploy enters the compilation phase, go to another terminal window to find the IP address of the compilation VM:

bosh vms
    ...
    +-----------------+---------+---------------+-------------+
    | Job/index       | State   | Resource Pool | IPs         |
    +-----------------+---------+---------------+-------------+
    | unknown/unknown | running |               | 10.244.0.70 |
    +-----------------+---------+---------------+-------------+

Our compilation VM is at 10.244.0.70. We log into it to see the status of the compilation:

ssh [email protected]
ssh: connect to host 10.244.0.70 port 22: Operation timed out

We cannot connect because we haven't set our route to our BOSH Lite's VMs. Let's set our route and try again:

sudo route add -net 10.244.0.0/24 192.168.50.4
ssh [email protected] # password is 'c1oudc0w'
sudo su - # password is still 'c1oudc0w'

We use ps to find the sleep command's PID (239 in this case), which we use to determine (via the /proc filesystem) the location of the compilation directory:

ps auxwww | grep sleep
    root       239  0.0  0.0   4124   316 ?        S<   02:16   0:00 sleep 10800
ls -l /proc/239/cwd
    lrwxrwxrwx 1 root root 0 May 25 02:21 /proc/239/cwd -> /var/vcap/data/compile/bind-9-9.10.2/bind-9.10.2

We cd into the compilation directory.

cd /var/vcap/data/compile/bind-9-9.10.2/bind-9.10.2

We set the BOSH_INSTALL_TARGET environment variable.

export BOSH_INSTALL_TARGET=/var/vcap/packages/bind-9-9.10.2/

We change to the parent directory, edit the packaging script to remove the sleep command, and run the script with tracing enabled to see where it's failing:

cd ..
vim packaging
    # sleep 10800
bash -x packaging
    ...
    packaging: line 29: --prefix=/var/vcap/packages/bind-9-9.10.2/: No such file or directory

Our packaging script had an error. We fix the error locally on the compilation VM and test before backporting the change to our release.

When we're finished, we kill the sleep process:

  • sleep will exit with a non-zero return code
  • since we have configured our packaging script to exit on errors (via set -e), our packaging script will exit with a non-zero return code
  • BOSH, upon receiving the non-zero return code from the packaging script, will tear down the now-unneeded compilation VM. BOSH will assume compilation has failed and won't attempt a deploy.
killall sleep

1. Troubleshooting BOSH pre-compilation problems

We may fail before reaching the compilation phase:

bosh create release --force --with-tarball
[WARNING] Missing blobstore configuration, please update config/final.yml before making a final release
...
Building jobs
-------------
Job spec is missing

We examine our jobs/named/spec file; we notice that we mistakenly put a single quote within a single-quoted string in our spec file. We double-quote the string in order to fix the problem (we are not the only ones to have encountered this issue)

Broken spec:
    description: 'The contents of named.conf (named's configuration file)'
Fixed spec (double-quotes):
    description: "The contents of named.conf (named's configuration file)"

2. Troubleshooting the BOSH deployment

These are the steps we followed when our deployment successfully completed the compilation phase only to fail during deployment.

bosh -n deploy
  Started preparing deployment > Binding existing deployment. Failed: Timed out sending `get_state' to 45ae2115-e7b3-4668-9516-34a9129eb705 after 45 seconds (00:02:15)
  Error 450002: Timed out sending `get_state' to 45ae2115-e7b3-4668-9516-34a9129eb705 after 45 seconds
  Task 33 error
  For a more detailed error report, run: bosh task 33 --debug

We follow the suggestion:

bosh task 33 --debug
...
E, [2015-04-04 19:00:57 #2184] [task:33] ERROR -- DirectorJobRunner: Timed out sending `get_state' to 45ae2115-e7b3-4668-9516-34a9129eb705 after 45 seconds
/var/vcap/packages/director/gem_home/ruby/2.1.0/gems/bosh-director-1.2811.0/lib/bosh/director/agent_client.rb:178:in `block in handle_method'
/var/vcap/packages/ruby/lib/ruby/2.1.0/monitor.rb:211:in `mon_synchronize'
...
I, [2015-04-04 19:00:57 #2184] []  INFO -- DirectorJobRunner: Task took 2 minutes 15.042849130000008 seconds to process.
999

We determine the VM that's running named:

bosh vms
...
  +-----------------+--------------------+---------------+-----+
  | Job/index       | State              | Resource Pool | IPs |
  +-----------------+--------------------+---------------+-----+
  | unknown/unknown | unresponsive agent |               |     |
  +-----------------+--------------------+---------------+-----+
...

We tell BOSH to run an IAAS consistency check ("cck", i.e. "cloud check"):

bosh cck

BOSH cloud check discovers that a VM is missing; we tell BOSH to recreate it

Problem 1 of 1: VM with cloud ID `7e72c2bc-593d-42fb-5c69-9299d7ed47e8' missing.
  1. Ignore problem
  2. Recreate VM using last known apply spec
  3. Delete VM reference (DANGEROUS!)
Please choose a resolution [1 - 3]: 2

We ask for a listing of BOSH's VMs:

bosh vms
...
  +-----------+---------+---------------+-------------+
  | Job/index | State   | Resource Pool | IPs         |
  +-----------+---------+---------------+-------------+
  | bind-9/0  | failing | bind-9_pool   | 10.244.0.66 |
  +-----------+---------+---------------+-------------+
...

We've gotten further: we have a VM that's running, but now the job is failing, and we need to fix that. We need to ssh to the VM to troubleshoot further. We use BOSH's ssh feature to ssh into the deployment's VM. Notice how we identify the VM to BOSH: we don't use the hostname or the IP address; instead, we use the job-name appended with the index (i.e. "bind-9 0") (note we remove the "/").

bosh ssh bind-9 0
...
  Enter password (use it to sudo on remote host): *

We set the password to 'p'; we don't need the password to be terribly secure—BOSH will create a throw-away userid (e.g. bosh_1viqpm7v5) that lasts only the duration of the ssh session. Also, since this is BOSH Lite (BOSH in a VM that's only reachable from the hosting workstation), the VM is in essence behind a firewall.

We become root (troubleshooting is easier as the root user):

sudo su -
  [sudo] password for bosh_1viqpm7v5:
  -bash-4.1#

We check the status of the job:

/var/vcap/bosh/bin/monit summary
  The Monit daemon 5.2.4 uptime: 3h 43m

  Process 'bind'                      Execution failed
  System 'system_64100128-c43d-4441-989d-b5726379f339' running

We tell monit to not try to restart the BIND daemon so we can start it manually to determine where it's failing:

/var/vcap/bosh/bin/monit stop bind
bash -x /var/vcap/jobs/bind/bin/ctl start
...
  + exec /var/vcap/packages/bind-9-9.10.2/sbin/named -u vcap -c /var/vcap/jobs/bind/etc/named.conf
  /var/vcap/packages/bind-9-9.10.2/sbin/named: error while loading shared libraries: libjson.so.0: cannot open shared object file: No such file or directory

We fix our release to install libjson.0, upload the release, and redeploy. Our bosh vms command shows the job as failing, so we again ssh into our VM and check monit's status:

/var/vcap/bosh/bin/monit summary
  Process 'bind'                      not monitored

Now we check the logs. We were lazy—we let named default to logging to syslog using the daemon facility, so we check the daemon log to see why it failed:

bosh ssh bind-9 0
...
tail /var/log/daemon.log
...
  Apr  5 04:53:44 kejnssqb34o named[2819]: /var/vcap/jobs/bind/etc/named.conf:6: expected IP address near 'zone'
  Apr  5 04:53:44 kejnssqb34o named[2819]: loading configuration: unexpected token
  Apr  5 04:53:44 kejnssqb34o named[2819]: exiting (due to fatal error)

In our haste we botched our deployment manifest (bind-9-bosh-lite.yml), and the section for the config_file was incomplete (the forwarders section was empty and lacked a closing brace ("};")):

...
  properties:
     config_file: |
       options {
         recursion yes;
         forwarders {
       zone "." in{
...

We fix our manifest and redeploy, but it's still failing:

bosh -n deploy
bosh vms
...
  +-----------+---------+---------------+-------------+
  | Job/index | State   | Resource Pool | IPs         |
  +-----------+---------+---------------+-------------+
  | bind-9/0  | failing | bind-9_pool   | 10.244.0.66 |
  +-----------+---------+---------------+-------------+
...

We ssh in and troubleshoot:

bosh ssh bind-9
...
sudo su -
/var/vcap/bosh/bin/monit summary
...
  Process 'bind'                      not monitored
...

We check the logs:

tail /var/log/daemon.log
...
  Apr  9 02:54:53 kh2fp1cb3r7 named[1813]: the working directory is not writable
  Apr  9 02:54:53 kh2fp1cb3r7 named[1813]: managed-keys-zone: loaded serial 0
  Apr  9 02:54:53 kh2fp1cb3r7 named[1813]: all zones loaded
  Apr  9 02:54:53 kh2fp1cb3r7 named[1813]: running

named has seemed to start correctly. Let's double-check and make sure it's running:

ps auxwww | grep named
...
  vcap      1813  0.0  0.2 239896 13820 ?        S<sl 02:54   0:00 /var/vcap/packages/bind-9-9.10.2/sbin/named -u vcap -c /var/vcap/jobs/bind/etc/named.conf
  vcap      1826  0.0  0.2 173580 12896 ?        S<sl 02:55   0:00 /var/vcap/packages/bind-9-9.10.2/sbin/named -u vcap -c /var/vcap/jobs/bind/etc/named.conf
  vcap      1838  0.5  0.2 239116 12888 ?        S<sl 02:56   0:00 /var/vcap/packages/bind-9-9.10.2/sbin/named -u vcap -c /var/vcap/jobs/bind/etc/named.conf

monit attempts to start named, which succeeds, but monit seems to think that it has failed (that's why we see three copies of namedmonit has attempted to start named at least three times)

We examine monit's log file:

tail /var/vcap/monit/monit.log
...
  [UTC Apr  9 02:50:53] info     : 'bind' start: /var/vcap/jobs/bind/bin/ctl
  [UTC Apr  9 02:51:23] error    : 'bind' failed to start
  [UTC Apr  9 02:51:33] error    : 'bind' process is not running
  [UTC Apr  9 02:51:33] info     : 'bind' trying to restart
  [UTC Apr  9 02:51:33] info     : 'bind' start: /var/vcap/jobs/bind/bin/ctl

We suspect the PID file is the problem; we notice that our ctl.sh template places the PID file in /var/vcap/sys/run/named/pid, but that in jobs/bind/monit we specify a completely different location, i.e. /var/vcap/sys/run/bind/pid.

We recreate our release, upload our release, and deploy it:

bosh -n deploy
...
  Failed: Permission denied @ dir_s_mkdir - /vagrant/tmp (00:00:00)
  Error 100: Permission denied @ dir_s_mkdir - /vagrant/tmp

We need to fix our Vagrant permissions (described in this thread). The thread describes a permanent fix; our fix below is merely temporary (i.e. when the BOSH Lite VM is recreated you will need to run the commands below again):

pushd ~/workspace/bosh-lite/
vagrant ssh -c "sudo chmod 777 /vagrant"
popd

We attempt our deploy again:

bosh -n deploy
...
  Failed: Creating VM with agent ID 'fbb2ba55-d27c-40e5-9bdd-889a4ec6d356': Creating container: network already acquired: 10.244.0.64/30 (00:00:00)

We have run out of IPs for our compilation VM (we believe it's because we have but one usable IP address, and our BIND VM is using it). We delete our deployment and re-deploy:

bosh -n delete deployment bind-9-server
...
bosh -n deploy
...
  Failed: `bind-9/0' is not running after update (00:10:11)
  Error 400007: `bind-9/0' is not running after update

We check if the VM is running and then ssh to it:

bosh vms
...
  +-----------+---------+---------------+-------------+
  | Job/index | State   | Resource Pool | IPs         |
  +-----------+---------+---------------+-------------+
  | bind-9/0  | failing | bind-9_pool   | 10.244.0.66 |
  +-----------+---------+---------------+-------------+
...
bosh ssh bind-9
...
sudo su -
...
/var/vcap/bosh/bin/monit summary
  /var/vcap/monit/job/0000_bind.monitrc:3: Warning: the executable does not exist '/var/vcap/jobs/named/bin/ctl'
  /var/vcap/monit/job/0000_bind.monitrc:4: Warning: the executable does not exist '/var/vcap/jobs/named/bin/ctl'
  The Monit daemon 5.2.4 uptime: 18m

We had made a mistake in our jobs/named/spec file; we had configured the name to be bind when we should have configured it to be named. We re-create our release and deploy:

 # fix `name`
vim jobs/named/spec
bosh create release --force
bosh upload release
bosh -n delete deployment bind-9-server
bosh -n deploy
  Failed: Can't find template `bind' (00:00:00)
  Error 190012: Can't find template `bind'

We had not changed our job name from bind-9 to named in our manifest; we edit our manifest and deploy again:

 # fix `name`
vim config/bind-9-bosh-lite.yml
bosh create release --force
bosh upload release
bosh -n delete deployment bind-9-server
bosh -n deploy
  Failed: `named/0' is not running after update (00:10:10)
  Error 400007: `named/0' is not running after update

We look at the VMs and then check monit's log:

bosh vms
    +-----------+---------+---------------+-------------+
    | Job/index | State   | Resource Pool | IPs         |
    +-----------+---------+---------------+-------------+
    | named/0   | failing | bind-9_pool   | 10.244.0.66 |
    +-----------+---------+---------------+-------------+
bosh ssh named
sudo su -
/var/vcap/bosh/bin/monit summary
    Process 'named'                     not monitored
ps auxwww | grep named
    # MANY processes
killall named
/var/vcap/bosh/bin/monit stop named
tail /var/vcap/monit/monit.log
    [UTC Apr  9 14:41:46] info     : 'named' start: /var/vcap/jobs/named/bin/ctl
    [UTC Apr  9 14:42:16] error    : 'named' failed to start
    [UTC Apr  9 14:42:26] error    : 'named' process is not running
    [UTC Apr  9 14:42:26] info     : 'named' trying to restart

We check our pid file:

ls ls /var/vcap/sys/run/named/
    named.pid  pid  session.key

We see notice that there are two pid files: the one we specified, pid, and one that's most likely created by named (named.pid). We check our running named processes and note the PIDs of the first four entries. We compare that to the processes that are running:

ps auxwww | grep named
    vcap       193  0.0  0.2 239896 13912 ?        S<sl 19:37   0:00 /var/vcap/packages/bind-9-9.10.2/sbin/named -u vcap -c /var/vcap/jobs/named/etc/named.conf
    vcap       207  0.0  0.2 239116 12936 ?        S<sl 19:37   0:00 /var/vcap/packages/bind-9-9.10.2/sbin/named -u vcap -c /var/vcap/jobs/named/etc/named.conf
    vcap       221  0.0  0.2 173580 12972 ?        S<sl 19:38   0:00 /var/vcap/packages/bind-9-9.10.2/sbin/named -u vcap -c /var/vcap/jobs/named/etc/named.conf
    vcap       233  0.0  0.2 239116 12976 ?        S<sl 19:38   0:00 /var/vcap/packages/bind-9-9.10.2/sbin/named -u vcap -c /var/vcap/jobs/named/etc/named.conf
    ...
head -4 /var/vcap/sys/run/named/pid
    120
    203
    217
    229

The PIDs don't match. Our startup script records the wrong PID, and monit, seeing that process is no longer running, attempts to start named again.

We dig further and realize that /var/vcap/sys/run/named/named.pid is the correct PID, so we modify our release to use that as the PID file.

An alternative solution would be to specify the correct pid in our deployment manifest, in the properties section that includes the named.conf, but we are hesitant to force our users to include options { pid-file "/var/vcap/sys/run/named/pid"; };. We feel that our release should manage the pid file, not the user.

We make our changes and redeploy. Success. We test our DNS service by querying it using nslookup:

nslookup google.com. 10.244.0.66
    Server:		10.244.0.66
    Address:	10.244.0.66#53

    ** server can't find google.com: REFUSED

We ssh into our VM and make sure that named is listening on all interfaces:

bosh ssh named
...
ss -a | grep domain
    LISTEN     0      10            10.244.0.66:domain                   *:*
    LISTEN     0      10              127.0.0.1:domain                   *:*
    LISTEN     0      10                     :::domain                  :::*

named has correctly bound it its external IP address (10.244.0.66), so let's try an nslookup from the VM itself:

nslookup google.com. 10.244.0.66
    Server:		10.244.0.66
    Address:	10.244.0.66#53

    Non-authoritative answer:
    Name:	google.com
    Address: 74.125.239.136
...

The lookup from the VM succeeded, but from without did not. We check named's log file:

less /var/log/daemon.log
    Apr 11 20:38:35 ace946c0-170d-4a59-a9d7-0a08050c24f9 named[469]: client 192.168.50.1#57821 (google.com): query (cache) 'google.com/A/IN' denied

We see that named receive the request and explicitly denied it. It seems we have mis-configured the named configuration that is specified in config/bind-9-bosh-lite.yml.

We decide to edit the configuration in place on the VM; when we're sure our changes work, we'll backport them to the deployment manifest. We edit the configuration to include a directive to allow all queries, including ones that do not originate from the loopback address (i.e. we add allow-recursion { any; };):

bosh ssh named
sudo su -
 # edit the named.conf to allow recursion
vim /var/vcap/jobs/named/etc/named.conf
    options {
      allow-recursion { any; };
      ...
/var/vcap/bosh/bin/monit restart named
/var/vcap/bosh/bin/monit summary
    ...
    Process 'named'                     Execution failed

We check to to make sure named is running, and that its PID matches the contents of the named.pid file:

ps auxwww | grep named
    vcap       386  0.0  0.2 370968 13936 ?        S<sl 15:27   0:00 /var/vcap/packages/bind-9-9.10.2/sbin/named -u vcap -c /var/vcap/jobs/named/etc/named.conf
    root       455  0.0  0.0  11356  1424 ?        S<   18:33   0:00 /bin/bash /var/vcap/jobs/named/bin/ctl stop
cat /var/vcap/sys/run/named/named.pid
    386

We determine 2 things:

  1. named is running, and its PID matches the entry in the PID file
  2. monit's attempt to restart named has apparently hung on the portion that stops the named server.

We kill monit's process to stop named, and run it manually with tracing on so we can determine the cause of the hang:

kill 455
 # make sure it's really dead:
ps auxwww | grep 455
bash -x /var/vcap/jobs/named/bin/ctl stop
    + RUN_DIR=/var/vcap/sys/run/named
    + PIDFILE=/var/vcap/sys/run/named/named.pid
    + case $1 in
    ++ cat /var/vcap/sys/run/named/named.pid
    + PID=386
    + '[' -n 386 ']'
    + SIGNAL=0
    + N=1
    + kill -0 386
    + '[' 1 -eq 1 ']'
    + echo 'waiting for pid 386 to die'
    waiting for pid 386 to die
    + '[' 1 -eq 11 ']'
    + '[' 1 -gt 20 ']'
    + n=2
    + sleep 1
    + kill -0 386
    + '[' 1 -eq 1 ']'
    + echo 'waiting for pid 386 to die'
    waiting for pid 386 to die
    + '[' 1 -eq 11 ']'
    + '[' 1 -gt 20 ']'
    + n=2

We realize we have made a mistake in our ctl.sh template; we used uppercase 'N' for a variable except when we incremented, where we mistakenly used a lowercase 'n'.

We also realize that our kill script attempts to [initially] kill with a signal '0', which we don't understand because kill -0, according to the man page, doesn't do anything ("If sig is 0, then no signal is sent"). We decide that a mistake of this magnitude deserves a re-write of the ctl.sh template and a redeploy:

vim jobs/named/templates/ctl.sh
 # make changes and then
 # check for syntactic correctness
bash -n jobs/named/templates/ctl.sh
bosh create release --force
 # we're already up to release 13
bosh upload release dev_releases/bind-9/bind-9-0+dev.13.yml
bosh -n deploy --recreate
nslookup google.com. 10.244.0.66
    ** server can't find google.com: REFUSED
bosh ssh named
sudo su -

We pick up our debugging where we left off—edit named's configuration in place on the VM; when we're sure our changes work, we'll backport them to the deployment manifest. We edit the configuration to include a directive to allow all queries, including ones that do not originate from the loopback address (i.e. we add allow-recursion { any; };):

 # edit the named.conf to allow recursion
vim /var/vcap/jobs/named/etc/named.conf
    options {
      allow-recursion { any; };
      ...
/var/vcap/bosh/bin/monit restart named
/var/vcap/bosh/bin/monit summary
    ...
    Process 'named'                     running

It appears that our change to monit's ctl.sh template was successful. Now let's test from our workstation to see if our re-configuration of named allows recursion:

nslookup google.com. 10.244.0.66
    Server:		10.244.0.66
    Address:	10.244.0.66#53

    Non-authoritative answer:
    Name:	google.com
    Address: 216.58.192.46

We backport our change to our deployment's manifest:

vim config/bind-9-bosh-lite.yml
      properties:
        config_file: |
          options {
            allow-recursion { any; };

We deploy again:

bosh -n deploy --recreate
nslookup google.com 10.244.0.66

At this point we conclude our debugging, for our release is working as expected.

bosh-init troubleshooting

bosh-init deploy config/bind-9-ntp-aws.yml
Command 'deploy' failed:
  Deploying:
    Building state for instance 'named_and_ntp/0':
      Rendering job templates for instance 'named_and_ntp/0':
        Rendering templates for job 'named/6220dda89cf6b85d6821667e63661ee059b8637a':
          Rendering template src: ctl.sh, dst: bin/ctl:
            Rendering template src: /var/folders/5q/p12p8rq57hx83fm555qrd2k80000gn/T/bosh-init-release258719544/extracted_jobs/named/templates/ctl.sh, dst: /var/folders/5q/p12p8rq57hx83fm555qrd2k80000gn/T/rendered-jobs231930644/bin/ctl:
              Running ruby to render templates:
                Running command: 'ruby /var/folders/5q/p12p8rq57hx83fm555qrd2k80000gn/T/erb-renderer068207971/erb-render.rb /var/folders/5q/p12p8rq57hx83fm555qrd2k80000gn/T/erb-renderer068207971/erb-context.json /var/folders/5q/p12p8rq57hx83fm555qrd2k80000gn/T/bosh-init-release258719544/extracted_jobs/named/templates/ctl.sh /var/folders/5q/p12p8rq57hx83fm555qrd2k80000gn/T/rendered-jobs231930644/bin/ctl', stdout: '', stderr: '/var/folders/5q/p12p8rq57hx83fm555qrd2k80000gn/T/erb-renderer068207971/erb-render.rb:11:in `merge!': no implicit conversion of nil into Hash (TypeError)
	from /var/folders/5q/p12p8rq57hx83fm555qrd2k80000gn/T/erb-renderer068207971/

We mistakenly used the release names (e.g. bind-9 and ntp) instead of our job names (e.g. named and ntpd) in our manifest's properties section.

References

The official BOSH documentation contains a list of IaaS-specific errors that one may encounter during deployments:

Acknowledgements

Dmitriy Kalinin's input was invaluable when correcting errors and simplifying the debugging process.