one of the one true ways of ops

Posted on Wed 26 July 2023

one of the one true ways of ops

I’m going to tell you the secret (it’s not a secret) to building reliable, operable, debuggable infrastructure. This is going to be terse, but hopefully understandable to someone with just a little experience.

You’re going to need some infrastructure. Infrastructure is not the stuff that you are building, and it’s not the tools that you are building the stuff with. Infrastructure is the reliable services which you depend on to help you build your stuff.

At a minimum:

an Internet connection
a computer acting as a firewall/router to protect you from the Internet
a network switch, preferably one which is configurable with VLANs
more computers than you would think, some of which will be specialized by speed or amount of storage, RAM, processors, special hardware…

The first rule is that nothing can be built without a firm foundation. A firm foundation does not change unless someone makes an active decision to change it, or something breaks. A broken foundation must be detected and fixed.

To detect things changing, we need a monitoring system. The monitoring system should make read-only inquiries via SNMP, check on the functionality of services on remote computers by running tests on them ranging from pings to port connections through HTTPS queries and SQL queries. When it has checked on everything, it needs to go through and do it again. The monitoring system needs a reliable way of sending an alert. It must reliably continue sending the alert periodically until it is stopped by a person or the detected problem is no longer detected.

The monitoring system needs to know what time it is. Use NTP. Designate at least one machine as an NTP server, and have it talk to a pool of NTP servers out on the Internet, as well as all of your internal machines.

The monitoring system needs to be able to send alerts. If the Internet is up, send email, preferably to a paging service. How will you get alerts if the Internet is down? You can try cellphone gateways, but I recommend a different method: set up a small copy of part of your monitoring system somewhere else. Have this one just monitor the availability of your services from an outside perspective. Are you pingable? Are the ports for your applications open? Can a login page be retrieved? If not, shout via email.

From now on, your main monitoring system gets a new monitor for every machine you put into service, and new alerts for every new service you run, internally or externally.

Now you can detect changes. You need to track changes. On a reliable server machine with lots of disk space, install your version tracking system. On that or a similar machine, install a web server that can host a copy of your preferred operating system’s installation system. And, also, multiple copies of the complete repository of external software. Why so much space? Someday you will upgrade the operating system, and for some period of time you will need a copy of the old and a copy of the new. And new is usually larger than old.

Install a system that can install operating systems on new machines. That’s usually a combination of DNS, DHCP, PXE, and a PXE-boot menu. Figure out how you want to name machines now. Figure out how you will handle expansion in the future. Come up with a flexible network routing and address allocation policy that is also reasonably efficient. Remember that humans like unique names for things that they depend on, but are okay with meaningful+serial names for machines that are interchangeable.

You now need a way to take a freshly installed (via PXE) machine and install and configure specific software on it. Study the available configuration automation systems (ansible, puppet, chef, bconfig, cfengine, whatever) and pick one that you can live with for a long time. Consider carefully whether things should be fundamentally pushed from a server to a client or pulled from a server by a client. Always prefer pull for repeated tasks.

When someone tells you that technology Z doesn’t provide security, just convenience, believe them.

You will probably find yourself in need of a database pretty soon. If you do not have a burning need for a specific database, there are only three you should consider (as of 2023): sqlite, mariadb (formerly mysql), and postgresql. Strongly consider using languages with a built-in database layer that can use all three of these systems. Consider picking Postgresql and just sticking with it, unless your needs are very, very simple – in which case, sqlite might be exactly what you want.

Learn a major web server: either nginx or apache. They both work well. I think nginx has a slightly better configuration language, but in the end you’re going to be deploying configs via that config automation system.

For every language you develop in, you must find out what library management system they have and make a local repo of the libraries that you use. You only build from the local repo. Only. Ever. Local. When you want a new version of something you bring it down into your local repo. Don’t remove the old one, it might be better. After three versions have gone by, you might not care any more. This defends against someone poisoning the upstream source – a supply chain attack. It is not a perfect defense.

Which systems are ‘development’ and which are ‘production’? They should look the same, be deployed the same, but you need a gateway between them. At any moment you should be prepared to repel boarders, including developers snooping where they should not and clients tugging on exposed ports. A formal process with a gatekeeper is good, but remember that codifying and practicing for emergencies makes everyone feel better on the tragic but inevitable day when disaster strikes.

You need to know who you are trusting. OS developers? Package maintainers, library authors, coworkers, contractors, clients? Figure out the data flows and the trust relationships. Document this. You need a wiki. Pick one that stores wiki pages in the filesystem, not in a database: the wiki is going to be a precious documentation source, and on the day you can’t run the wiki software but you can grep and read the files, you will thank me.

Access control. You will need to get into your system remotely, which means Wireguard or SSH or both, one over the other. You need to manage special privileges, which means logins on each machine and sudo or doas privileges. In whatever application you are building, consider your security model first and every time you make a change. Keep it separate from your infrastructure access control.

Now size the backups and make them, automatically and repeatedly. The rule of backups is this: nobody cares about backups, they only care about restores. You have three distinct backup targets:

oops, I deleted/changed a thing. Can I get it back fast?
- use a snapshotted filesystem, with automatic snapshots (I like ZFS)
- use a version control system (yes, for its own sake)
- use a self-service per-user backup/restore system (don’t do this)
this computer died taking a lot of data with it. Can we restore it fast?
- have an onsite backup to disk
- make those backups nightly
- have multiple copies of freshly acquired data
- have an offsite backup of the onsite backup for that day when everything burns (or the power goes out)
- could you have a live backup server? It costs more. That might be worthwhile.
the lawyer/accountant says we need to retain this for years. Can we do that efficiently?
- encrypt that data and store the passphrase in three different secure places.
- offsite is probably good
- keep an onsite catalog of where you put it

I haven’t mentioned your load balancing, streaming database replication, second site, internal firewalls, office systems, or printing. If you can avoid ever buying a printer, do that. If you can minimize printing, do that. Buy a larger monitor rather than more reams of paper and toner. Use wired networking for every machine with a fixed location, and treat your wireless networks as being outside visitors. Survey the MAC addresses of the wired machines and refuse changes without authorization. If you handle payments of any kind, read the PCI documentation and do better than they demand. You can do it: they demand the minimum that they can cope with.

Buy more capacity up front. Compare fully depreciated capital assets versus the cash flow of rented/leased/flexible services, and bet that you will be in it for the long haul. If you aren’t sure, scale back. Don’t depend on the whims of giants: buy commodities that you can get from anywhere.

There’s always more. This is enough to get you a firm enough foundation that your organization can survive to find out what you need to do differently.

Tags: blog, devops, operations, chef, infrastructure, deployment

Previously: quote of note n=11: copyright

More recently: learning opportunity: LLM