
Writing about C++, Programming, FOST.3™, Mahlee™, the web, Thailand and anything else that catches my attention—with some photos thrown in
The latest version of Fost was tagged in our repositories a few days ago.
The latest release was tagged a few days ago, but no changes have been made this quarter.
The next version is due out in March 2012.
| Linux & Mac |
|---|
svn co http://svn.felspar.com/public/fost-hello/tags/4.11.12.43952 fost-hello cd fost-hello Boost/build hello/compile dist/bin/hello-world-d On the Mac you will need to set DYLD_LIBRARY_PATH before running hello-world-d export DYLD_LIBRARY_PATH=dist/lib dist/bin/hello-world-d |
| Windows |
svn co http://svn.felspar.com/public/fost-hello/tags/4.11.12.43952 fost-hello cd fost-hello Boost\build hello\compile dist\bin\hello-world-gd |
Everything is available through our Subversion repository. Below are the locations for the tagged releases for Fost 4.11.12.43952 components.
These are some notes for running Oneiric on a development machine.
Make sure all old Postgres installations are purged:
sudo apt-get purge postgresql postgresql-8.4
Install postgres:
sudo apt-get install postgresql
Turning off the fsync option will make Postgres run faster as it won't insist that all database changes are written to disk. Clearly do not do this if you care about the data you are writing to Postgres. This shouldn't be a problem on any development machine as they're all test databases.
Edit the server configuration:
sudo nano /etc/postgres/9.1/main/postgresql.conf
Change:
fsync = off
Then:
sudo service postgresql restart
Making proper use of ident authentication means that you can access Postgres without needing to use a password. Many of our Postgres set up scripts for projects assume that you have ident configured as a Postgres superuser.
Run psql as the postgres user to add in your configuration:
sudo -u postgres psql
And now we need to configure ident authentication — my user name is kirit, change to whatever you log in to Ubuntu as:
create role kirit login superuser; create database kirit with owner=kirit;
Exit psql and try again with your own account:
psql
You should now be able to see the postgres prompt again.
I have a Django user that I use across projects:
create role "Django" login superuser password 'django';
From here you can set up the databases or add other roles that you need normally.
The latest version of Fost was tagged in our repositories a few days ago.
The big changes are support for newer versions of Boost, including 1.46 and 1.47. We have now deprecated versions prior to 1.41.
We have fixed up and tidied up a fair bit in the networking libraries. Most importantly, we have fixed a problem that was causing large data sends to get truncated under some circumstances and have fixed up some time out handling across the board.
The next version is due out in December 2011.
| Linux & Mac |
|---|
svn co http://svn.felspar.com/public/fost-hello/tags/4.11.09.43690 fost-hello cd fost-hello Boost/build hello/compile dist/bin/hello-world-d On the Mac you will need to set DYLD_LIBRARY_PATH before running hello-world-d export DYLD_LIBRARY_PATH=dist/lib dist/bin/hello-world-d |
| Windows |
svn co http://svn.felspar.com/public/fost-hello/tags/4.11.09.43690 fost-hello cd fost-hello Boost\build hello\compile dist\bin\hello-world-gd |
Everything is available through our Subversion repository. Below are the locations for the tagged releases for Fost 4.11.09.43690 components.
Downtime is always embarrassing, especially so when you get some significant downtime on a site, and kirit.com has been down for over a week.
The short reason is that the site has always been on a server lent to me by a partner company. The partner cancelled the server without letting me know and once the machine was turned off that was that. It's then taken me more than a week to get a new server set up — not because it's a difficult thing to do, but because I've just not had any time to do it. Finally over the weekend I had a few hours to get things up and running again.
I'm still waiting to receive the file data backup (for the photos), but the rest of the site is at least back up now.
I've been thrown into an infrastructure role recently, and this of course has forced me to try to codify some of what I've been doing in a manner that I can pass on to others. The fact that I've been getting about 500 emails about “warnings”, “errors” and “business as usual” per day hasn't improved my mood about taking this new role on — although I expect it'll get to become quite enjoyable once the team and I start to get a proper handle on things.
There are so many ways of trying to predict what might happen with systems, but in the end there's only one thing that we can really deal with: a real server failure for a reason that we hope to be able to identify. So in order to handle this we put alerts on all sorts of things: memory usage, CPU usage, network usage, and any other kind of usage we can hook an alerting system up to.
Our intuition that a memory warning saying we're using more than 90% of RAM means something, actually, isn't worth anything unless we can correlate that with at least a concrete service failure (what you classify as a service failure, well, that's between you and your SLAs).
What we miss here is that, sure high RAM usage may well be a factor in a failure, but it's not predictive unless every high memory usage (to let's say 90% of cases) means we're actually going to suffer a service outage. If I get an email every day because a certain process running at off peak time uses 90% of RAM and everything remains OK then that means I'm getting spammed by an email every day. It doesn't mean that the warning is useful if there is never any problems associated with RAM consumption. It also doesn't mean that a warning of more than 80% usage during peak usage might not presage a complete outage. Our warnings need to be contextual, but above all they need to be predictive.
That is, they need to be predictive in the scientific Popperistic sense of meaning “this alert has a very high correlation with a problem I care enough about that I'm going to have to fix it now”. In that case the alert in itself isn't actionable, but at least it primes you to be ready for the outage that will surely follow — a “heads up” if you will. If that correlation can't be at least 90% accurate (a number I pulled from thin air, but it needs to be very high, I think 90% is the lowest number that might be useful) then people will learn to ignore the warnings. And once warnings can be ignored then you might as well not bother issuing them.
The basic premise here is that every alert must be actionable in some way. If they're not actionable then they're purely diagnostic — and there's nothing wrong with that. We need a huge amount of data before and after failures in order to try to work out what the factors that caused it were. Hopefully once we know that we can take steps to codify this into a predictive warning that tells us what we need to do (or even better an automated warning that tells us what was reconfigured to ensure the problem doesn't arise). Of course, if the correlation is high enough, and the downside low enough then we should just automate the mitigating factors in such a way that the solving of the problem becomes part of the normal system operation.
Diagnostic alerting, which isn't actionable, however is just a distraction. We end up ignoring them all. What I want to see is that post mortems of real world failures show a good enough correlation that we can issue a warning when certain things happen and to use that information to try to pre-empt a failure. If we can always pre-empt the failure then it should become part of our normal operating procedures (and be fully automated) and no longer require a warning.
Now, of course, none of this should detract from “error” reporting — that being reporting that something is broken. The big problem with the technologies that we use today (especially at scale) is that “errors” aren't hard and fast. Transient problems come and go all the time, and we have to deal with that as a reality.
With a simple system it's possible to say either it works or it doesn't, but once you have more than a couple of services working together to serve resultsm and you have a bit of fault tolerance thrown in, you can't talk about a system being up or down, only parts of it being up or down. What to do about this is something I think we still need to work out on a case by case basis.
I hate taxonimising, but still… The essential tl;dr is here:
An error needs to be reserved for an actual system outage. There is something concrete wrong that needs immediate action by somebody smart enough to look at the full context and decide what is the acceptable response to take — and who has authority to take it (this last bit seems to be missing far too often in case studies you see on the internet).
Warnings I'm still less sure about, but some ideas are:
The idea here is that even warnings are actionable. Non-actionable messages hitting my email, pager, phone or anything else is essentially spam and I really don't want to be setting our systems up to be spamming myself or anybody else.
I'm hoping that after working this through for a month or so on all of our systems I can reduce my emails from more than 500 per day saying something might be worth looking into to about 3 per week that are something that certainly needs doing.
If we can be clever enough in how we deal with all of this my plan is to bring it down to around 3 per year. I'll live in hope :)