kirit.com

Created 26th April, 2005 13:12 (UTC), last edited 5th July, 2007 06:59 (UTC)

Writing about C++, Programming, FOST.3™, Mahlee™, the web, Thailand and anything else that catches my attention—with some photos thrown in

Fost 4 release 4.11.12.43952 now out

Posted 28th December, 2011 08:59 (UTC), last edited 29th December, 2011 03:26 (UTC)

The latest version of Fost was tagged in our repositories a few days ago.

The latest release was tagged a few days ago, but no changes have been made this quarter.

The next version is due out in March 2012.

Linux & Mac
svn co http://svn.felspar.com/public/fost-hello/tags/4.11.12.43952 fost-hello
cd fost-hello
Boost/build
hello/compile
dist/bin/hello-world-d

On the Mac you will need to set DYLD_LIBRARY_PATH before running hello-world-d

export DYLD_LIBRARY_PATH=dist/lib
dist/bin/hello-world-d
Windows
svn co http://svn.felspar.com/public/fost-hello/tags/4.11.12.43952 fost-hello
cd fost-hello
Boost\build
hello\compile
dist\bin\hello-world-gd

Download locations

Everything is available through our Subversion repository. Below are the locations for the tagged releases for Fost 4.11.12.43952 components.


Categories:

Running Postgres for development on Oneiric

Posted 18th October, 2011 03:27 (UTC), last edited 18th October, 2011 03:31 (UTC)

These are some notes for running Oneiric on a development machine.

Upgrading from Natty

Make sure all old Postgres installations are purged:

sudo apt-get purge postgresql postgresql-8.4

Getting Postgres 9.1 up and running

Install postgres:

sudo apt-get install postgresql

Only on a development machine

Turning off the fsync option will make Postgres run faster as it won't insist that all database changes are written to disk. Clearly do not do this if you care about the data you are writing to Postgres. This shouldn't be a problem on any development machine as they're all test databases.

Edit the server configuration:

sudo nano /etc/postgres/9.1/main/postgresql.conf

Change:

fsync = off

Then:

sudo service postgresql restart

Configure your ident authentication

Making proper use of ident authentication means that you can access Postgres without needing to use a password. Many of our Postgres set up scripts for projects assume that you have ident configured as a Postgres superuser.

Run psql as the postgres user to add in your configuration:

sudo -u postgres psql

And now we need to configure ident authentication — my user name is kirit, change to whatever you log in to Ubuntu as:

create role kirit login superuser;
create database kirit with owner=kirit;

Exit psql and try again with your own account:

psql

You should now be able to see the postgres prompt again.

I have a Django user that I use across projects:

create role "Django" login superuser password 'django';

From here you can set up the databases or add other roles that you need normally.


Categories:

Fost 4 release 4.11.09.43690 now out

Posted 27th September, 2011 07:06 (UTC), last edited 27th September, 2011 07:15 (UTC)

The latest version of Fost was tagged in our repositories a few days ago.

The big changes are support for newer versions of Boost, including 1.46 and 1.47. We have now deprecated versions prior to 1.41.

We have fixed up and tidied up a fair bit in the networking libraries. Most importantly, we have fixed a problem that was causing large data sends to get truncated under some circumstances and have fixed up some time out handling across the board.

The next version is due out in December 2011.

Linux & Mac
svn co http://svn.felspar.com/public/fost-hello/tags/4.11.09.43690 fost-hello
cd fost-hello
Boost/build
hello/compile
dist/bin/hello-world-d

On the Mac you will need to set DYLD_LIBRARY_PATH before running hello-world-d

export DYLD_LIBRARY_PATH=dist/lib
dist/bin/hello-world-d
Windows
svn co http://svn.felspar.com/public/fost-hello/tags/4.11.09.43690 fost-hello
cd fost-hello
Boost\build
hello\compile
dist\bin\hello-world-gd

Download locations

Everything is available through our Subversion repository. Below are the locations for the tagged releases for Fost 4.11.09.43690 components.

Detailed change log

fost-base

  • Changed some test logging messages so they don't look like real errors any more.
  • Added support for Boost 1.46.
  • Added a time logging function that logs any individual test that takes more than ten seconds to run.
  • Added a timer to the datetime library which allows us to time how long parts of a program execution take.
  • The insert functions for JSON values now use a coerce rather than a JSON constructor so the API will work with more data types.

fost-internet

  • Connection errors now report the host and port they're trying to connect to.
  • The pop3 tests now allow the server to be configured, and it is more aggressive in keeping the mailbox empty.
  • The pop client includes some extra logging describing what it is doing.
  • Made a change to the TCP time out handling to support the version of gcc that Macs use.
  • Made some changes to support Boost 1.46.0.
  • Fixed a bug that was causing occasional data packets to be lost when sending large blocks of data.
  • Implemented a proper exception type for a particular networking error.
  • Improved the time out handling for large downloads where the download size is known in advance.
  • Added in a connect timeout which defaults to ten seconds.

fost-py

  • Boost 1.46.0 is now properly supported.

Categories:

Embarrassing

Posted 11th July, 2011 07:06 (UTC), last edited 11th July, 2011 07:18 (UTC)

Downtime is always embarrassing, especially so when you get some significant downtime on a site, and kirit.com has been down for over a week.

The short reason is that the site has always been on a server lent to me by a partner company. The partner cancelled the server without letting me know and once the machine was turned off that was that. It's then taken me more than a week to get a new server set up — not because it's a difficult thing to do, but because I've just not had any time to do it. Finally over the weekend I had a few hours to get things up and running again.

I'm still waiting to receive the file data backup (for the photos), but the rest of the site is at least back up now.


Categories:

Errors and warnings

Posted 23rd June, 2011 18:43 (UTC), last edited 24th June, 2011 14:41 (UTC)

I've been thrown into an infrastructure role recently, and this of course has forced me to try to codify some of what I've been doing in a manner that I can pass on to others. The fact that I've been getting about 500 emails about “warnings”, “errors” and “business as usual” per day hasn't improved my mood about taking this new role on — although I expect it'll get to become quite enjoyable once the team and I start to get a proper handle on things.

There are so many ways of trying to predict what might happen with systems, but in the end there's only one thing that we can really deal with: a real server failure for a reason that we hope to be able to identify. So in order to handle this we put alerts on all sorts of things: memory usage, CPU usage, network usage, and any other kind of usage we can hook an alerting system up to.

Our intuition that a memory warning saying we're using more than 90% of RAM means something, actually, isn't worth anything unless we can correlate that with at least a concrete service failure (what you classify as a service failure, well, that's between you and your SLAs).

What we miss here is that, sure high RAM usage may well be a factor in a failure, but it's not predictive unless every high memory usage (to let's say 90% of cases) means we're actually going to suffer a service outage. If I get an email every day because a certain process running at off peak time uses 90% of RAM and everything remains OK then that means I'm getting spammed by an email every day. It doesn't mean that the warning is useful if there is never any problems associated with RAM consumption. It also doesn't mean that a warning of more than 80% usage during peak usage might not presage a complete outage. Our warnings need to be contextual, but above all they need to be predictive.

That is, they need to be predictive in the scientific Popperistic sense of meaning “this alert has a very high correlation with a problem I care enough about that I'm going to have to fix it now”. In that case the alert in itself isn't actionable, but at least it primes you to be ready for the outage that will surely follow — a “heads up” if you will. If that correlation can't be at least 90% accurate (a number I pulled from thin air, but it needs to be very high, I think 90% is the lowest number that might be useful) then people will learn to ignore the warnings. And once warnings can be ignored then you might as well not bother issuing them.

The basic premise here is that every alert must be actionable in some way. If they're not actionable then they're purely diagnostic — and there's nothing wrong with that. We need a huge amount of data before and after failures in order to try to work out what the factors that caused it were. Hopefully once we know that we can take steps to codify this into a predictive warning that tells us what we need to do (or even better an automated warning that tells us what was reconfigured to ensure the problem doesn't arise). Of course, if the correlation is high enough, and the downside low enough then we should just automate the mitigating factors in such a way that the solving of the problem becomes part of the normal system operation.

Diagnostic alerting, which isn't actionable, however is just a distraction. We end up ignoring them all. What I want to see is that post mortems of real world failures show a good enough correlation that we can issue a warning when certain things happen and to use that information to try to pre-empt a failure. If we can always pre-empt the failure then it should become part of our normal operating procedures (and be fully automated) and no longer require a warning.

Errors

Now, of course, none of this should detract from “error” reporting — that being reporting that something is broken. The big problem with the technologies that we use today (especially at scale) is that “errors” aren't hard and fast. Transient problems come and go all the time, and we have to deal with that as a reality.

With a simple system it's possible to say either it works or it doesn't, but once you have more than a couple of services working together to serve resultsm and you have a bit of fault tolerance thrown in, you can't talk about a system being up or down, only parts of it being up or down. What to do about this is something I think we still need to work out on a case by case basis.

Taxonomising — :(

I hate taxonimising, but still… The essential tl;dr is here:

An error needs to be reserved for an actual system outage. There is something concrete wrong that needs immediate action by somebody smart enough to look at the full context and decide what is the acceptable response to take — and who has authority to take it (this last bit seems to be missing far too often in case studies you see on the internet).

Warnings I'm still less sure about, but some ideas are:

  • A warning means that an error is very likely going to follow to the extent that if I don't get ready to deal with the error now I'm going to have a service outage, or the outage is going to be longer than it needs to be. The idea here being that if you're away from your desk now might be a good time to return to it because something important looks likely to happen in the next few minutes.
  • A warning needs to be that something concrete has changed. It's an automated system that says "I've changed something because this happened" and the warning allows an operator to override that before something else goes horribly wrong. After all, I hope the operators know a thing or two that the software doesn't!
  • A warning tells us that there is a very high correlation of failure if operator action isn't taken, but the action is itself so dangerous that other factors (the script isn't taking into account) need to be thought about before something is actually done. The system needs a reboot to handle all of the requests coming in, but the decision to do the reboot or not and kill the requests that are in process needs to be taken by a human who really understands the priorities of what is being done.

The idea here is that even warnings are actionable. Non-actionable messages hitting my email, pager, phone or anything else is essentially spam and I really don't want to be setting our systems up to be spamming myself or anybody else.

I'm hoping that after working this through for a month or so on all of our systems I can reduce my emails from more than 500 per day saying something might be worth looking into to about 3 per week that are something that certainly needs doing.

If we can be clever enough in how we deal with all of this my plan is to bring it down to around 3 per year. I'll live in hope :)


Categories:

Felspar