How to build a better (SSL proxy) mouse trap (with lighttpd)?

We run our web farm behind a pair of Citrix Netscalers in both San Jose and Amsterdam. What really hits these boxes hard is the SSL offloaded traffic and in certain instances has caused the Netscalers to fall over on themselves.

Right now our setup looks like:

Current Netscalers setup

(Pardon the shapes, it’s what I have to work with.)

So I’ve been toying with the idea of how to scale our web infrastructure horizontally. What I really want is some sort of “simple” high performance caching reverse proxy that can terminate a large number of SSL client connections (and compress too!).

I’ve been looking at lighttpd (I’d look at Varnish but it’s not yet where I need it to be). My thinking is to use something to load balance between a pool of lighttpd servers, which would terminate SSL sessions, and proxy back to the Netscaler, where I can take advantage of it’s caching and caching policy engine and its global load balancing, which I’m tied to because we’re using dynamic proximity probes to load balance - the alternative would be to hack up BIND to use some sort of geo-ip database (which wouldn’t account for brownouts).

That setup might look like this:

Load Balancing with lighttpd

Of course, in places like Amsterdam, the backend servers would be nearly on the other side of the planet, in San Jose.

The question is whether lighttpd could handle the SSL load that we currently see which, during a non-release cycle, hits around 3000 SSL transactions/second (and nearly double that during a release) or how many lighttpd front-ends I’d need to run to match that.

One option would be to use hardware SSL accelerators. The Citrix Netscaler 12000 appears to use two Cavium Networks NITROX cards. Presumably if I had even one of those I could hit anywhere between 14k and 28k SSL trans/second. The guys over at Zeus seem to think that a dual dual-core Opteron running some 64bit OS could match or out perform hardware accelerators.

I suppose the only real way to find out is to test it!

I’m still trying to deploy this and since I’m still waiting for the new Netscalers to arrive in Amsterdam (and don’t have any Cavium cards), I went ahead and setup two lighttpd servers (both HP DL360s running 32bit RHEL4) behind the current Netscaler 7000s in Amsterdam. The Netscaler is passing all traffic to lighttpd, which terminates SSL, and proxies back to the Netscaler (which has better cache mechanisms) which proxies back to the real addons.mozilla.org site. And yes, that was a confusing setup to configure.

I’d be real interested in getting feedback from folks (and performance numbers on lighttpd + SSL, if anyone has it) on how this setup works. If you want to play:

The IP there is 63.245.213.31 — should be able to test by changing your hosts file.

Notes about the install:

  • public pages are cached entries delivered from the local NS
  • logged-in pages are actually from the sjc cluster
  • admin/dev/editor pages are all still from the sjc cluster

Enjoy!

How do you manage a lot of log files from a lot of hosts?

You use Splunk, of course! And for those that hadn’t noticed, the 3.0 beta went out yesterday. You should probably get it.

Update: Under their beta notes I see this: “Only the Firefox browser is supported in this beta. Support for IE will be added later.”

I won’t shed any tears over that :)

I’m a Mac? and why the open Web rocks.

With prodding from co-workers I’ve shed the final vestiges of my Windows life. A couple months ago I switched my portable computing platform to a black MacBook that no one wanted (it seemed the perfect size to take with my on my trip to Amsterdam) and now I’ve replaced my desktop with a Mac Pro.

I don’t know if I’d ever call myself a Windows users - by and large it was what was available and what had games applications.

I first started out on a VIC20 (admittedly, this was a Christmas gift for Dad but it wound its way into my room where I spent countless hours typing in BASIC programs from Compute’s Gazette), moved to a Commodore 64 and finally graduated to an Amiga (which was like Windows but with a usable command shell).

Commodore wasn’t long for the world and I eventually sold it to live in a DOS modem dialer so I can dial up to the ISP I worked at (where my “desktop” was a Sparc 2). This was a time where the most exciting graphical app on the Internet was gopher. Yes, it was a couple years ago.

Windows never interested me until I got involved in the Windows 95 beta and that’s when I bought my first PC (a 90Mhz Pentium; yes, that too was awhile ago). I’ve been pretty Windows centric since but always missed a usable shell. I’ve never been into running Linux (or some variant with X) mostly because of the lack of applications for what I needed to do. The fact of the matter is that the world runs on Microsoft Office and OpenOffice is not the same as Office and there does not exist anything like Visio (is there?) on anything other than Windows.

But my reality has recently changed.

  1. OS9 sucked, OSX doesn’t
  2. I have two kids, playing games seems like a quaint pastime
  3. The Network really is the Computer
  4. The Web is the platform
  5. Parallels is awesome

Sun had it right a couple years ago when they proclaimed “The Network is the Computer (and even here). I spend the majority of my computing time either in a web browser (Firefox, of course), my email (Thunderbird) or logged in through ssh to a remote computer. My data, or that that needs to be available on all machines, lives in a private svn repository. I use Google Apps. None of these are platform specific anymore.

I’ve moved into a world where platform independence, and by extension, an open and free web is increasingly more important. Those sites that require me to use IE or some Active X control are increasingly going fall out of my browser history.

ps. I’m replacing my home computer too.

Another 100 miles, Delta Century Ride Report

(This would have been posted earlier this week but I’ve had no end of problems trying to get iframes in Wordpress-MU working.)

I wrapped up my second century this year over in Lodi, CA this past Sunday doing the Delta Century. Total ride time came in at 6:29.

Ride had a number of firsts for me:

  1. First century with hardly any elevation gain
  2. First century it didn’t rain on me
  3. First century where I biked across bridges and took a ferry to cross the river
  4. First century with massive headwinds !

The winds were the worst part of the ride and made me wish for some hill to climb instead (at least then I’d know when the pain was over). I was either biking into the wind or trying not to get blown over from it. By mile 60 I had had enough and by 80 I just wanted to be done with it.

Weather forecast for Sunday was supposed to be in the high 80s but I never felt any of that heat and, in fact, kept my arm warmers on for the length of the ride.

mrz @ Delta Century 2007 Anyways, a good work out nonetheless and I got to see a part of California I wouldn’t have otherwise seen.Next ride isn’t for a month, down in Riverside County for Ride Around the Bear (with lots of climbing!). Who’s in with me?

What do you do with a bunch of Celerons?

In an effort to redefine the meaning of “performance”, we turn them into a performance cluster (where slower is better).

As Boris Zbarsky said,

If the hardware is so fast that, say, the Date.now() resolution (or whatever perl module we use) is insufficient to accurately time the tests, then we have a problem.

This follows up to the Mac Mini cluster we built out a couple weeks back. QA will now have a pool of “desktop class” machines running Windows XP and Ubuntu.

On a technical note, these ten machines are ColoBlade C10 PC Blade from www.colomachine.com, a unique blade-like chassis solution that takes up 8u of space for 10 1u machines. However, unlike a traditional blade system, power and ethernet are all seperate per “blade” as you can see below.

QA Performance Cluster

Where in the world is AMO? (Part III: It’s Dead.)

Shortly before 12:30am PDT I had to roll back the DNS changes to AMO and serve it only out of San Jose. Around this time, Europe started coming online and pushed traffic loads up, exhausting the capabilities of the Netscalers in Amsterdam.

The Bad
Unfortunately when SSL transactions/second hit nearly 900 a second the CPU was pegged at 100% and the box started failing external health checks and started peforming “oddly”.

SSL Transactions / second

I mentioned elsewhere that the pair in Amsterdam is a pair of Netscaler 7000s without hardware SSL offloading. The glossy material from Citrix says I should be able to get 4400 SSL trans/second. Admittedly the box is doing more than just SSL (caching, compressing, RTT probes), but not even getting to 1/4 of that number sucks.

(We had exactly the same problem with the 9000s (4400 SSL tps) and 10ks (8800 SSL tps) - during release periods we’d easily top out at more than 3k SSL trans/sec, below their 4400/8800 mark, and the boxes would fall over on themselves. We’re now running on the 12ks which have two SSL hardware cards and two CPUs and perform much better but I’m not sure where Citrix get their numbers)

The Good
On the success side, AMO quite quickly started pushing a significant amount of bandwidth out of Amsterdam -

AMO Traffic

I rolled back before peak traffic but during this time frame, a good 11% of AMO traffic was sourced out of Amsterdam and I got a lot of feedback from other channels that performance was quicker.

What’s next?
So what’s the next step? I’ll be shipping out replacement pair of Netscaler 9000s this week that do have an SSL offload card and we’ll re-try this in a couple weeks when they’re online.

While the Netscaler clearly failed to keep up with the load, I should point out that I’m a huge fan of the product. If I had to build out some non-commercial solution using lighttpd or squid or something else to handle AMO (and the SSL traffic and load balancing and GSLB and HA), I’d have spent more than I spent on the Netscalers.

ps. Anyone more local to Amsterdam who wants to help racking?

Where in the world is AMO? (Part II: It’s live!)

Pushed out the DNS changes to addons.mozilla.org about 36 minutes ago. Those of you on the other side of the planet should see much improved page load times.

Where in the world is AMO?

Dam Square, Amsterdam at Night.Amsterdam

After a week or so of testing, we’re ready to flip the switch on getting addons.mozilla.org to be served out of Amsterdam as well as San Jose. I talked about how we’re doing this last week if you’re interested. We’ll make this change during our normal Tuesday maintenance window.

Amsterdam Colo Hopefully the Netscalers in Amsterdam will be able to handle the load - they are three hardware versions below what we have in San Jose and I’d really hate to have to back out.If it does work, folks living in closer proximity to Amsterdam should see much better page load times!

What else can you do with a Mac mini?

I guess we’re not the only ones to do strange things with Mac Minis.

These guys are using them as routers at LINX!

Mac Mini Router

Where’s Mozilla traffic going?

One of our objectives for this year was to get a better trending of network traffic and flows for capacity planning, colo planning and attack mitigation.

The NetFlow tool we’re using also allows me to run a report of top destination ASNs, or networks we send the most amount of traffic to as well as a bunch of other reports, three of which I’ll show below.

(Incidentally, this is the best tool I’ve come across for the price, and yes, I know there are free open source solutions but I’ve never found cflowd or others easy to setup.)

Because of our traffic dispersion, the top twenty destination sites typically get between 1-4Mbps. All of the other traffic is lumped into “Other” with end networks getting far less than 1Mbps. None the less, this gives an interesting idea of where our users are, by bandwidth.

Continued reading >