Tag Archives: configuration

An introduction to Puppet

puppetPerfectly positioned to provide automation for the infrastructure providing both private and public clouds (and a darling of the burgeoning DevOps scene), Puppet has seen a groundswell of adoption in recent years. It’s undoubtedly very capable but may not be what some enterprises expect.

For those not familiar with Puppet it’s a tool which helps to automate system administration tasks. They’ve managed to build a large mindshare and strong brand recognition although it’s still a relatively small company of around 190 staff globally, headquartered out of Portland, Oregon in the US. The London based team is actively growing (interested in a job with PuppetLabs?) and the first usergroup meeting in London recently attracted 45 people at pretty short notice. Their financial results speak for themselves with year on year sales more than tripling and over 9 million downloads. Pretty impressive for a company which in 2010 only had 11 staff! They’re not the only show in town (Chef, Salt Stack, & Ansible are notable competition) but they seem to be getting the most traction.

Puppet’s success lies in the VM sprawl ushered in by virtualisation combined with the availability of cloud infrastructures which can scale rapidly and on demand. If you need to quickly spin up hundreds, maybe thousands, of servers and guarantee that their configuration is identical and correct, how would you do it?  How do you manage the rapid releases required by your software development lifecycle, especially if you’re aiming for continuous delivery? How do you deal with configuration drift in your test and development environments? This is where Puppet comes to the rescue.

I’ve been keeping an eye on Puppet as a configuration management tool since 2009 when it first popped up on my radar (maybe it was Thoughtworks Radar). At the time I was looking for tools to help deploy RedHat Linux 4.6 but sadly I didn’t opt for Puppet – in hindsight I consider that a missed opportunity! Earlier this year it was covered at the London VMUG and I’ve recently had conversations with PuppetLabs staff both at VMworld Europe (Jose Palafox) and in the UK (Steve Thwaites). Have a read of the official PuppetLab intro then continue reading to get my initial thoughts.

Puppet comes in two flavours Continue reading An introduction to Puppet

BetterWPSecurity – a great WordPress plugin but proceed with caution

I’ve recently installed the BetterWPSecurity WordPress plugin, and found that while it’s very useful and does increase the security of WordPress it can also break your site.

Ah, Monday morning and the start of my three months paternity leave looking after my six month old son Zach. During his morning nap I logged into my blog to work on an article and noticed that my blog wasn’t loading articles correctly even though the home page worked just fine. Investigating further and looking at my site stats (I use both the Jetpack plugin and Google Analytics) clearly showed that something broke at the start of the weekend – I had nearly no traffic all weekend. Having just referred a colleague to my site for some information and on my first day of paternity leave (ie less time on my hands, not more as some may think) this was definitely not ideal timing!

My first step was to check my logs for information, in this case the BetterWPSecurity log for changed files. This revealed that the .htaccess file in the root directory was changed late on Friday night at 11:35pm – and I knew that wasn’t me as I was tucked up in bed. My first thought was a hack as the .htaccess file permits access to the site but there was no redirect or site graffiti and the homepage still worked so that didn’t seem likely. I logged in via SSH to have a look at the .htaccess file but didn’t see anything obvious although I’m no WordPress expert.


My priority was to get the blog working again so I tried restoring a copy of the changed file from the previous week’s backup (made via the BackWPUp plugin) only to find the backup wasn’t useable. Bad plugin! Luckily I’m a believer in ‘belt and braces’ and I knew my hosting company, EvoHosting, also took backups. I logged a call with them and within the hour they’d replied with the contents of the file from a week earlier. Sure enough the file had been changed but looking at the syntax it appeared to be an error rather than malicious hack.

My .htaccess file when the site was working;

# BEGIN WordPress

RewriteEngine On

RewriteBase /

RewriteRule ^index\.php$ - [L]

RewriteCond %{REQUEST_FILENAME} !-f

RewriteCond %{REQUEST_FILENAME} !-d

RewriteRule . /index.php [L]

# END WordPress

My .htaccess file after the suspicious change;

# BEGIN Better WP Security

Order allow,deny

Allow from all

Deny from 88.227.227.32

# END Better WP Security

RewriteBase /

RewriteRule ^index\.php$ - [L]

RewriteCond %{REQUEST_FILENAME} !-f

RewriteCond %{REQUEST_FILENAME} !-d

RewriteRule . /index.php [L]

</IfModule>

# END WordPress

I backed up the suspicious copy of the file (for future reference, ie writing this blogpost), restored the original et voila – the blog was working again. Step one complete, now to find the root cause…

Part of any diagnostic process is the question ‘what’s changed?’ and I had a suspicion that BetterWPSecurity could be the culprit as I’d only installed it a few weeks earlier. There was also the obvious issue of the new code in the .htaccess file which looked to belong to BetterWPSecurity. I checked the site access logs which confirmed my hypothesis – someone had attempted to break into my site and while attempting to block the attacker BetterWPSecurity had mangled my .htaccess file. The logs below have been truncated to remove many of the brute force login attempts (there were plenty more) but note that on the final line (after BetterWPSecurity has blocked the attacker) the HTML return code was 418 (“I’m a teapot”) rather than 200 plus the suspect IP 88.227.227.32 is the same as the one denied in the mangled .htaccess file. Yes, you read that right, “I’m a teapot”! Here’s a full explanation for that April Fool’s error code. 🙂

88.227.227.32 - - [15/Feb/2013:23:35:19 +0000] "POST /wp-login.php HTTP/1.1" 200 3017 "http://www.vexperienced.co.uk//wp-login.php" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"
88.227.227.32 - - [15/Feb/2013:23:35:19 +0000] "POST /wp-login.php HTTP/1.1" 200 3017 "http://www.vexperienced.co.uk//wp-login.php" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"
88.227.227.32 - - [15/Feb/2013:23:35:19 +0000] "POST /wp-login.php HTTP/1.1" 200 3017 "http://www.vexperienced.co.uk//wp-login.php" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"
88.227.227.32 - - [15/Feb/2013:23:35:19 +0000] "POST /wp-login.php HTTP/1.1" 200 3017 "http://www.vexperienced.co.uk//wp-login.php" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"
88.227.227.32 - - [15/Feb/2013:23:35:19 +0000] "POST /wp-login.php HTTP/1.1" 418 5 "http://www.vexperienced.co.uk//wp-login.php" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"

So BetterWPSecurity led me to the fault but also caused it. To be fair the plugin does warn you which settings are potentially going to cause issues but I’d assumed that it wouldn’t be me – dangerous things assumptions. I’ve rectified the issue by restricing BetterWPSecurity from altering core system files as shown in the screenshot below;

My blog is fixed and I’m feeling quite chuffed that it was all resolved during a long lunchbreak – not a bad day’s work if I do say so myself! Lesson for today? Take warnings seriously and have multiple backups!

Federated login failures – the LSA cache

While working recently on an ADFS federation solution I came across a Microsoft ‘feature’ which doesn’t seem to be well known and which caused me to deliver my project a week late. It often manifests itself via failed logins and affects many products which integrate with AD such as Sharepoint, Office365, OWA, and of course ADFS. This is very much one of those ‘document it here for future reference’ posts but hopefully it’ll help spread the word and maybe save someone else the pain I felt!

To describe how the ‘feature’ affects ADFS you need to understand the communication flow when a federation request is processed. The diagram below (from an MSDN article on using ADFS in Identity solutions) shows a user (the web browser) connecting to a service (the ASP.NET application although it could be almost any app) which uses ADFS federation to determine access;

Communication flow using federated WebSSO

Summarising the steps;

  • The user browses to the web application (step 1)
  • The web app redirects the user to ADFS (step 2,3)
  • ADFS attempts to authenticate the user, usually against Active Directory (step 4)
  • ADFS generates a token (representing the users authentication) which is passed back to the user who then presents it to the app and is given access (steps 5,6,7)

My problem was that while some users were being logged into the web application OK, some were failing and I couldn’t work out why. Diagnosing issues in federation can be tricky as by its nature it often involves multiple parties/companies. The web application company were saying their application worked fine, both redirecting users and processing the returned tokens. The users were entering their credentials and being authenticated against our internal Active Directory. ADFS logs showed that tokens were being generated and sent to the web app. Hmm.

Digging deeper I found that the AD username (the UPN to be precise) being passed into the token generation process within ADFS was occasionally incorrect. The user would type their username into the web form (and be authenticated) but when ADFS tried to generate claims for this user via an LDAP lookup it used an incorrect UPN and hence failed. It seemed as if the Windows authentication process was returning incorrect values to ADFS. This stumped me for a while – how can something as simple and mature as AD authentication go wrong?

Of course it’s not going wrong, its working as designed. It transpires there’s an LSA cache on domain member servers. On occasions where the AD values have changed recently (the default is to cache for 7 days) it can result in the original, rather than the updated, values being returned to the calling application by the AD authentication process. A simple change such as someone getting married and having their AD account updated with their married name could therefore break any dependant applications. Details of this cache can be found in MS KB article 946358, along with the priceless statement “This behaviour may prevent the application from working correctly“. No kidding! This impacted my project more than most because the AD accounts are created programmatically via a web portal and updated later by some scripts. The high rate of change means they’re more susceptible to having old values cached.

This might seem like a niche problem but it also impacts implementations of Sharepoint, OWA, Project server, and Office365 – any product that relies on AD for authentication. These products can be integrated with AD to facilitate single sign on but if you make frequent changes to AD the issues above can occur.

How can I diagnose this issue?

The symptoms will vary between products but thankfully Microsoft have some great documentation on ADFS. The troubleshooting guide details how to enable the advanced ADFS logs via Event Viewer- when you’ve got those check for Event ID 139. The event details shows the actual contents of the authentication token so you can check the UPN and ensure it’s what you expect. If not follow the instructions in the KB article to disable or fine tune the cache retention period on the domain member server (ie the ADFS server, not the AD server).

Further Reading

Understanding the LSA lookup cache

Using vCenter Operations v5 – Capacity features and conclusions (3/3)

In the first part of this series I introduced vCOps and it’s requirements before covering the new features in part two. This final blogpost covers the capacity features (available in the Advanced and higher editions) along with pricing information and my conclusions.

The previous trial I used didn’t include the capacity planning elements so I was keen to try this out. I’d used CapacityIQ previously (although only briefly) and found it useful but combined with the powerful analytics in vCOps it promises to be an even more compelling solution. VMware have created four videos with Ben Scheerer from the vCOps product team – they’re focused on capacity but if you’ve watched Kit Colbert’s overview much of it will be familiar;

UPDATE APRIL 2012 – VMware have just launched 2.5 hrs of free training for vCOps!

If you don’t have time to watch the videos and read the documentation (section 4 in the Advanced Getting Started guide) here’s the key takeaways;

  • Capacity information is integrated throughout the product although modelling is primarily found under the ‘Planning’ view. Almost every view has some capacity information included either via the dynamic thresholds (which indicate the standard capacity used) or popup graphs of usage and trending.
  • Storage is now included in the capacity calculations (an improvement over CapacityIQ) resulting in a more complete analysis. Datastores are now shown in the Operations view although if you’re like me and use NFS direct to the guest OS it’s not going to be as comprehensive as using block protocols.
  • the capacity tools require more tailoring to your environment than the performance aspects but provide valuable information
  • With vCOps you can both view existing and predicted capacity and you can model changes like adding hosts or VMs.

Continue reading Using vCenter Operations v5 – Capacity features and conclusions (3/3)

Using vCenter Operations v5 – What’s new (2/3)

In part one of Using vCenter Operations I covered what the product does along with the different versions available and deployment considerations. In this post I’ll delve into what’s new and improved and in the final part I’ll cover capacity features, product pricing, and my overall conclusions. I had intended to cover the configuration management and application dependency features too but it’s such a big product I’ll have to write another blogpost or I’ll never finish!

Introductory learning materials

UPDATE APRIL 2012 – VMware have just launched 2.5 hrs of free training for vCOps.

Deep dive learning materials;

What’s new and improved in vCOps

Monitoring is a core feature and for some people the only one they’re concerned about. As the size of your infrastructure grows and becomes more complex the need for a tool to combine compute, network, and storage in real time also grows. Here are my key takeaways;

  • there’s a new dashboard screen which shows health (immediate issues), risks (upcoming issues) and efficiency (opportunity for improvements) in a single screen. The dashboard can provide a high level view of your infrastructure and works nicely on a plasma screen as your ‘traffic light’ view of the virtual world (and physical if you go with Enterprise+). The dashboard can also be targeted at the datacenter, cluster, host or VM level which I found very useful although you can only customise the dashboard in Enterprise versions. There is still the Operations view (the main view in vCOPS v1) which now also includes datastores. This view scales extremely well – even if you have thousands of VMs and datastores across multiple vCenters they can all be displayed on a single screen.
    NOTE: If you find some or all of your datastores show up as grey with no data (as mine did) there is a hotfix available via VMware support.
  • Continue reading Using vCenter Operations v5 – What’s new (2/3)