And that’s no joke.
How to Destroy a Sysplex
To say we had an interesting Business Recovery Exercise this week would be an understatement!
Since bringing our BR (business recovery) / DR (disaster recovery) solution in house, rather than performing offsite, we’ve had a total of five BR Exercises this year alone. This is pretty impressive for our shop since we use to go YEARS between BR Exercises. Now our clients can declare a BR Exercise without prior notice to ensure our infrastructure is sound and solid.
Our infrastructure IS sound and solid…provided no one messes with it!
Two months earlier I was doing what I thought was helpful clean up on RACF. I was adding a new PROFILE for a monitoring application. Our RACF expert had just recently retired and our new RACF person was not quite trained and up to speed.
On occasion I would go in and “fix up” some things in RACF trying to helpful. Although I had ADMIN rights to reset PASSWORDS when I’m on-call, I’m not really suppose to mess around in RACF.
But what’s the worse that can happen?
I honestly thought I was doing something good by deleting a VERY suspicious * (G)ENERIC profile.
To me this generic profile seemed a security risk and decided to take matters into my own hands (since the new guy surely was not going to) and DELETED this profile!
What I didn’t realize what I had done is that instead of making the system more secure I delete a VERY important PROFILE that’s used at IPL.
[The] class SURROGAT profile consisting simply of "**" or "*.*" (sometimes called a catchall profile). It applies to all user IDs that aren't matched by a more specific profile and probably covers your user ID unless steps have been taken to avoid this. ... Without a catchall generic profile of some kind in the class STARTED, a previously undefined started task will fall back to the contents of ICHRIN03. ... If fallback to ICHRIN03 can happen, you need to know what privileges it's granting.
That’s exactly what happened.
We started the Business Recovery Exercise and the system upon the first IPL came to a screeching halt. Apparently JES2 (Job Entry Subsystem) did not have authority and the ICHRIN03 was poorly coded.
But…NOTHING has changed!!!!
Imagine the frustration my fellow colleagues (and myself before discovery) were experiencing. Here we were doing our FIFTH BR exercise this year. It always worked. It never failed. We had a perfect mirror of our working production. Nothing had changed!
To make a long story (and painful one for me) short, we opened a Service Request with Severity 1 with IBM. This is equivalent to calling 911 or pressing the nuclear panic button when you need IBM support and need it fast!
We were directed to a teleconference with their JES and RACF experts and with their AWE INSPIRING expertise guide us to the discovery that yes, we were missing that * GENERIC PROFILE in RACF. Since JES2 at our shop started in a certain sequence we were unable to re-create this PROFILE on our BR system.
Since this was a mirror of our production we discovered that we were in fact vulnerable on our PRODUCTION SYSTEM!!!
If we had IPL’d any of our production LPARS, meaning recycling them, there was NO WAY they were coming back up. JES2 would have ran into the same authority issue error and the entire system would be in a matter of speaking…toast!
Luckily we caught this and were able to RECREATE the profile on our PRODUCTION system so we could mirror it over to the BR SYSTEM and finish the exercise.
Take away lessons:
- NEVER… EVER… MESS WITH RACF! (At least without knowing what you’re doing. My RACF roles have been relinquished to the appropriate people.)
- Business / Disaster Recovery Exercises are there for a REASON! If you’re not doing it at your shop, how do you know you’re not vulnerable?
</CONFESSION AND LESSON>