Related Topics: Virtualization Magazine, Apache Web Server Journal, CMS Journal

Virtualization: Article

How got Linux-on-a-mainframe wrong

David Boyes responds to's 'Weighing the pros & cons of IBM's mainframe Linux' series.

I do apologize for the slow response to your single e-mail attempt to contact me on April 6 -- I have been working on a project in mainland China with poor connectivity and have been reserving bandwidth for critical messages and paying customers -- at $2.80/minute and about 1200-baud transfer rates, it's not an environment conducive to answering random queries demanding extensive explanations if it isn't critical business data.

I do have some concerns with your representation and with your understanding of the technical environment and capabilities surrounding the Test Plan Charlie implementation. Perhaps an earlier explanation would have avoided some of your misunderstandings, but please allow me to address your initial questions and then make a few suggestions to help you understand more about the environment and the capabilities of the VM hypervisor.

First, some general comments are in order. Test Plan Charlie is a specialized environment designed for a specific customer. There are tradeoffs in that environment that are specific to that customer which may not be appropriate or reasonable for general purpose use, something that I do try to make clear when I present the solution and the analysis leading to the decision to use virtual servers and Linux on the S/390 for their later production solution. I assume that as you probably assimilated your viewpoint from the presentation slides alone that you did not hear that caveat and were unaware of the discussion surrounding it. I tend to liken it to design of an embedded system rather than a general-purpose environment, and many ideas borrowed from a brief flirtation in my career with being a process control engineer surface here in terms of separation of writable and non-writable components, use of RAMdisks and PROM-equivalents, and optimizing for specific function from a general purpose code base.

Second, keep in mind that I approached the problem with the perspective of employing substantial knowledge of the VM system facilities and services as a container for guest operating systems and to determine the simplest method to allow Linux to take advantage of these capabilities with a minimum of specialized code, such as the xpram driver for extended storage (essentially a page mapping technique to give the 31-bit S/390 architecture the ability to employ system memory in excess of the 2G limit on addressability in a 32-bit space -- the 32nd bit is used in the 390 world to indicate 24-bit or 31-bit compatibility for older S/360 and S/370 systems).

The Linux development community has since added specific kernel-space and user-space modifications which do a more sophisticated job, but the goal at that point in time was to make the VM facilities do the heavy lifting when we could force them to do so, and thus make Linux's job simpler -- what the guest didn't know about the VM function we used, didn't hurt it.

In specific, we took a great deal of advantage of VM functions such as minidisk block caching (which provides in-core caching of disk blocks similar to the method used by Linux, but VM provides several tiers of cache; main storage, expanded storage, and then finally out to magnetic media), virtual disk in storage (which are eligible to be pageable and thus can be heavily overallocated, unlike the Linux ramdisk setup or the later xpram driver which requires dedicated storage), and the sophisticated VM fair-share scheduler to offset the relatively weak Linux resource management capability in the early days.

Answering questions

Your questions from the April 6 note were:

How did you get 41,400 Apache servers network connected to the outside from within one G5 LPAR?

The concept here is identical to the idea of using routers and switches to provide sufficient ports to plug in the network cables for discrete servers. Following standard design practice, the logical way to manage a very large number of connections is using multiple tiers of routers and switches, e.g., a configuration analogous to a common router connecting to a WAN link and a number of internal networks, one interface per network. Those networks can in turn be connected to additional routers, building tiers of connections and networks. Ultimately, the end state is some connection between a network segment and a server network interface.

In the Test Plan Charlie (and the previous Test Plan Able and Baker, which were executed on the same hardware) scenario, this typical network architecture was implemented using Linux guests (the correct term; your use of "ghost" is a misnomer) acting as software-based routers. Using virtual channel to channel adapters (vCTCAs), we created a network of point to point links very similar to a WAN configuration, but terminating directly into servers, providing IP connectivity to a number of connections that are either another branch in the tree of routers or terminating directly at a server -- a very large branching tree with lots of leaves. Connecting servers directly to routers is not common in typical server farm implementations due to the cost of interfaces, but as the interfaces are simulated, this caveat does not apply in our situation -- we can create routers and interfaces at need at the expense of additional routing complexity.

Routing and IP transport are handled exactly as they would be if the routers and connections were real. Using OSPF to summarize routes and reachability information, we can present a consolidated routing entry to the outside world just as we do any other complex of network segments within a WAN. Use of the RFC 1916 private address blocks provides plenty of address space to work with, and we can do useful things with techniques like NAT and PAT with ipchains.

What was the size of the working set for each instance? If each instance was an independent Linux system, how big was that EMC? (and how did you get it to maintain that I/O rate?)

Fairly small, as the individual images were deliberately kept small and focused on specific tasks. Again, the analogy to an embedded system stands in good stead here -- if a service or application wasn't necessary to the design function for that class of instance, it was removed and/or disabled (the option of disabling the service but retaining the files actually made more sense in terms of allowing more substantial commonality between the base OS images, which leverages the VM minidisk caching facility and robust block-paging algorithm -- which attempts to minimize head movement during paging operations to optimize throughput while workload is suspended for virtual memory operations).

With regard to the EMC box, it was a fairly large one with substantial numbers of channel paths (physical connections between system and disk unit) and control unit cards, but it was also completely dedicated to the environment in question during the processing of the various test plans. The box had been ordered for use with the OS/390 parts of the system, had arrived, but the OS/390 guys hadn't gotten around to getting it configured for their use, so we grabbed it for a few days... 8-).

In terms of the I/O rates, keep in mind that we had the maximum number of channel paths available, and that the VM system was doing it's best to perform as much I/O optimization as was available to it. We also took great pains to limit the amount of copying required to create a unique instance (see below) -- if we could share, or reuse something, and target the most common files to remain in main storage or device cache, that was a goal in how the template systems were configured. See the next section for more discussion on this.

how much difference was there between instances? (were they all the same one? did you share everything except /etc/hosts? )

No, there were six "template" systems corresponding to various functions within the configuration -- each with a common set of disks for a majority of the binaries. The distribution structure and initialization process were customized to move all writable or instance-specific components outside the traditional Linux filesystem organization; if you've ever encountered a diskless Sun workstation, you would find a number of the modifications to the filesystem architecture we had to make to be very familiar.

Each template system started with a common base OS and startup files, and had various things tweaked to optimize it for a specific role: router, file server, WWW server, load generator, etc. The additional server instances for a specific function then used that template as a base set of disks with only the absolutely necessary customized files residing in a small /var partition. Swap was done to VDISK (to allow VM to do swap I/O as CP paging I/O, the fastest code path in any VM I/O sequence).

Your point in one of the sidebars about I/O required to generate the number of instances missed this critical item -- the idea is to copy only the components that are absolutely necessary to make the instance distinct from the template; instead of the gigabytes you are arguing, it's more like a few hundred kilobytes per instance. Some of the templates are more complicated (the file servers for example), but they can be pre-allocated. (Later versions of the setup do this -- a queue of pre-allocated but unconfigured template systems that can be customized and brought up on demand, and when a system template is used, a background job is created to replace the queue entry with a new unconfigured template. This approach dramatically reduces the time to allocate and configure a server -- a few seconds to update a directory and customize the server image).

You report that it took 90 seconds to create an instance. How did you get 41,400 of those into the 111,600 from Friday at 5 to Sat at midnight?

Here's where you missed something waiting for the movie version of my presentation... 8-) Test plans Able, Baker and Charlie are cumulative. Charlie was the go-for-broke piece to establish where the whole system fell apart -- the Friday to Saturday midnight is the part on top of what was already done in Able and Baker. So, lots of time to get to the end of Baker; Charlie is mostly just torture... 8-)

What did you adjust the interrupt timer to? (and if this was less than 13.8 seconds, please help me understand how you managed the paging demand created)

The setting used in the router template was 12 interrupts/second (much lower and TCP started falling apart; higher generated too much overhead), the other servers were in the 10-12 interrupts/sec range. This entire undertaking was not a performance test, but an integrity test (which the environment passed), even when essentially paralyzed with thrashing pages in and out of memory. (One thing's for sure, the last few guests took a good long while to create! The shock was more that the whole thing didn't blow up, just got slower and slower.) With a dedicated disk farm, controlling paging wasn't the point; survivability was.

On the later production environment (with far fewer instances -- at it's height, around 8,000 with attrition due to the general downturn in the ISP industry and corresponding slump in telecommunications purchasing pulling it down to a more nominal 5,800 or so as of the last email I got from the monitoring system last Friday), there did have to be some consideration of the sequencing of initializing the images and location/speed of paging areas.

A lot of age-old paging optimization wisdom was hauled out of storage and tried to determine what made sense in this environment, and the servers on the production machine are started in batches to allow the dust to settle between groups. Paging areas are located across multiple volumes to ensure that CP can do parallel I/O requests (note that this approach predates the PAV (parallel access volume) support in CP that would have allowed multiple outstanding I/O requests to a single volume), and there is a systematic review of VM performance data on a regular (in some cases, daily) basis to determine if there are areas of concern.

The sidebars

I do agree with you that some portions of IBM and it's constellation of business partners do not represent the test accurately, and as you notice in your sidebar comments but don't make very clear, I do not claim that Linux on S/390 hardware is necessarily the best solution for every problem; it is a useful tool, and opens up some interesting possibilities for infrastructure design that challenge a number of the assumptions about how large server farms can and should be built. In fairness to the people in IBM marketing, they also make clear that they do not claim that Linux on the mainframe is right for every solution either -- after all, that's why they keep making PowerPC and Intel-based machines as well as the zSeries systems. There are problems that are not appropriate for a particular platform, which is why we have a diversity of vendors.

Turning now to another of your sidebar comments (the one in the reductio ad absurdum component, your discussion of the hardware requirements for the alternative solution missed another important aspect of the comparison: the 750 system number was supplied by their previous consulting firm (in this case, one of the Big Six consulting firms). I find that solution strange as well and asked similar questions about their choices, which is one of the reasons for establishing a number of alternative scenarios that were part of the discussion with the client.

Since you didn't have details on the expected workload, I find your comment on whether a UE2 would be appropriate somewhat puzzling -- there certainly are sufficiently large organizations using the UE2 for infrastructure roles without a problem, and they are quite common in ISP and ASP environments as the equivalent of what would be today a 1RU utility Intel appliance or similar system.

Your comment on database applications is also puzzling, as there is no reference to databases in the TPC environment other than the ones maintained by 'bind' and Usenet news, which are certainly I/O intensive, but not what I would consider database workload, so I would contend that you are lacking the detailed information required to understand the requirements of the specific problem at hand, and thus have little basis to argue design decisions -- whether the solution is right or not for this customer is not something you can argue because you don't know the parameters of the problem.

In another of the sidebars (on discussing the value of assembler), you make another interesting comment on the ability to take advantage of architectural capabilities in the S/390 environment. Rather than a blanket statement that Linux cannot take advantage of these capabilities (which our earlier discussion in this response makes clear that there are a number of capabilities of the hardware and VM hypervisor that Linux on S/390 does utilize to it's benefit), I would draw a specific distinction between architectural functions that are exploited by inheritance, and architectural function that is exploited consciously with an awareness of the interaction.

I do agree that at that time there were relatively few VM and hardware system services that Linux could consciously exploit (this situation has changed rather dramatically in the last few months), but I would take exception to the blanket assertion that Linux takes no advantage of the inherited functions (also, do keep in mind that a Linux guest running in the VM environment is able to communicate with the VM hypervisor to request services that it may not directly know how to manipulate, but can ask the hypervisor to perform tasks on its behalf).

The hardware functions such as CPU and memory sparing, I/O path replication, and hot component replacement are always available and are operating system neutral -- Linux benefits from them as well as any other operating system on the physical machine. The facilities provided by the CP component of VM (isolation, virtual device support, vCTCAs, virtual disks, console automation, scripting, performance management, etc) are also there regardless of whether Linux is the operating system or not. Note also that by employing the VM hypervisor and allowing CP to intercept I/O requests, you inherit some very sophisticated I/O optimization for minidisk I/O. The examples can continue, but I would think that this hardly constitutes being "unable to take advantage" of the architecture.

I do have to admit to a good laugh over the use of the PL/1 compiler as an example of strong optimization -- a project I worked on at a previous employer did a lot of analysis on parallel Fortran programs using a really complicated PL/1 based parser. The PL/1 optimizer is, umm... less than straightforward in operation, but as you say, produces amazingly good code. Debugging same is, well... interesting at best.

Understanding problems the mainframe solves

Now that I've talked about some of the specific problems, I'd like to provide some discussion of your general understanding of the problem at hand.

For a comparison of a single application running one box, your analysis is probably ok. It's taking the larger view of the fact that most organizations don't have only one application in the production mix, nor do they have one box per application that makes the discussion of this type of solution interesting.

Let's think about box count first for a moment. Consider that for most organizations, when you deploy Application X's server in production you need some extra hardware above and beyond the actual production server to make the solution supportable. You need:

  1. the production box itself, or a small group of servers to provide horizontal scalability
  2. a backup server or a hot spare in clustered environment (we are talking mission-critical apps here, so it's unlikely that you would deploy only one box)
  3. a development box or two
  4. a test/QA box
  5. possibly a regression box in case more than one version of the application is in production at any given time.

So, the comparison of one z900/z800 is actually against four, possibly five Sun boxes per application deployed. You can't double up on the test systems because you need them to mirror production to be a valid test; and you certainly don't want developers testing on the same box you use for QA against production. Granting you the discussion of sharing the backup server on the same physical image is not the best method of guaranteeing overall application reliability, then we are still in the ballpark of four out of five classes of support systems that could profitably coexist on a single physical system.

Assuming worst case of five boxes per application, we've erased most of that 18% number down to about 2-3% overall. What happens when the next application comes along, call it Application Y? You now need new boxes for that application -- you can't use the others because they're dedicated to application X. So, for Application Y, you now need 1+4 *more* servers. We're now up to eight servers for two applications. The trend is clear.

The cost of floor space

Note that we have not addressed the floor space or power costs yet -- which increase each time we add a server. We also can assume no sharing of resources; they're separate boxes, and you can't move MIPS or I/O without disrupting service. We also are not computing additional overhead for maintenance (you get a lot of that for free with VM; the Linux stuff takes some thought to do, but is also doable with a much higher level of automation). The real kick for most people is while the initial cost of the zSeries is high; it amortizes quickly across multiple applications -- if you take the model above where you need four or five servers to deploy an application, a solution where the same physical server handles that load and gets partitioned to supply the same configuration logically, your cost per application decreases substantially -- 1/n instead of n*4 or more. The part you miss is in the case of applications n+1 and n+2, there is not necessarily a hardware acquisition component, or an additional facilities costs, which are substantial, but fixed for the duration of the capacity available in the z800/z900, which can be overlapped for normal applications (the case of applications using 100% of the box is fairly rare).

The steps for environmentals and staff are larger (and you make a interesting point in that as the mainframe-literate staff age, their cost per unit does increase, however that cost represents the acquisition of senior experienced personnel which are much less likely to make beginner mistakes in terms of understanding operational reality and the requirements of 24x7 business operations than their "cheaper" college-age compatriots you used in your example), but acquisitions of such staff are required on a much less frequent basis. It is not uncommon for a competent VM systems programmer to maintain several dozen large VM systems due to the tooling and system management function available in the environment that can be used for that purpose. These tools also extend to the Linux environment (the programmable operator, REXX, CMS Pipelines, and 30 years worth of openly shared tools and expertise available to the community (BTW, this is an interesting example of open-source methodology long before Linux or Unix existed).

In terms of facilities, floor space, power and connectivity are not free, and you do not include those in your analysis in any significant form. If you include those factors, the observed TCO is more closely related to the published data from The Registry on the operating cost for an network installation rather than just the server data. Their estimates (based on 2000 data) indicate that the hardware and software component of a operation averages:

  • 20-23% hardware and software acquisition and maintenance
  • 37%: staffing
  • remainder: facilities and management costs

These figures will be normalized by location, but the general proportion holds across different locations as well. Your analysis focuses only on the hardware and software costs -- the smallest part of the problem in my opinion. Following your article, another participant in the mailing list posted the comment:

So - then the argument would be that for 16.4% more, you can get all of the RAS of z/900 hardware, vs. the mainframe box. Is that a fair statement?

I would answer, not really. The major argument is that you control the cost of deployment and operations rather than focus on cost of hardware for a number of applications, not just one, and that the investment you make in support infrastructure and staffing is overall smaller for the same number of logical images. That cost is coming out of the 70+% of the TCO that is not related to hardware, and is much more likely to be recurring cost, which is what makes any solution expensive.

I would contend that I am not the only person that holds this view. I would point you to Consulting Times' analysis, the work performed by other industry groups such as Gartner and Meta Group, and the amount of focus being placed on the discussion by both Sun and Microsoft. With such illustrious contenders furiously spending advertising dollars to refute the point -- including investing in consulting talent such as yourself to respond to the idea -- I would argue that there is a reason to have the discussion on what assumptions organizations use to justify their solutions decisions. That makes this valuable in and of itself -- without challenges, we make bad decisions.

As a side comment, you should be commended for locating Melinda Varian's excellent paper, but actually reading the paper would have helped you to avoid some factual problems. The timeline for VM releases flowed from CP-40 to CP-67, then to VM/370, VM/SP, a short code fork for a higher-performance option (VM HPO), VM/XA to deal with a major architectural update to enable 31-bit systems, VM/ESA to include the next generation of architecture upgrades, and now, z/VM to return the focus to virtualization. VM does interact directly to the hardware; PR/SM provides an assist, but if not present, VM will operate just fine -- if you were referring to the SIE instruction which increases the efficiency of VM operation in a logical partition (LPAR), then there is some dependency on PR/SM for efficiency, but is not absolutely required. VM operates fine without any LPAR processing -- it's called "basic mode" when a single instance of VM controls all the resources of the system -- and is common in many organizations.

I've already discussed the hardware sparing and other HA features of the hardware, so I won't reiterate them here. Detailed specifications for the zSeries hardware (and just about every other piece of IBM equipment or software is available at -- the zSeries 900 is hardware product id 2064, the zSeries 800 is product 2066). I would commend this site to your attention; it contains much useful and current information on IBM systems and programs.

These (and other factual errors based on outdated knowledge) lead me to discount your other arguments. I am happy to discuss issues, but your later comments and your actions declining the opportunity of discussing the issues (re: your email of May 20), your later comments in parts 2 and 3 of your article, and coupled with's revision of your article to remove some or most of the more inflammatory comments tend to indicate that the offer is unlikely to be accepted. I would welcome the opportunity to discuss any of your concerns in detail, with the understanding that I'm often in the field deploying this solution and that we can locate a mutually convenient time.

More Stories By David Boyes

David Boyes is the chief technical officer at Sine Nomine Associates, a consulting firm based in Ashburn, VA that specializes in internetworking, telecommunications, strategic and training services.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.