How to Easily Deploy an IMDG in the Cloud

Cloud-based applications enjoy the unique elasticity that cloud infrastructures provide. As more computing resources are needed to handle a growing workload, virtual servers (also called cloud “instances”) can be added to take up the slack. For example, consider a Web server farm handling requests for Web users or mobile apps. Being able to add computing resources on demand keeps work queues small and ensures that Web users always see fast response times. And after a period of peak demand subsides, resources can be dialed back to minimize cost without compromising quality of service. Flexible pricing options on some public clouds ranging from hourly to annual charges per instance give organizations the ability to cost-effectively outsource hosting for their production applications.

IMDGs Help Scale Applications

In-memory data grids (IMDGs) add tremendous value to this scenario by providing a sharable, in-memory repository for an application’s fast-changing state information, such as shopping carts, financial transactions, pending orders, geolocation information, machine state, etc. This information tends to be rapidly updated and often needs to be shared across all application servers. For example, when external requests from a Web user are directed to different Web servers, the user’s state has to be tracked independent of which server is handling the request.

With their tightly integrated client-side caching, IMDGs typically provide much faster access to this shared data than backing stores, such as blob stores, database servers, and NoSQL stores. They also offer a powerful computing platform for analyzing live data as it changes and generating immediate feedback or “operational intelligence;” for example, see this blog post describing the use of real-time analytics in a retail application.

The Need to Keep It Simple

A key challenge in using an IMDG as part of a cloud-hosted application is to easily deploy, access, and manage the IMDG. To meet the needs of an elastic application, an IMDG must be designed to transparently scale its throughput by adding virtual servers and then automatically rebalance its in-memory storage to keep the workload evenly distributed. Likewise, it must be easy to remove IMDG servers when the workload decreases and creates excess capacity.

Like the applications they serve, IMDGs are deployed as a cluster of cloud-hosted virtual servers that scales as the workload demands. This scaling may differ from the application in the number of virtual servers required to handle the workload. To keep it simple, a cloud-hosted application should just view the IMDG as an abstract entity and not be concerned with individual IMDG servers or the data they hold. The application does not want to be concerned with connecting N application instances to M IMDG servers, especially when N and M (as well as cloud IP addresses) vary over time.

Deploying an IMDG in the Cloud

Even though an IMDG comprises several servers, the simplest way to deploy and manage an IMDG in the cloud is to identify it as a single, coherent service. ScaleOut StateServer® (and ScaleOut Analytics Server®, which includes features for operational intelligence) take this approach by naming a cloud-hosted IMDG with a single “store” name combined with access credentials. This name becomes the basis both for managing the deployed servers and for connecting applications to the IMDG.

For example, ScaleOut StateServer’s management console lets users deploy and manage an IMDG in both Amazon EC2 and Windows Azure by specifying a store name and the initial number of servers, as well as other optional parameters. The console does the rest, interacting with the cloud provider to accomplish several tasks, including starting up the IMDG, configuring its servers so that they can see each other, and recording metadata in the cloud needed to manage the deployment. For example, here’s the console wizard for deploying an IMDG in Amazon EC2:


When the IMDG’s servers start up, they make use of metadata to find and connect to each other and to form a single, scalable, peer-to-peer service. ScaleOut StateServer uses different techniques on EC2 and Azure to make use of available metadata support. Also, the ScaleOut management console lets users specify various security parameters appropriate to the various cloud providers (e.g., security groups and VPC in EC2 and firewall settings in Azure), and the start-up process configures these parameters for all IMDG servers.

The management console also lets users add (or remove) instances as necessary to handle changes in the workload. The IMDG automatically redistributes the workload across the servers as the membership changes.

Easily Hooking Up an Application to the IMDG

The power of managing an IMDG using a single store name becomes apparent when connecting instances of a cloud-based application to the IMDG. On-premise applications typically connect each client instance to an IMDG using a list of IP addresses corresponding to available IMDG servers. This process works well on premise because IP addresses typically are well known and static. However, it is impractical in the cloud since IP addresses change with each deployment or reboot of an IMDG server.

The solution to this problem is to let the application access the IMDG solely by its store name and cloud access credentials and have the IMDG find the servers. The store name and credentials are stored in a configuration file on each application instance with the access credentials fully encrypted. At startup time, the IMDG’s client library reads the configuration file and then uses previously stored metadata in the cloud to find the IMDG’s servers and connect to them. Note that this technique works well with both unencrypted and encrypted connections.

The following diagram illustrates how application instances automatically connect to the IMDG’s servers using the client library’s “grid mapper” software, which retrieves cloud-based metadata to make connections to ScaleOut Analytics Server:


The application need not be running in the cloud. The same mechanism also allows an on-premise application to access a cloud-based IMDG. It also allows an on-premise IMDG to replicate its data to a cloud-based IMDG or connect to a cloud-based IMDG to form a virtual IMDG spanning both sites. (These features are provided in the ScaleOut GeoServer® product.) The following diagram illustrates connecting an on-premise application to a cloud-based IMDG:


Summing Up

As more and more server-side applications migrate to the cloud to take advantage of its elasticity, the power of IMDGs to unlock scalable performance and operational intelligence becomes increasingly compelling. Keeping IMDG deployment as simple as possible is critical to unlock the potential of this combined solution. Leveraging cloud-based metadata to automate the configuration process lets the application ignore the details of the IMDG’s infrastructure and easily access its scalable storage and computing power.


AppFabric Caching: Retry Later

We have spent a great deal of time at ScaleOut Software re-architecting our in-memory data grid (IMDG)’s code base to make best use of many cores and large memory. For example, the IMDG must be able to efficiently create millions of objects in each server to make use of its huge storage capacity. Likewise, object access paths must be heavily multi-threaded and avoid lock contention to minimize access latency and maximize throughput. Also, load-balancing after membership changes must be both multi-threaded and pipelined to drive the network at maximum bandwidth.

Given all this, we thought it would be a good opportunity to see how we are doing relative to the competition, and in particular, relative to Microsoft’s AppFabric caching for Windows on-premise servers. In addition to looking at performance differences, we also want to compare ScaleOut StateServer (SOSS) to AppFabric on qualitative measures, such as features, ease of installation, and management.

Testing Scale-Up Performance

Well, performance comparisons aren’t so easy since the AppFabric license agreement states: “You may not disclose the results of any benchmark tests of the software to any third party without Microsoft’s prior written approval.” So our comments will be confined to what testing we felt was valuable and how well SOSS performed.

In a recent customer engagement, the application needed to load 10M objects into each server of the IMDG’s cluster to make full use of high-end servers with 60GB memory capacity. Measuring object creation and access rates on a single server is a good test of the IMDG’s memory management and multithreading, and this is an area in which we have made several performance optimizations. Using a workload of random object sizes varying from 200B to 2KB, SOSS was able to load 2 million objects in 59.2 seconds and then sustain 18K read/update pairs per second to random objects. That’s actually quite fast. (We invite you to test AppFabric’s performance; contact us if you need a test application.)

We also looked at load-balancing and recovery times for this workload of 2M objects when adding a server to a 2-server cluster, removing the third server, and also just killing the third server. This measures how well the IMDG’s servers use multithreading to maximize network bandwidth during load-balancing, and it also evaluates failure detection and recovery algorithms. These are areas in which we have invested heavily to take advantage of 10 Gbps (and faster) networks and to handle intermittent network delays inherent in virtual server infrastructures. While handling an access load of 6K read/update pairs per second, SOSS was measured to complete load-balancing in less than 35 seconds for joins and leaves and also to complete recovery and restore full throughput in this amount of time after a server failure.

Unwanted Client Exceptions from AppFabric

We were surprised to discover that AppFabric throws exceptions to the client application during load-balancing and recovery and due to security issues, as described in other blog posts. During load-balancing, the client gets the following exception when accessing the cache:

ErrorCode<ERRCA0017>:SubStatus<ES0006>:There is a temporary failure. Please retry later. (One or more specified cache servers are unavailable, which could be caused by busy network or servers. …)

You’ll also see a “Please retry later” exception (forever!) if the client runs with insufficient security authorization; we had to run the client as administrator to avoid problems without a much deeper investigation. The client also throws these exceptions during “graceful” host shutdowns and during recovery.

To keep application development as straightforward as possible, SOSS handles all exceptions that occur during membership changes and load-balancing, automatically retrying requests within the server as necessary. This avoids exposing the application to the details of the IMDG’s internal operations unless the IMDG is completely unreachable.

A Few Words on Design Philosophy: Keep It Simple

Our experience with customers consistently reinforces the need to keep middleware software as easy to install and use as possible, especially given that it is deployed as distributed software running on a cluster of servers. This philosophy manifests itself in all areas, including installation, application development, and cluster management. We believe that installing our software should be as straightforward as we can make it, requiring minimal knowledge of the host operating system and the fewest possible explicit configuration settings.

AppFabric caching does not share this design approach and requires the configuration of numerous components, including security policies, SQL Server or a network file share with the appropriate permissions, “lead” hosts, etc. Also, a long list of PowerShell commands for managing the cache must be learned and correctly used. Running AppFabric caching in production typically requires the use of a domain and is deeply tied into the Windows Server infrastructure. As expected, AppFabric requires the comprehensive knowledge of a Windows system administrator to install and manage.

In our testing, the net result was about a half-day investment in time on our part (and some frustration) getting AppFabric up and running. After spending an hour trying to join multiple cluster hosts in a Windows workgroup, we switched to using a domain controller to make this work; It just wasn’t worth the time to sort out the incorrect configuration settings. Head scratching was required in several other areas before our AppFabric cluster showed signs of life.

GUI Based Management Is Crucial for Distributed Software

A major source of frustration with AppFabric is its use of PowerShell commands to manage the cache cluster. It’s easy to forget that distributed software is running on multiple servers which need to be orchestrated as a group, and that’s hard to do with a sequence of shell commands because you can’t track the state of the cluster at a glance. It’s much easier to manage a cluster with a graphical user interface (GUI) which shows the status of all hosts and alerts you to dynamic situations, such as high usage, load-balancing, or network outages.

To take full advantage of the GUI approach, SOSS uses a Windows-based management console with intuitive controls that make management of the IMDG simple and easy to learn. The console also adds advanced visualization features, such as integrated performance charts and tabular usage charts, a “heat” map showing the availability and dynamic state of all regions within the distributed store, and an object browser that lets you see stored objects and examine both their metadata and contents.

The following screenshots illustrate SOSS’s performance charting and heat map. Note that the status of the IMDG and all cluster hosts is instantly visible in the tree list at the left:


Because the GUI management console receives periodic updates from all cluster hosts, it stays tightly integrated with the cluster and dynamically updates the latest status. In contrast, the use of shell commands just gives you a snapshot of the state of the cluster at one instant in time. We also have observed that these results quickly can become out of sync with what client applications are actually experiencing. For example, a shell command can report that the cluster is in an unknown state when in fact the client is successfully completing access requests. (In AppFabric, be prepared to wait for several seconds for management commands to time out when a cluster host goes into an unknown state.)

Use Fully Peer-to-Peer Design for Simplicity and High Availability

AppFabric uses a single store, either a file share or SQL Server, to hold the cluster’s configuration, which adds complexity to installation and creates a single point of failure. Although SQL Server can be clustered, this adds even more cost than just using a single server. To avoid these problems, SOSS automatically replicates its configuration files on every cluster host to maximize availability with a fully peer-to-peer implementation. This approach also keeps the user from having to deal with managing a configuration store.

The peer-to-peer issue arises again in AppFabric’s requiring a majority of lead hosts (presumably) to reconfigure the cluster after a membership change. Because some hosts are lead hosts and others are not, an AppFabric caching cluster will go down even when hosts are healthy. Moreover, the administrator has to understand and ensure that a majority hosts quorum of lead hosts exists, and the number of lead hosts varies with cluster size. For example, a small cluster of up to 20 hosts might use 3 lead hosts, requiring two hosts to form a quorum. If 2 of the 3 lead hosts go down, the cluster will go down even if there are 18 healthy hosts.

With ScaleOut, you don’t need to know what the word “quorum” means. SOSS sidesteps these issues by using a fully peer-to-peer membership so that all hosts can participate in constructing the cluster membership. (SOSS makes use of a ScaleOut Software patent which eliminates the need for a majority hosts quorum by building a logical quorum on a uniform set of servers.) This means that SOSS can avoid the use of lead hosts and maintains service as long as any host survives. System administrators view all cluster hosts as peers and do not have to be aware of SOSS’s internal mechanisms for implementing the cluster membership.

Distributed Cache or In-Memory Data Grid?

It’s actually not clear to us whether AppFabric’s design philosophy regarding high availability is more closely aligned with “best effort” distributed caches like memcached or with mission-critical in-memory data grids like SOSS and others. Data replication is turned off by default and is explicitly set on a namespace-by-namespace basis using management commands. (High availability apparently is not available on Windows Server 2008R2 Standard Edition and requires Enterprise Edition, which can cost about $2,800 more per server, or you must upgrade to Windows Server 2012.) Also, since data is hosted in managed code, access delays due to garbage collection (as well as the exceptions noted above) are to be expected. (Microsoft recommends not storing more than 16GB in a cache host to avoid GC pauses.) Lastly, AppFabric’s client cache is not kept coherent with the distributed cache, and so the client cache cannot be used for transactional data.

In contrast, SOSS automatically replicates all stored objects on multiple hosts to maintain high availability at all times. Likewise, it uses an unmanaged heap for stored data to keep access times as predictable as possible and avoid GC pauses. It also keeps all client caches coherent with the IMDG so that multi-threaded code running on the cluster can coordinate access to transactional data using the well understood sequential consistency model.

In Summary: Design for Ease of Use and High Performance

It was tempting to write this blog post as a feature comparison between AppFabric and SOSS. It’s clear that, as a full in-memory data grid, SOSS — unlike AppFabric — incorporates many features that go well beyond distributed caching. For example, SOSS lets applications query stored data by property using Microsoft’s own LINQ, and ScaleOut Analytics Server (SOAS) can perform data-parallel analysis of queried objects using application code that SOAS automatically deploys on the cluster. ScaleOut hServer takes analytics a step further by hosting full Hadoop MapReduce on the IMDG.

That said, since AppFabric is targeted at distributed caching, we felt that AppFabric users likely are more focused on issues regarding deployment, performance, and availability. Beyond just evaluating how well products like AppFabric and SOSS extract performance from scale-up, it’s also important to examine how they stack up in their overall role of providing scalable, highly available in-memory storage.

When looking at other design approaches, we feel that ScaleOut’s philosophy of easy-to-use, fully peer-to-peer design with GUI-based management provides important dividends by simplifying deployment and maximizing visibility, while driving high performance and availability. Not surprisingly, all of this lowers the total cost of ownership, which — as we have seen — even for “free” software is never zero.



Reports of Scale-Out’s Demise Are Greatly Exaggerated

A recent blog post highlighted a Microsoft technical report which asserts that most Hadoop workloads are 100 GB or smaller, and for almost all workloads except the very largest “a single ‘scale-up’ server can process each of these jobs and do as well or better than a cluster in terms of performance, cost, power, and server density.”  It’s certainly true that Hadoop MapReduce seems to have focused more on clustering issues than on single-server optimizations. But — to paraphrase Mark Twain — reports of scale-out’s demise for all but the largest workloads are greatly exaggerated. Continue reading


Using In-Memory Data Grids for ETL on Streaming Data

The Hadoop stack offers a compelling set of technologies and tools that can be deployed to serve as the core of next-generation data warehouses. The combination of scalable MapReduce to analyze petabyte data sets, parallel SQL query using Hive or Impala, and data visualization tools gives the analyst powerful resources for mining strategically important data. The Hadoop Distributed File System (HDFS) serves as a highly scalable data repository for hosting this data and efficiently feeding it into Hadoop’s parallel analysis engine. With industrial strength support from companies like Cloudera and others, the time is now right for deploying a Hadoop-based data warehouse: Continue reading


How Do In-Memory Data Grids Differ from Storm?

­­In last week’s blog post, we talked about the fact that our in-memory computing technology often is confused with popular other “big data” technologies, in particular Spark/Spark Streaming, Storm, and complex event processing (CEP).  As we mentioned, these innovative technologies are great at what they’re built for, but in-memory data grids (IMDGs) were created for a distinct use case. In this blog post, we will take a look at how IMDGs differ from Storm. Continue reading


How Do In-Memory Data Grids Differ from Spark?

As an in-memory computing vendor, we’ve found that our products often get confused with some popular open-source, in-memory technologies. Perhaps the three technologies we are most often confused with are Spark/Spark Streaming, Storm, and complex event processing (CEP). These innovative technologies are great at what they’re built for, but in-memory data grids (IMDGs) were created for a distinct use case. In this blog post, we will take a look at how IMDGs differ from Spark and Spark Streaming. Continue reading


Transforming Retail with Real-Time Analytics

Real-time analytics has the potential to transform operational systems by providing instant feedback that dramatically enhances how these systems respond to fast-changing events. For example, in a previous blog we saw how a hedge fund tracking its equity portfolios can respond to market fluctuations in milliseconds instead of minutes. However, these benefits are not restricted to financial services. In discussions with both e-commerce and brick-and-mortar retail companies, we also have identified opportunities to enhance their operational systems with real-time analytics. Let’s take a look at a few examples after a quick review of in-memory data grids (IMDGs). Continue reading


How Object-Oriented Programming Simplifies Data-Parallel Analytics

In-memory computing enables real-time analytics to be integrated into operational systems so that fast-changing, “live” data can be instantly evaluated to provide feedback in milliseconds or seconds. As we have discussed in previous blogs, the key to scalable performance and fast response time lies in the use of data-parallel programming techniques. How can we structure these computations to ease their integration into operational systems? Continue reading


Creating Data-Parallel Computations for Real-Time Analytics

Real-time analytics offers enterprises the ability to examine “live,” fast-changing data within operational systems and obtain feedback in milliseconds to seconds. For example, a hedge fund in a financial services organization can track the effect of market fluctuations on its portfolios (“strategies”) of long and short equity positions in various market areas (high tech, real estate, etc.) and immediately identify strategies requiring rebalancing. As we have seen in previous blogs, the key to real-time performance, especially for growing workloads, is to use in-memory, data-parallel computing, which delivers scalable throughput and minimizes performance losses due to data motion. But how can we easily structure computations to take advantage of this “scale out” technology? Continue reading


Scaling Real-Time Analytics with an IMDG

In the last blog we discussed how in-memory data grids (IMDGs) share the same architecture as parallel supercomputers. Parallel supercomputers typically add computing power by scaling “out” across a cluster of servers. Likewise, IMDGs scale out their in-memory data storage and analytics engine across service processes running on a cluster of servers. Let’s take a little deeper look at the benefits of scaling out, especially for computations in real-time analytics. Continue reading