How to Easily Deploy an IMDG in the Cloud

Cloud-based applications enjoy the unique elasticity that cloud infrastructures provide. As more computing resources are needed to handle a growing workload, virtual servers (also called cloud “instances”) can be added to take up the slack. For example, consider a Web server farm handling requests for Web users or mobile apps. Being able to add computing resources on demand keeps work queues small and ensures that Web users always see fast response times. And after a period of peak demand subsides, resources can be dialed back to minimize cost without compromising quality of service. Flexible pricing options on some public clouds ranging from hourly to annual charges per instance give organizations the ability to cost-effectively outsource hosting for their production applications.

IMDGs Help Scale Applications

In-memory data grids (IMDGs) add tremendous value to this scenario by providing a sharable, in-memory repository for an application’s fast-changing state information, such as shopping carts, financial transactions, pending orders, geolocation information, machine state, etc. This information tends to be rapidly updated and often needs to be shared across all application servers. For example, when external requests from a Web user are directed to different Web servers, the user’s state has to be tracked independent of which server is handling the request.

With their tightly integrated client-side caching, IMDGs typically provide much faster access to this shared data than backing stores, such as blob stores, database servers, and NoSQL stores. They also offer a powerful computing platform for analyzing live data as it changes and generating immediate feedback or “operational intelligence;” for example, see this blog post describing the use of real-time analytics in a retail application.

The Need to Keep It Simple

A key challenge in using an IMDG as part of a cloud-hosted application is to easily deploy, access, and manage the IMDG. To meet the needs of an elastic application, an IMDG must be designed to transparently scale its throughput by adding virtual servers and then automatically rebalance its in-memory storage to keep the workload evenly distributed. Likewise, it must be easy to remove IMDG servers when the workload decreases and creates excess capacity.

Like the applications they serve, IMDGs are deployed as a cluster of cloud-hosted virtual servers that scales as the workload demands. This scaling may differ from the application in the number of virtual servers required to handle the workload. To keep it simple, a cloud-hosted application should just view the IMDG as an abstract entity and not be concerned with individual IMDG servers or the data they hold. The application does not want to be concerned with connecting N application instances to M IMDG servers, especially when N and M (as well as cloud IP addresses) vary over time.

Deploying an IMDG in the Cloud

Even though an IMDG comprises several servers, the simplest way to deploy and manage an IMDG in the cloud is to identify it as a single, coherent service. ScaleOut StateServer® (and ScaleOut Analytics Server®, which includes features for operational intelligence) take this approach by naming a cloud-hosted IMDG with a single “store” name combined with access credentials. This name becomes the basis both for managing the deployed servers and for connecting applications to the IMDG.

For example, ScaleOut StateServer’s management console lets users deploy and manage an IMDG in both Amazon EC2 and Windows Azure by specifying a store name and the initial number of servers, as well as other optional parameters. The console does the rest, interacting with the cloud provider to accomplish several tasks, including starting up the IMDG, configuring its servers so that they can see each other, and recording metadata in the cloud needed to manage the deployment. For example, here’s the console wizard for deploying an IMDG in Amazon EC2:

awsConsoleWizSummary

When the IMDG’s servers start up, they make use of metadata to find and connect to each other and to form a single, scalable, peer-to-peer service. ScaleOut StateServer uses different techniques on EC2 and Azure to make use of available metadata support. Also, the ScaleOut management console lets users specify various security parameters appropriate to the various cloud providers (e.g., security groups and VPC in EC2 and firewall settings in Azure), and the start-up process configures these parameters for all IMDG servers.

The management console also lets users add (or remove) instances as necessary to handle changes in the workload. The IMDG automatically redistributes the workload across the servers as the membership changes.

Easily Hooking Up an Application to the IMDG

The power of managing an IMDG using a single store name becomes apparent when connecting instances of a cloud-based application to the IMDG. On-premise applications typically connect each client instance to an IMDG using a list of IP addresses corresponding to available IMDG servers. This process works well on premise because IP addresses typically are well known and static. However, it is impractical in the cloud since IP addresses change with each deployment or reboot of an IMDG server.

The solution to this problem is to let the application access the IMDG solely by its store name and cloud access credentials and have the IMDG find the servers. The store name and credentials are stored in a configuration file on each application instance with the access credentials fully encrypted. At startup time, the IMDG’s client library reads the configuration file and then uses previously stored metadata in the cloud to find the IMDG’s servers and connect to them. Note that this technique works well with both unencrypted and encrypted connections.

The following diagram illustrates how application instances automatically connect to the IMDG’s servers using the client library’s “grid mapper” software, which retrieves cloud-based metadata to make connections to ScaleOut Analytics Server:

blogdiagrams-8-3-14-01

The application need not be running in the cloud. The same mechanism also allows an on-premise application to access a cloud-based IMDG. It also allows an on-premise IMDG to replicate its data to a cloud-based IMDG or connect to a cloud-based IMDG to form a virtual IMDG spanning both sites. (These features are provided in the ScaleOut GeoServer® product.) The following diagram illustrates connecting an on-premise application to a cloud-based IMDG:

blogdiagrams-8-3-14-02

Summing Up

As more and more server-side applications migrate to the cloud to take advantage of its elasticity, the power of IMDGs to unlock scalable performance and operational intelligence becomes increasingly compelling. Keeping IMDG deployment as simple as possible is critical to unlock the potential of this combined solution. Leveraging cloud-based metadata to automate the configuration process lets the application ignore the details of the IMDG’s infrastructure and easily access its scalable storage and computing power.

facebooktwitterredditpinterestlinkedinmail
facebooktwitterlinkedinyoutube

AppFabric Caching: Retry Later

We have spent a great deal of time at ScaleOut Software re-architecting our in-memory data grid (IMDG)’s code base to make best use of many cores and large memory. For example, the IMDG must be able to efficiently create millions of objects in each server to make use of its huge storage capacity. Likewise, object access paths must be heavily multi-threaded and avoid lock contention to minimize access latency and maximize throughput. Also, load-balancing after membership changes must be both multi-threaded and pipelined to drive the network at maximum bandwidth.

Given all this, we thought it would be a good opportunity to see how we are doing relative to the competition, and in particular, relative to Microsoft’s AppFabric caching for Windows on-premise servers. In addition to looking at performance differences, we also want to compare ScaleOut StateServer (SOSS) to AppFabric on qualitative measures, such as features, ease of installation, and management. Continue reading

facebooktwitterredditpinterestlinkedinmail
facebooktwitterlinkedinyoutube

Reports of Scale-Out’s Demise Are Greatly Exaggerated

A recent blog post highlighted a Microsoft technical report which asserts that most Hadoop workloads are 100 GB or smaller, and for almost all workloads except the very largest “a single ‘scale-up’ server can process each of these jobs and do as well or better than a cluster in terms of performance, cost, power, and server density.”  It’s certainly true that Hadoop MapReduce seems to have focused more on clustering issues than on single-server optimizations. But — to paraphrase Mark Twain — reports of scale-out’s demise for all but the largest workloads are greatly exaggerated. Continue reading

facebooktwitterredditpinterestlinkedinmail
facebooktwitterlinkedinyoutube

Using In-Memory Data Grids for ETL on Streaming Data

The Hadoop stack offers a compelling set of technologies and tools that can be deployed to serve as the core of next-generation data warehouses. The combination of scalable MapReduce to analyze petabyte data sets, parallel SQL query using Hive or Impala, and data visualization tools gives the analyst powerful resources for mining strategically important data. The Hadoop Distributed File System (HDFS) serves as a highly scalable data repository for hosting this data and efficiently feeding it into Hadoop’s parallel analysis engine. With industrial strength support from companies like Cloudera and others, the time is now right for deploying a Hadoop-based data warehouse: Continue reading

facebooktwitterredditpinterestlinkedinmail
facebooktwitterlinkedinyoutube

How Do In-Memory Data Grids Differ from Storm?

­­In last week’s blog post, we talked about the fact that our in-memory computing technology often is confused with popular other “big data” technologies, in particular Spark/Spark Streaming, Storm, and complex event processing (CEP).  As we mentioned, these innovative technologies are great at what they’re built for, but in-memory data grids (IMDGs) were created for a distinct use case. In this blog post, we will take a look at how IMDGs differ from Storm. Continue reading

facebooktwitterredditpinterestlinkedinmail
facebooktwitterlinkedinyoutube

How Do In-Memory Data Grids Differ from Spark?

As an in-memory computing vendor, we’ve found that our products often get confused with some popular open-source, in-memory technologies. Perhaps the three technologies we are most often confused with are Spark/Spark Streaming, Storm, and complex event processing (CEP). These innovative technologies are great at what they’re built for, but in-memory data grids (IMDGs) were created for a distinct use case. In this blog post, we will take a look at how IMDGs differ from Spark and Spark Streaming. Continue reading

facebooktwitterredditpinterestlinkedinmail
facebooktwitterlinkedinyoutube

Transforming Retail with Real-Time Analytics

Real-time analytics has the potential to transform operational systems by providing instant feedback that dramatically enhances how these systems respond to fast-changing events. For example, in a previous blog we saw how a hedge fund tracking its equity portfolios can respond to market fluctuations in milliseconds instead of minutes. However, these benefits are not restricted to financial services. In discussions with both e-commerce and brick-and-mortar retail companies, we also have identified opportunities to enhance their operational systems with real-time analytics. Let’s take a look at a few examples after a quick review of in-memory data grids (IMDGs). Continue reading

facebooktwitterredditpinterestlinkedinmail
facebooktwitterlinkedinyoutube

How Object-Oriented Programming Simplifies Data-Parallel Analytics

In-memory computing enables real-time analytics to be integrated into operational systems so that fast-changing, “live” data can be instantly evaluated to provide feedback in milliseconds or seconds. As we have discussed in previous blogs, the key to scalable performance and fast response time lies in the use of data-parallel programming techniques. How can we structure these computations to ease their integration into operational systems? Continue reading

facebooktwitterredditpinterestlinkedinmail
facebooktwitterlinkedinyoutube

Creating Data-Parallel Computations for Real-Time Analytics

Real-time analytics offers enterprises the ability to examine “live,” fast-changing data within operational systems and obtain feedback in milliseconds to seconds. For example, a hedge fund in a financial services organization can track the effect of market fluctuations on its portfolios (“strategies”) of long and short equity positions in various market areas (high tech, real estate, etc.) and immediately identify strategies requiring rebalancing. As we have seen in previous blogs, the key to real-time performance, especially for growing workloads, is to use in-memory, data-parallel computing, which delivers scalable throughput and minimizes performance losses due to data motion. But how can we easily structure computations to take advantage of this “scale out” technology? Continue reading

facebooktwitterredditpinterestlinkedinmail
facebooktwitterlinkedinyoutube

Scaling Real-Time Analytics with an IMDG

In the last blog we discussed how in-memory data grids (IMDGs) share the same architecture as parallel supercomputers. Parallel supercomputers typically add computing power by scaling “out” across a cluster of servers. Likewise, IMDGs scale out their in-memory data storage and analytics engine across service processes running on a cluster of servers. Let’s take a little deeper look at the benefits of scaling out, especially for computations in real-time analytics. Continue reading

facebooktwitterredditpinterestlinkedinmail
facebooktwitterlinkedinyoutube