Distributed database technologies to share data between database systems

John Charles Olamendy
16y
20.3k
0
0

Article

Introduction

This article is intended to illustrate the main concepts of distributed database technologies supporting the replication principles in order to share data among different database systems. Companies have multiple copies of data distributed across various autonomous sites, and we need to access to this distributed data in a particular moment as we were connected to a single database. A distributed database is a loosely-coupled connected databases spread across multiple servers. There are two methods to distribute data: distributed transactions and replication.

Replication is a set of technologies that allows moving data and database objects from one database system to another one across different geographic and platforms. This allows a user to work with a local copy of data, and then transfer the changes to one or more remote sites. This consistency is maintained by the synchronization process.

There are other reasons to use replication.

Availability. Replication provides another alternative to access data when one server is down.
Performance. Replication provides fast, local access to shared data. Users access to the server closest geographically to them.
Disconnected computing. We have a snapshot or partial copy (replica) of the data. It allows user to work with a subset of the data while they are disconnected from the central server. Later, when the connection is established, the underlying data is synchronized.
You want to copy distributed data to different sites on scheduled basis.
Business scenario where multiple users are executing transactions and you need to merge the data modifications and resolve potential conflicts.

SQL Server implements three type of asynchronous replication: snapshot, transactional, and merge replication.

Snapshot replication makes a copy of the data to propagate the changes for the whole data set rather than individual transactions.

Transaction replication allows incremental changes of data to be transferred either continuously or at specific intervals. This type of transaction is used where there is a high volume of inserts, updates and deletes intended to server-to-server (peer-to-peer) environment.

Merge replication allows subscribing servers to make changes and then it propagates those changes to the publishing servers, which in turn propagates their changes to the subscribers.

In order to implement the replication in SQL Server, the developers of SQL Server follows the Publisher-Subscriber model based on a metaphor from the publishing industry.

In this architecture, the publisher creates the publications which comprises of several articles (in this case, database objects such as tables, views, stored procedures) and then the publications are sent to the distributor. The distributor distributes the publications through several agents whose job is to deliver the publications and the underlying articles to the subscribers. The data is synchronized initially, and then the changes are propagated from the publishers to the subscribers.

Now, I'm going to explain the main components.

The distributor is the main component which links the publisher and the subscriber managing the data flow. In a snapshot and transactional replication, it stores the replica as well as the metadata and job history. In a case of push subscription, the distributor runs the replication agents. In a merge replication, the distributor stores the metadata and the history of the synchronization. It runs the Snapshot agent and the Merge agent.

The publisher provides data to the distributor to be replicated (the replicas). It can also manage the data changes.

The subscriber stores the replicas and receives updates from the publisher. It may act as publish and republish the replicas.

Let's talk about subscriptions. There are two methods for receiving the publications. Anonymous subscription and named subscription. Anonymous subscription doesn't enable store information about the subscriber on the publisher, thus subscriber needs to keep track of changes.

In named subscription, the subscriber is explicitly enabled by the publisher. There are two type of named subscription: push and pull.

In push subscription, the publisher propagates the changes to the subscribers without any request of them. Changes are pushed on demand, continuously or on a scheduled basis.

In pull subscription, the subscriber requests for data changes made at the publisher. Changes are pulled on demand or on a scheduled basis.

In this replication architecture, we have four types of agents: Snapshot, Log Reader, Queue Reader, Distribution and Merge.

The Snapshot agent is used in all the replications, specifically at the beginning of synchronizations making copies of data and schema to be published, storing the replica in the snapshot file and recording all the data about synchronization in the distribution database. It usually resides on the distributor.

The Log Reader agent monitors the transaction logs of the database involved in a transaction replication (thus used by transactional replication). Then the agent copies the data changes from the publisher to the distribution database on the distributor. Then the distributor sends the data to the subscribers.

The Distribution agent is used by snapshot and transactional replication is responsible to move snapshots and transactions held in the distribution database on the distributor to the subscribers. In push subscriptions, it resides on the publisher while on pull subscription this agent resides on the subscriber.

The Merge agent is used for merge replication. It applies initial snapshots and then monitors for changes to be merge on the subscribers resolving update conflicts. Each database, part of a merge replication, has one Merge agent.

In transactional replication you can either immediately update the messages or store them in a queue. In this case, you send the changes when the connection is available. The Queue Reader agent reads the changes from the queue and applies to the distribution database on the distributor.

Configuring a Snapshot Replication using GUI

In order to configure the Snapshot Replication, we need to know how it really works. First, the Snapshot agent establishes a connection between the Distributor and the Publisher (locking the tables to be published). Then, the agent generates the schema of tables and writes them to a file with the extension .sch in the snapshot folder on the Distributor. It also generates the script files for other database objects such as stored procedure, indexes, views, functions to be replicated. The Snapshot agent sends copies of data from the publication database on the Publisher to the snapshot folder on the Distributor (the lock on the publication database are released). The Distribution agent will read later the location of these files from the MSrepl_commands system table in the distribution database and send the replicate data to the subscribers.

Now let's begin to configure the Snapshot Replication using Microsoft SQL Server Management Studio. Right-click on the Local Publications node under the Replication folder and select New Publication. The New Publication Wizard appears (see Figure 1).

Figure 1.

Click Next, and choose the database that contains the objects to be published, and Click Next. The next page lists several types of publications, then select the Snapshot publication (Figure 2).

Figure 2.

Click Next, the next page displays a list of objects to be published (see Figure 3).

Figure 3.

Click Next, the next page asks whether you want to filter the rows horizontally. Click Next, and you'll see the Snapshot Agent page. You can specified when to run this agent. In my case, I select to create a snapshot immediately, and also it should run at scheduled times (see Figure 4).

Figure 4.

Click Next, and you'll see the Agent Security page and you must specify the account under which the Snapshot Agent will run. It's recommended to configure the connection to the Distributor using a domain account, such as the SQL Server agent service account (see Figure 5).

Figure 5.

Finally set a name for the publication and create it.

Now that we have a publication, it's time to create a subscription to this publication. Go to the SQL Server instance that wants to receive the replica data. Right-click on the Local Subscription node under the Replication tree, and select New Subscriptions option. Then the New Subscription Wizard appears (see Figure 6).

Figure 6.

Click Next, and in the publication page, browse to the publisher and select the publication that you want to. Click Next, the wizard asks you for the location of the distribution agent. You can run this agent either at the Distributor (push subscription) or the Subscriber (pull subscription). I have selected the option to run the Distributor agent at the Subscriber (pull subscription) (Figure 7).

Figure 7.

Click Next, and the page allows you to select the Subscriber and the underlying database. Click Next, and the Distribution Agent Security page (see Figure 8).

Figure 8.

Click Next, and in the Synchronization Schedule page, you can specify the schedule of your subscription (see Figure 9).

Figure 9.

Click Next, in the Initialize Subscription page, you can specify when you want to initialize the subscription. You have two options: immediately or at the first synchronization. I have selected "at the first synchronization" option.

Click Next, and the publication is created. You have a replication configuration which can be customized to your own situation.

Conclusion

In this article, I covered the key principles and techniques to distribute data between relational database systems. After you read this article, you have a deep insight of the Microsoft technologies supporting the replication concepts. After that, you can apply these techniques to your own business scenario and achieve a strong replication configuration.