http://www.codinghelmet.com/  

Wear a helmet. Even when coding.

howto > bulk-insert

How to Bulk Insert Complex Objects into SQL Server Database
by Zoran Horvat @zoranh75

There are many practical examples in which application needs to insert large quantities of data into the database: synchronizing with remote system, periodically importing business data, or simply receiving too much data during peak hour. On the other hand, Object-Relational Mapping (ORM) frameworks, such as Entity Framework, are rapidly increasing their market share. Striking a right balance between flexibility of an ORM and performance requirements expected by a common user is often a nontrivial task. In this article we are going to demonstrate several techniques that can be combined to seriously increase rate at which data are stored into the database.

We will demonstrate performance issues on a very small application with data abstraction layer based on SQL Server and Entity Framework - so typical setting nowadays - and then make changes in order to improve performance when application attempts to save large number of objects into the database. Results that will be presented show that it is very easy to score times that are by order of magnitude, or even two orders of magnitude lower than when Entity Framework is used in quite a realistic setting. Custom made bulk insert solutions should be able to operate much faster than any ORM-based solution.

Database and Entity Model

To begin with, we will first provide a T-SQL script which can be used to create database for this demonstration. Database will consist of only two tables – the first one carrying information about registered users, and another one that stores known email addresses for each of the users. Here is the create script.

USE master

IF NOT EXISTS (SELECT * FROM dbo.syslogins WHERE name='bulkmaster')
    CREATE LOGIN bulkmaster WITH PASSWORD='bulk'
GO

CREATE DATABASE BulkDemo
GO

USE BulkDemo

CREATE USER bulkmaster FOR LOGIN bulkmaster
EXEC sp_addrolemember N'db_owner', N'bulkmaster'
GO

CREATE SCHEMA [users]
GO

CREATE TABLE [users].[SiteUser]
(
    [SiteUserID] INT PRIMARY KEY IDENTITY,
    [Username] VARCHAR(30) NOT NULL,
    [FirstName] NVARCHAR(50) NULL,
    [LastName] NVARCHAR(50) NULL,
    [YearOfBirth] INT NULL,
    CONSTRAINT uc_users_siteuser_username UNIQUE NONCLUSTERED (Username)
)

CREATE TABLE [users].[EmailAddress]
(
    [EmailAddressID] INT PRIMARY KEY IDENTITY,
    [SiteUserID] INT NOT NULL,
    [Address] VARCHAR(200) NOT NULL,
    [MailsSentCount] INT NOT NULL,
    [MailsReceivedCount] INT NOT NULL,
    CONSTRAINT fk_users_emailaddress_siteuserid
    FOREIGN KEY ([SiteUserID])
    REFERENCES [users].[SiteUser]([SiteUserID]),
    CONSTRAINT uc_users_emailaddress_address UNIQUE NONCLUSTERED (Address)
)

This script creates database with name BulkDemo, login name bulkmaster, users schema and the two tables in it: SiteUser and EmailAddress. There's a couple of things that we should be aware of right away. User records are uniquely identified by autoincrement SiteUserID field: this is the primary key in SiteUser table. However, persons described by records in the SiteUser table are also identified by their usernames. This field is a natural key for the SiteUser table - that is the field which uniquely identifies a person in the real world. Observe unique constraint added to the SiteUser table, indicating that username cannot be repeated within the table. Similarly, email addresses are unique - hence the unique constraint in the EmailAddress table - and consequently Address field is treated as the natural key for these objects. We will return to this design detail in its time, just remember that application has two distinct methods to identify each of the objects in the system: primary key and natural key.

Now that there is a database available, we are ready to generate Entity Framework model from it. Two model classes will be generated, both equipped by navigation properties. SiteUser class exposes a collection property EmailAddresses, while EmailAddress class exposes single-valued property SiteUser in order to point to address's owner. Figure below shows entity model generated from the database.

What Entity Framework Can Do

Now that we have a database and a corresponding entity model, we are ready to write some code that stores data into the database. Take a look at the following function:

bool AddSiteUser(SiteUser user)
{

    bool saved = false;

    try
    {

        BulkDemoEntities ent = new BulkDemoEntities();

        ent.SiteUsers.AddObject(user);
        ent.SaveChanges();

        Console.WriteLine("{0}: SiteUserID={1}", user.Username, user.SiteUserID);
        foreach (EmailAddress ea in user.EmailAddresses)
            Console.WriteLine("    {0}: EmailAddressID={1}", ea.Address, ea.EmailAddressID);

        saved = true;

    }
    catch (System.Exception ex)
    {
        Console.WriteLine(ex.Message);
    }

    return saved;

}

This function receives SiteUser model instance and simply stores it into the database. Here is the code which calls the function:

SiteUser user = new SiteUser();
user.Username = "someuser";
user.FirstName = "First";
user.LastName = "Last";
user.YearOfBirth = 1982;

EmailAddress email = new EmailAddress();
email.Address = "someone@somewhere.org";
email.MailsSentCount = 3096;
email.MailsReceivedCount = 1;
user.EmailAddresses.Add(email);

email = new EmailAddress();
email.Address = "alternate@somewhere.org";
email.MailsSentCount = 19367;
email.MailsReceivedCount = 2;
user.EmailAddresses.Add(email);

AddSiteUser(user);

And here is the output produced when code is executed:

someuser: SiteUserID=14936
    someonesomewhere.org: EmailAddressID=92637
    alternatesomewhere.org: EmailAddressID=92638

Why is this output so important? Because it contains actual values of autoincrement identity columns: one value for SiteUserID and two values for EmailAddressID. We have not provided those values, but they were fed back into the model objects by the Entity Framework. This is part of the magic provided by ORM of choice. In order to understand better how does this happen, we can trace statements executed in the database when code runs:

BEGIN TRANSACTION

INSERT INTO [users].[SiteUser] ([Username], [FirstName], [LastName], [YearOfBirth])
VALUES ('someuser' /* @0 */, 'First' /* @1 */, 'Last' /* @2 */, 1982 /* @3 */)

SELECT [SiteUserID]
FROM [users].[SiteUser]
WHERE @@ROWCOUNT > 0 AND [SiteUserID] = SCOPE_IDENTITY()

INSERT INTO [users].[EmailAddress] ([SiteUserID], [Address], [MailsSentCount], [MailsReceivedCount])
VALUES (14936 /* @0 */, 'someone@somewhere.org' /* @1 */, 3096 /* @2 */, 1 /* @3 */)

SELECT [EmailAddressID]
FROM [users].[EmailAddress]
WHERE @@ROWCOUNT > 0 AND [EmailAddressID] = SCOPE_IDENTITY()

INSERT INTO [users].[EmailAddress] ([SiteUserID], [Address], [MailsSentCount], [MailsReceivedCount])
VALUES (14936 /* @0 */, 'alternate@somewhere.org' /* @1 */, 19367 /* @2 */, 2 /* @3 */)

SELECT [EmailAddressID]
FROM [users].[EmailAddress]
WHERE @@ROWCOUNT > 0 AND [EmailAddressID] = SCOPE_IDENTITY ()

COMMIT TRANSACTION

Observe closely the way in which Entity Framework has queried the database after each INSERT statement, just in order to fetch identity value that was assigned by the database engine to the newly inserted row. These values are then assigned to identity properties of model objects. Note how stringent each identity query is – failure to fetch exact identity value of the record that was inserted by the previous statement would result in exception thrown by the Entity Framework.

What Entity Framework Can't Do

SQL trace presented above clearly shows the weakest point of Entity Framework (and other similar ORMs, to make this statement clear). Each object that was inserted required two SQL statements - one to insert a record, and additional one to obtain identity of the new record. We had one master object (SiteUser) and two detail objects (EmailAddresses) and that required six SQL statements to be communicated to the database server, with six results awaited back - total of six round-trips to the server, not counting the time needed to transfer the statements and their corresponding results.

Now if we extend this case to one thousand objects, then we'll get six thousand SQL statements sent to the server. For one hundred thousand objects we quickly get over half a million statements. Well that is certainly going to take some time to execute. With this approach, we are quickly getting out of time. Let alone the fact that we are needlessly spending precious database resources just to communicate with the engine, rather than letting it do what it is built for: data storing.

This form of communication is often referred to as "chatty" or, more formally, fine grained. Client application contacts the server too many times to accomplish a complex operation. Better approach is coarse grained interface, approach in which client gathers all the data required for the operation and then communicates the whole request to the server in one go, saving many round-trips to the server. But on a second though, Entity Framework is not really in position to act any differently than being chatty. Although it could somehow send all the data to the database server in one bundle, fetching identity values and assigning them back to original objects is a notoriously complex task in general case. ORM has no notion of natural keys in the first place. To make things worse, current operation is not the only one that is executed on the database - at any given time other sessions are free to insert their own objects at will. Picking up records that were inserted "lately" is not so simple when some of the latest records were actually inserted by other parties. Even worse, complex models consist of records stored in multiple tables - some of them pre-existing, others newly created. Possibilities are so diverse that there is no way to design a comprehensive, yet practical, system that covers them all.

And where general solutions fall short, tailored solutions prosper. Here comes one of them, neatly designed for our users-with-emails application alone.

Custom Solution for Bulk Insert

After discussing the way in which Entity Framework inserts multiple objects, we can now turn our attention to alternatives. First of all, .NET Framework provides a solution for fast insertion of flat tables: a class called SqlBulkCopy (System.Data.SqlClient namespace). This class allows setting up a mapping betweenDataTable's columns and columns of a specific target database table. Once everything is in place, we simply pass the DataTable object to the SqlBulkCopy object and it does whatever is needed to insert all the objects into the database table.

Good news about the SqlBulkCopy class is that it contacts the database only once, unless specified otherwise using its properties. Number of round-trips to the database server is significantly reduced, communication with database server is optimized and resources are consumed with sense of responsibility. Below is the function which inserts SiteUser records (without associated EmailAddress records this time).

void BulkInsertSiteUsers(SiteUser[] users)
{

    System.Data.DataTable table = new System.Data.DataTable();
    table.Columns.Add("Username", typeof(string));
    table.Columns.Add("FirstName", typeof(string));
    table.Columns.Add("LastName", typeof(string));
    table.Columns.Add("YearOfBirth", typeof(int));

    for (int i = 0; i < users.Length; i++)
        table.Rows.Add(new object[] {
                            users[i].Username,
                            users[i].FirstName,
                            users[i].LastName,
                            users[i].YearOfBirth });

    string connString = System.Configuration.ConfigurationManager.ConnectionStrings["BulkDemo"].ConnectionString;

    using (System.Data.SqlClient.SqlConnection conn = new System.Data.SqlClient.SqlConnection(connString))
    {

        conn.Open();

        using (System.Data.SqlClient.SqlTransaction tran = conn.BeginTransaction())
        {

            using (System.Data.SqlClient.SqlBulkCopy bulkCopy = new System.Data.SqlClient.SqlBulkCopy(conn))
            {

                bulkCopy.DestinationTableName = "[users].[SiteUser]";
                bulkCopy.ColumnMappings.Add("Username", "Username");
                bulkCopy.ColumnMappings.Add("FirstName", "FirstName");
                bulkCopy.ColumnMappings.Add("LastName", "LastName");
                bulkCopy.ColumnMappings.Add("YearOfBirth", "YearOfBirth");

                bulkCopy.WriteToServer(table);

            }

            tran.Commit();

        }

    }

}

This is obviously not the most beautiful data insertion code of all times. But it does perform by an order of magnitude faster than what Entity Framework does using its own devices. Nevertheless, we should not forget that Entity Framework did one thing more - it has asked database server to communicate back identity values assigned to each of the new rows. Custom function above did not do that.

With the problem of speed solved, we need to address the problem of having identity values returned back from the server. One solution is to somehow tag records that are going to be inserted using SqlBulkCopy object. Then, use this tag to fetch identity values back, using the natural key to sort rows. By sorting the original data in the same way, we ensure that identity values will be assigned to corresponding objects with no mistakes.

But before we can implement this solution, we need to add support for "tags". One simple solution is to add another table (e.g. named BulkInsertSession), which would serve the sole purpose to generate its autoincrement identity and thus provide a unique tag that could be used to distinguish rows created by current bulk insert operation from rows created by other sessions. Off course, SiteUser table would have to be modified to accommodate foreign key that would connect its rows to corresponding bulk insert sessions. We can think of this change as a mean to design the database around the notion of bulk inserts, rather than to perform bulk inserts over existing database schema, no matter how unsuitable the schema is for such an operation. Putting the bulk insertion requirement up front, we can modify the database design to give a helping hand when time comes to insert data. One last modification required is that BulkInsertSession foreign key in the SiteUser table is indexed for faster retrieval of SiteUser records that belong to a given bulk insert session. This modification will have impact on insert speed, but will pay off quickly when selecting by bulk insert session comes.

Here is the modified part of the database create script:

CREATE SCHEMA [bulk]
GO

CREATE TABLE [bulk].[BulkInsertSession]
(
    BulkInsertSessionID INT PRIMARY KEY IDENTITY,
    TsCreated DATETIME NOT NULL
)
GO

CREATE SCHEMA [users]
GO

CREATE TABLE [users].[SiteUser]
(
    [SiteUserID] INT PRIMARY KEY IDENTITY,
    [Username] VARCHAR(30) NOT NULL,
    [FirstName] NVARCHAR(50) NULL,
    [LastName] NVARCHAR(50) NULL,
    [YearOfBirth] INT NULL,
    [BulkInsertSessionID] INT NULL,
    CONSTRAINT uc_users_siteuser_username UNIQUE NONCLUSTERED (Username),
    CONSTRAINT fk_users_siteuser_bulkinsertsessionid
    FOREIGN KEY ([BulkInsertSessionID])
    REFERENCES [bulk].[BulkInsertSession]([BulkInsertSessionID])
)
GO

CREATE INDEX ui_users_siteuser_bulkinsertsessionid ON [users].[SiteUser] (BulkInsertSessionID)
GO

CREATE TABLE [users].[EmailAddress]
(
    [EmailAddressID] INT PRIMARY KEY IDENTITY,
    [SiteUserID] INT NOT NULL,
    [Address] VARCHAR(200) NOT NULL,
    [MailsSentCount] INT NOT NULL,
    [MailsReceivedCount] INT NOT NULL,
    CONSTRAINT fk_users_emailaddress_siteuserid
    FOREIGN KEY ([SiteUserID])
    REFERENCES [users].[SiteUser]([SiteUserID]),
    CONSTRAINT uc_users_emailaddress_address UNIQUE NONCLUSTERED ([Address])
)
GO

Observe that EmailAddress table does not need to be extended with BulkInsertSessionID foreign key. This is because each EmailAddress record has a corresponding SiteUser parent record which carries the BulkInsertSessionID value for both of them. Here is the updated entity model:

In this way, we have prepared the ground for enhanced function, the one which fetches identity values back after the insert. Below is the modified function which stores large number of SiteUser objects (still without their EmailAddress details).

void InsertBulkUsers(SiteUser[] users)
{

    string connString = System.Configuration.ConfigurationManager.ConnectionStrings["BulkDemo"].ConnectionString;

    using (System.Data.SqlClient.SqlConnection conn = new System.Data.SqlClient.SqlConnection(connString))
    {

        conn.Open();

        using (System.Data.SqlClient.SqlTransaction tran = conn.BeginTransaction())
        {

            System.Data.SqlClient.SqlCommand cmd =
                new System.Data.SqlClient.SqlCommand(
                    "INSERT INTO [bulk].[BulkInsertSession]([TsCreated]) VALUES (CURRENT_TIMESTAMP)", conn, tran);

            cmd.ExecuteNonQuery();

            cmd = new System.Data.SqlClient.SqlCommand(
                "SELECT [BulkInsertSessionID] FROM [bulk].[BulkInsertSession] " +
                "WHERE @@ROWCOUNT > 0 and [BulkInsertSessionID] = SCOPE_IDENTITY()", conn, tran);

            int bulkInsertSessionId = (int)cmd.ExecuteScalar();

            System.Data.DataTable table = new System.Data.DataTable();
            table.Columns.Add("Username", typeof(string));
            table.Columns.Add("FirstName", typeof(string));
            table.Columns.Add("LastName", typeof(string));
            table.Columns.Add("YearOfBirth", typeof(int));
            table.Columns.Add("BulkInsertSessionID", typeof(int));

            for (int i = 0; i < users.Length; i++)
                table.Rows.Add(new object[] {
                            users[i].Username,
                            users[i].FirstName,
                            users[i].LastName,
                            users[i].YearOfBirth,
                            bulkInsertSessionId });

            using (System.Data.SqlClient.SqlBulkCopy bulkCopy =
                       new System.Data.SqlClient.SqlBulkCopy(conn, System.Data.SqlClient.SqlBulkCopyOptions.Default, tran))
            {

                bulkCopy.DestinationTableName = "[users].[SiteUser]";
                bulkCopy.ColumnMappings.Add("Username", "Username");
                bulkCopy.ColumnMappings.Add("FirstName", "FirstName");
                bulkCopy.ColumnMappings.Add("LastName", "LastName");
                bulkCopy.ColumnMappings.Add("YearOfBirth", "YearOfBirth");
                bulkCopy.ColumnMappings.Add("BulkInsertSessionID", "BulkInsertSessionID");

                bulkCopy.WriteToServer(table);

            }

            cmd = new System.Data.SqlClient.SqlCommand(
                "SELECT [SiteUserID] " +
                "FROM [users].[SiteUser] " +
                "WHERE [BulkInsertSessionID]=@bulkInsertSessionId " +
                "ORDER BY [Username] ASC", conn, tran);
            cmd.Parameters.Add(new System.Data.SqlClient.SqlParameter("@bulkInsertSessionId", bulkInsertSessionId));

            SiteUser[] sortedUsers =
                (from u in users
                 orderby u.Username
                 select u).ToArray();

            using (System.Data.SqlClient.SqlDataReader reader = cmd.ExecuteReader())
            {
                int index = 0;
                while (reader.Read())
                    sortedUsers[index++].SiteUserID = (int)reader[0];
            }

            cmd = new System.Data.SqlClient.SqlCommand(
                "UPDATE [users].[SiteUser] " +
                "SET [BulkInsertSessionID] = NULL " +
                "WHERE [BulkInsertSessionID] = @bulkInsertSessionId", conn, tran);
            cmd.Parameters.Add(new System.Data.SqlClient.SqlParameter("@bulkInsertSessionId", bulkInsertSessionId));
            cmd.ExecuteNonQuery();

            cmd = new System.Data.SqlClient.SqlCommand(
                "DELETE FROM [bulk].[BulkInsertSession] " +
                "WHERE [BulkInsertSessionID] = @bulkInsertSessionId", conn, tran);
            cmd.Parameters.Add(new System.Data.SqlClient.SqlParameter("@bulkInsertSessionId", bulkInsertSessionId));
            cmd.ExecuteNonQuery();

            tran.Commit();

        }

    }

}

At this point we have a fully operational solution for parent table, but still lack data in the corresponding details table. In order to save records about email addresses, we only need to extend the solution one step further. Parent identities are now known and it should not be a problem to write them into each email address record. Once data are stored in the database (using SqlBulkCopy), we can fetch EmailAddress identities in the same way as we did with the SiteUser identity values.

Optimizing Detail Insert

After parent objects are inserted, their corresponding child objects are already equipped with parent record identities. This is sufficient for us to repeat the whole procedure of bulk insert and identity retrieval on child records. Translated to our example, this means that, having SiteUser records inserted and their identity values fetched back, we can easily construct a DataTable with EmailAddress records that reference their parent SiteUser records. Since EmailAddress has Address field as natural key, fetching autoincrement identity values is as simple as it was with SiteUser records. That would create a full circle of operations needed to bulk insert complex objects - SiteUsers with their associated EmailAddresses - and to update model objects with identity values assigned by the database.

But now we may ask a legitimate question - how often did we actually use identity values assigned back to model objects? Well, having natural keys readily available, especially communicating natural keys to human users (because they normally better understand natural keys than autoincrement identities), the answer would probably be - not so often. So why at all do we fetch identity values? It is clear why ORMs do that - they are meant to be general-purpose, and that means to cover those cases when caller really needs to know identity values (e.g. to accomplish next operation which is bound to keys assigned by the database). When designing custom bulk insert operation, we are in charge of fetching back identity values. This means that we are in perfect position to decide whether to perform that step or to skip it altogether.

Optimization that can be applied is straight-forward. Along with complex objects that caller would like to have inserted into the database, we would also expect a Boolean parameter indicating whether to fetch identity values or not. If not, then only the necessary ones are fetched along the route, and that means identities of SiteUser records - these are needed to tie up child records. Otherwise, if caller is keen to know identities, the full identity fetch step is executed both on the parent and on the child table. Fetching identity values takes nearly the same time as bulk inserting data. When majority of the records stored in the database are actually child records, those not referenced by any other record, then avoiding the identity fetch step on child tables could significantly enhance performance of the whole operation.

Complete Bulk Insert Solution

In this section we will provide all the functions needed to bulk insert SiteUser model objects with their contained EmailAddress objects. Root function, the only one called from the outside, is BulkInsertSiteUsers and it receives an array of site users, followed by a Boolean flag which indicates whether to fetch identity values or not. Here is the source code.

/// <summary>
/// Stores site users from the specified array into the database,
/// including their child objects. Optionally, fetches identity values
/// assigned to all objects by the database.
/// </summary>
/// <param name="users">Array containing site user objects that should be
/// stored to database.</param>
/// <param name="fetchIdentities">Indicates whether identity values
/// assigned to objects by the database should be fetched and
/// set to corresponding model properties.</param>
void BulkInsertSiteUsers(SiteUser[] users, bool fetchIdentities)
{

    string connString = System.Configuration.ConfigurationManager.ConnectionStrings["BulkDemo"].ConnectionString;

    using (System.Data.SqlClient.SqlConnection conn = new System.Data.SqlClient.SqlConnection(connString))
    {

        conn.Open();

        using (System.Data.SqlClient.SqlTransaction tran = conn.BeginTransaction())
        {

            int bulkInsertSessionId = InitializeBulkInsertSession(conn, tran);

            BulkInsertFlatUsers(users, bulkInsertSessionId, true, conn, tran);  // Always fetch identities
                                                                                // for parent objects

            BulkInsertFlatEmailAddresses(users, bulkInsertSessionId, fetchIdentities, conn, tran);

            CompleteBulkInsertSession(bulkInsertSessionId, conn, tran);

            tran.Commit();

        }

    }

}

/// <summary>
/// Inserts a row into bulk.BulkInsertSession table and returns its autoincrement identity value.
/// By doing so, this method provides a unique value which can be used to tag all rows
/// inserted by the bulk insert procedure.
/// </summary>
/// <param name="conn">Active database connection.</param>
/// <param name="tran">Transaction under which operation is performed.</param>
/// <returns>Value of the autoincrement identity assigned to the newly inserted row.</returns>
int InitializeBulkInsertSession(System.Data.SqlClient.SqlConnection conn,
                                        System.Data.SqlClient.SqlTransaction tran)
{

    System.Data.SqlClient.SqlCommand cmd = new System.Data.SqlClient.SqlCommand(
        "INSERT INTO [bulk].[BulkInsertSession]([TsCreated]) VALUES (CURRENT_TIMESTAMP)", conn, tran);

    cmd.ExecuteNonQuery();

    cmd = new System.Data.SqlClient.SqlCommand(
        "SELECT [BulkInsertSessionID] FROM [bulk].[BulkInsertSession] " +
        "WHERE @@ROWCOUNT > 0 and [BulkInsertSessionID] = SCOPE_IDENTITY()", conn, tran);

    int bulkInsertSessionId = (int)cmd.ExecuteScalar();

    return bulkInsertSessionId;

}

/// <summary>
/// Bulk inserts SiteUser data into the database, ignoring detail objects. Optionally, fetches
/// identity values assigned to corresponding database rows.
/// </summary>
/// <param name="users">Array containing site user model objects that should be inserted to database.</param>
/// <param name="bulkInsertSessionId">Unique bulk insert session identity, used to tag all
/// newly inserted rows.</param>
/// <param name="fetchIdentities">Indicates whether identity values should be fetched
/// and assigned back to model objects in the <paramref name="users"/> array after
/// bulk insert operation has completed.</param>
/// <param name="conn">Active database connection.</param>
/// <param name="tran">Transaction under which operation is performed.</param>
void BulkInsertFlatUsers(SiteUser[] users, int bulkInsertSessionId, bool fetchIdentities,
                                System.Data.SqlClient.SqlConnection conn,
                                System.Data.SqlClient.SqlTransaction tran)
{

    System.Data.DataTable table = new System.Data.DataTable();

    table.Columns.Add("Username", typeof(string));
    table.Columns.Add("FirstName", typeof(string));
    table.Columns.Add("LastName", typeof(string));
    table.Columns.Add("YearOfBirth", typeof(int));
    table.Columns.Add("BulkInsertSessionID", typeof(int));

    for (int i = 0; i < users.Length; i++)
        table.Rows.Add(new object[] {
                            users[i].Username,
                            users[i].FirstName,
                            users[i].LastName,
                            users[i].YearOfBirth,
                            bulkInsertSessionId });

    using (System.Data.SqlClient.SqlBulkCopy bulkCopy =
            new System.Data.SqlClient.SqlBulkCopy(conn, System.Data.SqlClient.SqlBulkCopyOptions.Default, tran))
    {

        bulkCopy.DestinationTableName = "[users].[SiteUser]";
        bulkCopy.ColumnMappings.Add("Username", "Username");
        bulkCopy.ColumnMappings.Add("FirstName", "FirstName");
        bulkCopy.ColumnMappings.Add("LastName", "LastName");
        bulkCopy.ColumnMappings.Add("YearOfBirth", "YearOfBirth");
        bulkCopy.ColumnMappings.Add("BulkInsertSessionID", "BulkInsertSessionID");
        bulkCopy.BulkCopyTimeout = 300000;

        bulkCopy.WriteToServer(table);

    }

    if (fetchIdentities)
        LoadSiteUserIdentities(users, bulkInsertSessionId, conn, tran);

}

/// <summary>
/// Fetches identity values assigned by the database engine to records
/// that correspond with specified site user model objects.
/// </summary>
/// <param name="users">Site user objects that were inserted to database.
/// On output these objects will have their SiteUserID properties set
/// to corresponding values loaded from the database.</param>
/// <param name="bulkInsertSessionId">Bulk insert session identity,
/// which can be used to distinguish database records that were inserted
/// during current bulk insert session.</param>
/// <param name="conn">Active database connection.</param>
/// <param name="tran">Transaction under which operation is performed.</param>
void LoadSiteUserIdentities(SiteUser[] users, int bulkInsertSessionId,
                                    System.Data.SqlClient.SqlConnection conn,
                                    System.Data.SqlClient.SqlTransaction tran)
{

    System.Data.SqlClient.SqlCommand cmd = new System.Data.SqlClient.SqlCommand(
        "SELECT [SiteUserID] " +
        "FROM [users].[SiteUser] " +
        "WHERE [BulkInsertSessionID]=@bulkInsertSessionId " +
        "ORDER BY [Username] ASC", conn, tran);
    cmd.Parameters.Add(new System.Data.SqlClient.SqlParameter("@bulkInsertSessionId", bulkInsertSessionId));

    SiteUser[] sortedUsers =
        (from u in users
         orderby u.Username
         select u).ToArray();

    using (System.Data.SqlClient.SqlDataReader reader = cmd.ExecuteReader())
    {
        int index = 0;
        while (reader.Read())
        {
            int id = (int)reader[0];
            foreach (EmailAddress ea in sortedUsers[index].EmailAddresses)
                ea.SiteUserID = id;
            sortedUsers[index++].SiteUserID = id;
        }
    }

}

/// <summary>
/// Bulk inserts only EmailAddress records to the database, assuming that parent
/// SiteUser records are already inserted and their identity values fetched back
/// into model objects.
/// </summary>
/// <param name="users">Array of users from which EmailAddress objects should
/// be extracted and inserted to database.</param>
/// <param name="bulkInsertSessionId">Bulk insert session identity,
/// which can be used to distinguish database records that were inserted
/// during current bulk insert session.</param>
/// <param name="fetchIdentities">Indicates whether identity values should be fetched back
/// from the database and stored in EmailAddress objects.</param>
/// <param name="conn">Active database connection.</param>
/// <param name="tran">Transaction under which operation is performed.</param>
void BulkInsertFlatEmailAddresses(SiteUser[] users, int bulkInsertSessionId, bool fetchIdentities,
                                            System.Data.SqlClient.SqlConnection conn,
                                            System.Data.SqlClient.SqlTransaction tran)
{

    EmailAddress[] addresses =
        (from u in users
         from e in u.EmailAddresses
         select e).ToArray();

    System.Data.DataTable table = new System.Data.DataTable();

    table.Columns.Add("SiteUserID", typeof(int));
    table.Columns.Add("Address", typeof(string));
    table.Columns.Add("MailsSentCount", typeof(int));
    table.Columns.Add("MailsReceivedCount", typeof(int));

    for (int i = 0; i < addresses.Length; i++)
        table.Rows.Add(new object[] {
                            addresses[i].SiteUserID,
                            addresses[i].Address,
                            addresses[i].MailsSentCount,
                            addresses[i].MailsReceivedCount});

    using (System.Data.SqlClient.SqlBulkCopy bulkCopy =
            new System.Data.SqlClient.SqlBulkCopy(conn, System.Data.SqlClient.SqlBulkCopyOptions.Default, tran))
    {

        bulkCopy.DestinationTableName = "[users].[EmailAddress]";
        bulkCopy.ColumnMappings.Add("SiteUserID", "SiteUserID");
        bulkCopy.ColumnMappings.Add("Address", "Address");
        bulkCopy.ColumnMappings.Add("MailsSentCount", "MailsSentCount");
        bulkCopy.ColumnMappings.Add("MailsReceivedCount", "MailsReceivedCount");

        bulkCopy.BulkCopyTimeout = 300000;

        bulkCopy.WriteToServer(table);

    }

    if (fetchIdentities)
        LoadEmailAddressIdentities(users, bulkInsertSessionId, conn, tran);

}

/// <summary>
/// Fetches identity values assigned by the database engine to records
/// that correspond with EmailAddress objects contained in
/// the specified SiteUser model objects.
/// </summary>
/// <param name="users">Site user objects that contain EmailAddress
/// objects that were inserted to database. On output these EmailAddress
/// objects will have their EmailAddressID properties set
/// to corresponding values loaded from the database.</param>
/// <param name="bulkInsertSessionId">Bulk insert session identity,
/// which can be used to distinguish database records that were inserted
/// during current bulk insert session.</param>
/// <param name="conn">Active database connection.</param>
/// <param name="tran">Transaction under which operation is performed.</param>
void LoadEmailAddressIdentities(SiteUser[] users, int bulkInsertSessionId,
                            System.Data.SqlClient.SqlConnection conn,
                            System.Data.SqlClient.SqlTransaction tran)
{

    System.Data.SqlClient.SqlCommand cmd = new System.Data.SqlClient.SqlCommand(
        "SELECT [EmailAddressID] " +
        "FROM [users].[SiteUser] INNER JOIN " +
        "   [users].[EmailAddress] ON [SiteUser].[SiteUserID] = [EmailAddress].[SiteUserID] " +
        "WHERE [SiteUser].[BulkInsertSessionID]=@bulkInsertSessionId " +
        "ORDER BY [Address] ASC", conn, tran);
    cmd.Parameters.Add(new System.Data.SqlClient.SqlParameter("@bulkInsertSessionId", bulkInsertSessionId));

    EmailAddress[] sortedAddresses =
        (from u in users
         from e in u.EmailAddresses
         orderby e.Address
         select e).ToArray();

    using (System.Data.SqlClient.SqlDataReader reader = cmd.ExecuteReader())
    {
        int index = 0;
        while (reader.Read())
            sortedAddresses[index++].EmailAddressID = (int)reader[0];
    }

}

/// <summary>
/// Completes bulk insert session by removing the specified bulkInsertSessionId
/// value from the database.
/// </summary>
/// <param name="bulkInsertSessionId">Bulk insert session identity value which
/// was used to track records inserted in the database as part of current
/// bulk insert session.</param>
/// <param name="conn">Active database connection.</param>
/// <param name="tran">Transaction under which operation is executed.</param>
void CompleteBulkInsertSession(int bulkInsertSessionId,
                                        System.Data.SqlClient.SqlConnection conn,
                                        System.Data.SqlClient.SqlTransaction tran)
{

    System.Data.SqlClient.SqlCommand cmd = new System.Data.SqlClient.SqlCommand(
        "UPDATE [users].[SiteUser] " +
        "SET [BulkInsertSessionID] = NULL " +
        "WHERE [BulkInsertSessionID] = @bulkInsertSessionId", conn, tran);
    cmd.Parameters.Add(new System.Data.SqlClient.SqlParameter("@bulkInsertSessionId", bulkInsertSessionId));
    cmd.ExecuteNonQuery();

    cmd = new System.Data.SqlClient.SqlCommand(
        "DELETE FROM [bulk].[BulkInsertSession] " +
        "WHERE [BulkInsertSessionID] = @bulkInsertSessionId", conn, tran);
    cmd.Parameters.Add(new System.Data.SqlClient.SqlParameter("@bulkInsertSessionId", bulkInsertSessionId));
    cmd.ExecuteNonQuery();

}

Comparing Performance

We have used functions presented above to measure performance of adding new objects to a database. Before showing the results, we will briefly describe conditions under which measures was taken. Prior to any measurement, database was restored into state with exactly one million SiteUser records and exactly two million child EmailAddress records. For any given test, SiteUser model objects contained one, two or three EmailAddress objects (evenly distributed), producing total of twice as many EmailAddress records as SiteUser records. Effort was made to create equal conditions for each test itreation.

However, conditions were not exactly the same for very large number of objects. Entity Framework requires quite a lot of memory to track model objects and for as many as couple of hundreds of thousands of objects its performance drops so low that we may freely say that anything more than that is off limits for Entity Framework. At one million objects (three million records) Entity Framework would require something like 3 GB of operating memory just to track object states. At the same time, custom solution based on DataTables takes only 200-300 MB of memory, making it quite effective for practical use. To cope with this problem, Entity Framework solution would cut each bulk insert problem into chunks with no more than 50,000 objects per chunk. Not much of the "same testing conditions", but insisting on running one million objects through Entity Framework in one transaction is just a no go idea.

Table below shows time required to insert specified number of objects. First column shows number of objects that were inserted, with average of two EmailAddress subobjects per SiteUser object. Second column is the time required by the Entity Framework solution to insert objects. Third column is time required by custom solution to insert objects and fetch back all object identities. Last column is time required by custom solution just to insert objects, but not to fetch object identities from the database.

Number of objects Entity Framework Time (sec.) Custom Solution with Fetching IDs (sec.) Custom Solution without Fetching IDs (sec.)
1 0.22 0.75 0.06
2 0.22 0.78 0.12
4 0.22 0.80 0.13
8 0.30 0.83 0.14
16 0.23 0.82 0.07
32 0.30 0.79 0.07
64 0.34 0.77 0.12
128 0.34 0.77 0.08
256 0.46 0.91 0.09
512 0.76 0.87 0.12
1024 1.55 0.88 0.16
2048 2.22 0.98 0.26
4096 4.69 1.33 0.43
8192 8.63 1.79 0.76
16384 16.93 2.75 1.51
32768 34.06 5.73 2.97
65536 68.52 14.30 6.40
131072 134.37 33.55 11.96
262144 276.20 33.27 27.18
524288 553.13 78.16 54.06
1048576 1106.63 169.70 132.32

If we take a look at progression of times we can see that Entity Framework consistently takes twice as much time when storing twice as much data. That should be expected if we know the way in which Entity Framework stores data. On the other hand, custom solution performs each stage as exactly one SQL statement. Therefore, it has a kick-off time that dominates in all cases when small number of objects is inserted. Only after a reasonable model size is acquired (say, about a thousand objects - three thousand records), time requirements for custom solution also start to grow linear to model size.

Overall result of our experiment is that full custom solution, the one which fetches back identity values, operates about five to eight times faster than solution based on Entity Framework. One step further down the line is the solution that does not fetch identity values operates at speeds that are consistently ten to eleven times faster than Entity Framework. Actual results are highly dependent on model structure (less on its volume, obviously).

Last row in the table shows decline in custom solution's speed, but that should be explained by the fact that measurement was taken under high memory footprint generated by the model itself, not by the data storing solution. On the other hand, Entity Framework solution could not even be applied to that case. Memory requirements of solution based on Entity Framework were roughly ten times higher than of the custom solution (based on DataTables). This is because Entity Framework maintains quite a complex change tracking model under the hood, which obviously arrives with a high memory consumption tag. Therefore, we were cheating by letting the Entity Framework insert one million records in 20 smaller steps, while not providing the same allowance to the custom solution.

Below is the chart showing speed factor achieved by the two operating modes of the custom solution.

As discussed previously, most of the performance issues that Entity Framework suffers comes from too frequent communication with the database server. Each record inserted requires two statements to be issued to the server, with two roundtrips to the server to fully execute. All the measurements presented in this section are based on SQL Server installed on local computer, with nearly zero roundtrip time. Consider database server installed on LAN, which is typical situation, with only one millisecond roundtrip time. Inserting one million objects from our experiment means three million records, with six million roundtrips to the database. With one millisecond irrevocably lost with each database record, we are already 100 minutes behind the schedule. At the same time, bulk implementation won't spend more than ten milliseconds for the same thing, because it sends less than a dozen of statements to the server. All the data actually transferred by the bulk implementation are actual data that will be saved into database records - communication overhead nears zero on such a large amount of useful data. What was speed-up factor of ten on local computer, suddenly becomes factor of 40 in favor of custom solution. With two milliseconds roundtrip time this grows to factor of 75!

Just for fun, we have tried out this experiment on a cross-Atlantic database connection (not quite realistic for a loaded system, isn't it). Anyway, with roundtrip time of 140 msec, net result was that Entity Framework took 40 seconds to insert only 100 objects into the database. In the same amount of time, custom bulk insert solution was able to insert over 50,000 objects, which is a speed-up factor of 500.

As a result of this deliberation, we can conclude that speed-up of one order of magnitude is a very conservative lower bound of actual speed-up that will be achieved in practice - actual speed-ups on practical systems will be much higher than factor of ten.

Highlights and Lowlights of Bulk

Building systems that smoothly accept huge amounts of data is always a difficult task. We strongly recommend that any system should be designed bulk-insert-aware up front, rather than tweaked at later stages to accept bulk data. But then, what we end up with is a bifocal system. Applying bulk-related parts of the design to operations that are focused to small number of objects often adds needless complexity and, why not to say, bugs. Therefore, the system design will include two implementations for critical operations: one that works with small number of objects and another one that deals with bulk data.

That approach preserves high performance on bulk operations and low complexity and safety of operations that are not bulk. So much about good sides of this design principle. Drawback is in maintenance, where two implementations of the same operation have to be maintained. Should we add or change a feature, we have to implement it twice. Further down the development process, we have twice as much testing, documenting, training... This all pays off only if benefits from speeding the bulk operations up is sufficient. Do not forget that we had almost 300 lines of code including comments just to enable bulk insert into two tables. On the other hand, corresponding EntityFramework code looks like this:

using (BulkDemoEntities ent = new BulkDemoEntities())
{
    for (int i = 0; i < users.Length; i++)
        ent.SiteUsers.AddObject(users[i]);
    ent.SaveChanges();
}

Only six lines of code, curly braces included. Make sure to know reasons why you are willing to write, test, document and maintain code that is fifty times longer.

Don't be tricked into thinking that any long operation is bad. Ten minutes or even a couple of hours invested into one bulk operation may be quite acceptable for the business, provided that it is performed when system is not in use, typically over night. On the other hand, don't be lulled into thinking that fast bulk operation solves all your problems. One minute required to bulk insert millions of objects may sound as a huge success when compared to one hour required by an ORM. But one minute of any intensive operation on the database means that the whole rest of the database system is probably going to be frozen for one minute.

And there we return to the first suggestion. Get known with your data. When there is such a need, make sure to build your database system around the notion of bulk inserts. Do not let performance troubles drive your decisions later because it may as well be too late.

See also:

Published: Apr 7, 2013; Modified: Apr 8, 2013

ZORAN HORVAT

Zoran is software architect dedicated to clean design and CTO in a growing software company. Since 2014 Zoran is an author at Pluralsight where he is preparing a series of courses on design patterns, writing unit and integration tests and applying methods to improve code design and long-term maintainability.

Follow him on Twitter @zoranh75 to receive updates and links to new articles.

Watch Zoran's video courses at pluralsight.com (requires registration):

Tactical Design Patterns in .NET: Managing Responsibilities

Applying a design pattern to a real-world problem is not as straightforward as literature implicitly tells us. It is a more engaged process. This course gives an insight into tactical decisions we need to make when applying design patterns that have to do with separating and implementing class responsibilities. More...

Tactical Design Patterns in .NET: Control Flow

Improve your skills in writing simpler and safer code by applying coding practices and design patterns that are affecting control flow. More...

Improving Testability Through Design

This course tackles the issues of designing a complex application so that it can be covered with high quality tests. More...

Share this article

webmasters