Performance issues and selecting optimal rules in DTM Data Generator
In many cases there are several alternative ways to fill a table in the database with
identical values, but these ways can be quite different concerning the performance.
In the process of filling extra large databases (with dozen millions of records),
this issue can turn out to be a critical one as long as time is concerned. In this
article we will consider the questions of product performance, give the results
of our measurements for some standard tasks and several recommendations for non-standard
cases.
Note: all tests in the article were done on the following system: Intel P4 (3.4 GHz),
1024 Mb RAM, Windows XP Pro, local MS SQL Server 2000 (Developers Edition).
Let us start from adding unique values for some field to the database. There are two ways to
ensure value uniqueness: to set the corresponding checkbox in the generation rule or to use
the unique index on the server. In the first case, the data generation software tries to create unique
values on its own and checks each new value with the list of those already generated.
Such checks require time and consume additional RAM, but you can avoid additional server
queries in this case. The second way takes the task of selecting records to the server.
The main disadvantage of the second method is that DTM Data Genaretor does not guarantee to add the
number of records the user specifies in the rule. It means that if 1000 records
are specified to be inserted and 100 records are rejected, only 900 records will
be inserted into the database. The first method almost always guarantees the necessary
number of records if it is possible (if there are unique values left).
Let us see what actually happens. Let us create two tables: the first one with a varchar
field and the other one with an integer field and try to insert 10,000 unique values
into each of them using each of the above methods:
| |
First method (by Data Generator) |
Second method (unique index) |
| Integer value |
800 (records/sec) |
710 (10% rejected) |
| Varchar(750) value |
764 (records/sec) |
780 |
In case the range of generation is rather large, there is no substantial difference
between these two methods.
Now let us change the task so that the number of variants that can be generated is
considerably reduced: specify the range from 0 to 50000 for the Integer field, and
5 characters for the Varchar field:
| |
First method (by Data Generator) |
Second method (unique index) |
| Integer value in range |
450 (records/sec) |
770 (10% rejected) |
| Varchar value by mask |
373 (records/sec) |
800 |
As we can see in this case, the performance of the first method can be considerably
lower because the program has to generate several values, check them and reject the
improper ones.
It should be mentioned that if the connection to the server is slow or there are
big timeouts, the ratio for getting the same number of inserted records is most
likely to change for the opposite because the process of checking values will
take a good deal of time.
The next important factor that will seriously influence the performance is the
transaction "size" - the number of table rows that will be inserted within a
single transaction. Let us change the task like this: insert 100,000 varchar
values into the table changing the transaction size from 50 to 5000 rows per
transaction and evaluating the resulting performance:
| Rows per transaction |
Performance (rows/second) |
| 50 |
687 |
| 100 |
734 |
| 200 |
801 |
| 500 |
782 |
| 1000 |
740 |
| 2500 |
721 |
| 5000 |
673 |
We stopped and restarted the database server after following each rule in
order to prevent the results from being influenced by the record cache.
Note: the optimal transaction size depends on the software and hardware
configuration of the database server. You should probably make some experiments
to find the optimal values for your server.
Let us now compare the performance of the program when three various methods
are used to specify the lists of values: directly, by selecting the list from
a file and in case of using Values Library. We will use the table with one
varchar field that was created in one of the above examples and fill it
with 10,000 values.
| List (5 values) |
List in file (50 values) |
Values Library (400 values) |
| 490 |
473 |
448 |
As we can see, all three methods are practically identical in comparable
conditions from the point of view of performance and this parameter almost
does not depend on the size of the list.
There is a way to considerably speed up loading test data for MS SQL users. In case
the server is installed locally (i.e. DTM Data Generator is launched on one computer with the server),
the program offers built-in means for working with the BCP utility (see the corresponding
option in the Settings dialog box). If the server is remote, we recommend that
you create an output text file instead of generating data directly and then use
this file together with the command line BCP utility.
In conclusion, let us mention some factors that are critical for the performance
and can be changed or customized by the user. To increase the performance, you should:
- Temporarily disable all triggers corresponding to the tables being filled, except
those cases when they ensure data integrity.
- If possible, remove indexes, except the cases of creating unique values mentioned above.
- Temporarily disable the transaction mechanism for the specific client for
the period of generating text data if your database server permits it.
See Also
|