Every now and then we encounter environments where data loading must be performed over a wide area network and are asked how we handle this scenario. Data loads over wide area networks present an interesting challenge. Our experience has been that the problem can be tackled from features within the tool itself and also how the warehouse ETL process is designed.
From a tool perspective, WhereScape RED provides some good options to help reduce network traffic and load times;
• RED provides a facility for permanent staging/mapping tables. These tables are kept intact over time and can be used to store valuable mapping/lookup information required during transformations, etc. This eliminates the need to potentially extract lots of lookup information during each and every extract. In this case you would only need to bring over the changed information and keep the mapping table(s) updated.
• Our standard approach is to perform change-data-capture (CDC) on all data in the warehouse – dimensions and facts, where possible. Rather than reloading entire dimension/fact tables each time they are processed, we would only bring across the changed data and insert/update the appropriate rows in the target tables. This will have the effect of reducing the overall extraction times and also reducing the network traffic. RED is designed with this approach in mind.
• We attempt to get only the raw data from the source system during the extract process. Once the data has been placed into our Load tables we would then go about the process of performing the complex transformations, etc. By performing all the transformations on the warehouse database we would hopefully reduce the amount of data that travels over the wide area network.
• "Remote Views" on load tables provide a facility where the SQL for the data extraction is created as a view on the source system. This forces the extraction query to be performed on the source system rather than allowing the optimizer to choose how it is performed, which could potentially lead to the query being performed across the database link. This option is available for Oracle environments and can provide significant performance gains and reduction in network traffic.
Other non-tool related considerations;
• Try to perform extracts when there is minimal load on the network. Running extracts outside normal business hours when user activity is at a minimum. Another consideration is to schedule extracts around other IT maintenance activities, such as system backups, which could consume large chunks of network bandwidth.
• Since data is going over a wide area network try to design all data extraction with a CDC approach in mind.
By taking advantage of the built-in functionality of WhereScape RED and designing the ETL process with a Wide Area Network in mind you can make significant improvements in the overall performance.