Set up DataSync Agent to share to Snowflake

Krypton

You can configure your DataSync Agent to share data from ServiceNow to a Snowflake database by changing some additional configurations in your agent.xml file.

The DataSync Agent uses the Snowflake COPY command to bulk load data from the local file system where the Agent is running to optimize performance. Being that Snowflake is a cloud-based database, the COPY command is used to improve performance as standard SQL operations (i.e. INSERT, UPDATE, etc) on a record by record basis will have slow performance due to network latency.

Snowflake support is available as of Krypton 8.0.8 and newer
In Krypton 8.0.11, support is added for:
- Altering tables to add new columns or increase column lengths as is supported with other databases
- Deleting records
- Table Compare (see below for information on databases.xml)
- IODatetime plugin
The following features are not currently supported and may be added in future versions:
- Other Agent plugins
- Temporal
- Legacy DataSync Agent approach with combining sys_ids for database views. The Agent will use the sys_id as provided by ServiceNow for a database view. See below for more information.
Once Snowflake's 2024_08 Behavior Change Release bundle is enabled in your Snowflake account and applied to all existing users on the account (timeline of sometime in 2025 per Snowflake), you will want to update (alter or create) the Snowflake user you configure the DataSync Agent to use to have the LEGACY_SERVICE type to continue having the Agent connect to Snowflake as is.

Prerequisites

First, you will need to Install the DataSync Agent.

You will also need to create a ServiceNow dynamic share/bulk share.

Make sure to stop running your DataSync Agent before making any Agent configuration changes.

Procedure

To set up your DataSync Agent to share application data to Snowflake, follow these steps:

Navigate to the conf directory in the installation folder where you installed your DataSync Agent and look for the agent.xml configuration file.

Open your agent.xml file in a text editing application. Then, locate the <task> directive(s) within your <subscribe> directive, and update the following database-related directives:

Directive	Value to enter
<handler>	com.perspectium.replicator.sql.SnowflakeSQLSubscriber
<database_type>	snowflake
<database_server>	URL/IP address for your Snowflake database.
<database_user>	Username used to access your Snowflake database
<database_password>	Password used to access your Snowflake database
<database_parms>	To specify the schema, warehouse and role to access your Snowflake database using & to specify as & (due to how Java encodes characters): schema=<SCHEMA_NAME>&warehouse=<WAREHOUSE_NAME>&role=<ROLE> For example: schema=PUBLIC&warehouse=COMPUTE_WH&role=ACCOUNTADMIN would be used if your schema name is PUBLIC, your warehouse name is COMPUTE_WH and your role is ACCOUNTADMIN
<database>	Name of your Snowflake database
<database_port>	(Optional) Port number to access your Snowflake database
<enable_snowflake>	To enable the DataSync Agent for Snowflake. By default, the Agent is not enabled for Snowflake because of the aforementioned performance limitations with standard SQL operations.

<files_directory>	To specify a local directory where the Agent is running and has access to so it can save its temporary CSV files. These files are used push to Snowflake staging tables in order to do the COPY command for bulk loading data from the local file system. NOTE: The folder specified should only be used for the Agent. If you store other files in this directory, they will be deleted as part of the cleanup using the <delete_files_interval> configuration. If the folder specified cannot be accessed by the Agent, the Agent will use the default directory of replicated.files/ in the directory where the Agent is installed i.e. <Agent_Installation_Path>/replicated.files
<file_max_size>	To specify how large the temporary files will be before they are pushed to Snowflake. To optimize performance, records should be accumulated in the temporary files before they are pushed so we're not continuously pushing files to Snowflake which can affect performance. NOTE: The size can be in KB, MB, or GB, i.e. 50KB, 250MB, 1GB. Make sure to have NO space in between the number and the unit. The minimum value for this directive is 25KB. If you input a value less than 25KB, the value will be set to 25KB. Conversely, the maximum value for this directive is 10GB. If you input a value greater than 10GB, the value will be set to 10GB. If the value cannot be read from this directive (i.e. an invalid non-number is entered), then the value will be set to 25KB.
<uppercase_columns>	(Optional) To specify if the columns should be saved in uppercase lettering in the Snowflake database. Since ServiceNow's columns are lowercase, this will convert them to uppercase for querying and saving content into Snowflake (i.e. sys_id becomes SYS_ID). It is recommended to enable this configuration since Snowflake by default uses uppercase lettering.
<flush_interval>	(Optional) To specify how often the DataSync Agent will flush its temporary files and push to Snowflake if they have not accumulated to the size specified in <file_max_size>. Values are entered as whole numbers to represent minutes, such as 10 to represent 10 minutes. As of Krypton 8.0.11, the default value is 3 for 3 minutes if this configuration is not specified. This way, we don't wait too long to push changes if there are no more records to be shared. For example, if you share 100 records and your <file_max_size> configuration is 400KB and then 80 records fit within the 400KB, those 80 records will be pushed but the remaining 20 records will stay in the next temporary file until it reaches 400KB. This flush interval ensure they get pushed in case there isn't any more sharing done for now, in which case we wouldn't reach 400KB again for some time. NOTE: If the value cannot be read from this directive (i.e. an invalid non-number is entered), then the value will be set to the default value of 3. To work properly with Table Compare, this value should be set smaller than 5 since Table Compare sends out compare messages 5 minutes after bulk sharing. If the flush interval is set higher than 5 minutes (such as 10 minutes), the compare messages will be processed before the records have been copied into the Snowflake database and indicate records are missing.
<delete_files_interval>	(Optional) To specify how old the temporary files should be before they are deleted. To ensure your temporary files don't take up too much space on the file system while still keeping them for any troubleshooting purposes, we only delete the files once they get older (last modified) beyond a certain day/time related to the current date and time. The default value is 24H for deleting when older than 24 hours. The value is used each time the Agent runs to flush its temporary files (using the above <flush_interval> configuration), checking this configuration for any temporary files older than the value specified here related to the current time. For example, if the value is 12H (for 12 hours), when the Agent runs its flush, it will check for any temporary files older than 12 hours from now and delete those files. NOTE: The value can be entered in hours (H) and days (D) i.e. 12H (for 12 hours), 3D (for 3 days). Make sure to have NO space in between the number and the unit. The default value is 24H (24 hours) if this configuration is not specified. As well, if the value cannot be read from this directive (i.e. an invalid non-number is entered), then the value will be set to the default value of 24H.

Your agent.xml file should look similar to the example shown below:

<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<config>
    <agent>
       	<share>
     	</share>  
     	<subscribe>
     	<task>
            <task_name>test_snowflake_subscribe</task_name>
            <message_connection password="encrypted:vlOtU71yu8N/EFIJH85SSBtaIt7qEEfvqiqft9VZyYE=" user="exampleuser">amqps://example.perspectium.net</message_connection>
            <instance_connection password="encrypted:vlOtU71yu8N/EFIJH85SSBtaIt7qEEfvqiqft9VZyYE=" user="example.user">https://myinstance.service-now.com</instance_connection>     
			<decryption_key>The cow jumped over the moon</decryption_key>
            
			<handler>com.perspectium.replicator.sql.SnowflakeSQLSubscriber</handler>             
			<database_type>snowflake</database_type>
            <database_server>servername.snowflakecomputing.com</database_server>
            <database_user>exampleuser</database_user>
            <database_password>examplepassword</database_password>
            <database_params>schema=PUBLIC&warehouse=COMPUTE_WH&role=ACCOUNTADMIN</database_params>
            <database>SAMPLE</database>
			
			<enable_snowflake/>
			<uppercase_columns/>
				
			<files_directory>/usr/local/datasyncagent/tempfiles</files_directory>
			<file_max_size>500KB</file_max_size>
			<flush_interval>3</flush_interval>				
			<delete_files_interval>12H</delete_files_interval>          
		</task>
      	</subscribe>
      	<max_reads_per_connect>4000</max_reads_per_connect>
      	<polling_interval>30</polling_interval>
   </agent>
</config>

Save the changes you made to your agent.xml file and close the file.

With Krypton 8.0.11 and newer, the additional lines in wrapper.conf will automatically be added for you when installing the Agent using the Installer.

Navigate back to the conf directory in the DataSync Agent installation folder and look for the wrapper.conf file.

Open up the wrapper.conf file in a text editing application. Look for the # Java Additional Parameters section and add the following two lines:

wrapper.java.additional.4=--add-opens 
wrapper.java.additional.5=java.base/java.nio=ALL-UNNAMED

The changes should look as follows:

# Java Additional Parameters
wrapper.java.additional.1=-Djava.util.logging.config.file=../conf/logging.properties
wrapper.java.additional.2=-Dlog4j.configurationFile=../conf/log4j2.xml
wrapper.java.additional.3=-Dfile.encoding=UTF-8
wrapper.java.additional.4=--add-opens 
wrapper.java.additional.5=java.base/java.nio=ALL-UNNAMED

This configuration is needed for the Agent to work properly so it can access the Java modules to save data in memory before pushing it to Snowflake to use the COPY command.

Save the changes you made to your wrapper.conf file and close the file.

After configuring your agent.xml and wrapper.conf files to support replication to your Snowflake database, start running your DataSync Agent again.

Notable Features

Agent Log Reporting

The Agent reports both the records it processes (i.e. records it reads from the Perspectium Integration Mesh queue and then prepares for saving into Snowflake) along with the records it pushes into Snowflake (records that are sent into Snowflake and then copied into the database tables with the COPY command). Because of how the Agent batches records together to a max file size before using the COPY command, these details will be reported separately and these two actions will not happen at the exact same time. For example, if looking at the Agent's log file (perspectium.log), you will see log entries as follows:

2024-03-28 23:33:51.049 INFO - PerformanceReportGenerator - PerformanceReport - change_request.bulk=[100, 425560]: 100 messages processed and 425560 bytes processed

2024-03-28 23:37:21.049 INFO - Timer-4 - SnowflakeSQLSubscriber - Scheduled flushing to push records

2024-03-28 23:37:28.144 INFO - Timer-4 - SnowflakeSQLSubscriber - Processed 100 record(s) to table: change_request

The PerformanceReportGenerator logging will represent the processing of records by the Agent from the Integration Mesh queue (i.e. reading the messages, decrypting the content, formatting it and then saving into the local temporary files) and then the SnowflakeSQLSubscriber logging will represent the actual pushing of the records into the Snowflake database (taking the temporary files, pushing them into Snowflake and then using the COPY command to save into the actual Snowflake database tables themselves).

Column Sizes

By default, the DataSync Agent will create columns to match the sizes as specified in ServiceNow up to 4000 characters. Anything that is defined over 4000 characters will be created as the maximum size for that column type.

For example, if in ServiceNow the column is defined as String with 200 characters, in Snowflake this column will be created as VARCHAR(200) (VARCHAR is the equivalent for a String in the Snowflake database).

If the column was defined as String with 10,000 characters in ServiceNow, in Snowflake this column will be created as VARCHAR(16777216) which is the maximum size for VARCHAR column types since Snowflake does not have a TEXT or CLOB equivalent.

You can update the default value of 4000 that controls when the column is created as the maximum size by updating the <column_max_size> configuration in databases.xml. This value can only be changed to the maximum size that Snowflake supports (i.e. you can't set this value to 32777216 since this is greater than the VARCHAR maximum value of 16777216).

Database Views

A ServiceNow database view replicated to the Agent for saving into Snowflake will use the encoded sys_id value so that the record can be properly queried both for updating and deleting. This sys_id value is created by ServiceNow when the Perspectium application queries for records from a database view for sharing. The sys_id will have a value such as __ENC__MzVmM2RjNjljMGE4MDhhZTAwMDg2NjU3MzFjMTdkMDM=-MjQ2ZWY2ZDk4MzE5NDIxMGYyZGQxNjMwY2VhYWQzZWE=-MDg2ZTM2ZDk4MzE5NDIxMGYyZGQxNjMwY2VhYWQzYmQ=.

NOTE: The legacy DataSync Agent approach with combining sys_ids based on *_sys_id fields listed in the record for database views is not supported.

Table Compare

In order for Table Compare to work properly, the databases.xml configuration file needs two configurations (<most_recent_record_query> and <all_ids_query_syntax>) to properly query for records in tables being shared to Snowflake. These configurations should be located under the <database> section where <database_type>snowflake</database_type> is (note other configurations have been omitted in the example below for easy viewing):

 			<database>
                <database_type>snowflake</database_type>
 				.
				.
				.
                <most_recent_record_query>SELECT sys_id FROM %s WHERE sys_id = '%s' AND %s = %s</most_recent_record_query>
                <all_ids_query_syntax>SELECT SYS_ID as "sys_id" FROM %s</all_ids_query_syntax>
            </database>

Because of how the Agent queries for records when synchronizing deletes, the <all_ids_query_syntax> should contain the "sys_id" alias for the sys_id field. The default configuration uses SYS_ID as this field and aliases it to "sys_id" (i.e. SYS_ID as "sys_id") since SYS_ID in all upper case lettering is Snowflake's default syntax for column names. You should update this if you have different casing for your column names.

Contact Perspectium Support to obtain the latest version and after verifying the two configurations are in databases.xml, place it in the Agent's conf folder i.e. <Perspectium_Replicator_Agent_Installed_Directory>/conf.

agent.ack and agent.error Messages

Since we are leveraging the COPY API, the Agent's processing of messages it receives and pushing the data to Snowflake are two separate processes (whereas with other databases, the Agent will process a message and push it to the database in one step). As a result, the Agent will send back agent.ack messages for processing messages (not including pushing data to Snowflake) and then if the Agent has any errors pushing this data to Snowflake, will send back agent.error messages. This way you can subscribe to notifications for these error messages.

IODatetime Deletes

For deletes with the IODatetime plugin, the column specified as the update_column will also be updated with the datetime when the delete action happens. This is due to how the COPY API updates the records to mark them as a soft delete (i.e. not actually deleted from the database table).

Content

Space Tools