Data integration sounds daunting – but what is it and how can it help you and your organization?
In this article, we will talk about integrating data as the process of connecting and synchronising data from one system to another.
With Splashback, integration unlocks data analysis and management capabilities you never would have thought possible. We do this by providing a federated platform with a full suite of analysis tools, apps and admin portal, with all the features you’d expect of a next-gen data platform. Read more about what we do on our website.
Here are just some applications of integrating Sample data with Splashback:
- Upgrade your organization’s data analysis and management capabilities without breaking legacy Access/SQL applications
- Use Splashback’s suite of analysis tools on a public FTP/CSV dataset
- Analyze external datasets alongside your existing Splashback data
TL;DR: You get all the great features of a next-gen data platform while not needing to fully transition away from old processes/software!
Today we will focus on how to integrate Sample data with the Splashback APIs in Python. Once we have shown you the hard way, we will introduce the Splashback Integration Script (SIS), our script framework designed to make it easy to integrate any data source.
Make sure to read to the end to get a complete understanding of the import process and find out which option is best suited for your integration needs!
Fundamentals of building data integrations
You can think of an integration as a factory – they take resources of one type in, and produce resources of another type. Of course, in this case an integration takes in your existing dataset and produces data in Splashback.
The first step is to identify how you can access the dataset, and what format it is in. Basically, you want to write code to read your data and turn it into a Splashback import.
Python has a multitude of packages available for fetching and loading data of many formats, making it a great language for building integrations. Here are just a few you might find useful:
- csv – Built-in package for reading CSV and TSV content.
- json – Built-in package for reading JSON content.
- requests – Package for downloading content over HTTP, such as web APIs.
Once you have created the import, you need to make sure it meets the validation checks provided by Splashback.
Splashback Samples is a rigorous data storage platform with strict metadata requirements. We break down the import process into two stages – the import check, and the import run. If you are familiar with Splashback Governor, this is the same process you go through when manually importing data.
The import check stage
In this stage, we submit the rows we intend to import to Splashback for validation, returning any status messages such as errors to the client. The checks it performs are as follows:
- Type checks – Validate that the type of each field matches the column type.
- Lookup checks – Ensure that all metadata lookups exist.
- Local duplicate checks – Check for duplicate rows within the submitted data after metadata lookups have been resolved.
- Remote duplicate checks – Check for duplicate rows against the Splashback database.
Once you have resolved any errors produced by the check stage, you are ready to proceed to the run stage.
The import run stage
This is where the magic happens – Splashback takes your validated import rows and inserts them into the database. For safety, Splashback also rechecks the data beforehand to ensure we still have valid data.
And that’s it! Congratulations, you have now imported your first dataset! Now that you know how the import process works, let’s get our hands dirty with some Python and upload some example data to a Splashback Pool!
Importing Sample data with the Splashback API
At Splashback, we believe in open access to data. That is why we publish the same internal APIs we use to power our applications and analysis tools for our customers.
In this section we’ll take you through your first Sample data import with Python!
To get started, follow our Python quick start guide in the Splashback Docs. We’ll take you through the steps of installing the package, securely managing your API key and performing your first query!
We’ll assume you’re starting with the script from “Make your first query”. Just delete the code after setting your Pool ID and we’re set:
import os import time import splashback_data from dotenv import load_dotenv from pprint import pprint from splashback_data.api import sites_api # Load .env load_dotenv() # Setup the API configuration configuration = splashback_data.Configuration( host = "https://api.splashback.io/data" ) # Configure API key authorization: api-key configuration.api_key['api-key'] = os.environ.get('SPLASHBACK_API_KEY') configuration.api_key_prefix['api-key'] = 'API-Key' # Set your Pool ID pool_id = <your_pool_id_here>
Let’s start by adding some imports to our Python script. These should go at the top of the file with the other imports, before any code:
import sys from splashback_data.api import imports_api from splashback_data.models import ModelImport
Now let’s create our rows for import:
imports = [ ModelImport( site_name='Splash River Estuary', site_code='', date='2021-06-01T12:00:00', time='', taken_by='Bernard Black', comment='Imported with Python', program='Mine Env', variant_type='Depth', variant_date_time='', variant_value='', variant_comment='', parameter='Water Temperature', qualifier='', value='10.49', quality='Good', laboratory='Field measurement', sampling_method='WTW' ) ]
We’re using a Pool with our demo Splash Valley dataset, so the metadata will already exist. If you are using another Pool, make sure that the metadata and lookups exist, or your check will fail! Don’t worry though, the check will tell you what’s wrong to help you solve any issues with your imports.
Performing the import checks and running the import
Now, we’re ready to check our imports are ready for import:
print('Checking imports ...') with splashback_data.ApiClient(configuration) as api_client: api_instance = imports_api.ImportsApi(api_client) results = api_instance.api_imports_check_pool_id_post(model_import=imports, pool_id=pool_id) if results.has_error_message: print('Errors checking imports', results.messages) sys.exit(1) print('Successfully checked imports!')
The moment of truth – running the import! This will insert the Sample into the Pool’s database:
print('Running import ...') with splashback_data.ApiClient(configuration) as api_client: api_instance = imports_api.ImportsApi(api_client) results = api_instance.api_imports_run_pool_id_post(model_import=imports, pool_id=pool_id) if results.has_error_message: print('Errors running import', results.messages) sys.exit(1) print('Successfully run imports!')
Now you’re ready to run the script, so save and run it in your environment:
(venv) $ python main.py Checking imports ... Sucessfully checked imports! Running import ... Successfully run imports! (venv) $
If you see this output, it worked! Use Splashback Governor to browse the new Sample:
Tip: Running the script more than once will fail with a remote duplicate error – this is Splashback’s rigorous validation system kicking in! You should change the imports between runs or delete the duplicate imports before trying again.
Importing Sample data with the Splashback Integration Script
We have started work on SIS, the Splashback Integration Script. It is designed to abstract as much of the nitty-gritty as possible from the integration process.
SIS consists of two main components – finders and parsers.
Finders are used to locate data to import, producing a list of file paths to be processed. They can do a simple file system search, download files from an FTP server, or save responses from a web API.
Parsers take the list of file paths and convert them into a Splashback import. You will want to create a parser for each file format you want to handle, such as CSV (.csv), TSV (.tsv), netCDF (.nc), XML (.xml) or XLSX (.xlsx).
SIS includes some finders and parsers, and we welcome contributions to our open source efforts!
Which data integration method is best for me?
Now that you know how to import Sample data to Splashback, you’re ready to integrate your dataset.
In most cases, we recommend the Splashback Integration Script (SIS), as it abstracts much of the setup and processing code away, reducing your workload.
If you are developing an integration with Splashback in your existing application, the API is the best way to go as you have the greatest flexibility.