Using Azure Data Factory To Copy 300M JSON Files From An SFTP Folder
Context
For the past few months, our team has been in talks with several potential data providers for one of our initiatives, and one of the vendors has agreed to share an SFTP folder in AWS containing JSON files. Our task is to copy all these JSON files to our side before we can start analyzing the quality of their data.
Data Source Details
Based on the URL they’ve sent, their SFTP folder is hosted in an AWS data center in the US EAST region. They didn’t state how many JSON files are there, but based on our estimates, there may be around 300M JSON files in this SFTP folder in an arbitrary number of nested subdirectories. These JSON files’ sizes range from a few KB to a few hundred Kb.
They’ve also sent us an OPEN SSH private key, a username, and a URL of their SFTP folder.
Stepping back a bit, you may say, why didn’t they just compress all their files as one gigantic file and share it with us, right? Or even multiple compressed files? And why SFTP? Arent there other protocols that are also secured but with a significantly faster throughput? Unfortunately, I don't have much visibility on their capacity. For all I know, their tech team may be stretched too thin, and waiting for them to work on this albeit simple request may take days, even weeks, while we, on the other hand, have tight schedules as well.
So we have to pull all the data out as quickly as possible.
Initial Implementation With Python
Knowing that the vendor’s SFTP folder is hosted in an AWS data center in US EAST, I spun up an Azure VM in US EAST. This would minimize latencies as both the source and destination data centers are located in the same region.
I implemented an approach based on a python script I saw in Stackoverflow.
My Python program was simple.
- Get the complete list of folders in the SFTP root folder, and store them in a Python list.
- Divided the list into eight smaller lists
- Each list will then be processed by a dedicated Python process that copies the files from their SFTP folder to our Azure Storage Account.
Sounds simple enough, right? So I left it running overnight and then went to bed.
When I logged back into my working laptop, I looked at the throughput, and it was underwhelming, only copying an average of 30–40K JSON files per hour. As there are an estimated 300M JSON files, with this rate, it will take 416 days to copy all the JSON files. This was under the assumption that there were no disconnections or errors too!
With this approach, achieving the task is unrealistic with the current set of tools that we are using.
There MUST be a better way in terms of throughput, cost, as well as error handling!
Enter Azure Data Factory
What is Azure Data Factory? From Microsoft’s official documentation.
In the world of big data, raw, unorganized data is often stored in relational, non-relational, and other storage systems. However, on its own, raw data doesn’t have the proper context or meaning to provide meaningful insights to analysts, data scientists, or business decision makers.
Big data requires a service that can orchestrate and operationalize processes to refine these enormous stores of raw data into actionable business insights. Azure Data Factory is a managed cloud service that’s built for these complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects.
I’ve been using Azure Data Factory for a while now, but this will be the first time that I will use the service to copy this large amount of data.
Setting It Up
- From within an Azure Data Factory service, navigate to the Author tab. Create a new pipeline, then add a Copy Data activity onto the designer canvas.
- Click the Copy Data activity on the canvas, and you should see the Source and Sink tabs as shown below.
3. Within the Source tab, click the “+ New” button. The right panel should show the screens below. Search for SFTP dataset type. Then in my case, as I know that I'm copying JSON files, I chose JSON.
4. We were given an OpenSSH PRIVATE KEY to access the vendor’s SFTP folder, but Azure DataFactory only accepts RSA PRIVATE KEY instead. Attempting to use the key will result in the error seen below.
5. We can use PuttyGen to convert an OpenSSH Key to RSA format. For a complete step-by-step set of instructions on how to do this, go here.
6. After the conversion of the key, utilizing this new key should yield a successful connection to the SFTP site, as seen below.
7. As we don’t know the directory structure of the SFTP directory, we need to utilize wildcard pathing with recursion for Azure Data Factory to dynamically traverse the folder structure and automatically fetch an arbitrary amount of JSON files. The settings below are what I configured in the SOURCE tab.
8. After configuring the source, we should now configure the sink or destination.
9. Clicking the “+ New” button should open a panel to the right. I’ve taken screenshots of the step-by-step configurations that I did. Configuring the sink is easier as we have more control over the destination as this is within our infrastructure.
We will also be configuring the destination folder of the JSON files that we are going to copy the files into. Remember that we are going to copy 300M JSON files. We don't want all 300M files in one gigantic folder, so I configured it to “preserve hierarchy” this way, the copying process should also create the same folder structure in the sink/ destination container.
10. One of the challenges with our initial Python implementation was the throughput. At a rate of 30K JSON files an hour, it will take 416 days to copy 300M JSON files. This assumes that there are no disconnections and errors.
In Azure Data Factory, we can configure the maximum parallel copying processes to maximize throughput under the settings tab of the copy data activity.
11. By default, the timeout value of an ADF activity is set to 12 hours. As we have estimated that there may be around 300M files in the SFTP folder, and we still don't know the would-be throughput of the copying process, we will set this to the maximum allowable value of 7 days.
Validating and Publishing The Pipeline
After configuring and tweaking the settings of the pipeline, we need to validate and publish them to a live data factory for execution.
After this step, all that is now left is to trigger the importation of files.
Triggering The Copy Activity
At the upper right corner of the canvas is the “Add Trigger” button. Clicking this button will show the “Trigger now” button.
After triggering the copying activity, assuming there are no errors, we will just have to monitor the process.
Monitoring The Process
Triggering the copy activity will take a few minutes of ramping up. This ramping-up process dynamically allocates the necessary resources based on the configuration and settings that you’ve entered.
Here is my copy activity that I’ve left running for about 40 hours.
I’ve computed the throughput, which is around 500K JSON files per hour for 40 hours. This is a whopping 1,667% increase in throughput when compared to our initial Python implementation! Also can be configured into this pipeline is error handling, automatic retries, logging, monitoring, and as well as dynamic allocation of resources!
Running Multiple Copy Activities
At a rate of 500K JSON files per hour, copying 300M JSON files would now just take 25 days, but remember that the maximum allowable activity run time is ONLY seven days.
To make the copying even faster, what we can do is create multiple copies of the pipeline but configure each pipeline copy’s wildcard pathing differently. As seen above, and now knowing that the subfolders under the root SFTP folder always start with a digit, I created pipelines that copied the JSON file contents of sub-folders whose names start with 6,8,9, and 0.
Word Of Caution
Copying a massive amount of files and doing a lot of write operations into Azure may incur a significant cost.
In my case, I am writing the files into Azure Datalake Gen2. The intention is to query these JSON files using a serverless Azure Synapse with Openrowset to get insights from the data.
It may be possible to reduce cost and complexity by minimizing the scope. We need to ask some questions.
Do we even have the capacity and the capability to analyze ALL <X amount> of data? Can we just get a small subset of json data?
These questions can only be answered by you and your team.
By asking these questions during one of our standup sessions, I was able to negotiate and bring down the initial requirement of 300M JSON files down to 30M JSON files.
Takeaways
The task was really simple and wasn’t complex at an initial glance. Copy JSON files exposed by a vendor in an SFTP into our side for further analysis. But as we started to get more details, the sheer volume of the data was simply too big for the traditional approaches. We needed a better-suited tool for the task of copying a few hundred million JSON files in a timely manner.
For those who have a Microsoft Azure subscription, Azure Data Factory can be one of those tools that can be utilized. From my perspective, the tool is very powerful and customizable to cover most scenarios. It is also well documented. Having minimal experience with the tool, I was able to configure something ground-up in a matter of a few hours!
Author’s Notes
I hope that you’ve learned something new. In case you are also using Microsoft Azure, and a similar task has been given to you or your team, you know that Azure Data Factory can be used in this situation.
Have you had a similar task before of copying hundreds of millions of files from one location into another? What tool(s) did you use? How did you approach it? Would like to learn from you as well. Share it in the comments!
Thank you!