Ingesting data into Couchbase using Streamsets

Nick Cadenhead, 9th BIT Consulting, offers his expertise.
I have been working with the Couchbase for some time now and it’s been an interesting journey. Historically, I’m not a database guy, so I’ve not worked much with databases in terms of designing, building and maintaining them as a full-time job.
However, I do know the basics. This position has allowed me to get into the mindset of NoSQL concepts like no structures, no transactions, denormalizing of data and more, without having many conflicting situations with the paradigms of the structured world of SQL and relational databases.
During my sales engineering activities supporting Couchbase proof of concept (POC) engagements, there is always a requirement to ingest data in a Couchbase bucket (think of a bucket as a relational database) to demonstrate and highlight the features and capabilities of Couchbase.
Usually, data ingestion requires some code to be written. Fortunately, Couchbase provides quite a few SDKs (Java, .Net, Node JS and more), which allow developers to integrate their applications with Couchbase.
This got me thinking: why can’t there be a standard way or tool for that matter to ingest data into Couchbase instead of writing code all the time? Not that there is anything wrong with writing code, of course.
Then I came across Streamsets.

Open source, web-based ingestion
Streamsets is an open source platform for the ingestion of streaming and batch data into big data stores. It features a graphical web-based console for configuring data pipes to handle data flows from origins to destinations, monitoring runtime dataflow metrics, and automates the handling of data drift.
Data pipes are constructed in the web-based console via a drag-and-drop process. Pipes connect to origins (sources) and ingest data into destinations (targets). Between origins and destinations are processor steps, which are essentially data transformation steps for doing field masking, field evaluating, routing data, evaluating/manipulating data using Java Script or Groovy, looking up data in a database or external cloud services such as Salesforce.com, and many more.
Streamsets is a great option for my data ingestions needs. It’s open source and available to download immediately. There are many technologies supported for data ingestion, ranging from databases to flat files, logs, HTTP services and big data platforms like Hadoop, MongoDB and cloud platforms like Salesforce.com. But there was one problem. Couchbase is not on the list of technology data connectors available for Streamsets.
So, I decided to write my own.

Integrating Couchbase
Leveraging the data connector Java-based API available for the open-source community to extend the integration capabilities of Streamsets, together with the online documentation and guides, I could implement a data connector very quickly for Couchbase.
The initial build of the connector is very simple; just ingest JSON data into a Couchbase bucket. Over time the connector will be expanded to include the ability to query a Couchbase bucket, offer better ingestion capabilities and more, but for now it serves my needs.
One of the added benefits of Streamsets is data pipeline analytics. The analytics features in the console offers users insight into how data flows from origin to destination. The standard visualizations in the console provides detailed analysis on the performance of the data pipeline. This analysis quickly showed me how my data was being ingested into the Couchbase buckets, and highlighted any errors that occurred throughout the stages of the data pipeline.
TL/DR: By using data pipelines in Streamsets I can quickly ingest data into Couchbase without writing much or even no code at all.
The data connector is open source and can be found at: https://github.com/nicholasc69/CouchbaseConnector.git

Ingesting data into Couchbase using Streamsets

Related