Almost all organizations have lots of data. However, many would admit that they are not realizing its full potential. In fact, “Cross-industry studies show that on average, less than half of an organization’s structured data is actively used in making decisions—and less than 1% of its unstructured data is analyzed or used at all,” as stated by Harvard Business Review.
Important aspects of any successful data strategy or architecture include ensuring an effective mechanism exists to understand the different data sources available, clearly understanding the problem or questions you are trying to solve, and having the appropriate methods as well as expertise on your team. This post aims to simplify the common aspects to connect structured and unstructured data to make effective business decisions– or, put another way, to help you realize your data’s full potential.
Data Sources Availability
Before we can start to make predictions to derive conclusions, it is critical to analyze what different data sources are available that can help us complete our story or derive meaningful conclusions. Creating a data dictionary for all the different fields/categories of data is a critical step in avoiding delays, a lack of understanding, and misuse of information. Ensuring there are join conditions/same fields available to connect and thread the data effectively is equally important.
Understanding Business Requirements
Business requirements define the reason and purpose behind the project. The process of discovering, gathering, documenting, and understanding a project’s objective is a crucial input to the success of any initiative. It is important for everyone involved with the project to have a mutual understanding of the business requirements which guide the technical design and architecture. Working in an agile environment makes the role of the project manager, product owner and scrum master extremely critical throughout the lifecycle of the project. Some of the important aspects of good business requirements include:
- Project Objectives (Overview, Vision, Scope)
- Success Criteria
- Stakeholder Map
- Business Requirements Document (or centralized format available to all)
- Any reference material or process flows
Writing effective business requirements is a work-in-progress activity and is something that requires continuous practice and revisions, so while it may seem daunting, the effort pays off many times over.
Common Integration Methods
There are two common methods to acquire data from different source systems: pull-based, and push-based.
Pull-based systems are used for ingesting data from batch-oriented source systems. A simple example would be pulling/retrieving new contacts from your CRM system on a daily, weekly, or monthly basis.
Push-based systems are applicable for streaming or event-driven source systems. A good example is feeding the results of a marketing campaign for the analytics team to display on a dashboard such as Tableau.
Depending on the source system type, the connection methods may differ. The two most common methods are Batch and Streaming.
A batch is a collection of data points that have been gathered within a specific time interval and ideally suited to applications where an extra latency will not severely impact your operation.
- Payroll processing
- Customer order processing
- Customer billing cycle
Under the hood, the data integration systems/applications connect with batch-based source systems through one of the below connection methods.
1) Database connection (ODBC/JDBC -> Oracle, SQLServer, MySQL, Redshift, etc.)
2) File based connection (sftp/ftp -> csv/text/parquet)
3) API based connectors (REST/SOAP calls –> Salesforce, Workday, etc.)
Custom applications or data integration tools (such as Informatica or Talend) or dashboard/analytics tools (such as ThoughtSpot or PowerBI) use their own native connectors to integrate with the source systems. These native connectors leverage one of the above core connection methods.
Streaming connections makes it possible to ingest data continuously from source to destination as the data is created, making it useful along the way. Streamed data is typically validated on the fly to check for given conditions or anomalies thereby ensuring data pollution is caught and mitigated quickly. The data is then transformed, and/or gets persisted into a data lake/data warehouse for further processing. Stream processing feeds data into an analytical tool: in micro-batches or in real-time.
Message queue services like Apache Kafka, Google Pub-Sub or AWS (Amazon Web Services) Kinesis data stream are widely adopted by enterprises these days to capture the streamed data, making it possible to build scalable real-time data analytics applications.
1) IOT (Internet of Things) Sensor data
2) Click-stream data
3) Social media feed
4) Gaming analytics
Data security is of paramount importance in today’s world. Consumers are more aware of the pitfalls of unsecured data and expect their data to be secured from the collector to the pipeline to databases and finally to the analysis platform such as a dashboard. Enterprises protect data assets by designing multiple layers of protection and enforcing stronger authentication and authorization mechanisms. Some of the important design considerations in connecting data sources are allowing only white-listed servers/consumers to access the data, provisioning fine-grained access control, authorizing access to only the required data for each given persona or roles, limiting the number of parallel database connections, and limiting the number of API calls via page limits, etc.
What are your most common challenges while connecting data from various sources? What is holding your data strategy from building the insights you need to grow your business?
Subscribe here to receive Analytics & Automation blogs directly in your inbox!