Tags: ETL, high availability
Working with customers over the past many years, the question of Data Architecture and Data Strategy has come up a lot. ‘What should my data strategy be?’ or ‘How do I go around creating a Data Strategy or a Data Architecture and what are some of the critical elements of a Data Architecture?’
Let me address some of those questions and the importance of having a well-defined Data Strategy and a Data Architecture. First and foremost, why do you need a Data Strategy and a Data Architecture? Well, the answer is simple: to bring some standardization, consistency, and an Architecture or a Plan to your world of data. The fundamental reason you need to do this is simple: Data is the primary asset of your organization and organizations that can make effective use of data will always have a competitive advantage. You do not want your data locked in data islands and in a tangled web where you cannot get to it in a timely manner and make use of it as and when you need it. Also, the infrastructure hosting the data has to be optimized, nimble, and efficient. A server sprawl or a server shortage are both undesirable. So, the key elements of a Data Strategy that help you arrive at a consistent Data Architecture are as follows:
Data Platform Standardization
Too often, I see too many Data platforms in a customer Data Center: Oracle, SQL Server, DB2, Sybase and some more. Multiple BI and Reporting platforms as well. I remember, once a customer ‘proudly’ declared “you name a database or a data related technology and we have it”. There could be valid reasons for this, such as acquisitions, mergers, legacy systems, ISV systems etc. but too much diversification is not a very efficient strategy in managing your Data Platform. There is a reason why Southwest Airlines has only 737’s in their fleet (AFAIK). Too much diversification means multiple DBA teams, multiple licensing agreements, multiple Support and Maintenance contracts, and multiple skills sets to hire and keep trained. Consolidating to one platform (or one major and one minor) is generally a good idea. It helps bring standardization and keeps cost under control.
If you are highly diversified, start creating a plan to standardize by considering the following: • Sun setting of legacy systems • ERP for running your enterprise (which will help sunset systems) • Modernization: Consolidation, upgrade, and modernization of legacy systems • Migrations: Migrating to your standardized platform
The first step in Data Standardization is to have standards. Although, ISV packages, legacy systems etc., may cause hurdles to have a standardized representation of the data elements, it is always a reasonable idea to strive for a standardized data dictionary. This also helps tremendously in the ETL and the data integration aspect. Though standardization is important, I am not a big believer in the dogmatic enforcement of standards. There will always be exceptions where deviation from standards is ok, however, only for reasonable exceptions. I recommend thinking of standardization in the following aspects:
Data Modeling Standards Create and enforce modeling standards and patterns. This ensures consistency of data design and helps establishing a design pattern. Normalization, de-normalization, physical design, logical design, indexing, PK/FK enforcement and overall RI enforcement can be parts of the standards.
Naming Standards Names and patterns for Tables, Variables, Columns, etc. should be established. SPs and UDFs etc. are other examples for name standardization. This helps tremendously with code design, development, and maintenance. Consistency is a good thing in this case.
Coding Standards In my definition, Coding Standards are different from (than) the Naming Standards mentioned above. Coding Standards entails establishing a repeatable and consistent approach and patterns for writing code and algorithms. This facilitates code reuse and helps code maintenance, and makes management much easier.
Infrastructure and Configuration Standardization Many companies have a standard Infrastructure and configuration. This could mean having a standard VM or a server size (CPU, Memory, Disk layout etc.) for all rollouts and exceptions made as needed. Having a ‘golden image’ is a good practice. Similarly, standardizing on server and instance level configuration settings is important as well. Standards should be maintained and enforced, however, being dogmatic about it is never productive. Exceptions should always be made as and when necessary.
ETL, ESB, EAI, SOA, EDI Strategy
Think of all of this as your data plumbing. If your plumbing architecture is well established, your data will flow efficiently, thus enabling Reporting, BI, Analytics and DW. Let’s look at all this aspects:
ETL Defined as Extract, Transform, and Load or depending on the target platform as ELT. The strategy needed for this is arriving at a consistent and enterprise ready toolset. The most common ones I see out in the Enterprise are SSIS, Informatica, and Data Stage. Each of these products having their respective pros and cons. SSIS generally speaking is ‘best value for money’, also with the SSIS 2014 it is now possible to have a centralized ETL server where the packages could be stored, secured, managed, and even run centrally. Central dashboards and instrumentation are important for a good ETL product. I have also recently seen that Hadoop could be a potential ETL platform as well. See this for more details on it: ETL basics.
SOA and ESB, EAI A Service-oriented architecture has a direct impact on your overall Data Strategy and Architecture. Enterprise Service Bus and Enterprise Application Integration are important components of SOA. SOA is an important element of your Data Strategy and Architecture.
Too often I see customers with no HA/DR strategy or maybe with just a DR plan. Having a consistent HA strategy will help standardize and establish a pattern for HA. Let me explain what I mean by that: There are many ways to achieve HA at a Database tier. The data platform itself will provide you with HA options such as AlwaysOn, Clustering, Data Guard, etc. and even means to achieve HA through the Virtualization platform. HA can be implemented at the Hyper-visor tier, Instance tier or the Database tier or a combination of (the) all the above. Establishing a pattern or a strategy on which one to use and when is important; the HA should not be ‘happenstance’, but well considered.
DR is similar as well. DR can be achieved through SAN replication, Database features (such as AlwaysOn or Data Guard), replication at the Virtualization layer etc. Having a centralized and consistent strategy will help simplify things a lot. Important to keep in mind that the choice of a HA/DR strategy could potentially have licensing implications as well. Always remember, keep your HA/DR as simple as possible. The simpler the design is, the better HA and DR you achieve. Too often, I see a complex and unnecessary HA lead to anything but HA.
BI and Analytics Strategy
The most important thing to avoid in this area is the superfluous reporting tools and data marts with redundant data elements, no consistent ETL process, and, obviously, bad data quality. Nothing kills a BI initiate faster than Bad Data and very slow and complex reporting/analysis tools. Come up with a simple BI and Analytics Strategy that consists of the following elements: • Standardized BI Tools • ETL Framework • Centralized DW • A few specialized Data Marts
Unstructured Data Integration
Every organization has tons of unstructured data, some of it is very useful and relevant and some is not. Letting the unstructured data go un-analyzed is a mistake. There are gold nuggets sitting in that data and could be used for a variety of purposes. For instance, if you are in Retail, a simple analysis of weblogs to find out what was clicked on but not purchased, could lead to the insights. Click analysis of the public facing websites could tell you what is relevant and valued by users and help drive insights based on that.
More so, if the unstructured data elements can be extracted and added to your existing DW as a new dimension or as new facts, it could significantly enhance the capability of your DW.
Private, Public, and Hybrid Cloud Strategy
The cloud is a great enabler in IT. It is quite revolutionary in ways that it can bring efficiency, save costs, and provide tremendous business agility. The ‘on-demand’ nature and elasticity of the cloud is very compelling. It is important to understand the different types of cloud options available and when to leverage them.
A simple Private Cloud strategy could help you achieve a lot of consolidation and virtualization benefits. You can get sophisticated by adding self-provisioning, maintenance, load balancing etc. to your private cloud. Private clouds definitely help reduce cost and save on Data Center resources.
Public Cloud such as Microsoft Azure or Amazon AWS can be very effective as well. The most common uses of private cloud that I see amongst my customers are IaaS, SaaS, and Storage. PaaS not so much, but is slowly catching up, especially for new net applications.
Hybrid cloud should definitely be a consideration of the overall strategy. A few examples of a Hybrid Cloud are as follows: • AlwaysOn on-prem with a node for HA/DR and Reporting in the cloud • Web and Application tier in the cloud and the Data tier on-prem. This gives the flexibility and elasticity to grow and shrink the web and the app servers on demand, but still keep your data tier on-prem (this type of design needs a high performance and domain join setup between your Data Center and the cloud, such as Express Route from Azure). • RDBMS based Data Warehouse on-prem, and a Hadoop cluster in the cloud, with a synchronization and integration process to bring the data to on-prem. • OLTP systems on-prem, Data Warehousing in the cloud. Public, Private, and Hybrid Cloud should absolutely be part of your Data Strategy and your Data Architecture.
Security, Encryption, and Compliance
No Data Strategy and Data Architecture could be complete without giving due consideration to Security, Encryption, and Compliance.
It is important to have a Security strategy which includes different things such as Authorization and Authentication, SSO, MFA etc. A simple example is Windows Vs. Mixed Mode authentication standard.
There are various ways to encrypt data. It can be encrypted at rest (TDE), in the columns (column level encryption), in the application, or on the file system. Each of these patterns has its own pros and cons. It is important to create a recommendation / best practice on which one to use, when, and how.
Compliance plays a very important role in most organizations. Compliance could be regulatory compliance such as HIPPA or it could be internal compliance such as adhering to standards. Security, Encryption, and Standardization are a few elements of a good Compliance Story.
Too often I see organizations ignoring or not making Licensing a part of a Data Strategy and Data Architecture. Say for example, if Virtualization is your key strategy, you should consider the Licensing Agreements, which are the most cost efficient and flexible to support, a virtualized infrastructure.
If Public Cloud is a part of your architecture, then having a licensing agreement that allows you to seamlessly move servers / instances from the on-prem to the cloud would be useful.
Another scenario where Licensing plays a big role is the roll out of the appropriate software edition. For instance, SQL Server Standard and Enterprise editions can be licensed differently and are significantly different in cost and capabilities. Does your Licensing Agreement allow you to use either edition without a huge cost implication? Do you have an evaluation process to determine which edition to deploy and when?
Buy Vs Build Strategy
As an organization, have you arrived at a Buy vs Build Strategy? More and more, I am seeing companies lean towards the ‘Buy’ strategy, unless a ‘Build’ is necessary. Buy Vs Build affects a lot of things I have discussed above. More so, it is important to have a strategy and a philosophy on Buy Vs Build.
Appliance Vs Custom Build Vs VMs What role do Appliances play in your overall architecture? Appliances for DW, OLTP, and Storage etc. are becoming more and more prevalent. Appliance definitely has specific value and can be used effectively. However, the judicious evaluation of the role of an Appliance is important. The How, Why, and When has to be studied and made a part of your Data Strategy and Data Architecture.
Innovation and Experimentation
Do you allow a platform for Innovation and Experimentation, and more important, do you foster a culture for it? You may wonder, ‘How come this is a part of a Data Strategy and Data Architecture?’
Let me explain. Say, a few years ago, when the Hadoop was a new technology, how easy would it be to experiment with it in your organization? Do you have the process, the infrastructure, the methodology to bring a new and upcoming technology in your IT landscape to evaluate, to assess and to determine its proper fit and use? Creating and fostering an environment for innovation and experimentation should be a part of your Data Strategy and Data Architecture.
A well thought-out Data Strategy will make you more efficient, agile, and consistent in your world of Data. It is much easier to realize the full value of your data, enable critical reporting and BI, and most important, easily integrate data within and outside your organization. Remember, a company that can efficiently use data will always have a competitive advantage.