The objective
To use data over several different system it is necessary to create an unique identifier. Out of the data vault 2.0 idea, the best way is to use hashes (see Dan Linstedt). Md5 hashes are available on the most systems. It makes sense to use Md5 as hashing algorithm. If the same key (hash) is available on all systems, we can use queries across DB and Hive, e.g. using Big Data SQL, based on keys. Where to generate the key data, in Oracle DB or at the level of hadoop, can be decided based on the available resources. In the next part I describe the creation of the keys in Hive and Oracle DB.
Continue reading