Most of the times, the same object (a person, thing, and abstract concept) is described by different attributes, in different data sources. So, when the data sources are combined, the information gained, due to attributes combination, may be more or less or unchanged.
In Big Data viewpoint, the data awareness involves identification of relevant entities (objects) that may be residing in single or multiple data sources. The identification of entities also include coherence (relationships) among those identified entities.
Hadoop A new Technology:
Hadoop is designed to deal with very large datasets by distributing the data over many servers. To understand this technology it is useful to consider it in relation to the four major steps identified earlier – acquisition, marshaling, analysis, and action. Hadoop and its related technologies really only address the marshaling and analysis steps.
In some cases, different sources have same data but are represented by different attribute or class names. Refer above tables. Here the ambiguity of objects in different tables is based on attribute labels, provided the values are matching.
This is the most common and important aspect related to “variety” component of Big Data. Every data source has partial information about a particular object and complete information can be achieved by merging the sources accordingly. The Big Data initiatives important driver is to get maximum information of objects by collating the data sources.