Once a year, on its birthday, I say a few words about my G2 project. Three years ago today, G2 was made public on International Privacy Day in 2011. With two years of secret “skunk works” development before that; G2 is now five years in the making. It is no small endeavor.
More background on G2 | Sensemaking can be found here.
I get asked from time-to-time to describe what makes G2 different. So this year, I’m going to highlight seven rather unique features of G2 – aka seven not so secret sauces.
1. Principle-based Context Accumulation
Systems that make decisions about “when things are the same or related” generally require lots of data for training and/or experts to dictate very domain specific rules. Lots of data for training is easier said than done, unless you are Google. And domain experts can be hard to come by when you need them; and after they come and go, one finds themselves being encumbered by an ever growing and complex set of rules – rules few if any will ever fully understand.
Principle-based context accumulation is a new technique to determine “when things are the same or related” that requires neither training data nor experts. As new data sources (e.g., the company’s asset management system), entity types (e.g., vehicles) and features (e.g., VIN numbers) are introduced to G2, there is no need to first see training data or extend an already unwieldy set of rather brittle rules.
Preparing G2 to make sense of new data sources, new entity types and new features will often take less than ten minutes. Immediately thereafter, one can start pumping vehicle related data into G2 and G2 will be able to differentiate when two vehicles are the same or related, without any other configurations/complexity.
How? Adding new data sources and entity types is nearly effortless – so simple that this should take no more than two minutes to introduce G2 to these concepts. Describing the new kinds of features to be expected (e.g., for a vehicle these might include VIN, license plate, make, model, year, color) is the time-consuming part … might even take eight minutes. This configuration step requires that the user define three behavior options for each added feature:
Frequency: How many entities generally share the same value? One, Few, Many?
Exclusivity: Does an entity generally have only one such value? Yes, No?
Stability: Is this value generally stable over the lifetime of the entity? Yes, No?
For example, a VIN number is simply introduced to G2 as an “F1ES” (Frequency one, Exclusive and Stable) because a VIN number typically refers to exactly one vehicle, a vehicle generally have only one VIN number (i.e., Exclusive), and a VIN number remains the same over its lifetime (i.e., Stable). The “Make” feature is “FMES” (Frequency Many, Exclusive and Stable). Color would be “FM” (Frequency Many, and not Exclusive or Stable) because a car can have multiple colors and can be repainted anytime. Keep in mind these are guidelines, not hard and fast behaviors. For example, G2 would detect that a VIN of “00000000” is not behaving as expected and take this into account, auto-magically.
Because all features are described as having expected behaviors; a small, manageable number of abstracted rules (aka principles) rely on only these feature behaviors to assert, persist and manage context (determining same and related). As a result, in most cases, G2’s default principles will be all you will ever need. So an organization can start with people and organizations and then pivot to vehicles, vessels and routers without training data and without the need for experts to build upon ever expanding, elaborate, rules.
2. Sequence Neutrality at Scale
I use the term “Sequence Neutrality” to mean: Did the records arrive in the order of [A, B, C] or some other order like [C, A, B]? Regardless of the arrival order of the data, the end-state should be the same. It took my team and me roughly 20 years of building and deploying entity resolution systems before we stumbled into this most subtle, essential behavior.
Imagine if you learned of a [Mark, 111-22-333] first and a [Mark, Mark@email.com] second. With just these two records there is no basis to claim they are the same (based on just the same first name) so they become Entity1 and Entity2. Some time later you learn of a [Mark, 111-22-3333, Mark@email.com]; with this information, one could reasonably assert that Entity1 and Entity2 are the same. As such, this third record should cause Entity1 and Entity2 to collapse into a single Entity – an entity now containing all three observations (records).
Many systems cannot do the above; something we call “re-resolve.”
Few systems can do the reverse, using a new observation to fix previous false positives; something we call “un-resolve.”
And no other technology, to my knowledge, can perform such re-resolve and un-resolve events while managing the relationship graph, in real-time, and at scale. The fastest we can do this today in our existing industrial strength engine is roughly 2k/second over a database containing >10B records and >500M entities. G2 is specifically designed to smoke these numbers.
Sequence Neutrality is non-trivial. Essentially, this concept means that new observations can reverse earlier assertions. Imagine that. When ingesting the 10 billionth observation, not only does G2 figure out how it relates to what is known, it also asks the question “Had I known this in the beginning, over the 10B previous observations and assertions I have already made, should I have made any of these assertions differently?” And if so, these Sequence Neutral algorithms fix the past while integrating the new observation. G2 has been specifically designed to do this at extraordinary scale, far beyond anything the world has seen yet.
How? We have fundamentally re-engineered the underlying schemas to work on a distributed number of computers (call it “optimized for the elastic cloud” if you like). We also have introduced a slight increase in compute during ingestion that pays off big time later when dealing with entities containing very large numbers of conjoined records. That is about all I am willing to share at this time.
I think this will be the single most difficult aspect of G2 for others to replicate. And its importance is paramount. Without Sequence Neutrality new data can cause databases to drift from the truth – data drift. One common remedy for data drift involves periodically tearing down and reloading the entire database; obviously, the larger the database the less fun this is. As well, the exciting “Big Data. New Physics.” phenomenon I have described, whereby a system becomes more accurate and faster as more data is loaded cannot be attained without this Sequence Neutrality behavior.
How hard is this? Now with over 10 years of engineering experience in Sequence Neutral algorithms, my G2 team and I continue to learn, tune, weep, fix and dream. Nonetheless, I think we are years ahead – if not a decade ahead.
3. Privacy by Design (PbD)
We, just three of us at the time, spent the first year designing G2 on paper, drafting detailed design specifications. From day one, we felt it was important to bake-in as many privacy and civil liberties protecting features as we could fathom.
All seven of our Privacy by Design features can be read about here: Privacy by Design in the Era of Big Data. We are proud to say that G2 may have more baked-in privacy features than any other, even remotely similar, technology.
Features like Selective Anonymization (a capability that now ships with SPSS Modeler Premium V16) will allow an organization to perform rich analytics without using human readable forms of Personally Identifiable Information (PII). This PbD feature reduces the risk of unintended disclosure – and in this day and age given all of the data breaches – this one feature may very well become a business imperative.
4. High Tolerance for Uncertainty
“The test of a first rate intelligence isthe ability to hold two opposed ideas in the mind at the same time, and still retain the ability to function.”
There is a lot of uncertainty, ambiguity and conflicting information in the real world. Traditional systems and processes spend considerable effort trying to remedy uncertainties and errors … as they yearn to establish a single version of truth. Not G2. There are stark differences between the intentions of Master Data Management (MDM) vs. Sensemaking.
The G2 technology has a high tolerance for uncertainly (e.g., they might be the same or related), ambiguity (e.g., this Pat record could be this Patrick or that Patricia) and information in conflict (e.g., a person who reportedly has eleven reported dates of birth). In fact, we find that all this natural variability in data – albeit disconcerting at times – is valuable and makes our system smarter.
Turns out: Sometimes bad data is good. Before you pooh-pooh this crazy talk, might I remind you that you have already seen and benefitted from similar nonsense.
When you search Google and it says “Did you mean this?”
Google is not using a dictionary.
Rather, it has remembered all the errors.
If Google had not remembered all the errors (natural variability), it would not be so smart.
G2 is well suited to manage millions and even billions of uncertainties, ambiguities, and contradictions. And there is no compelling need for humans to review all these maybes. Instead G2 is quite comfortable letting all these uncertainties just fester – waiting for new information to bring clarity, or not.
G2 is smarter for this reason. For example, because G2 has such a high tolerance for information in conflict (e.g., noting someone has eleven different dates of birth) G2 comes to realize on its own that it is confused. Why is this important? Well, if you are looking to analytics to make important decisions, wouldn’t you want to know during the decision-making process if there was any related confusion before action is taken?
5. Selective Curiosity
Now imagine all of those maybes, uncertainties, ambiguities and dissent, floating around as described above. Most of those maybes don’t matter and never will.
Imagine a system that routinely comes upon maybe conditions. And with each such occurrence the system asks itself “If this was true, would it matter?” Of course, most of the time the answer would be no. In this case G2 moves on, as it lets this uncertainty fester. But every now and then, as you might imagine, one stumbles upon a maybe whereby if it turned out to be true … Holy Crap! Summon the police – Billy the Kid is in the house!
The G2 Selective Curiosity feature, is just that, selectively curious. It finds a maybe that would matter. It figures out what it wishes it knew (e.g., If I just knew the work address …). It figures out where it should go to ask for such a data point (e.g., Google, LexusNexus or … wait for it … a Jeopardy! champion). Then G2 asks. And if lucky, the next inbound observation from this inquiry confirms or denies. It won’t always be so lucky, of course, but such is life.
6. Geospatial Awareness
I firmly believe geospatial data – data about where things are when – will prove to be the highest order bits when it comes to Sensemaking systems.
The power of geospatial data and the privacy ramifications are mind-numbing. If you want to get excited and creeped-out at the same time about all the potential of geospatial data, check out this session Jennifer Lynch of EFF and I delivered at the 2013 SxSW Conference entitled “I Know Where You're Going: Location as Biometric.”
With G2 I have introduced the notion of Space-Time-Boxes (STB’s). A STB is used to group nearby coordinates with nearby time. The conversion of geospatial data to STB’s enables exceptionally fast correlation of space and time. The technique also helps account for rather large errors in precision, although these details are beyond the scope of a happy birthday G2 blog post. [We have a few techie papers. If you want to know more and are willing to agree to no further redistribution, drop me a note and make your case.] In short, STB’s will allow G2 to contribute to new and exciting domains ranging from better identity theft protection and maritime domain awareness to the hunt for asteroids.
Note: While asteroids don’t have privacy, when dealing with geospatial data about consumers I first strongly recommend you let them opt-in. And then if you are going to perform compute over their geospatial data I would encourage you to use STB’s in conjunction with G2’s Selective Anonymization feature (as discussed above in the Privacy by Design section). This means rich geospatial analysis can be performed while at the same time reducing the risk of unintended disclosure.
7. Diverse Perspective
Imagine a rather large pile of messy puzzle pieces (some pieces missing, some duplicates, some professionally fabricated lies, etc). If we gave exactly this same set of puzzle pieces to two people, and gave them each some finite amount of time to make the most of it. Do you think both assemblies would be exactly the same? Of course not. Now, while much of the puzzle assembly is likely to be in agreement between these two people, one may have missed a few obvious connecting pieces (false negatives). The other may have inadvertently connected two pieces in error (false positives).
Why is this? Despite having the same observation space presented to them … they each used their own strategy and biases to assert which pieces belonged where.
For example, one analyst considers name and date of birth sufficient evidence to claim someone is the same (rarely a good idea by the way). Another analyst believes this should be treated as just a potential match – at least until more evidence emerges such as a similar address. Maintaining both perspectives allows G2 to notice there may be different opinions about the contextual interpretation of things, despite the fact the observation space is the same.
Under the covers, G2 utilizes the notion of a “Lens.” Raw observations are maintained in a single place and in a single state. Interpretations about how these observations relate to each other are managed and maintained through one or more “Lenses.” This Lens construct allows varying perspectives to co-exist over a single observation space.
The ability to recognize Diverse Perspectives is essential to Sensemaking as this enables, among other things, one to notice there is a Minority Report when considering critical decisions.
Of the seven capabilities described above, items 1, 2, 3, 4 and 6 are in relatively decent working order. We have not yet started working on items 5 (‘Selective Curiosity’) and 7 (‘Diverse Perspectives’) – these are on our road map. I have lots of things on the G2 roadmap – at least a few more years’ worth of exciting features.
G2 is not trying to deliver Artificial Intelligence or Cognitive Computing. That said, I think G2’s capacity to deliver “incremental context accumulation” may be a fundamental stepping stone for such things as Artificial Intelligence and Cognitive Computing in the future.
To be clear about what G2 is and is not: G2 does not extract structure from unstructured data. G2 does not discover new patterns (more about the relationship between what G2 does versus “Deep Reflection” here in this video). G2 does not express itself in any human consumable, visual form (think XML in, XML out). While G2 does not do these things, one day G2 may help these technologies do these things better.
So what does G2 do? Put in the simplest terms: G2 helps finds the obvious.
“All truths are easy to understand once they are discovered; the point is to discover them.”
Galileo Galilei (1564-1642)
So what is so great about G2? Well … yes, it finds the obvious … at the moment it becomes knowable, over billions and billions of historical observations, in real-time, responding fast enough to do something about it while it is still happening.
Availability, you ask … How do you get some? Well, this is not open source, my friend. And if you are waiting for anything close to appear in open source, I suggest you do not hold your breath. Fear not, there is a relatively inexpensive way to get your hands on G2. In fact, there is some chance if you work for a big company you may already own it. A light-weight version of G2 is now commercially available via SPSS Modeler Premium V16. This very easy to use version of G2, manifesting itself as the “Entity Analytics” node, which is included at no charge. If you own this, you can get real business done today. It comes with Selective Anonymization and Space-Time-Boxes (STB’s) too. And, if you have more than 10M records and want to benefit from degrees of separation (how entities are related) there is an “Entity Analytics Unleashed” version that will cost you a few more bucks.
As you can probably tell, this project is my “MAIN THING” and is very exciting. I would like to thank my team that works tirelessly as they perform their engineering feats of strength. And, last but not least, I would like to thank IBM, my employer, for their faith in this work and the significant investment they have made to date and continue to be making.
MORE ABOUT SPSS MODELER WITH ENTITY ANALYTICS (G2):