Our goal is to enable machines to better understand human communication. An important question is, what does the word “understand” mean here? Consider the following example. For human beings, when we see “25 Oct 1881”, we recognize it as a date, although most of us do not know what it is about. However, if we are given a little more context, say the date is embedded in the following piece of short text “Pablo Picasso, 25 Oct 1881, Spain”, most of us would have guessed (correctly) that the date represents Pablo Picasso’s birthday. We are able to do this because we possess certain knowledge, and in this case, “one of the most important dates associated with a person is his birthday.”
As another example, consider a problem in natural language processing. Humans do not find sentences such as “animals other than dogs such as cats” ambiguous, but machine parsing can lead to two possible understandings: “cats are animals” or “cats are dogs.” Common sense tells us that cats cannot be dogs, which renders the second parsing improbable.
It turns out what we need in order to act like a human in the above two examples is nothing more than knowledge about concepts (e.g., persons and animals) and the ability to conceptualize (e.g., cats are animals). This is not a coincidence. Psychologist Gregory Murphy began his highly acclaimed book with the statement “Concepts are the glue that holds our mental world together”. Nature magazine book review pointed out “Without concepts, there would be no mental world in the first place”. Doubtless to say, having concepts and the ability to conceptualize is one of the defining characteristics of humanity. The question is then: How do we pass human concepts to machines, and how do we enable machines to conceptualize?
In Microsoft Research, we built a research project called Probase, which is a big graph of concepts. Knowledge in Probase is harnessed from billions of web pages and years' worth of search logs -- these are nothing more than the digitized footprints of human communication. In other words, Probase uses the world as its model. This Microsoft Concept Graph release is built upon Probase.
Please go to the DOWNLOAD page to get the Microsoft Concept Graph.
Our mental world contains many concepts about worldly facts, and the Microsoft Concept Graph tries to duplicate them. The core taxonomy of Microsoft Concept Graph alone contains above 5.4 million concepts. The above figure shows their distribution. The Y axis is the number of instances each concept contains(logarithmic scale), and on the X axis are the 5.4 million concepts ordered by their size. In contrast, existing knowledge bases have far fewer concepts (Freebase contains no more than 2,000 concepts, and Cyc has about 120,000 concepts), which fall short of modeling our mental world. As we can see in the above figure, besides popular concepts such as “cities” and “musicians”, which are included by almost every general purpose taxonomy, Microsoft Concept Graph has millions of long tail concepts such as “anti-parkinson treatments”, "celebrity wedding dress designers” and “basic watercolor techniques”, which cannot be found in Freebase or Cyc. Besides concepts, Microsoft Concept Graph also has a large data space (each concept contains a set of instances or sub-concepts), a large attribute space (each concept is described by a set of attributes), and a large relationship space (e.g., “locatedIn”, "friendOf”, "mayorOf”, as well as relationships that are not easily named, such as the relationship between apple and Newton.)
In the first release, the Microsoft Concept Graph majorly contains the IsA relation.
The Microsoft Concept Tagging model (a.k.a. the Conceptualization model) aims to map text format entities into semantic concept categories with some probabilities, which may depend on the context texts of the entities. As an example, “Microsoft” could be automatically mapped to “Software Company” and “Fortune 500 company” etc. with some probabilities. It provides computers the common sense computing capability and make machines "aware" of the mental world of human beings, through which way machines can better understand human communication in text. In detail, conceptualization maps instances or short texts into a large auto learned concept space, which is a vector space, with human-level concept reasoning. It can be treated as both human understandable and machine understandable text embedding. Thus it provides us the capability of text concept tagging, short text semantic similarity computation etc. for text understanding. It can benefit various text processing applications including search engines, automatic question-answering, online advertising, recommendation systems and artificial intelligence system.
1.Single instance conceptualization (This release)
Single instance conceptualization can return a ranked list of automatically learned concept/category names for any input entity mention/instance. Each concept has a probability to denote the possibility of the input entity belonging to this concept. As a result, the input entity is represented as a numerical vector, which shows its distribution over the concept vector space.
For human beings, given a single instance, this concept distribution often forms automatically and subconsciously. More importantly, those categories at the appropriate level rank higher. Psychologists and linguists call it as Basic-level Categorization (BLC).
As an example, consider the term Microsoft, which can be categorized into a large number of concepts, ranging from extremely general to extremely specific, such as company, software company, and largest OS vendor. If we go through company, we may find objects such as McDonald’s and BMW, which have not much similarity to Microsoft. If we go through largest OS vendor, we may not be able to find any reasonable object other than Microsoft. On the other hand, if we go through software company, we may find Oracle, Adobe, IBM, which are a lot more similar to Microsoft. Thus, software company is a more appropriate basic-level concept for Microsoft, or in other words, properties associated with software company are more readily applied to Microsoft, which is also the reason why through software company we can find many objects that are similar to Microsoft.
In this release, we will provide the concept distribution of input text with basic-level conceptualization. Besides, some common measures for conceptualization including MI, PMI, PMIk, and Typicality will be provided simultaneously.
A snapshot of the demo:Given a single instance “python”, the demo returns concept distributions with different measures (including BLC measure):
You can simply integrate this single instance conceptualization service into your own applications.
2.Single instance conceptualization with context (v2 release in future)
Given “apple” and “pie”, our API maps “apple” to fruit related senses.
Given “apple” and “ipad”, our API maps “apple” to company related seneses.
3.Short text conceptualization (v3 release in future)
Given a short text “the engineer is eating the apple”, will do the segmentation, concept mapping, and sense disambiguation.
Please cite following papers if you use our data:
Please cite following papers if you use our conceptualization service:
Data Mining and Enterprise Intelligence Group, MSRA
We would like to acknowledge Haixun Wang, Zhongyuan Wang, Dawei Zhang, Jun Yan, Yangqiu Song, Hongsong Li, and many interns for their contributions to the Microsoft Concept Graph and the Microsoft Concept Tagging model. Especially for Haixun Wang, he initiated and led this project when he was at Microsoft Research. We highly appreciate his tremendous contributions and insightful vision which make this project succeed finally.