Reference Resolution for NLU
What is the meaning?
Words have multiple meanings. No, they don't. Words may refer to different meanings. Words are like Python variables - they may refer to values of different types. If you substitute "type" with "property" you will get a good approximation of what a meaning is. We view properties as measuring tapes with ranges and points on them. An individual range or point of a specific property is a "real thing". A word referring to that range or point has that particular meaning. The referring relationship between a word and a meaning is imposed by the user of that word, the producer of a sentence with that word. It is important that the producer expects the consumer of the sentence to be able to resolve the reference. Without an agreement between them about that, effective communication is impossible. This agreement may be broad or narrow - all the speakers, some groups, some couples (spies?). This mechanism works for any language, even an artificial one.
Words are ambiguous because of their referential nature. Properties, on the other hand, are unambiguous. For the purpose of clarity I suggest supplying them with two ids - one is based on a common word referring to that property with a unique index, and the other may be a unique numeric index. Internally, the machine will use numeric indexes. In logs, to improve readability, the combinations of words and indexes will be used.
Consider the word "high". It may refer to the "height-1" property or the "pitch-1" property, to name a few. In both cases, it refers to the upper range on the respective measuring tape. Can we determine the meaning of "high" alone or its synonym? No, we cannot. We can do it when "high" is used in a phrase like "high key". "key" can refer to different meanings as well. To name a few, those may be "tool-1" or "pitch-1". The only intersection of two sets of meanings leaves us with "high key" = "upper pitch-1".
Of course, it is not that simple. One more meaning of "high" refers to the property "vertical-position-1". Then in the sentence "your key is high on the shelf" that meaning will take over and "key" will be resolved in the favor of the "tool-1" property. This is because "on the shelf" refers to the position but not to pitch.
It is interesting how natural languages assign words/labels to the values of different properties. It seems the assignment process takes into account the further need to intersect the sets of properties to resolve meanings. Boy, languages are cool!
How do you refer to a person in a crowd? You mention the values of different properties - "on the left", "tall", and "wearing a t-shirt". How do you choose properties to use? You want each additional reference to narrow down the number of matching candidates.
Consider the following text - "There are 10 students in a class, boys and girls. 3 of them are tall and they are boys. There are two teachers - John and Peter. John is tall". Resolve references "tall teacher", "tall student", and "tall girl". Consider properties "occupation-1" and "height-1". We consider available candidates (Michael Jordan is tall but he is not in the class) and filter them out by comparing properties. The first reference will be easily resolved to stand for teacher John. The second reference will be under-resolved and clarifying questions will be required. The third reference will result in an empty set. We may try to make the search fuzzy - "tall" compared to other girls. Or we may say that there are no tall girls in the class (if "tall" was first defined to be above 6 feet and there are no girls with such height).
Consider ranges on a measuring tape once again. Imagine someone in a crowd is referred to as "6 feet tall". Suppose all the people there are well below 6 feet and only one person is slightly taller than that. If a police officer is looking for a person who is exactly 6 feet tall then we have no match. If a coach is looking for a substitute for a basketball player then we have a match. Mathematical logic or FOPL has little to do with a fuzzy search for the most fitting candidate. We will discuss logic another time.
Now you can easily define synonyms and antonyms, given ranges on a measuring tape. If several words refer roughly to the same range they are synonyms. If two words refer to the ranges symmetric with respect to the middle of the tape then those two words are antonyms. But it only works when we know in what meaning those words are used.
Anaphora and Deixis Resolution
We may have general knowledge about objects and living creatures and also specific knowledge about those in a given episode. Sentences deliver information piece by piece. New sentences may refer to the information mentioned in the previous ones or contained in general memory or inferred from what is known. Such references are deictic. Anaphoric references are a subset of deictic ones. Anaphora refers to what was mentioned in the previous sentences.
Consider the following pronoun resolution example. "I saw a car. It was yesterday. It was red". What does "It" stand for in each case? The first "It" is linked to a temporal property. The second "It" is linked to a color property. In the first sentence, we have two candidates for "It" - an event of seeing and a car, which is an item. "Events" have temporal properties. "Items" have color properties. So, the first "It" is "seeing", and the second "It" is "car".
Cataphora is a reference to something introduced later as in the sentence "Because he was in a hurry John took a cab". "he" introduces some male character. Later we supply additional properties like "name-1" to him.
The deictic reference resolution is done based on available candidates. "she" will look for female characters mentioned before (or in case of cataphora later). "do" will look for actions. "there" will look for locations. "tomorrow" will look for time references to perform a basic "+1 day" math. Notice that we will rely on meta-information here. "The thrown item" will look not just for items but for specific items that recently participated in the "throw" action in the Object role. Remember this when resolving Winograd schema-type cases, like in the "thanking for assistance" sentences.
We have mentioned meta-information. Consider this famous phrase "Just do it". Do what? The system will figure out the producer and addressee of the sentence based on meta-information tags. Then the machine will check their goals and obstacles on the way toward those goals from formulas in the memory. The producer is Nike. The addressees are people around the world, Nike's (potential) customers. Nike is a footwear manufacturer with an obvious goal to sell shoes. People recursively enforce on each other the need to be in good shape. Exercises pose a tradeoff between hard work and good results. All those are pieces of knowledge stored in memory. "just" has a flavor of "despite hard". There are many hard actions that people know about, which are good for them, and still, they don't engage in them. Only some of those actions involve shoes. So, "do it" may be resolved as "practice", "exercise", or "run". Preferably in Nike's shoes. Reference resolution is tricky, so they supply that slogan with a visual hint of running people.
Sentence Resolution
The syntax module may only provide us with the result in the form "NP verb NP" where "NP" stands for a noun phrase. Both sentences "John is a teacher" and "John hit a teacher" will be recognized as "NP verb NP". How do we convert them to formulas SVC and SVO, which they are?
Resolving constituents starts with the verb. The linking verb goes with the SVC formula, transitive verbs go with the SVO formula. Therefore, we need to take those properties into account.
Then we proceed to the resolution of object references. We need to know their possible meanings - the property-value pairs. We continue to the verb's possible meanings - which properties that verb may affect. In each meaning, a verb expects certain subjects and objects in the case of transitive verbs, and in the case of linking verbs complements should fit subjects. Here we have a simple case of sets intersection, which we have already seen.
There is more to the verb resolution. Consider the phrase "These days eat apples". Should we consider "eat" in the present tense or in an imperative mood?
In this approach, we interpret "eat" as changing properties "hunger-1", "energy-1", "quantity-of-food-1", etc. It expects "eater-1" as the subject and some "food-1" as the object. "apples" in the above sentence can be generalized to "food-1". The problem is with "these days", which cannot be generalized to "eater-1".
On the other hand, if we consider the other interpretation of the sentence as an imperative one, which does not require an explicit subject, then we only need to generalize "these days" to the adverbial of time, which is realistic.
This approach enables the differentiation of many other constituents. Consider a vocative vs subject resolution - "My dear diary John is my love. My dear diary John is my journal". Which one is vocative and which one is subject?
"love" may refer to the property "people-by-my-attitude-1" and "journal" may refer to the property "tools-for-keeping-notes-1". Each of the first two noun phrases does not refer to both of those properties. Also, notice that meta information about the addressee is helpful here.
A sentence may show only one side or aspect of an observed phenomenon. Like "I heard an explosion". Our memory holds knowledge that explosions also cause visual effects, smoke, damage, etc. Therefore that sentence alone will also explain the smoke. We will explore that in a post about pragmatics.
Ambiguity Resolution
Now we will consider how to handle structural ambiguity using the example of "Visiting relatives can be annoying". Is it visiting my relatives or relatives visiting me?
We expect a good text to provide enough information for disambiguation. Here we have three possible sentences that may be provided. Consider also two options - a helping sentence may be provided before or after an ambiguous one. And let's make sure that an unhelpful sentence like the last one here will not deceive us. Just like using words different from those stored in memory.
My house got too crowded.
I do not like the climate there.
My shopping bill increased.
I do not like noisy people.
In the first stage, the syntax module will produce two possible interpretations of the sentence. In one case "visiting" is a gerund, in the other, it is an adjective. In the first case, I visit relatives, in the second, relatives visit me.
As an action, a visit changes many properties. One of them is the number of people in a house. The first interpretation corresponds to the decrease in that number for my house. The second implies the increase. "too crowded" resolves to the kind of change as in the second option. Therefore, in the presence of that hint, we may conclude that relatives visited me.
A visit also exposes the visitor to a new environment with all the consequences. Therefore the second hint supports the first interpretation - that I visited relatives.
Notice the usage of "there" as a reference to a distant location. This conclusion is additionally strengthened by the agreement of emotional assessments "do not like" and "annoying". If these assessments were the opposite that hint would not be helpful.
The bigger the number of people in my house the more I need to buy every day to provide for them. Therefore the third hint supports the second option that relatives visited me. Again, the emotional assessment for increased expenses is usually negative, which agrees with "annoying".
The unhelpful hint from the last sentence supports both options therefore it cannot help the process of disambiguation.
If a hint follows an ambiguous sentence then during the initial assessment we proceed with the option most supported by records in memory - I do not exclude the use of statistics. Later, encountering the hint, we will return to the ambiguous phrase and reevaluate which interpretation fits better.
NLU experiment for AGI
Let's reorganize a dictionary. Consider only nouns, adjectives, and verbs for now. Assign a unique ID like "key-1" to each meaning of those words and relate it to a property like "tool". The same verb like "explode-1" may relate to several properties. For verbs, specify the expected subjects and objects in terms of properties.
Good old graph traversal and set algebra algorithms will handle these structures reliably and fast. I am skeptical about LLMs but using such data may improve their performance as well. But the true potential will be realized with sentence formulas.
That was a long post. Hopefully, it all makes sense :)


If you want to build some useful application, then speculations about what is possible in principal is not enough. If you just want to push forward some ideas, then you must start from some known ideas about meaning and reference, to avoid many previous failures. You seem to be trying to apply unproven ideas to create algorithms based on your intuition and handful of examples. You may succeed by accident only. But you chances are extremely low like those of Leonardo Da Vinci designing a flying machine (he tried and failed to design helicopter and submarine because he had no ideas to build on)
I can only wish you good luck.
First, you need to disambiguate "cold" which could refer to sex appeal. In your example, she could be frigid, not freezing.
Second, a class hierarchy had to exist before you navigate it. And it is not necessarily a hierarchy (aka taxonomy), but more likely a conceptual lattice. How do you construct it and maintain it?
Third, where do you get properties from? If you have a fixed list of properties with definite descriptions like "person I care about", then you subscribe to a descriptive theory of reference which has paradoxes like a famous Frege's paradox that led to development of various theories of meaning. See https://plato.stanford.edu/entries/meaning/#TheoRefe