Giovanni Ciatto, Andrea Agiollo, Matteo Magnini, and Andrea Omicini
Dept. of Computer Science and Engineering (DISI), Alma Mater Studiorum - Università di Bologna, Italy
ArXiv preprint: arXiv:2404.04108
(Currently under revision for the Knowledge-Based Systems journal.)
Many knowledge-based applications require structured data for their operation
These systems commonly adapt their behaviour to the available data…
Notable example: recommendation systems
Chicken-and-egg problem, namely the cold-start:
As part of the CHIST-ERA IV project “Expectation”, we needed to:
design a virtual coach for nutrition
the system would need data about:
model the data schema and find some data matching it
Obvious solution: generating (i.e., synthesizing) data
Yet, the generated data should:
Easy yet powerful means to represent knowledge in a structured way
In a nutshell:
Thing
(resp. Nothing
)In $\mathcal{ALC}$ Description Logic:
Definitions are axioms defining concepts, roles, and their relationships
Assertions are axioms assigning individuals to concepts or to roles
Two possibly inter-leaved phases:
“schema design” phase: defining concepts, roles, and their relationships
“population” phase: assigning individuals to concepts or to roles
The ontology population problem is about populating an ontology with instances, i.e. individuals
- this is often done manually
- called “ontology learning” when done (semi-)automatically from data
Manual population:
Automatic population:
Insight: replace domain experts with large language models (LLMs),
treating them as oracles for automating ontology population
Let’s discuss how!
Stemming from:
a partially- (or, possibly, non-)instantiated ontology $\mathcal{O} = \mathcal{C} \cup \mathcal{P} \cup \mathcal{X}$ consisting of:
a subsumption (a.k.a. sub-class) relation $\sqsubseteq$ between concepts in $\mathcal{C}$
a trained LLM oracle $\mathcal{L}$
a set of query templates $\mathcal{T}$ for generating prompts for $\mathcal{L}$
… produce $\mathcal{X}’ \sqsupset \mathcal{X}$ such that:
Our algorithm for ontology population through LLM, stepping through 4 phases:
population phase: each concept in $\mathcal{C}$ is populated with a set of individuals
relation phase: each property in $\mathcal{P}$ is populated with a set of role assertions
redistribution phase: some individuals are reassigned to more adequate concepts
merge phase: similar individuals are merged into a single one
Templates $\approx$ a string named placeholders to be filled with actual values via interpolation
printf
format strings
"Hello <WHO>" / {WHO -> "world!"} = "Hello world!"
Each phase leverages templates of different sorts:
"Give me examples of <CONCEPT>"
"Give me examples of <PROPERTY> for <INDIVIDUAL>"
"What is the best concept for <INDIVIDUAL> among <CONCEPTS>?"
"Are <INDIVIDUAL1> and <INDIVIDUAL2> the same <CONCEPT>?"
Focus on some class $R \in \mathcal{C}$ (most commonly Thing
)
For each sub-class $C$ of $R$ (post-order-DFS traversal):
Focus on some property $\mathsf{p} \in \mathcal{P}$
Let $D$ (resp. $R$) be the domain (resp. range) of $\mathsf{p}$
For each individual $\mathtt{i}$ in $D$:
Focus on some class $R \in \mathcal{C}$ (most commonly Thing
)
Let $\mathcal{S}$ be the set of all direct sub-classes of $R$
For each individual $\mathtt{i}$ in $R$:
Repeat for all direct classes in $\mathcal{S}$
Focus on some class $R \in \mathcal{C}$ (most commonly Thing
)
For each sub-class $C$ of $R$:
Problem: queries may produce verbose and arbitrarily-structured responses, which are costly and hard to parse
"can you generate examples of cats?"
Certainly! While I can't physically generate images, I can certainly describe or provide information about different
types of cats.
Here are a few examples:
1. Domestic Shorthair:
Description: A common and popular cat breed known for its short, sleek coat.
Characteristics: Versatile in colors and patterns, friendly, and adaptable.
2. Siamese Cat:
Description: Elegant and slender cat with distinctive color points.
Characteristics: Vocal, social, and known for their striking blue almond-shaped eyes.
...
10. Russian Blue:
Description: Short-haired cat with a bluish-gray coat and striking green eyes.
Characteristics: Reserved but affectionate, known for its plush double coat.
Solution: queries should explicitly ask for conciseness and structure to spend less
and simplify parsing the response
"list of cat races, names only"
Certainly, here's a list of cat breeds with names only:
1. Persian
2. Siamese
...
10. Domestic Shorthair
Problem: the query may be clear for the user, but ambiguous for the LLM,
which may lead to out-of-context responses
"give me a list of examples of cats"
Certainly! Here are some examples of cats:
1. Cat Stevens
2. Catwoman
...
10. Cat in the Hat
Solution: provide the context explicitly in the query
Beware! Two contexts are always in place:
- the domain of the ontology
- the ontological jargon itself
"you're a zoologist creating and ontology about animals species."
"give me a list of examples of cats"
Certainly! Here are some examples of cat breeds:
1. Persian
2. Siamese
...
10. Domestic Shorthair
Problem: responses contain way more information than needed, in unstructured form
Persian
, 2. Siamese
, 3. Maine Coon
, 4. Bengal
, 5. Caracal
, 6. Sphinx
, …, 10. Domestic Shorthair
Solution: parse the response to extract the relevant information
Problem: cost model is most commonly proportional to consumed and produced tokens (words)
Solution: ask for conciseness + limit responses’ lengths + exploit caching
max_tokens
-like parameterProblem: most LLM services apply rate limitations on a per-tokens or per-requests basis
Solution: apply exponential back-off retrial strategy + limit of retries
Experiments tailored in the nutritional domain
Reference ontology (built for the purpose):
Misplacement error ($E_{mis}$): the individual “belongs” the ontology, but it is assigned to the wrong class
Incorrect individual error ($E_{ii}$): the individual makes no sense in the ontology, yet it has a meaningful name
Meaningless individual error ($E_{mi}$): the individual makes no sense at all
Class-like individual ($E_{ci}$): the individual has a name which is very similar to the one of a concept in the ontology
Duplicate individuals ($E_{di}$): the individual is a semantic duplicate of another one in the ontology
Wrong relation ($E_{wr}$): the relation connecting two individuals is semantically wrong
~21.6%
$RIE$~8.6%
$RIE$25%
(except for Mixtral)Experiment data: https://github.com/Chistera4-Expectation/knowledge-graphs
experiments/food-SERVICE-MODEL-DATE-HOUR-ID
Each commit of each experiment branch represents a consistent update to the ontology: