The Roles of Lexicons and Lexicography in Natural Language Processing

Workshop on Jun. 3, 2023 affliated with The 24th Biennial Dictionary Society of North America Conference, Boulder, May 31- June 3, 2023



Description

Many Natural Language Processing systems rely on producing predicate argument structures as canonical representations of sentences that abstract away from syntactic variation. Computational lexicons in the form of valency lexicons that can support the formation of such predicate argument structures therefore play an enduring role in Natural Language Processing (NLP) systems, as key enablers of such abstractions. The PropBank frame files (or their FrameNet and VerbNet cousins) provide the foundation for Semantic Role Labeling systems as well as English Abstract Meaning Representation (AMR) parsers.
Pre-existing language-specific Frame Files, or other types of valency lexicons, simplify the task of spinning up AMR annotation projects for new languages such as Chinese and Arabic. However, there are many unresolved fundamental questions having to do with the creation of such lexicons, and how to draw the boundaries between distinct senses and distinct word forms.
This workshop therefore solicits submissions that review existing computational valency lexicons that are widely used in NLP, the contributions they make and the lexicographic techniques used in their development. We are especially interested in papers that address the complications that arise with polysynthetic languages and their incorporation of certain semantic concepts such as aspect or enablement as prefixes rather than as distinct lexical items such as verbs, as is preferred by analytic languages.
We also welcome submissions that address questions about the boundaries of compounds and multi-word expressions, and criteria for determining when they should have distinct entries.
Any papers that discuss issues that arise when mapping between valency lexicons in multiple languages, such as differences in argument structure, are also especially relevant.



Schedule
     Time      Event Speaker Slides
9:00-9:15 Beyond Valence in the Lexicon: The Role of Deeper Semantic Representations across all Lexical Categories James Pustejovsky (virtual)
9:15-9:40 Enriching valency lexicons with constructional analyses: between verbal semantics and argument structure constructions Pavlina Kalm and William Croft
9:40-10:05 The Problem of Polysynthesis in UMR Annotations: Complexities in Handling Preverbal Modification and Noun Incorporation in Arapaho Andrew Cowell and Julia Bonn
10:05-10:30 UMR annotation of Chinese verb compounds and related constructions Nianwen Xue
10:30-11:00 Break
11:00-11:25 Work with what we have: Bootstrapping from lexical resources for low-resource languages to AMR/UMR annotation Alexis Palmer and Matt Buchholz
11:25-11:50 Linking an Event-type Ontology to Morphosyntax of the Predicate-Argument Structure Jan Hajič, Zdeňka Urešová, Eva Fučíková and Cristina Fernández Alcaina
11:50-12:15 Comparing English and Russian PropBank annotation Skatje Myers, Martha Palmer and Ishan Jindal
12:15-12:30 Discussion


Abstracts
Enriching valency lexicons with constructional analyses: between verbal semantics and argument structure constructions
Pavlina Kalm and William Croft

One of the main objectives of valency lexicons is to identify argument structures evoked by predicates. The analysis used in these resources thus heavily rests on verbal semantics. However, it has been argued that lexically based representations of event structure lack an important layer of semantic information: a semantic analysis of argument structure constructions (or “constructions”) in which verbs occur (Goldberg 1995, Kalm et al. 2019). Importantly, constructions carry meanings that are independent of particular verbs. In a standard dictionary, this type of semantic analysis is not immediately relevant to a description of lexical items as it is more pertinent to grammar; however, it is essential to the description of event structure, which is addressed in valency lexicons.
In many instances, the verbal semantics and the semantics of a construction in which it is used ‘align’, such as when a caused motion verb, e.g., put, occurs in a caused motion construction [SBJ V OBJ Loc-PP], as in He put the book on the shelf. The verb evokes an event in which an agent causes a theme to change location. Similarly, the schematic description of the construction describes an externally initiated event of motion in which a theme moves to a different location. However, the caused motion construction is not restricted to motion verbs. Verbs from other verb classes that do not evoke motion events may occur in this construction. For example, a verb of contact by impact such as hit can describe a motion event if it occurs with a locative phrase: He hit the baseball into the crowd. The verb hit in this example evokes the same predicate argument structure as hit used in a forceful contact event, such as He hit the dog; however, the construals of the verb are different.
Existing valency lexicons infer the meaning of an utterance by pairing thematic roles evoked by a verb to syntactic arguments. Such analysis deals with the semantics of constructions only indirectly; constructions are not dealt with independently of verb meaning. As such, occurrences of verbs in syntactic contexts that describe different types of events than the verb itself pose a challenge. To deal with this issue, we propose that each verb entry in lexicons be supplemented by a semantic analysis of constructions in which the verb occurs. A representation in which verbal and constructional semantics are analyzed separately will enrich the description of event structure.
Our presentation touches on some of the practical issues having to do with generating mappings between constructional and verbal semantics, such as ways to identify which participant roles map to which grammatical roles in a construction and notation strategies for linking roles in these two types of representations. We also bring up challenges related to constructional analyses, such as defining which participants are essential to the description of constructional semantics.

The Problem of Polysynthesis in UMR Annotations: Complexities in Handling Preverbal Modification and Noun Incorporation in Arapaho
Andy Cowell and Julia Bonn

Arapaho, like many agglutinating polysynthetic languages, poses problems for lexicography, in that many "words" are partially made up of lexicalized elements, and are partially syntactic in nature. Arapaho in particular has many prefixes which cross-linguistically would be treated as auxiliary verbs, with meanings such as 'want to...' 'like to...' 'able to...' 'fail to...'. It also features verb-stem elements which can add vector elements to otherwise non-path verbs, such as nouut-ohwoo- 'dance to the outside of a location' and like classic polysynthetic languages, it includes extensive noun incorporation in verb stems. While these issues have been widely recognized by linguists and lexicographers, they pose a special new challenge in the context of attempts to design Uniform Meaning Representations, which seek to aid in cross-linguistic semantic and syntactic comparison and automated translation. In particular, the mismatch between verb stems and overall verb words can lead to loss of semantic content when UMR annotations are done, or conversely, attempts to maintain full content can lead to the creation of "lexicons" which contain thousands of additional words, beyond the stem level, whose features and meanings are highly predictable from the perspective of morpho-syntax, and which do not really need full semantic representations as single "words."

UMR annotation of Chinese verb compounds and related constructions
Nianwen Xue

In this talk, we will examine the challenges of annotating the predicate-argument structure of Chinese verb compounds in Uniform Meaning Representation (UMR), a recent meaning representation framework that extends Abstract Meaning Representation (AMR) to cross-linguistic settings. The key issue is to decide whether to annotate the argument structure of a verb compound as a whole, or to annotate the argument structure of their component verbs as well as the relations between them. We will examine different types of Chinese verb compounds, and propose how to annotate them based on the principle of compositionality, level of grammaticalization, and productivity of component verbs.

Comparing English and Russian PropBank annotation
Skatje Myers, Martha Palmer and Ishan Jindal

Although semantic role labeling is useful in many downstream NLP tasks, these semantic representations are not available in many languages. We present our work on leveraging parallel sentences where the source language has automatic or manual SRL annotation to transfer annotations into the target language. The Universal PropBanks (UPB) 2.0 system uses unsupervised word alignments, filtering heuristics, and bootstrapping to automatically project English SRL annotations to 23 languages. Since this approach uses English-based semantic representations to create annotations in other languages, we provide a case study between English PropBank representations and the recently developed Russian PropBank. We evaluate the UPB 2.0 projecting English annotations onto sentences that have been annotated by Russian PropBank, developing language-agnostic and language-specific filters to improve the results.

Linking an Event-type Ontology to Morphosyntax of the Predicate-Argument Structure
Jan Hajič, Zdeňka Urešová, Eva Fučíková and Cristina Fernández Alcaina

In the SynSemClass project, we are building a multilingual ontology for text annotation at the semantic (meaning) level. The SynSemClass ontology consists of classes that represent one event type each, defined across multiple languages. Each class contains a set of words (synonymous verbs) that express that event type, as described by its definition. Each verb is linked to both syntactic properties of these words, as well as to similar semantic lexicons, such as FrameNet, VerbNet, EngVallex, PropBank, Ontonotes, and Wordnet for English, and similar lexicons for German, Czech and Spanish (the languages currently represented in SynSemClass). Depending on the resource being referred to, the chain of links maps the particular class member to its sense as defined in that other resource. In addition, the semantic roles associated with the class are linked to the arguments (rolesets, valency slots) as defined by the external resource, which in turn – in some cases – also contain morphosyntactic information relevant for each of the arguments. Such a mapping allows to extract the correct case, adposition, word order precedence and/or negation as well as other properties important for the corresponding surface form. In the talk, we will present the system of the linked data as present in SynSemClass, together with examples taken from the languages covered.

Work with what we have: Bootstrapping from lexical resources for low-resource languages to AMR/UMR annotation
Alexis Palmer and Matt Buchholz

Uniform Meaning Representation (UMR) is a framework for annotating the semantic content of a text. While UMR is designed to be cross-linguistically applicable, its annotation depends on valency lexicons and other materials which may not exist for low-resource languages. Using Arapaho as a case study, we demonstrate how support for UMR annotation can be developed automatically using existing data for low-resource languages, including lexicons and interlinear glosses.



Bios

Julia Bonn is a PhD student in Linguistics and Cognitive Science following more than a decade as a Senior Research Assistant at CLEAR. She is a long term contributor to PropBank and the PropBank Lexicon, Verbnet, AMR, and UMR, as well as the developer of SpatialAMR, an extension to AMR annotation for fine-grained, multimodal annotation of spatially-rich corpora. Her research interests center on bringing multimodality and pragmatics into semantic annotation (gesture, embodiment, frame of reference, implicit argumentation), lexical resource development, and expanding semantic annotation to typologically diverse languages (Arapaho and Quechua UMRs).

Matt Buchholz is a master’s student in the Computational Linguistics, Analytics, Search and Informatics (CLASIC) program at the University of Colorado at Boulder. His research interests include information extraction, distributional lexical semantics, and the computational study of hip-hop lyrics.

Andrew Cowell is Professor of Linguistics at the University of Colorado and Director of the Center for Native American and Indigenous Studies. His work focuses on linguistic anthropology, language documentation, and language maintenance and revitalization. He has created textual databases of several languages (Arapaho, Southern Sierra Miwok, Central Sierra Miwok) as well as an extensive lexical database of Arapaho, and another of Aaniiih (Gros Ventre). His recent work has focused on integrating computational and corpus methods with these databases of endangered indigenous languages, to make them more accessible to a variety of users.

William Croft is Professor Emeritus at the University of New Mexico. Croft received his PhD at Stanford University under Joseph Greenberg. He has taught at the Universities of Michigan, Manchester and New Mexico, visited the Max Planck Institutes of Psycholinguistics (Nijmegen) and Evolutionary Anthropology (Leipzig), and the Center for Advanced Study in the Behavioral Sciences (Stanford), and given lectures throughout the world. He is the author of dozens of articles and nine books, including Typology and Universals, Radical Construction Grammar, Explaining Language Change, Cognitive Linguistics (with Alan Cruse), Verbs, Ten Lectures on Construction Grammar and Typology and Morphosyntax.

Jan Hajič is a professor of Computational Linguistics and deputy director of the Institute of Formal and Applied Linguistics at the School of Computer Science, Charles University, Prague, Czech Republic. His field is NLP and building language resources with rich linguistic annotation, such as the Prague Dependency Treebank. He is currently also the director of a large, multi-institutional research infrastructure on language resources in the Czech Republic, LINDAT/CLARIAH-CZ. His work experience includes both industrial research (IBM) and academia (also in the U.S. and Norway). He has published more than 200 conference and journal papers, a book, book chapters, and encyclopedia and handbook entries. He is the chair of the Executive Board of META-NET, European research network in Language Technology, chair of the TACL Steering Committee, and a member of other international boards and committees. He has been the PI or Co-PI of several international NLP projects in the U.S. and EU.

Ishan Jindal is a staff Research Scientist at IBM and an expert on applying deep learning techniques to multilingual shallow semantic parsing for enterprise use cases. He has published more than 15 peer-reviewed publications that include papers at top NLP conferences such as EMNLP, NAACL, EACL, ICASSP, and LREC, as well as a book chapter (Google Scholar citations: 470, h-index 8).

Pavlina Kalm completed her PhD in Linguistics at the University of New Mexico in August 2022. Her dissertation work titled Social verbs: a force-dynamic analysis proposes a semantic analysis of verbs that describes various types of social interactions between people. Kalm was the primary developer of the force-dynamic representations of event structure associated with VerbNet. Kalm has also published research that deals with event structure representation and the intersection of verbal semantics and the semantics of argument structure constructions. Her other research interests include typology and syntax. Kalm was a recipient of the Bilinski Fellowship awarded to outstanding dissertation fellows in 2022.

James H. Martin is a Professor of Computer Science and Cogntive Science at the University of Colorado Boulder where he is the co-director of the Center for Computational Language and Education Research (CLEAR). His research interests are in computational semantics and its applications to grand challenge problems in education and medicine. He is the co-author of Speech and Language Processing, the premier textbook in the field.

Skatje Myers is a PhD candidate in Computer Science and Cognitive Science at the University of Colorado at Boulder. Her interests include expanding semantic NLP to new domains and languages and exploring ways of using techniques such as active learning and annotation projection to make more efficient use of human annotation.

Alexis Palmer is an Assistant Professor of Linguistics at the University of Colorado, with a specialization in computational linguistics. Her research interests include computational semantics and discourse, as well as the development and application of computational methods for endangered and other low-resource languages. She is a co-organizer of the ComputEL workshop series on computational methods for the study of endangered languages and a co-founder of the new SIGEL special interest group. She has also served as a co-organizer for the SEMEVAL workshop and is currently program co-chair for the *SEM 2023 conference.

Martha Palmer is the past Helen & Hubert Croft Professor of Engineering in the Computer Science Department, and Arts & Sciences Professor of Distinction for Linguistics, at the University of Colorado, with over 300 peer-reviewed publications. She is a co-Director of CLEAR, an Association of Computational Linguistics Fellow, an Association for the Advancement of Artificial Intelligence Fellow, and a co-editor of LiLT: Linguistic Issues in Language Technology. She has previously served as co-editor of the Journal of Natural Language Engineering, a member of the editorial board of Computational Linguistics and of TACL, President of ACL, Chair of SIGLEX, and Founding Chair of SIGHAN.

James Pustejovsky is the TJX Feldberg Chair in Computer Science at Brandeis University. He works on computational and lexical semantics, is creator of Generative Lexicon Theory, and has worked to develop lexical semantic resources for the NLP community. He is chief architect for TimeML and ISO-TimeML, a recently adopted ISO standard for temporal information in language; the recently adopted standard ISO-Space, a specification for spatial information in language; and the co-creator (with N. Krishnaswamy) of the VoxML multimodal modeling framework, encoding communication between humans and computers or robots utilizing speech, gesture, gaze, and action.

Zdeňka Urešová is a senior researcher at the Institute of Formal and Applied Linguistics at the School of Computer Science, Charles University, Prague, Czech Republic. Her interests are syntactic and semantic annotation of texts (Czech, English) and computational lexicons for text annotation, at both syntactic and semantic levels. She received her Ph.D. in Computational Linguistics in 2012. Since then she has been the PI or participated in numerous national and international projects in the EU. She is the author of two books on valency, and published over 70 other publications.

Nianwen Xue is a Professor in the Computer Science Department and the Language & Linguistics Program at Brandeis University. His core research interests include developing linguistic corpora annotated with syntactic, semantic, and discourse structures, as well as machine learning approaches to syntactic, semantic and discourse parsing. He is an action editor for Computational Linguistics and currently serves on the editorial boards of Language Resources and Evaluation (LRE). He also served as the editor-in-chief of the ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) from 2016 to 2019. He is currently the Vice Chair/Chair-Elect of Sighan, an ACL special interest group in Chinese language processing.