Computer linguistics (CL) or linguistic data processing examines how natural language can be processed algorithmically in the form of text or language data using the computer. It is the interface between linguistics and computer science. The term natural language processing (NLP) is common in literature and computer science.
Computer linguistics can be traced back to the 1960s as a term. With the beginnings of artificial intelligence, the task was already suggested. In the 1970s, publications with the term computer linguistics appeared more and more frequently in the title. There have already been financially complex experiments of exegetic applications (concordances, word and form statistics), but also larger projects for machine language analysis and translation. In contrast to internet linguistics, which examines in particular human language behaviour and the language formations induced by it on and via the Internet, computer linguistics has developed a more computer-practical orientation. But the subject did not completely give up the classical philosophical-linguistic questions and is today differentiated into theoretical and practical computer linguistics.
Computers see language either in the form of sound information (if the language is acoustic) or in the form of letter chains (if the language is in written form). To analyze the language, one gradually works from this input representation towards meaning and goes through different levels of linguistic representation. In practical systems, these steps are typically performed sequentially, so the pipeline model is referred to as follows:
- Speech recognition: If the text is available as sound information, it must first be converted into text form.
- Tokenization : The letter chain is segmented into words, sentences, etc.
- Morphological analysis: Personnel shapes or case markers are analyzed to extract the grammatical information and return the words in the text to basic forms such as those .B in the lexicon.
- Syntactic analysis: The words of each sentence are analyzed for their structural function in the sentence (e.g. subject, object, modifier, article, etc.).
- Semantic analysis: Meaning is assigned to the sentences or their parts. This step potentially involves a variety of different individual steps, as meaning is elusive.
- Dialogue and discourse analysis: The relationships between consecutive sentences are recognized. In dialogue, this could be a question and answer, in the discourse a statement and its justification or its limitation.
However, it is not the case that all computer linguistics techniques go through this entire chain. The increasing use of machine learning techniques has led to the realization that statistical regularities exist at each of the levels of analysis that can be used to model linguistic phenomena. For example, many current models of machine translation use syntax to a limited extent and semantics are virtually non-use; instead, they limit themselves to exploiting correspondence patterns at the word level.
At the other end of the scale are processes that work according to the principle semantics first, syntax second. Thus, cognitively oriented language processing is based on a semantic-based computer lexicon based on an essentially language-independent semantic core with language-specific morphosyntactic additions. This lexicon is used in parsing by a word class-controlled analysis for the immediate generation of semantic structures.