Overview of spaCy Conventions
Explore spaCy's conventions to understand its text processing pipeline and core components like tokens, Doc, and Vocab. Learn how spaCy handles tokenization, tagging, parsing, and entities through an efficient pipeline to simplify NLP development.
We'll cover the following...
Overview of spaCy
Every NLP application consists of several steps of processing the text. As we saw previously, we have always created instances called nlp and doc. But what did we do exactly?
When we call nlp on our text, spaCy applies some processing steps. The first step is tokenization to produce a Doc object. The Doc object is then processed further with a tagger, a parser, and an entity recognizer. This way of processing the text is called a language processing pipeline. Each pipeline component returns the processed Doc and then passes it to the next component:
A spaCy pipeline object is created when we load a language model. We load an English model and initialize a pipeline in the following code segment:
What happened exactly in the preceding code is as follows:
We started by importing spaCy.
In the second line,
spacy.load()returned aLanguageclass instance,nlp. TheLanguageclass is the text processing pipeline.After that, we applied
nlpon the sample sentenceI went thereand got aDocclass instance,doc.
The Language class applies all of the preceding pipeline steps to our input sentence behind the scenes. After applying nlp to the sentence, the Doc object contains tokens that are tagged, lemmatized, and marked as entities if the token is an entity (we will go into detail about what those are and how it's done later). Each pipeline component has a well-defined task, as seen in the table below:
Name | Component | Creates | Description |
tokenizer | Tokenizer |
| Segment text into tokens. |
tagger | Tagger |
| Assign part-of-speech tags. |
parser | DependencyParser |
| Assign dependency labels. |
ner | EntityRecognizer |
| Detect and labels names entities. |
The spaCy language processing pipeline always depends on the statistical model and its capabilities. This is why we always load a language model with spacy.load() as the first step in our code.
Each component corresponds to a spaCy class. The spaCy classes have self-explanatory names such as Language, Doc, and Vocab. We already used Language and Doc classes—let's see all of the processing pipeline classes and their duties:
Processing Pipeline
Type | Description |
Language | A text processing pipeline. Usually, we load this once per process as nlp and pass the instance around our application. |
Tokenizer | Segment text and create |
Lemmatizer | Determine the base form of words. |
Morphology | Assign linguistic features like lemmas, noun case, verb tense, etc., based on the word and its part-of-speech tagging. |
Tagger | Annotate part-of-speech tags on |
DependencyParser | Annotate syntactic dependencies on |
EntityRecognizer | Annotate named entities, e.g. persons or products, on |
Matcher | Match sequences of tokens, based on pattern rules, similar to regular expressions. |
PhraseMatcher | Match sequences of tokens, based on phrases. |
EntityRuler | Add entity spans to the |
Sentencizer | Implement custom sentence boundary detection logic that helps doesn't require the dependency parse. |
We shouldn't be intimated by the number of classes; each class has unique features to help us process our text better.
There are more data structures to represent text data and language data. Container classes such as Doc hold information about sentences, words, and text. There are also container classes other than Doc:
Container objects
Name | Description |
Doc | A container for accessing lingusitic annotations. |
Span | A slice from a |
Token | An individual token, i.e., a word, punctuation symbol, whitespace, etc. |
Lexeme | An entry in the vocabulary. It's a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse, etc. |
Finally, spaCy provides helper classes for vectors, language vocabulary, and annotations. We'll see the Vocab class often in this course. Vocab represents a language's vocabulary. Vocab contains all the words of the language model we loaded:
Other classes
Name | Description |
Vocab | A lookup table for the vocabulary that allows us to access the Lexeme objects. |
StringStore | Map strings to and from hash values. |
Vectors | Container class for vector data keyed by string. |
GoldParse | Collection for training annotations. |
GoldCorpus | An annotated corpus using the JSON file format. Managed annotations for tagging, dependency parsing, and NER. |
The spaCy library's backbone data structures are Doc and Vocab. The Doc object abstracts the text by owning the sequence of tokens and all their properties. The Vocab object provides a centralized set of strings and lexical attributes to all the other classes. This way, spaCy avoids storing multiple copies of linguistic data:
We can divide the objects composing the preceding spaCy architecture into two: containers and processing pipeline components. We'll first learn about two basic components, Tokenizer and Lemmatizer, and then we'll explore Container objects further.
spaCy does all these operations for us behind the scenes, allowing us to concentrate on our own application's development. With this level of abstraction, using spaCy for NLP application development is no coincidence. Let's start with the Tokenizer class and see what it offers us; then, we will explore all the container classes one by one.