spaCy Fundamentals: A Beginner’s Guide to Raw NLP Basics

6 min readOct 22, 2020

What’s spaCy ?

Glad you asked. spaCy is one of the most advanced, natural language processing libraries you can use with Python right now, competing with the likes of NLTK. It gives you raw control of what you want in terms of changing its text processing pipeline etc., or not if you don’t. Simple.

What we’re going to be covering and how

In this particular piece I plan on covering the absolute raw basics. We’ll get to the meat of it in another article. But for now, let’s get our basics right. We’ll be looking into lemmatization, stop words, matching words or phrases and my personal favorite, visualizing named entities along with some extras.

I believe in a practical approach to learning and so I’ll explain as we code. 😁

Pre-Requisites

Before we dive in, ensure you have spacy and the ‘en_core_web_sm’ language model installed.

Run the following code before following along, I’d suggest Jupyter Lab as an IDE, although you can pick whatever you’re comfortable with:

import spacynlp = spacy.load(‘en_core_web_sm’)

Also, note that each ‘document’ we want to process must be passed through the ‘nlp’ pipeline we just loaded above.

Note

Since spaCy can do so much analyzing, memory efficiency is of paramount importance. The developers solved this by hashing the value of each tag to reduce the amount of data stored in memory.

So anytime you want to see the real value instead of the hashed value, you will need to add an underscore to the end of the property so,

.label → yields the hash

.label_ → yields the actual tag

And whenever we want to see a more detailed explanation of what a particular tag means then,

spacy.explain(the-tag-goes-here)

1. Named Entities

Named entities are basically words categorized by taking context into account. This usually sheds more meaning about the word. Let’s take a look at what this can mean in the real world.

document = nlp(‘The new Apple iPhone 12 lineup is a pure work of art, completely justifying the 699 dollar price.’)
for word in document.ents:
 print(f’{word.text:{10}} | {spacy.explain(word.label_):{10}}’)

Running the above returns the following:

As we can see it figured out that Apple is a company, the iPhone is an object and that 699 is somehow associated with dollar and hence registers it as money.

2. Visualizing Named Entities

Let’s run the following and see what happens:

from spacy import displacy
displacy.render(document, style=’ent’, jupyter=True)

But if you’re not using Jupyter then don’t set Jupyter=True. Instead use displacy.serve to make it spin up a new web socket and render it on your browser.

But what if we want to customize the visualization ? Surprise surprise ! There’s room for that too !

Maybe you only want to visualize Money. Come on, who doesn’t ? 😅

options = {‘ents’: [‘MONEY’]} # you can add more like ‘PRODUCT ’ to the list using commas

And then change the code to,

from spacy import displacy
displacy.render(document, style=’ent’, jupyter=True, options=options)

And boom, the visualization’s far more streamlined.

Maybe you want to color code the visualization, no problem,

colors = {‘ORG’: ‘lime’}
options = {‘ents’: [‘ORG’, ‘PRODUCT’], ‘colors’:colors}
displacy.render(document, style=’ent’, jupyter=True, options=options)

You can also use ‘dep’ instead of ‘ent’ as a style. That’ll show you the syntactic dependency of the words. Put simply, it’s the relationship between different words in the sentence.

However, if you need to visualize huge corpses of text, then pass in each sentence into the visualizer, otherwise you will regret it.

for sentence in document.sents:
 displacy.render(nlp(sentence.text), style=’ent’, jupyter=True)

Check out the official page for more documentation.

3. Parts Of Speech tagging

You will automatically brush up on your Parts Of Speech as you write more and more code.

However, for now here’s a super quick recap. They’re basically 8 parts of speech: nouns, pronouns, verbs, adjectives, adverbs, prepositions, conjunctions, and interjections.

The most common tags are below:

pos_ → this is a coarse part of speech tag
tag_ → this is a fine grained tag

You can find all the different tags in the official documentation.

Here’s an implementation:

document2 = nlp('I read up a little on the Frenkel Defect yesterday.')for token in document2:
    print(f'{token.text:{10}} | {token.pos_:{8}} | {token.tag_:{10}} | {spacy.explain(token.dep_)}')

4. Lemmatisaton

Whenever you need to look up the origin of the word without the tense and the other ‘noise’, the lemma is what you want. It’s essentially the root of the word.

Let’s see how we can look up the lemma of a word in spaCy.

document3 = nlp(‘Playing chess regularly is the key to improving your understanding of positions’)for token in document3:
 print(f’{token.text:{14}} | {token.lemma_:{8}}’)

It isn’t very accurate but it’s way more consistent than stemming in my opinion.

5. Stop Words

Let’s imagine you want to process a string and extract it’s core meaning. Then you definitely want to get rid of common words like 'the' and ‘and’. spaCy comes with that feature built in along with a super easy method to customize the default Stop Words yourself.

Let’s take a look.

With 0 being False and 1 being True. Duh.

But what if we wanted to add our own stop words ? Maybe I feel bread should also be a stop word. 🤷‍♂️

nlp.Defaults.stop_words.remove(‘bread’)nlp.vocab[‘bread’].is_stop = False

Now bread is 1 → True, which means it’s now registered as a stop word.

And if you decide you no longer want it to be a stop word, then you can use remove instead of add and set the is_stop value to False.

6. Phrase Matching

Imagine an instance of when you need to find the index of a particular word to, maybe add it as a custom entity. What then ? What if it exists in several forms like ‘back-pack’ or ‘backpack’ ?

We’re going to have to search for it and add each instance as an entity. Trawling through an entire text corpus to find these instances sounds like a horrible way to kill time.

Enter phrase matching. Sort of like hitting Control + F on a page to find what you need, except finding the exact range as well.

from spacy.matcher import PhraseMatcher
phraseMatcher = PhraseMatcher(nlp.vocab)phrase_list = [‘backpack’, ‘back-pack’]
phrase_patterns = [nlp(phrase) for phrase in phrase_list]phraseMatcher.add(‘Backpack’, None, *phrase_patterns)document5 = nlp(‘I bought a new backpack for this year, but then COVID meant no school and so no back-pack.’)for entitiy in document5.ents:
 print(entitiy.text, ‘ →’, entitiy.label_)matches = phraseMatcher(document5)
matchesfrom spacy.tokens import Span
spacy_product = document5.vocab.strings[u’PRODUCT’]
new_entities = [Span(document5, match[1],match[2],label=spacy_product) for match in matches]
document5.ents = list(document5.ents) + new_entitiesfor entitiy in document5.ents:
 print(entitiy.text, ‘ →’, entitiy.label_)