Blog / Videogrep and Spacy

June 10, 2022

Creating Supercuts with Pattern Matching

In previous versions of Videogrep I included the pattern library, which, at the time (2015), was my preferred tool for natural language processing tasks. Integrating pattern allowed Videogrep to create supercuts based on grammatical patterns.

In an attempt to simplify things, I no longer include pattern with Videogrep. And generally speaking I’ve moved on from pattern altogether, to the more robust spaCy, an “industrial-strength” natural language processing library which is also a pleasure to use.

In this post I’ll explain how to use spaCy with Videogrep. In doing so, I’ll also go over some NLP basics, as well as how to use Videogrep as a Python module, rather than a command-line application.

If you’re unfamiliar with Videogrep, see this tutorial first.

Introduction to spaCy

Note: this is an overly brief section! For more, see spaCy 101, and/or Allison Parrish’s tutorial NLP Concepts with spaCy.

At its core, natural language processing (or NLP) systems allow the computer to “make sense” of human language. NLP tasks might include: determining that a particular set of characters is a word or a sentence; determining that a particular word is a noun, adjective or other part of speech; determining that a word is a “named entity” like a business, place, or person; estimating the similarity between two words or phrases. And so on!

spaCy is a wonderful open source library that allows you to perform the above tasks, and more.

Installation

To install it, open up a terminal and type:

pip3 install spacy

You’ll also need to install a language model. I’ll be using English here, but other models are also available:

python3 -m spacy download en_core_web_sm

Note: I’m using the “small” model which is fast, but less accurate.

Basic Usage

To use spaCy just, load up the library and the model, and then pass it some text to analyze:

import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp("A spectre is haunting Europe. The spectre of communism.")

for token in doc:
    print(token.text, token.pos_, token.tag_)

The output will be:

A DET DT
spectre NOUN NN
is AUX VBZ
haunting VERB VBG
Europe PROPN NNP
. PUNCT .
The DET DT
spectre NOUN NN
of ADP IN
communism NOUN NN
. PUNCT .

doc is an iterator containing spaCy Token objects. Each token is a word or piece of punctuation, annotated with attributes that give you additional info about that token’s grammatical structure. Here, we’re printing out the text of each token, along with its coarse-grained (.pos_) and fine-grained (.tag_) part of speech tag.

The possible parts of speech (coarse-grained) are:

ADJ: adjective
ADP: adposition
ADV: adverb
AUX: auxiliary
CCONJ: coordinating conjunction
DET: determiner
INTJ: interjection
NOUN: noun
NUM: numeral
PART: particle
PRON: pronoun
PROPN: proper noun
PUNCT: punctuation
SCONJ: subordinating conjunction
SYM: symbol
VERB: verb
X: other

The fine-grained .tag_ annotation provides more specific details, such as verb form. A full list of those tags can be found here.

So, if we want to extract all the nouns from a text we can simply select the words whose .pos_ tag is equal to "NOUN":

import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp("A spectre is haunting Europe. The spectre of communism.")

nouns = [token.text for token in doc if token.pos_ == "NOUN"]
print(nouns)

The output would be: ['spectre', 'spectre', 'communism'].

Pattern Matching

spaCy also has a great system for finding phrases that match specific grammatical patterns in texts. It’s sort of like regex but for grammar. spaCy has extensive documentation on their matcher, as well as an interactive tool to explore the matcher rules, so this is, again, just a very brief intro!

Let’s say that we want to extract all the phrases from the Communist Manifesto that match the pattern determiner adjective noun. These are phrases like “a holy alliance” and “the modern State”.

import spacy
from spacy.matcher import Matcher

# load spacy and create a matcher object
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

# a list of patterns to look for
patterns = [
    [{"POS": "DET"}, {"POS": "ADJ"}, {"POS": "NOUN"}]],
]

matcher.add("MyPattern", patterns)

# read in the communist manifesto
doc = nlp(open("manifesto.txt").read())

# find matches
matches = matcher(doc)

# print the text of the matches
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

Note: the matcher returns a start and end index for each match. In order to actually get the text of the matches, we feed those indices to our doc object.

The output (sorted) is:

a bare existence
a bourgeois revolution
a bribed tool
a certain stage
a collective product
a common plan
a considerable part
a constant battle
a cosmopolitan character
a critical position
a distinctive feature
a few hands
a few years
a foreign language
a general reconstruction
a great extent
a great part
a heavy progressive
a historical movement
a holy alliance
a literary battle
a long course
a mere figure
a mere instrument
a mere money
a mere training
a miserable fit
a national struggle
a new basis
a new supply
a political party
a proletarian revolution
a reactionary character
a reactionary interest
a separate party
a similar movement
a single instant
a small section
a social power
a socialist tinge
a sweet finish
a universal war
all chinese walls
all coercive measures
all earlier epochs
all earlier ones
all former exoduses
all individual character
all political action
all practical value
all previous securities
all theoretical justification
an agrarian revolution
an earlier period
an exclusive monopoly
an ideological standpoint
an incoherent mass
an oppressed class
any historical initiative
any sectarian principles
each single workshop
every other article
every revolutionary movement
every villainous meanness
no other conclusion
no other nexus
the absolute governments
the absolute monarchy
the administrative work
the ancient religions
the ancient world
the average price
the bitter pills
the bombastic representative
the bourgeois clap
the bourgeois class
the bourgeois conditions
the bourgeois family
the bourgeois mode
the bourgeois objections
the bourgeois relations
the bourgeois sense
the bourgeois supremacy
the branding reproach
the civilised ones
the classical works
the commercial crises
the common ruin
the communist revolution
the communistic mode
the communistic modes
the continued existence
the decisive hour
the earlier epochs
the economic functions
the economic situation
the economical conditions
the eighteenth century
the entire proletariat
the essential condition
the exact contrary
the extensive use
the extreme length
the fettered traders
the feudal aristocracy
the feudal nobility
the feudal organisation
the feudal relations
the feudal system
the first conditions
the first elements
the first step
the forcible overthrow
the free development
the french criticism
the french ideas
the french original
the french revolution
the french sense
the german _
the german bourgeoisie
the german nation
the german workers
the golden apples
the great chagrin
the great factory
the great mass
the greatest pleasure
the heavy artillery
the historical movement
the holy water
the hostile antagonism
the icy water
the immediate aim
the immediate aims
the immediate result
the immense majority
the impending bourgeois
the individual bourgeois
the individual members
the industrial army
the industrial capitalist
the industrial war
the inevitable ruin
the intellectual creations
the last resort
the last word
the leading question
the little workshop
the lower strata
the lowest layers
the lowest stratum
the mediaeval commune
the middle class
the miserable character
the misty realm
the modern bourgeois
the modern bourgeoisie
the modern laborer
the modern state
the modern working
the momentary interests
the national ground
the national struggles
the necessary condition
the necessary consequence
the necessary offspring
the new markets
the new methods
the old bourgeois
the old conditions
the old cries
the old family
the old ideas
the old means
the old modes
the old nationalities
the old ones
the old order
the old property
the old society
the old wants
the only class
the original views
the other portions
the petty bourgeois
the petty bourgeoisie
the political constitution
the political movement
the practical absence
the practical measures
the present family
the present system
the prime condition
the productive forces
the proletarian movement
the prussian bourgeoisie
the public power
the rapid improvement
the reactionary character
the reactionary classes
the real point
the remotest zones
the revolutionary class
the revolutionary element
the revolutionary literature
the same character
the same proportion
the same time
the same way
the selfish misconception
the small manufacturer
the small peasant
the small tradespeople
the social character
the social consciousness
the social forms
the socialistic bourgeois
the theoretical conclusions
the threatening bourgeoisie
the traditional anathemas
the typical man
the unceasing improvement
the undeveloped state
the upper hand
the urban population
the vanished status
the various interests
the various stages
the very foundation
the violent overthrow
the virtuous indignation
the whole bourgeoisie
the whole country
the whole nation
the whole proletariat
the whole range
the whole relations
the whole superincumbent
the whole surface
these fantastic attacks
these first movements
these intermediate classes
these philosophical phrases
these same governments
this comfortable conception
this distinctive feature
this french literature
this transcendental robe
what earlier century
whose essential character
whose silly echo

Note: if you want to use fine-grained tags, just write "TAG" instead of "POS".

Here are a few other examples of patterns.

“determiner adjective noun adposition noun”, i.e. “the present system of production”:

[{"POS": "DET"}, {"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "ADP"}, {"POS": "NOUN"}]

“determiner noun be adjective”, i.e. “the workers are victorious”:

[{"POS": "DET"}, {"POS": "NOUN"}, {"LEMMA": "be"}, {"POS": "ADJ"}]

Videogrep in Python

In order to combine spaCy and Videogrep, you have to first load Videogrep into a Python script. Here for example is how to simply print out lines from a video’s transcript file.

import videogrep

videofile = "shell.mp4"
transcript = videogrep.parse_transcript(videofile)
for sentence in transcript:
    print(sentence["content"])

The parse_transcript function finds a transcript for the given video file and returns a list of dictionary objects that contain sentences (and words if available) with start and end timestamps. The function will look for .json files first, then .vtt files, and finally .srt files.

Its output will be formatted something like this, where each sentence is a dictionary object:

[
    {
        "content": "a spectre is haunting",
        "start": 0,
        "end": 3.0,
        "words": [
            {"word": "a", "start": 0, "end": 0.5},
            {"word": "spectre", "start": 0.6, "end": 1.2},
        ],
    }
]

We can also call the main Videogrep function to make a supercut. It’s very similar to how we’d use Videogrep from the command line. In this example, I’ll make a supercut with a random word from the transcript of this video of Shell’s third quarter 2021 results, that I’ve downloaded with yt-dlp:

import videogrep
import random

videofile = "shell.mp4"
transcript = videogrep.parse_transcript(videofile)

# create a list of all the words in the transcript
all_words = []
for sentence in transcript:
    all_words += transcript["words"]

# grab a random word
query = random.choice(all_words)["word"]

# create the supercut
videogrep.videogrep(
    videofile,
    query,
    search_type="fragment",
    output="random_supercut.mp4"
)

Videogrep and spaCy

Only Nouns

We can now integrate Videogrep and spaCy. Let’s start by making a supercut containing all the nouns from a video, again, using Shell’s quarterly results as an example.

To integrate spaCy and Videogrep, we need to process all the text in our video by making a nlp object for each sentence, and iterating over the tokens. We can then extract a list of nouns from the text to use within Videogrep.

import videogrep
import spacy

videofile = "shell.mp4"

nlp = spacy.load("en_core_web_sm")

search_words = []

# iterate through the transcript,
# saving nouns to search for
transcript = videogrep.parse_transcript(videofile)
for sentence in transcript:
    doc = nlp(sentence["content"])
    for token in doc:
        # change this if you don't want nouns!
        if token.pos_ == "NOUN"
            # ensure that exact matches are made
            search_words.append(f"^{token.text}$")

videogrep.videogrep(videofile, search_words, search_type="fragment", output="only_nouns.mp4")

See if you can bear to watch the whole thing!

A few things to note here. First, Videogrep uses Python’s regular expression engine for queries, so I’m surrounding each search term with a ^ and a $ in order to grab the exact words, and avoid partial matches. Second, I’m passing a list of search terms to Videogrep. You could also pass it a single string, with search terms separated with the | character.

Pattern Matching

Finally, we can now use spaCy’s pattern matching with Videogrep. Here’s an example where I’ve extracted all the adjective and noun combinations from the video:

import videogrep
import spacy
from spacy.matcher import Matcher

video = "shell.mp4"
nlp = spacy.load("en_core_web_sm")

patterns = [[{"POS": "ADJ"}, {"POS": "NOUN"}]]

matcher = Matcher(nlp.vocab)
matcher.add("Patterns", patterns)

searches = []

transcript = videogrep.parse_transcript(video)
for sentence in transcript:
    doc = nlp(sentence["content"])
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]
        searches.append(span.text)

videogrep.videogrep(
    video, searches, search_type="fragment", output="pattern_matcher.mp4"
)

Similarity

You can also use spaCy to determine the similarity between words and phrases. Each doc and token object contains a similarity function that takes as an argument any other doc or token, and will return as estimate of semantic similarity between the words.

To use this, you must first download spaCy’s larger language model:

python -m spacy download en_core_web_lg

We can then iterate through all the words in a transcript, recording similarity to a search term, and then make a supercut based on what we find.

Here, for example, is code that will make a supercut of words similar to the word “money”.

import videogrep
import spacy

# load the larger language model
nlp = spacy.load("en_core_web_lg")

video = "shell.mp4"

# search for words similar to "money"
search_sim = nlp("money")

similarities = []

transcript = videogrep.parse_transcript(video)
for sentence in transcript:
    doc = nlp(sentence["content"])
    for token in doc:
        # calculate the similarity between each token
        # and our search term
        sim = search_sim.similarity(token)

        # store the similarity value
        similarities.append((sim, token.text))

# sort the words by the similarity value
similarities = sorted(similarities, key=lambda k: k[0], reverse=True)

# limit to 20 results
similarities = similarities[0:20]

# make a unique list of words
searches = list(set([s[1] for s in similarities]))

# create the video
videogrep.videogrep(
    videos, searches, search_type="fragment", output="money.mp4"
)