Blog / Videogrep and Spacy
Creating Supercuts with Pattern Matching
In previous versions of
Videogrep I included the
pattern library, which, at the time
(2015), was my preferred tool for natural language processing tasks. Integrating
pattern
allowed Videogrep to create supercuts based on grammatical
patterns.
In an attempt to simplify things, I no longer include pattern
with
Videogrep. And generally speaking I’ve moved on from
pattern
altogether, to the more robust
spaCy, an “industrial-strength” natural
language processing library which is also a pleasure to use.
In this post I’ll explain how to use spaCy
with Videogrep. In doing
so, I’ll also go over some NLP basics, as well as how to use Videogrep as a
Python module, rather than a command-line application.
If you’re unfamiliar with Videogrep, see this tutorial first.
Introduction to spaCy
Note: this is an overly brief section! For more, see spaCy 101, and/or Allison Parrish’s tutorial NLP Concepts with spaCy.
At its core, natural language processing (or NLP) systems allow the computer to “make sense” of human language. NLP tasks might include: determining that a particular set of characters is a word or a sentence; determining that a particular word is a noun, adjective or other part of speech; determining that a word is a “named entity” like a business, place, or person; estimating the similarity between two words or phrases. And so on!
spaCy is a wonderful open source library that allows you to perform the above tasks, and more.
Installation
To install it, open up a terminal and type:
pip3 install spacy
You’ll also need to install a language model. I’ll be using English here, but other models are also available:
python3 -m spacy download en_core_web_sm
Note: I’m using the “small” model which is fast, but less accurate.
Basic Usage
To use spaCy just, load up the library and the model, and then pass it some text to analyze:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("A spectre is haunting Europe. The spectre of communism.")
for token in doc:
print(token.text, token.pos_, token.tag_)
The output will be:
A DET DT
spectre NOUN NN
is AUX VBZ
haunting VERB VBG
Europe PROPN NNP
. PUNCT .
The DET DT
spectre NOUN NN
of ADP IN
communism NOUN NN
. PUNCT .
doc
is an iterator containing spaCy Token
objects. Each
token is a word or piece of punctuation, annotated with attributes that give you
additional info about that token’s grammatical structure. Here, we’re
printing out the text of each token, along with its coarse-grained
(.pos_
) and fine-grained (.tag_
) part of speech tag.
The possible parts of speech (coarse-grained) are:
- ADJ: adjective
- ADP: adposition
- ADV: adverb
- AUX: auxiliary
- CCONJ: coordinating conjunction
- DET: determiner
- INTJ: interjection
- NOUN: noun
- NUM: numeral
- PART: particle
- PRON: pronoun
- PROPN: proper noun
- PUNCT: punctuation
- SCONJ: subordinating conjunction
- SYM: symbol
- VERB: verb
- X: other
The fine-grained .tag_
annotation provides more specific details, such as
verb form. A full list of those tags
can be found here.
So, if we want to extract all the nouns from a text we can simply select the words
whose .pos_
tag is equal to "NOUN"
:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("A spectre is haunting Europe. The spectre of communism.")
nouns = [token.text for token in doc if token.pos_ == "NOUN"]
print(nouns)
The output would be: ['spectre', 'spectre', 'communism']
.
Pattern Matching
spaCy
also has a great system for finding phrases that match specific
grammatical patterns in texts. It’s sort of like regex but for grammar. spaCy
has extensive documentation on their
matcher, as well as an
interactive tool to explore the
matcher rules, so this is, again, just a very brief intro!
Let’s say that we want to extract all the phrases from the Communist Manifesto
that match the pattern determiner adjective noun
. These are phrases like
“a holy alliance” and “the modern State”.
import spacy
from spacy.matcher import Matcher
# load spacy and create a matcher object
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# a list of patterns to look for
patterns = [
[{"POS": "DET"}, {"POS": "ADJ"}, {"POS": "NOUN"}]],
]
matcher.add("MyPattern", patterns)
# read in the communist manifesto
doc = nlp(open("manifesto.txt").read())
# find matches
matches = matcher(doc)
# print the text of the matches
for match_id, start, end in matches:
span = doc[start:end]
print(span.text)
Note: the matcher returns a start and end index for each match. In order to actually get the text of the matches, we feed those indices to our doc object.
The output (sorted) is:
a bare existence
a bourgeois revolution
a bribed tool
a certain stage
a collective product
a common plan
a considerable part
a constant battle
a cosmopolitan character
a critical position
a distinctive feature
a few hands
a few years
a foreign language
a general reconstruction
a great extent
a great part
a heavy progressive
a historical movement
a holy alliance
a literary battle
a long course
a mere figure
a mere instrument
a mere money
a mere training
a miserable fit
a national struggle
a new basis
a new supply
a political party
a proletarian revolution
a reactionary character
a reactionary interest
a separate party
a similar movement
a single instant
a small section
a social power
a socialist tinge
a sweet finish
a universal war
all chinese walls
all coercive measures
all earlier epochs
all earlier ones
all former exoduses
all individual character
all political action
all practical value
all previous securities
all theoretical justification
an agrarian revolution
an earlier period
an exclusive monopoly
an ideological standpoint
an incoherent mass
an oppressed class
any historical initiative
any sectarian principles
each single workshop
every other article
every revolutionary movement
every villainous meanness
no other conclusion
no other nexus
the absolute governments
the absolute monarchy
the administrative work
the ancient religions
the ancient world
the average price
the bitter pills
the bombastic representative
the bourgeois clap
the bourgeois class
the bourgeois conditions
the bourgeois family
the bourgeois mode
the bourgeois objections
the bourgeois relations
the bourgeois sense
the bourgeois supremacy
the branding reproach
the civilised ones
the classical works
the commercial crises
the common ruin
the communist revolution
the communistic mode
the communistic modes
the continued existence
the decisive hour
the earlier epochs
the economic functions
the economic situation
the economical conditions
the eighteenth century
the entire proletariat
the essential condition
the exact contrary
the extensive use
the extreme length
the fettered traders
the feudal aristocracy
the feudal nobility
the feudal organisation
the feudal relations
the feudal system
the first conditions
the first elements
the first step
the forcible overthrow
the free development
the french criticism
the french ideas
the french original
the french revolution
the french sense
the german _
the german bourgeoisie
the german nation
the german workers
the golden apples
the great chagrin
the great factory
the great mass
the greatest pleasure
the heavy artillery
the historical movement
the holy water
the hostile antagonism
the icy water
the immediate aim
the immediate aims
the immediate result
the immense majority
the impending bourgeois
the individual bourgeois
the individual members
the industrial army
the industrial capitalist
the industrial war
the inevitable ruin
the intellectual creations
the last resort
the last word
the leading question
the little workshop
the lower strata
the lowest layers
the lowest stratum
the mediaeval commune
the middle class
the miserable character
the misty realm
the modern bourgeois
the modern bourgeoisie
the modern laborer
the modern state
the modern working
the momentary interests
the national ground
the national struggles
the necessary condition
the necessary consequence
the necessary offspring
the new markets
the new methods
the old bourgeois
the old conditions
the old cries
the old family
the old ideas
the old means
the old modes
the old nationalities
the old ones
the old order
the old property
the old society
the old wants
the only class
the original views
the other portions
the petty bourgeois
the petty bourgeoisie
the political constitution
the political movement
the practical absence
the practical measures
the present family
the present system
the prime condition
the productive forces
the proletarian movement
the prussian bourgeoisie
the public power
the rapid improvement
the reactionary character
the reactionary classes
the real point
the remotest zones
the revolutionary class
the revolutionary element
the revolutionary literature
the same character
the same proportion
the same time
the same way
the selfish misconception
the small manufacturer
the small peasant
the small tradespeople
the social character
the social consciousness
the social forms
the socialistic bourgeois
the theoretical conclusions
the threatening bourgeoisie
the traditional anathemas
the typical man
the unceasing improvement
the undeveloped state
the upper hand
the urban population
the vanished status
the various interests
the various stages
the very foundation
the violent overthrow
the virtuous indignation
the whole bourgeoisie
the whole country
the whole nation
the whole proletariat
the whole range
the whole relations
the whole superincumbent
the whole surface
these fantastic attacks
these first movements
these intermediate classes
these philosophical phrases
these same governments
this comfortable conception
this distinctive feature
this french literature
this transcendental robe
what earlier century
whose essential character
whose silly echo
Note: if you want to use fine-grained tags, just write
"TAG"
instead of "POS"
.
Here are a few other examples of patterns.
“determiner adjective noun adposition noun”, i.e. “the present system of production”:
[{"POS": "DET"}, {"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "ADP"}, {"POS": "NOUN"}]
“determiner noun be adjective”, i.e. “the workers are victorious”:
[{"POS": "DET"}, {"POS": "NOUN"}, {"LEMMA": "be"}, {"POS": "ADJ"}]
Videogrep in Python
In order to combine spaCy and Videogrep, you have to first load Videogrep into a Python script. Here for example is how to simply print out lines from a video’s transcript file.
import videogrep
videofile = "shell.mp4"
transcript = videogrep.parse_transcript(videofile)
for sentence in transcript:
print(sentence["content"])
The parse_transcript
function finds a transcript for the given video file
and returns a list of dictionary objects that contain sentences (and words if
available) with start and end timestamps. The function will look for
.json
files first, then .vtt
files, and finally
.srt
files.
Its output will be formatted something like this, where each sentence is a dictionary object:
[
{
"content": "a spectre is haunting",
"start": 0,
"end": 3.0,
"words": [
{"word": "a", "start": 0, "end": 0.5},
{"word": "spectre", "start": 0.6, "end": 1.2},
],
}
]
We can also call the main Videogrep function to make a supercut. It’s very
similar to how we’d use Videogrep from the command line. In this example,
I’ll make a supercut with a random word from the transcript of this video of
Shell’s third quarter 2021 results, that I’ve downloaded with yt-dlp
:
import videogrep
import random
videofile = "shell.mp4"
transcript = videogrep.parse_transcript(videofile)
# create a list of all the words in the transcript
all_words = []
for sentence in transcript:
all_words += transcript["words"]
# grab a random word
query = random.choice(all_words)["word"]
# create the supercut
videogrep.videogrep(
videofile,
query,
search_type="fragment",
output="random_supercut.mp4"
)
Videogrep and spaCy
Only Nouns
We can now integrate Videogrep and spaCy. Let’s start by making a supercut containing all the nouns from a video, again, using Shell’s quarterly results as an example.
To integrate spaCy and Videogrep, we need to process all the text in our video by
making a nlp
object for each sentence, and iterating over the tokens. We
can then extract a list of nouns from the text to use within Videogrep.
import videogrep
import spacy
videofile = "shell.mp4"
nlp = spacy.load("en_core_web_sm")
search_words = []
# iterate through the transcript,
# saving nouns to search for
transcript = videogrep.parse_transcript(videofile)
for sentence in transcript:
doc = nlp(sentence["content"])
for token in doc:
# change this if you don't want nouns!
if token.pos_ == "NOUN"
# ensure that exact matches are made
search_words.append(f"^{token.text}$")
videogrep.videogrep(videofile, search_words, search_type="fragment", output="only_nouns.mp4")
See if you can bear to watch the whole thing!
A few things to note here. First, Videogrep uses Python’s regular expression
engine for queries, so I’m surrounding each search term with a
^
and a $
in order to grab the exact words, and avoid
partial matches. Second, I’m passing a list of search terms to Videogrep. You
could also pass it a single string, with search terms separated with the
|
character.
Pattern Matching
Finally, we can now use spaCy’s pattern matching with Videogrep. Here’s an example where I’ve extracted all the adjective and noun combinations from the video:
import videogrep
import spacy
from spacy.matcher import Matcher
video = "shell.mp4"
nlp = spacy.load("en_core_web_sm")
patterns = [[{"POS": "ADJ"}, {"POS": "NOUN"}]]
matcher = Matcher(nlp.vocab)
matcher.add("Patterns", patterns)
searches = []
transcript = videogrep.parse_transcript(video)
for sentence in transcript:
doc = nlp(sentence["content"])
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
searches.append(span.text)
videogrep.videogrep(
video, searches, search_type="fragment", output="pattern_matcher.mp4"
)
Similarity
You can also use spaCy to determine the similarity between words and phrases. Each
doc
and token
object contains a
similarity
function that takes as an argument any other
doc
or token
, and will return as estimate of semantic
similarity between the words.
To use this, you must first download spaCy’s larger language model:
python -m spacy download en_core_web_lg
We can then iterate through all the words in a transcript, recording similarity to a search term, and then make a supercut based on what we find.
Here, for example, is code that will make a supercut of words similar to the word “money”.
import videogrep
import spacy
# load the larger language model
nlp = spacy.load("en_core_web_lg")
video = "shell.mp4"
# search for words similar to "money"
search_sim = nlp("money")
similarities = []
transcript = videogrep.parse_transcript(video)
for sentence in transcript:
doc = nlp(sentence["content"])
for token in doc:
# calculate the similarity between each token
# and our search term
sim = search_sim.similarity(token)
# store the similarity value
similarities.append((sim, token.text))
# sort the words by the similarity value
similarities = sorted(similarities, key=lambda k: k[0], reverse=True)
# limit to 20 results
similarities = similarities[0:20]
# make a unique list of words
searches = list(set([s[1] for s in similarities]))
# create the video
videogrep.videogrep(
videos, searches, search_type="fragment", output="money.mp4"
)