Regex Grep Memo: Extracting YouTube URLs from Continuous Text and Using yt-dlp

6 min

language: ja bn en es hi pt ru zh-cn zh-tw

Good evening, it's your incompetent author.

When you dump Mastodon posts, they look like this.

/\u003e\u003ca href=\"https://youtube.com/watch?v=jL4NZ913v8E\u0026amp;feature=share\" target=\"_blank\" rel=\"nofollow noopener noreferrer\"\u003e\u003cspan class=\"invisible\"\u003ehttps://\u003c/span\u003e\u003cspan class=\"ellipsis\"

From this continuous string, I want to extract only YouTube URLs.

grep -oP ‘youtube.com\/watch\?v=…………’ outbox.json > mstv.txt

grep -oP ‘youtube.com\/shorts\/…………’ outbox.json > mstv2.txt

Anyway, since it's troublesome to distinguish between URLs with and without "www", I'll just match from "youtube.com" onwards and replace it later.

youtube.com/watch?v=RIqxyO3S8pc
youtube.com/watch?v=RIqxyO3S8p\
youtube.com/watch?v=RIqxyO3S8pc
youtube.com/watch?v=RIqxyO3S8p\
youtube.com/watch?v=W7G-QtbTWgs
youtube.com/watch?v=W7G-QtbTWg\
youtube.com/watch?v=W7G-QtbTWgs
youtube.com/watch?v=W7G-QtbTWg\

I haven't looked into it in detail yet, but for some reason, some entries have one character cut off and a backslash `\` inserted, so I'll remove them with sed -i.

sed -i ‘s/\\//g’ mstv.txt

youtube.com/watch?v=RIqxyO3S8pc
youtube.com/watch?v=RIqxyO3S8p
youtube.com/watch?v=RIqxyO3S8pc
youtube.com/watch?v=RIqxyO3S8p
youtube.com/watch?v=W7G-QtbTWgs
youtube.com/watch?v=W7G-QtbTWg
youtube.com/watch?v=W7G-QtbTWgs
youtube.com/watch?v=W7G-QtbTWg

Since it looks like this, I'll try counting the characters of the correct URL string.

echo “youtube.com/watch?v=g5HQFrSk4OA” | wc -c
32

It seems to be 32 characters long.

So, I'll extract only those with 32 characters.

grep -oP ‘^.{32}$’ mstv.txt > mstvtmp.txt

Well, it's written as if it succeeded, but when I re-verified, it didn't work. It worked when I changed 32 to 31. "Huh, why?" Can someone please tell me?

grep -oP ‘^.{31}$’ mstv.txt > mstvtmp.txt

youtube.com/watch?v=g5HQFrSk4OA
youtube.com/watch?v=g5HQFrSk4OA
youtube.com/watch?v=RIqxyO3S8pc
youtube.com/watch?v=RIqxyO3S8pc

Since it ends up with duplicates like this...

I'll remove duplicate lines with uniq.

uniq mstvtmp.txt > newmstv.txt

Just to be sure, I'll look for non-duplicate items and append them.

uniq -u mstvtmp.txt >> newmstv.txt

youtube.com/watch?v=jL4NZ913v8E
youtube.com/watch?v=g5HQFrSk4OA
youtube.com/watch?v=RIqxyO3S8pc
youtube.com/watch?v=W7G-QtbTWgs
youtube.com/watch?v=DRVp_cmW3Nw

Alright.

So, I'll try to download them with yt-dlp.
This time, I'll save them in mp4 format.

sed -i ‘s|youtube|yt-dlp -o “/media/ncp/yt/n/%(title)s” -f “bv[ext=mp4]+ba[ext=m4a]” --merge-output-format mp4 https://www.youtube|g’ newmstv.txt

At the above point, I'm replacing it with a URL that has "www". Some might say, "Why not just use `^` for the beginning of the search string?!" but personally, I prefer this method because replacing from a string in the middle seems to have higher reproducibility.

Also, since `/` is included, I'm using `|` to delimit in `sed`. It doesn't have to be `|`, though.

And then, let's change the name to a .sh format.

mv newmstv.txt ytdl.sh

vi ytdl.sh

and append #!/bin/bash.

Let's give it execution permission.

chmod +x ./ytdl.sh

(Here, I suddenly thought "I must do it with `./`!" and did it now. (A man who does things belatedly.) It doesn't change anything without it, though. But if you don't include it, you're disqualified as a man.)

After that, just execute it and it's okay.

By the way, there must absolutely be a cleaner way to extract the URL strings at the initial `grep` stage. I'm still very amateur, so please forgive me. I haven't even investigated why yet, so I think I'm doing things in a roundabout way, but this is a session about "it's fun to type various commands."

End.