Regex Grep Memo: Extracting YouTube URLs from Continuous Text and Using yt-dlp
Good evening, it's your incompetent author.
When you dump Mastodon posts, they look like this.
/\u003e\u003ca href=\"https://youtube.com/watch?v=jL4NZ913v8E\u0026amp;feature=share\" target=\"_blank\" rel=\"nofollow noopener noreferrer\"\u003e\u003cspan class=\"invisible\"\u003ehttps://\u003c/span\u003e\u003cspan class=\"ellipsis\"
From this continuous string, I want to extract only YouTube URLs.
grep -oP ‘youtube.com\/watch\?v=…………’ outbox.json > mstv.txt
grep -oP ‘youtube.com\/shorts\/…………’ outbox.json > mstv2.txt
Anyway, since it's troublesome to distinguish between URLs with and without "www", I'll just match from "youtube.com" onwards and replace it later.
youtube.com/watch?v=RIqxyO3S8pc
youtube.com/watch?v=RIqxyO3S8p\
youtube.com/watch?v=RIqxyO3S8pc
youtube.com/watch?v=RIqxyO3S8p\
youtube.com/watch?v=W7G-QtbTWgs
youtube.com/watch?v=W7G-QtbTWg\
youtube.com/watch?v=W7G-QtbTWgs
youtube.com/watch?v=W7G-QtbTWg\
I haven't looked into it in detail yet, but for some reason, some entries have one character cut off and a backslash `\` inserted, so I'll remove them with sed -i.
sed -i ‘s/\\//g’ mstv.txt
youtube.com/watch?v=RIqxyO3S8pc
youtube.com/watch?v=RIqxyO3S8p
youtube.com/watch?v=RIqxyO3S8pc
youtube.com/watch?v=RIqxyO3S8p
youtube.com/watch?v=W7G-QtbTWgs
youtube.com/watch?v=W7G-QtbTWg
youtube.com/watch?v=W7G-QtbTWgs
youtube.com/watch?v=W7G-QtbTWg
Since it looks like this, I'll try counting the characters of the correct URL string.
echo “youtube.com/watch?v=g5HQFrSk4OA” | wc -c
32
It seems to be 32 characters long.
So, I'll extract only those with 32 characters.
grep -oP ‘^.{32}$’ mstv.txt > mstvtmp.txt
Well, it's written as if it succeeded, but when I re-verified, it didn't work. It worked when I changed 32 to 31. "Huh, why?" Can someone please tell me?
grep -oP ‘^.{31}$’ mstv.txt > mstvtmp.txt
youtube.com/watch?v=g5HQFrSk4OA
youtube.com/watch?v=g5HQFrSk4OA
youtube.com/watch?v=RIqxyO3S8pc
youtube.com/watch?v=RIqxyO3S8pc
Since it ends up with duplicates like this...
I'll remove duplicate lines with uniq.
uniq mstvtmp.txt > newmstv.txt
Just to be sure, I'll look for non-duplicate items and append them.
uniq -u mstvtmp.txt >> newmstv.txt
youtube.com/watch?v=jL4NZ913v8E
youtube.com/watch?v=g5HQFrSk4OA
youtube.com/watch?v=RIqxyO3S8pc
youtube.com/watch?v=W7G-QtbTWgs
youtube.com/watch?v=DRVp_cmW3Nw
Alright.
So, I'll try to download them with yt-dlp.
This time, I'll save them in mp4 format.
sed -i ‘s|youtube|yt-dlp -o “/media/ncp/yt/n/%(title)s” -f “bv[ext=mp4]+ba[ext=m4a]” --merge-output-format mp4 https://www.youtube|g’ newmstv.txt
At the above point, I'm replacing it with a URL that has "www". Some might say, "Why not just use `^` for the beginning of the search string?!" but personally, I prefer this method because replacing from a string in the middle seems to have higher reproducibility.
Also, since `/` is included, I'm using `|` to delimit in `sed`. It doesn't have to be `|`, though.
And then, let's change the name to a .sh format.
mv newmstv.txt ytdl.sh
vi ytdl.sh
and append #!/bin/bash.
Let's give it execution permission.
chmod +x ./ytdl.sh
(Here, I suddenly thought "I must do it with `./`!" and did it now. (A man who does things belatedly.) It doesn't change anything without it, though. But if you don't include it, you're disqualified as a man.)
After that, just execute it and it's okay.
By the way, there must absolutely be a cleaner way to extract the URL strings at the initial `grep` stage. I'm still very amateur, so please forgive me. I haven't even investigated why yet, so I think I'm doing things in a roundabout way, but this is a session about "it's fun to type various commands."
End.