Re: Regex grep memo to extract Youtube URLs from continuous text and then yt-dlp ~Can't escape the question mark!? Chapter~

6 min

language: ja bn en es hi pt ru zh-cn zh-tw

Hello, I'm incompetent.

It's an article title that, true to my incompetence, I don't know if it's from a manga or an anime.

This article from last time.

https://soulminingrig.com/blog/連続したテキストからyoutube-url抜くだけでの正規表現grep/

At that time, I was writing other web scraping scripts and was half-dead from sleepiness, writing around 4 AM, so both the commands and the text were a mess.

Let me summarize it again.

grep -oP ‘youtube\.com\/watch\?v\=…’ outbox.json > pyaa.txt

The above can handle Perl regular expressions.
The following is also possible

grep -oP ‘youtube\.com\/watch\?v\=.{11}’ outbox.json > pyaa.txt

Also, using Perl regular expressions can be quite heavy, so if you have a lot of characters or are using it within a script, it might be better to use a command like the following as much as possible.

grep -o ‘youtube\.com/watch?v=…’ outbox.json > pyaa.txt

By the way, can escaping a question mark only be done with Perl regular expressions?

grep -o ‘youtube\.com/watch\?v=…’ outbox.json > owo.txt

The above is not good.

Without the backslash, this will "match a character that appears 0 or 1 time before the preceding character".

So, after looking it up, it seems that if you enclose `?` in brackets `[?]`, it can be treated as a literal string.

https://stackoverflow.com/questions/10602433/how-to-escape-a-question-mark-in-r

grep -o ‘youtube\.com/watch[?]v=…’ outbox.json > pyaa.txt

This works perfectly!

Let's delete lines containing unnecessary "/" and "\" and "<"!

At first, I suspected my grep command, but I was wrong.

For some reason, outbox.json, which exports Mastodon posts, produces the following output:

class=\"ellipsis\"\u003eyoutube.com/watch?v=DRVp_cmW3N\u003c/span\u003e\u003cspan class=\"invisible\"\u003es\u0026amp;feature=share7\u003c/span\u003e\u003c/a\u003e\u003c/p\u003e\","contentMap":{"ja":\"\u003cp\u003e\u003ca

If you look closely, the last string (11th character) "W7G-QtbTWg" is missing one character and has become a backslash! Of course, in this state, it cannot be output as a normal URL save URL.

youtube.com/watch?v=DRVp_cmW3Nw
youtube.com/watch?v=DRVp_cmW3N\

Fortunately, the correctly displayed URLs are also being grepped redundantly, so I just need to delete the unnecessary ones.
Additionally, some have < at the end, what is this...?

sed ‘/\\$\|\/$\|<$/d’ pyaa.txt > nyan.txt

Output duplicate and non-duplicate lines separately with uniq

uniq nyan.txt > nyanya.txt
uniq -u nyan.txt >> nyanya.txt

Done!!!

Finally, convert to yt-dlp format with sed

sed ‘s|youtube|yt-dlp -o “/media/ncp/yt/n/%(title)s” -f “bv[ext=mp4]+ba[ext=m4a]” –merge-output-format mp4 https://www.youtube|g’ nyanya.txt > nya.sh

That's all! Good work.

For the rest, just add #!/bin/bash, grant execute permissions, and run it. That should be fine. I don't think anyone will just copy and paste, but don't forget to change the save directory and other settings.

Finally, about the wc command...

Previous question

echo “youtube.com/watch?v=g5HQFrSk4OA” | wc -c
32

It seems to be 32 characters long.

So, extract only those with 32 characters.

grep -oP ‘^.{32}$’ mstv.txt > mstvtmp.txt

It was written as if it succeeded, but it didn't work when I re-verified it. It worked when I changed 32 to 31. Why is that? Could someone please tell me?

Apparently, when the wc command counts characters in a string, it also includes the newline character, so even an empty file created with `touch` shows 1. Therefore, when using it, you need to subtract the number of lines.

See you again.

Best regards.

Related Posts