Re: Regex grep memo to extract Youtube URLs from continuous text and then yt-dlp ~Can't escape the question mark!? Chapter~
Hello, I'm incompetent.
It's an article title that, true to my incompetence, I don't know if it's from a manga or an anime.
This article from last time.
https://soulminingrig.com/blog/連続したテキストからyoutube-url抜くだけでの正規表現grep/
At that time, I was writing other web scraping scripts and was half-dead from sleepiness, writing around 4 AM, so both the commands and the text were a mess.
Let me summarize it again.
grep -oP ‘youtube\.com\/watch\?v\=…’ outbox.json > pyaa.txt
The above can handle Perl regular expressions.
The following is also possible
grep -oP ‘youtube\.com\/watch\?v\=.{11}’ outbox.json > pyaa.txt
Also, using Perl regular expressions can be quite heavy, so if you have a lot of characters or are using it within a script, it might be better to use a command like the following as much as possible.
grep -o ‘youtube\.com/watch?v=…’ outbox.json > pyaa.txt
By the way, can escaping a question mark only be done with Perl regular expressions?
grep -o ‘youtube\.com/watch\?v=…’ outbox.json > owo.txt
The above is not good.
Without the backslash, this will "match a character that appears 0 or 1 time before the preceding character".
So, after looking it up, it seems that if you enclose `?` in brackets `[?]`, it can be treated as a literal string.
https://stackoverflow.com/questions/10602433/how-to-escape-a-question-mark-in-r
grep -o ‘youtube\.com/watch[?]v=…’ outbox.json > pyaa.txt
This works perfectly!
Let's delete lines containing unnecessary "/" and "\" and "<"!
At first, I suspected my grep command, but I was wrong.
For some reason, outbox.json, which exports Mastodon posts, produces the following output:
class=\"ellipsis\"\u003eyoutube.com/watch?v=DRVp_cmW3N\u003c/span\u003e\u003cspan class=\"invisible\"\u003es\u0026amp;feature=share7\u003c/span\u003e\u003c/a\u003e\u003c/p\u003e\","contentMap":{"ja":\"\u003cp\u003e\u003ca
If you look closely, the last string (11th character) "W7G-QtbTWg" is missing one character and has become a backslash! Of course, in this state, it cannot be output as a normal URL save URL.
youtube.com/watch?v=DRVp_cmW3Nw
youtube.com/watch?v=DRVp_cmW3N\
Fortunately, the correctly displayed URLs are also being grepped redundantly, so I just need to delete the unnecessary ones.
Additionally, some have < at the end, what is this...?
sed ‘/\\$\|\/$\|<$/d’ pyaa.txt > nyan.txt
Output duplicate and non-duplicate lines separately with uniq
uniq nyan.txt > nyanya.txt
uniq -u nyan.txt >> nyanya.txt
Done!!!
Finally, convert to yt-dlp format with sed
sed ‘s|youtube|yt-dlp -o “/media/ncp/yt/n/%(title)s” -f “bv[ext=mp4]+ba[ext=m4a]” –merge-output-format mp4 https://www.youtube|g’ nyanya.txt > nya.sh
That's all! Good work.
For the rest, just add #!/bin/bash, grant execute permissions, and run it. That should be fine. I don't think anyone will just copy and paste, but don't forget to change the save directory and other settings.
Finally, about the wc command...
Previous question
echo “youtube.com/watch?v=g5HQFrSk4OA” | wc -c
32
It seems to be 32 characters long.
So, extract only those with 32 characters.
grep -oP ‘^.{32}$’ mstv.txt > mstvtmp.txt
It was written as if it succeeded, but it didn't work when I re-verified it. It worked when I changed 32 to 31. Why is that? Could someone please tell me?
Apparently, when the wc command counts characters in a string, it also includes the newline character, so even an empty file created with `touch` shows 1. Therefore, when using it, you need to subtract the number of lines.
See you again.
Best regards.