Skip to content

Commit 2b19be7

Browse files
committed
add unicode string support
This should at least resolve the issue seen in #77. I created a 'bad' PDF containing 'wmv' (see `strings test/fixtures/name/application/pdf/wmv.pdf | rg wmv2`). And with `unicodeBE` and `unicodeLe` support, `wmv` and `wma` files aren't mistakenly identified as `video/x-ms-asf` (it's not wrong since they're using ASF but it's not the quite specific type). Furthermore it also resolves the `audio2.mp3` case of issue #125
1 parent 3d3c5dc commit 2b19be7

File tree

7 files changed

+7
-2
lines changed

7 files changed

+7
-2
lines changed

lib/marcel/tables.rb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2521,7 +2521,7 @@ module Marcel
25212521
['text/html', [[0, b['(?i)<(html|head|body|title|div)[ >]']], [0, b['(?i)<h[123][ >]']]]],
25222522
['image/svg+xml', [[0..4096, b['<svg']]]],
25232523
['video/x-msvideo', [[0, b['RIFF'], [[8, b['AVI ']]]], [8, b['AVI ']]]],
2524-
['video/x-ms-wmv', [[0..8192, b['Windows Media Video']], [0..8192, b['VC-1 Advanced Profile']], [0..8192, b['wmv2']]]],
2524+
['video/x-ms-wmv', [[0..8192, b["W\000i\000n\000d\000o\000w\000s\000 \000M\000e\000d\000i\000a\000 \000V\000i\000d\000e\000o\000"]], [0..8192, b["V\000C\000-\0001\000 \000A\000d\000v\000a\000n\000c\000e\000d\000 \000P\000r\000o\000f\000i\000l\000e\000"]], [0..8192, b["w\000m\000v\0002\000"]]]],
25252525
['video/mp4', [[4, b['ftypmp41']], [4, b['ftypmp42']]]],
25262526
['audio/mp4', [[4, b['ftypM4A ']], [4, b['ftypM4B ']], [4, b['ftypF4A ']], [4, b['ftypF4B ']]]],
25272527
['video/quicktime', [[4, b["moov\000"]], [4, b["mdat\000"]], [4, b["free\000"]], [4, b["skip\000"]], [4, b["pnot\000"]], [4, b['ftyp']], [0, b["\000\000\000\bwide"]]]],
@@ -2831,7 +2831,7 @@ module Marcel
28312831
['audio/x-flac', [[0, b['fLaC']]]],
28322832
['audio/x-mod', [[0, b['Extended Module:']], [21, b['BMOD2STM']], [1080, b['M.K.']], [1080, b['M!K!']], [1080, b['FLT4']], [1080, b['FLT8']], [1080, b['4CHN']], [1080, b['6CHN']], [1080, b['8CHN']], [1080, b['CD81']], [1080, b['OKTA']], [1080, b['16CN']], [1080, b['32CN']], [0, b['IMPM']]]],
28332833
['audio/x-mpegurl', [[0, b["#EXTM3U\r\n"]]]],
2834-
['audio/x-ms-wma', [[0..8192, b['Windows Media Audio']]]],
2834+
['audio/x-ms-wma', [[0..8192, b["W\000i\000n\000d\000o\000w\000s\000 \000M\000e\000d\000i\000a\000 \000A\000u\000d\000i\000o\000"]]]],
28352835
['audio/x-pn-realaudio', [[0, b[".ra\375"]]]],
28362836
['audio/x-psf', [[0, b['PSF'], [[3, b["\001"]], [3, b["\002"]], [3, b["\021"]], [3, b["\022"]], [3, b["\023"]], [3, b['!']], [3, b["\""]], [3, b['#']], [3, b['A']]]]]],
28372837
['audio/x-sap', [[0, b["SAP\r\n"]]]],

script/generate_tables.rb

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,10 @@ def get_matches(mime, parent)
6565

6666
offset = offset.size == 2 ? offset[0]..offset[1] : offset[0]
6767
case type
68+
when 'unicodeLE', 'unicodeBE' # Unicode string types (UTF-16 Little/Big Endian)
69+
value.gsub!(/\A0x([0-9a-f]+)\z/i) { [$1].pack('H*') }
70+
encoding = type == 'unicodeLE' ? Encoding::UTF_16LE : Encoding::UTF_16BE
71+
value = value.encode(encoding).force_encoding(Encoding::BINARY)
6872
when 'string', 'stringignorecase'
6973
value.gsub!(/\A0x([0-9a-f]+)\z/i) { [$1].pack('H*') }
7074
value.gsub!(/\\(x[\dA-Fa-f]{1,2}|0\d{1,3}|\d{1,3}|.)/) { eval("\"\\#{$1}\"") }
50.5 KB
Binary file not shown.
3.65 KB
Binary file not shown.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.ra�
3.83 KB
Binary file not shown.
2.38 KB
Binary file not shown.

0 commit comments

Comments
 (0)