Attachments migrated from SharePoint are marked with a type of "Unknown"

Question

We are doing a migration from SharePoint with the intention of using dbFront but the attachments are not recognized by dbFront.

We’re coping data from sharepoint via Access to SQL Server. One of my tables has files that I’m copying to a varbinary in SQL. I also have a filename field defined that I’m copying that data to. When I open that in dbFront, I get “unknown” – I see the article on your site that says dbFront gets the data from the filename field, but it can show unknown if something doesn’t match.

TXT file type changes to "Unknown" after upload.

For example, I have a PDF that shows unknown, but it really is a PDF (if I download it & change the extension, it opens just fine).

asked in How To by JustAskin (7.6k points)
edited by AnthonyV

1 Answer

AnthonyV · Answer 1

Thanks for the samples. Based on those samples it appears that SharePoint (or Access) is adding an extra file header before the actual file content that is "corrupting" the standard file header.

PDF
0x14 00 00 00 01 00 00 00 04 00 00 00 70 00 64 00 66 00 00 00 >> PDF header follows

DOCX
0x16 00 00 00 01 00 00 00 05 00 00 00 64 00 6F 00 63 00 78 00 00 00 >> DOCX header follows

The Sharepoint Header

Based on the limited examples, and totally unsuccessful searches, I suspect the attachment header does the following.

The first byte specifies the total SharePoint header length. e.g. 0x14 = 20, 0x16 = 22
Followed by "00 00 00 01 00 00 00"
The next byte seems to specify the attachment extension length + 1 as double bytes.
After that a utf16 encoded attachment type follows with an opening NULL:
00 00 00 70 00 64 00 66 = PDF
00 00 00 64 00 6F 00 63 00 78 = DOCX
Followed by "00 00 00"
That concludes the sharepoint header and the actual attachment with its normal header follows.

Why did the PDF work without issue

The PDF files and PDF file readers are designed to deal with a much higher level of garbage because they know the PDF files are hosted and shipped around using all kinds of different technologies and devices that are not always as compatible as they should be.

Other document files are not handled so robustly.

The FIX

I have updated the next release of dbFront (1.2.6.x) so that it will simply accept the extra sharepoint header for PDF files without complaint. The reasoning is that it covers a large portion of the files uploaded to SharePoint and there are no client errors if the file is downloaded.

For all other file types, you can use SQL to remove the file headers.

The following TSQL will strip the SharePoint file header from PDF Files.

UPDATE [AdventureWorksLT].[SalesLT].[Product] SET
   [ThumbNailPhoto] = SUBSTRING([ThumbNailPhoto], 21, DATALENGTH([ThumbNailPhoto]))
WHERE 
   SUBSTRING([ThumbNailPhoto], 0, 25) = 
   0x140000000100000004000000700064006600000025504446;

NOTE: The header length and content are different for each file type, so unique update queries will be require for each type.

The following query should find all records that require fixing.

SELECT * FROM [AdventureWorksLT].[SalesLT].[Product]
WHERE SUBSTRING([ThumbNailPhoto], 2, 7) = 0x00000001000000;

Attachments migrated from SharePoint are marked with a type of "Unknown"

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

The Sharepoint Header

Why did the PDF work without issue

The FIX

Please log in or register to add a comment.

Categories

Most popular tags