Database Corruption Challenge Part 1

Last weekend I saw a tweet mentioning Steve Stedman’s DataBase Corruption Challenge (DBCC) where he publishes a database that’s been corrupted and offers a challenge to the community to recover the database (hopefully) without any data loss. What a great idea! I had to enter and after some digging around figured out a way to bring the database online with all data intact.

Of the 21 successful entries there were many different solutions. Steve wrote that one even involved modifying the database files on disk using a hex editor. A hex editor?! Which crazy person would do that? Well… umm…. actually… that was me. This post covers how I found the problem and what I did to fix the corruption.

(You’ll note in my final solution I don’t open the database files in a hex editor however most of my workings out and modifications used one. For simplicity I’ve switched to using DBCC commands as they’re easier to repeat.)

Steve’s original blog covers how to get the corrupt database restored and ready for troubleshooting so I wont mention that; so let’s start with the DBCC CHECKDB error message instead.

The first line of the DBCC error message is very helpful as it quickly identifies the page that was corrupt (file 1 page 280) and the record in that page (record 3).

The object with object ID 245575913 is the table Revenue and we can use DBCC IND to find out that page 1:280 is part of the clustered index. Unfortunately this is a pain – because it’s a clustered index then a simple rebuild won’t fix the problem. Instead let’s look into the header of record 3 and try to figure out what’s wrong.

First we can view the page structure using DBCC PAGE:

At the end of the page is the record offset array which is used to locate individual records. From the offset array we can see that record 3 starts at offset 960 (0x3c0). (Note: row numbers actually start from 0. This is important as otherwise we will be wasting our time figuring out what’s corrupt about a non corrupt row. Not that that happened of course!)

Before looking at the actual corrupt record, a quick recap of the SQL Server record structure. A record comprises of the record header followed by the fixed length data, the NULL bitmap then finally data relating to the variable length columns (see Paul Randal’s blog for more information).

record_header

As we know the table where the corruption was found we can look at its schema to see how an individual record should look.

The four fixed length columns are all 4 byte integers so the fixed length portion of the record takes up 20 bytes.

record_fixed_length_portio

Note: I’ve deleted the last two columns for simplicity

Now let’s look at the corrupt record in detail. Below is the relevant part of the output of DBCC PAGE for record 3 in page 280.

dbcc page nohighlight

We can transpose the red portion of the record onto our table. Note that you need to read multibyte values from right-to-left. So a 2-byte value of 1400 in the DBCC PAGE output is actually 0x0014 . In my diagram and description below I’ve reversed the bytes so it’s easier to understand, but we’ll need to re-reverse them later when writing the fix back to the page.

record_transposed

The fixed length portion of the record isn’t interesting so I’ll skip it.

Bytes 20-23 contain the total number of columns in the record. The value is 0x6161 which is 24929 in decimal. There’s definitely something fishy as we know from looking at the table DDL that there are five columns, right?

Well…. not quite. Because the clustered index is non-unique there’s an additional uniquifier column which takes the total number of columns per record to six (to confirm this you could look at bytes 20-23 of row 2 and see the value 0x0006 (6 in decimal)).

So let’s change the value from 0x6161 to match the actual number of columns ( 0x0006 ).

record_header_fixed_colcount

To make the change we can use DBCC WRITEPAGE. This has the benefit of recalculating the page checksum for us as Steve was clever enough to make sure the page checksum was valid after his modifications. WRITEPAGE takes the file and page ID and also the offset in that page. In our case this is 980 bytes – 960 bytes to the start of the 3rd record and then 20 bytes into that record.

IMPORTANT NOTE! Please be careful running DBCC WRITEPAGE. It blindly trusts the input you give and therefore can very easily corrupt a database. Do not run it in production unless you fancy a career change!

Calling WRITEPAGE returns an error but maybe that fixed the problem?

Hmm. Something’s still not right. The error message still mentions record 3 of page 280 so let’s carry on looking at the broken record.

dbcc page blue

Byte 24 contains the NULL bitmap itself (in blue). The bitmap uses one bit per column so the total size should be number of columns/8. With six columns the bitmap will take up just a single byte. In our corrupt row the value is 0x61 which in binary is 11000010 . This took some investigating but in SQL 2014 unused parts of the NULL bitmap are set to 0. As there are only 6 columns this means the leftmost 2 bits should be 00 , but they’re not.

Reconstructing the NULL bitmap involved some guesswork. I decided to copy the NULL bitmap from row 2 (value 0x00 ) but it was possible that row 3 had some null columns which in row 2 weren’t null. In hindsight I could have used one of the techniques in Steve’s answer to view the data using a different index but I didn’t think of that at the time.

Adding the extra byte to DBCC WRITEPAGE results in the following command:

This time no error. Let’s run CHECKDB:

Looking better! Can we retrieve data from the table?

select-result

Yes – we can even see the record we fixed (in red).

The corruption has been fixed and the database is back online – it’s as simple as that!

Conclusion

Of course this is not something you want to be doing with production data. It’s one thing playing around on a Sunday afternoon but is a completely different activity under pressure from users to get their system online. In my environment I’d definitely be reaching for a recent backup instead!

This entry was posted in SQL Server. Bookmark the permalink.

One Response to Database Corruption Challenge Part 1

  1. Eduardo says:

    Many years ago this happened to a flolew DBA. The tapes were at offsite storage and were compromised. So I do it with my current backups but randomly pull tapes and make sure that the backups written to tape will work as well.In my current position I have a server that I set up to do this for me. There are scripts that copy the latest backup from all my SQL Server instances; restores several each night (and records how long it took), runs DBCC CHECKDB and writes the output into a table. Then I put it into another database where I record other info such as the file locations and when the last backups/transaction dumps were run, etc. I get a daily report from SSRS telling me if there were DBCC errors or if backups/logs have not been backed up in the past 3 days.Took a while to build it but it is great for helping me find errors or new databases that have been added or even new servers that have been added!

Leave a Reply

Your email address will not be published. Required fields are marked *