Difference between revisions of "DAT Format"
Imathrowback (Talk | contribs) m |
Imathrowback (Talk | contribs) (→DAT files) |
||
Line 5: | Line 5: | ||
'''A basic python reference implementation can be found here: https://github.com/imathrowback/rift_datamine_tools/blob/master/src/RIFTDataParser.py''' | '''A basic python reference implementation can be found here: https://github.com/imathrowback/rift_datamine_tools/blob/master/src/RIFTDataParser.py''' | ||
+ | |||
'''A better reference implementation in C# can be found here: https://github.com/imathrowback/telarafly/blob/master/Assets/DatParser/Parser.cs''' | '''A better reference implementation in C# can be found here: https://github.com/imathrowback/telarafly/blob/master/Assets/DatParser/Parser.cs''' | ||
− | Most of the RIFT data is in a binary serialized format. It is unknown | + | Most of the RIFT data is in a binary serialized format. It is unknown which library is used to generate it though it appears to be very similar to protocols such as: |
* BSON - http://bsonspec.org/spec.html | * BSON - http://bsonspec.org/spec.html |
Latest revision as of 06:20, 14 April 2017
DAT files
A basic python reference implementation can be found here: https://github.com/imathrowback/rift_datamine_tools/blob/master/src/RIFTDataParser.py
A better reference implementation in C# can be found here: https://github.com/imathrowback/telarafly/blob/master/Assets/DatParser/Parser.cs
Most of the RIFT data is in a binary serialized format. It is unknown which library is used to generate it though it appears to be very similar to protocols such as:
- BSON - http://bsonspec.org/spec.html
- google protobuf - https://developers.google.com/protocol-buffers/docs/encoding
The format is LITTLE ENDIAN exclusively, and makes liberal use of LEB128 encoded data for basic ints - https://en.wikipedia.org/wiki/LEB128
The format consists of a "class" code, followed by the binary serialized data, and then ended by a 0x07 as the class terminator. Classes can be embedded within other classes.
A DAT file can contain one or many top level classes.
Each "class" is made of zero or more "members". Each member is proceeded by a bitpacked code. This code is read as a LEB128 and decoded into a "typecode" and "extradata" as per the "splitCode" function in the reference implementation. The typecode determines how the following data should be deserialized and the extra data is additional data used to determine positions, such as an array index or array size.
There are 12 known codes which are listed as follows, note that the actual USE of the values depend on the class, so a code "1" may be a boolean in one class, but an int in another.
- 0 - Boolean false or null. No additional bytes are needed.
- 1 - Boolean true. This can also represent a value of "1" depending on the class type - No additional bytes are needed.
- 2 - Variable length encoded unsigned long - number of bytes depend on the value
- 3 - Variable length encoded signed long - number of bytes depend on the value
- 4 - 4 bytes. Generally used as a int but can also represent other types such as float
- 5 - 8 bytes. Generally used a double, but can also represent other types such as Windows FILETIME or a long
- 6 - byte data. Read a LEB128 value for data length, followed by the bytes. Generally used for strings.
- 8 - End of class object
- 9 - same as code 10 except without the classcode
- 10 - class object. The next LEB128 is the classcode, followed by a members (can be recursive)
- 11 - primitive array. The "extradata" determines the size of the array. Then the next LEB128 is decoded and the "extradata" of that is the type of the children followed by the children
- 12 - array of pairs - likely indexed array or hashmap
The deserialization routine appears to leave out default values, so empty/null array entries will not be deserialized. In this case, the extradata code usually indicates the index of the item read. (note there is a bug currently in the python reference implementation where this does not work)