wiki:DataModel

Data Model

All data files in the blind solver pipeline are FITS files.

File formats that interact with the outside world are in nice, portable FITS formats. The input files are:

  • Astrometry.net catalogs
  • field files (formerly XYLS files)
  • RDLS files

The output files are:

  • match files

There are also a number of file formats that are internal to the pipeline. These are also FITS files, but they contain big chunks of binary data that are stored in native-endian format and size. We chose to do this because doing so yields huge performance gains, and also because these files are internal to the pipeline and don't change very often. We're basically abusing the FITS format and using FITS libraries to get nice header handling.

These formats are:

  • objs: star catalogs
  • id : star identities
  • quad: stars contained in quads
  • code: shape descriptors of quads
  • ckdt: code kd-tree
  • skdt: star kd-tree
  • qidx: quad-index: quads a star belongs to

The endianness and data sizes are documented in the FITS headers: ENDIAN, UINT_SZ, and DUBL_SZ.

Each FITS file has a header (`AN_FILE') than says what kind of file it is.

AN_FILE File suffix Source file
OBJS objs.fits catalog.c
ID id.fits idfile.c
QUAD quad.fits quadfile.c
CODE code.fits codefile.c
CKDT ckdt.fits codetree.c / kdtree_fits_io.c
SKDT skdt.fits startree.c / kdtree_fits_io.c
QIDX qidx.fits qidxfile.c
-- fits an_catalog.c
MATCH match, agree matchfile.c
XYLS (varies) fits xylist.c
RDLS rdls.fits rdlist.c

OBJS

Contains a single table with a single column named xyz, which lists the star positions on the sphere.

Each row is three native-endian doubles (x, y, and z).

ID

Contains a single table with a single column named ids. Lists, for each star in the catalog (objs file), the Astrometry.net ID of the star.

Each row is one native-endian uint64.

QUAD

Contains a single table with a single column named quads. It lists the indices in the catalog (objs file) of the stars that comprise each quad.

Each row is four native-endian uints.

CODE

Contains a single table with a single column named codes. It lists the geometric code for each quad.

Each row is four native-endian doubles.

CKDT, SKDT

These are both written as generic kdtrees (with some extra headers).

The important tree size parameters (located in the FITS header) are NDATA (number of data points), NDIM (dimensionality of the data points) and NNODES (number of kdtree nodes).

The other important parameter is REAL_SZ, which should be sizeof(double) for us.

They contain three table, each with a single column.

All data is in native endian and size.

Column kdtree_nodes: each row is a struct kdtree_node_t followed by 2 * NDIM reals. A struct kdtree_node_t (at the moment this was written) is just two uints, the left and rightmost limits of the array that are owned by the node. Following this are two NDIM vectors which are the minimum and maximum corners of the bounding hyper-rectangle.

Column kdtree_data: each row is NDIM reals.

Column kdtree_perm: the permutation index; each row is a uint.

QIDX

This is the most complicated format, since it contains variable-length arrays (though we "roll our own" instead of using FITS variable-length arrays).

This lists, for each star, the quads of which it is a member.

There is one table with one column, qidx.

The data chunk has two regions: an "index" area followed by a "heap".

For each star, there are two uints: the offset and length of the heap area owned by that star. (Offset and length are in units of uints, and star from zero at the beginning of the heap area.)

The heap begins at offset 2 * sizeof(uint) * nstars. It contains lists of quads, stored as uints.

MATCH

These are nice, friendly, portable FITS files. They're also in flux so I'm not going to write about them now.

XYLS, RDLS

These are actually the same format, just with different default column names and AN_FILE values.

There is one table per field.

Files written by our code will have two columns (typically X and Y or RA and DEC) in float or double format (E or D FITS types).

Our code can read any file that contains appropriately-named columns in E or D formats.

See Also

 DATA_MODEL file for a good description of hits files.