Customizing Data Processing¶
One of the many goals of the TOM Toolkit is to enable the simplification of the flow of your data from observations. To that end, there’s some built-in functionality that can be overridden to allow your TOM to work for your use case.
To begin, here’s a brief look at part of the structure of the tom_dataproducts app in the TOM Toolkit:
tom_dataproducts
├──hooks.py
├──models.py
└──processors
├──data_serializers.py
├──photometry_processor.py
└──spectroscopy_processor.py
Let’s start with a quick overview of models.py
. The file contains
the Django models for the dataproducts app–in our case, DataProduct
and ReducedDatum
. The DataProduct
contains information about
uploaded or saved DataProducts
, such as the file name, file path,
and what kind of file it is. The ReducedDatum
contains individual
science data points that are taken from the DataProduct
files.
Examples of ReducedDatum
points would be individual photometry
points or individual spectra.
Each DataProduct
also has a data_product_type
. The
data_product_type
is simply a description of what the file is, more
or less, and is customizable. The list of supported
data_product_type
s is maintained in settings.py
:
# Define the valid data product types for your TOM. Be careful when removing items, as previously valid types will no
# longer be valid, and may cause issues unless the offending records are modified.
DATA_PRODUCT_TYPES = {
'photometry': ('photometry', 'Photometry'),
'fits_file': ('fits_file', 'FITS File'),
'spectroscopy': ('spectroscopy', 'Spectroscopy'),
'image_file': ('image_file', 'Image File')
}
In order to add new data product types, simply add a new key/value pair, with the value being a 2-tuple. The first tuple item is the database value, and the second is the display value.
All data products are automatically “processed” on upload, as well. Of
course, that can mean different things to different TOMs! The TOM has
two built-in data processors, both of which simply ingest the data into
the database, and those are also specified in settings.py
:
DATA_PROCESSORS = {
'photometry': 'tom_dataproducts.processors.photometry_processor.PhotometryProcessor',
'spectroscopy': 'tom_dataproducts.processors.spectroscopy_processor.SpectroscopyProcessor',
}
When a user either uploads a DataProduct
to their TOM, the TOM runs
process_data()
from the corresponding DataProcessor
subclass
specified in DATA_PROCESSORS
seen above. To illustrate, this is the
base DataProcessor
class:
import mimetypes
...
class DataProcessor():
FITS_MIMETYPES = ['image/fits', 'application/fits']
PLAINTEXT_MIMETYPES = ['text/plain', 'text/csv']
mimetypes.add_type('image/fits', '.fits')
mimetypes.add_type('image/fits', '.fz')
mimetypes.add_type('application/fits', '.fits')
mimetypes.add_type('application/fits', '.fz')
def process_data(self, data_product):
pass
Now let’s look at the built-in data processors. First, let’s check out
the PhotometryProcessor
, which inherits from DataProcessor
:
class PhotometryProcessor(DataProcessor):
def process_data(self, data_product):
mimetype = mimetypes.guess_type(data_product.data.path)[0]
if mimetype in self.PLAINTEXT_MIMETYPES:
photometry = self._process_photometry_from_plaintext(data_product)
return [(datum.pop('timestamp'), json.dumps(datum)) for datum in photometry]
else:
raise InvalidFileFormatException('Unsupported file type')
This class has an implementation of process_data()
from the
superclass DataProcessor
. The implementation calls an internal
method _process_photometry_from_plaintext()
, which return a list
of dict
s. Each dict contains the values for the timestamp,
magnitude, filter, and error for that photometry point. The list is then
transformed into a list of 2-tuples, with the first value being the
photometry timestamp, and the second being the JSON-ified remaining
values.
Next, let’s look at the SpectroscopyProcessor
:
class SpectroscopyProcessor(DataProcessor):
DEFAULT_WAVELENGTH_UNITS = units.angstrom
DEFAULT_FLUX_CONSTANT = units.erg / units.cm ** 2 / units.second / units.angstrom
def process_data(self, data_product):
mimetype = mimetypes.guess_type(data_product.data.path)[0]
if mimetype in self.FITS_MIMETYPES:
spectrum, obs_date = self._process_spectrum_from_fits(data_product)
elif mimetype in self.PLAINTEXT_MIMETYPES:
spectrum, obs_date = self._process_spectrum_from_plaintext(data_product)
else:
raise InvalidFileFormatException('Unsupported file type')
serialized_spectrum = SpectrumSerializer().serialize(spectrum)
return [(obs_date, serialized_spectrum)]
Just like the PhotometryProcessor
, this class inherits from
DataProcessor
and implements process_data()
. This is a
requirement for a custom DataProcessor! This process_data()
method
handles two file types, unlike the previous example, each of which calls
an internal method that returns a Spectrum1D
object. Again, like the
PhotometryProcessor
, a list of 2-tuples is created, with the first
value being the timestamp, and the second being the JSON spectrum.
You may be wondering why these two methods return lists of 2-tuples,
especially when the SpectroscopyProcessor
only returns a list of
length one. The rationale is to ensure that you, the TOM user, shouldn’t
have to worry about the database insertion, so the internal logic
handles that aspect, and it can do so whether you return one data point
or many data points.
For a custom DataProcessor
, there are just a few required steps. The
first is to create a class that implements DataProcessor
, like so:
from tom_dataproducts.data_processor import DataProcessor
class MyDataProcessor(DataProcessor):
def process_data(self, data_product):
# custom data processing here
return [(timestamp1, json1), (timestamp2, json2), ..., (timestampN, dictN)]
Let’s say that this file lives at mytom/my_data_processor.py
. Now
the processor needs to be added to DATA_PROCESSORS
, and it can
either process a new data product type, or replace an existing one.
Let’s replace spectroscopy:
DATA_PROCESSORS = {
'photometry': 'tom_dataproducts.processors.photometry_processor.PhotometryProcessor',
'spectroscopy': 'mytom.my_data_processor.MyDataProcessor',
}
And that’s it! Now your TOM will run the data processing specific to your case instead of the default one.