Data Types and the Frame Objects

The input into our sensor fusion application will be a series of points, radar points, and camera images that will be rendered and labeled. Because of the size of the objects involved, we will require that the data be JSON-encoded (or protobuf-encoded) and accessible via a URL passed in through the task request. Basically, in order to annotate a point cloud frame, format the data in one of our accepted formats, upload the data as a file, and then send a request to the Scale API, similar to the way that we would process image files.

Below are our definitions for our various object types for the JSON format, and for an entire point cloud frame. The protobuf format is largely identical, and can be downloaded here; the difference is that camera intrinsic parameters are encoded as a oneof within the CameraImage message type, and thus no camera_model field is needed.

Definition: Vector2

{
  "x": 1,
  "y": 2
}

Vector2 objects are used to represent positions, and are JSON objects with 2 properties.

PropertyTypeDescription
xfloatx value
yfloaty value

Definition: Vector3

{
  "x": 1,
  "y": 2,
  "z": 3
}

Vector3 objects are used to represent positions, and are JSON objects with 3 properties.

PropertyTypeDescription
xfloatx value
yfloaty value
zfloatz value

Definition: LidarPoint

{
  "x": 1,
  "y": 2,
  "z": 3,
  "i": 0.5,
  "d": 2,
  "t": 1541196976735462000,
  "is_ground": true
}

LidarPoint objects are mainly used to represent LIDAR points, and are JSON objects.

PropertyTypeDescription
xfloatx value
yfloaty value
zfloatz value
ifloat (optional)optional intensity value. Number between 0 and 1
dinteger (optional)optional non-negative device id to identify points from multiple sensors (i.e. what device captured the point)
tfloat (optional)optional timestamp of the sensor's detection of the LiDAR point, in nanoseconds
is_groundboolean (optional)optional flag to indicate whether point is of the ground. If not specified, the flag will default to false

📘

Good things to note:

z is up, so the x-y plane should be flat with the ground

Scale processes point coordinates as 32-bit floats. If your coordinates are greater than 10^5, your point cloud may suffer from rounding effects.

What we recommend as a best practice is applying the negative position of the first frame as an offset to all points/calibrations to avoid this, such that the first frame is at position 0,0,0 and all subsequent frames are just the offsets.

Why 10^5 as a suggested limit?
Given that 32 bit floats support 23 bits of precision, when using 10e5 order coordinates, that only leaves you about 2 decimal places of precision. (2e23 ~= 10e7)

Assuming a meter-based unit of measure, this means the point precision would be to the nearest centimeter. Going above 10e5 reduces the precision to an often unacceptable degree, hence the warning here.

Definition: Quaternion

{
  "x": 1,
  "y": 1,
  "z": 1,
  "w": 1
}

Quaternion objects are used to represent rotation. We use the Hamilton quaternion convention, where i^2 = j^2 = k^2 = ijk = -1, i.e. the right-handed convention.
The quaternion represented by the tuple (x, y, z, w) is equal to w + x*i + y*j + z*k.

PropertyTypeDescription
xfloatx value
yfloaty value
zfloatz value
wfloatw value

Definition: BoundingBox

{
  "top": -85,
  "left": -170,
  "bottom": 85,
  "right": 170
}

Specifies the lat/lon (in degrees) of a BoundingBox. Used for transforming from world coordinates to pixel coordinates

PropertyTypeDescription
topfloatTop boundary of box. Value must be between -90 and 90
leftfloatLeft boundary of box. Value must be between -180 and 180
bottomfloatBottom boundary of box. Value must be between -90 and 90
rightfloatRight boundary of box. Value must be between -180 and 180

Definition: GPSPose

{
  "lat": 1,
  "lon": 1,
  "bearing": 1
}

GPSPose objects are used to represent the pose (location and direction) of the robot in the world.

PropertyTypeDefinition
latfloatlatitude for the location of the robot. Number between -90 and 90
lonfloatlongitude for the location of the robot. Number between -180 and 180
bearingfloatbearing: represents the direction the robot is facing in terms of the absolute bearing. Number between 0 and 360, interpreted as decimal degrees. 0 degrees represents facing North, 90 degrees represents facing East, 180 degrees represents facing South, and 270 degrees represents facing West.

Definition: CameraImage

CameraImage objects represent an image and the camera position/heading used to record the image.

Camera models supported:

PropertyTypeDescription
timestampfloat (optional)The timestamp, in nanoseconds, at which the photo was taken
image_urlstringURL of the image file
scale_factorfloat (optional)Factor by which image has been downscaled (if the original image is 1920x1208 and image_url refers to a 960x604 image, scale_factor=2)
positionVector3World-normalized position of the camera
headingQuaternionVector <x, y, z, w> indicating the quaternion of the camera direction; note that the z-axis of the camera frame represents the camera's optical axis. See Heading Examples for examples.
priorityinteger (optional)A higher value indicates that the camera takes precedence over other cameras when a single object appears in multiple camera views. If you are using a mix of long range and short range cameras with overlapping coverage, you should set the short range cameras to priority 1 (default priority is 0).
camera_indexinteger (optional)This is required if you are using Scale Mapping. A number to identify which camera on the car this is (to be used in any 2D/3D linking task sent after completion of sensor fusion task). If not specified, will be inferred from the CameraImage's position in the array.
camera_modelstring (optional)Either fisheye for the OpenCV fisheye model, or brown_conrady for the pinhole model with Brown-Conrady distortion. Defaults to brown_conrady.
fxfloatfocal length in x direction (in pixels)
fyfloatfocal length in y direction (in pixels)
cxfloatprincipal point x value
cyflaotprincipal point y value
skewfloat (optional)camera skew coefficient
k1float (optional)1st radial distortion coefficient (Brown-Conrady, fisheye, omnidirectional)
k2float (optional)2nd radial distortion coefficient (Brown-Conrady, fisheye, omnidirectional)
k3float (optional)3rd radial distortion coefficient (Brown-Conrady, fisheye, omnidirectional)
k4float (optional)4th radial distortion coefficient (fisheye only)
p1float (optional)1st tangential distortion coefficient (Brown-Conrady, omnidirectional)
p2float (optional)2nd tangential distortion coefficient (Brown-Conrady, omnidirectional)
xifloat (optional)reference frame offset for omnidirectional model

For example, to represent a camera pointing along the positive x-axis (i.e. the camera's z-axis points along the world's x-axis) and oriented normally (the camera's y-axis points along the world's negative z-axis), the corresponding heading is 0.5 - 0.5i + 0.5j - 0.5k.

Definition: RadarPoint

{
  "position": {
    "x": 100.1,
    "y": 150.12,
    "z": 200.2
  },
  "direction": {
    "x": 1,
    "y": 0,
    "z": 0
  },
  "size": 0.5
}

RadarPoint objects are used to define an individual radar point and Doppler in a particular frame.

PropertyTypeDescription
PositionVector3A vector defining the position of the RADAR point in the same frame-of-reference as the frame's points array
directionVector3 (optional)A vector defining the velocity (direction and magnitude) of a potential Doppler associated with the RADAR point. This vector is relative to the individual RADAR point, and in the global reference frame. The magnitude of the vector should correspond to the speed in m/s, and the length of the Doppler showed to the labeler will vary based on the magnitude. So, if the Doppler is 1 m/s in the positive x direction, then the direction could be {"x": 1, "y": 0, "z": 0}.
size (optional, default 1)float between 0 and 1A float from 0 to 1 describing the strength of the radar return, where larger numbers are stronger radar returns. This value will be used to determine the brightness of the point to display to the labeler.

Definition: AnnotationRule

This object enforces certain annotation relationships.

PropertyTypeDescription
must_derive_fromArray<DeriveFrom>List of DeriveFrom objects that define the relationships between annotations

Definition: DeriveFrom

This object enforces that if line annotations are used to form, or in other words, "derive" a polygon annotation, then the labels of the involved annotations must be of a certain set.

PropertyTypeDescription
fromArray<string>A list of line labels or group names
toArray<string>A list of polygon labels or group names whose edges must be from lines

Definition: RegionOfInterest2d

RegionOfInterest2d This object allows Scale to perform the correct transformation from lon/lat world coordinates to pixels. It allows Scale to identify the pixel coordinates of the camera location on the provided aerial imagery task. If this is not provided, the camera context images will not render on the task.

PropertyTypeDescription
bounding_boxBoundingBoxThe bounds of the area the image covers in Lat Long coordinates. Validation is in place to ensure that latitude goes from -90 to 90 degrees, and longitude from -180 to 180 degrees.
crsstring (optional)Coordinate reference system used. Currently, only EPSG:4326 is supported

Definition: RegionOfInterest3d

RegionOfInterest3d This Object crops the attachments’ points to a rectangle on the XY plane centered around position with rotation counterclockwise to the z-axis. This must be submitted for any LiDAR TopDown annotation tasks, and defines the bounds to which the point cloud should be restricted to for annotation. The greater of the x and y dimensions are meter measurements, and are used to create a square on the XY plane. The RegionOfInterest3d differs from geofencing in that it crops in 3D world space as opposed to 2D orthographic image space. If the RegionOfInterest3d specified is larger than the size of the point cloud, the orthographic image will contain empty/black space where there are no points. If the RegionOfInterest3d contains no points, then an error will be thrown. Annotations will be translated to the RegionOfInterest3d coordinate frame, but will be translated back before the response is sent to the customer.

PropertyTypeDescription
positionVector2Position of the center of the region to be cropped, in meters with respect to the center of the scene (0,0)
dimensionsVector2Dimensions of the region to be cropped on the XY plane, in meters with respect to the position of the center of the region to be cropped as defined above
rotationfloatSpecifies the rotation counterclockwise to the z-axis, in degrees

Definition: CameraContext

CameraContext objects are non-primary images that can be referenced during labeling. They are used to provide additional information to annotators, but are not annotated themselves.

PropertyTypeDescription
typeobjectMust be either “lidar_camera”, “world_camera” or “pixel_camera”

“lidar_camera”: It is necessary to send the region_of_interest_3d, as an ortho_projection is needed to correctly handle Lidar cameras.

“world_camera”: In this case, the images are from a camera with lat/long coordinates. A CameraContext object of this type must have the lat/lon of the position of the camera, and the link to the image-attachment. It is also necessary to send the parameter, region_of_interest_2d, that will allow for the correct transformation from lat/lon to the main image coordinate system pixels.

“pixel_camera”: In this case, the reference images are in the same coordinate system as the main image. It is only required to specify type, link to the image, and camera position in x,y,z.
latfloat required for type world_cameraLatitude of the position of the camera
longfloat required for type world_cameraLongitude of the position of the camera
altfloatAltitude of the position of the camera
attachmentstringLink to the camera image
camera_positionVector3 required for type pixel_cameraPosition of the camera in the same coordinate system as the provide image
frameintFrame of the image

Definition: Link

Links are created to represent a relationship between two Annotation objects.

PropertyTypeDescription
tostringAnnotation UUID
fromstringAnnotation UUID
labelstringSelected from the link labels defined in submitted taxonomy
attributesobjectDict of all attribute names and selected values for the Link

Definition: OrthoResponse

{
 "response_type": "ortho_response",
 "annotations": [
   {
     "label": "Lane Line",
     "uuid": "3927f821-bfca-4cc9-8f6c-4fcf3c9e524e",
     "vertices": [
       {
         "x": 1.400146484375,
         "y": 5.218505859375
       },
       {
         "x": 1003.986572265625,
         "y": 1053.152587890625
       },
       {
         "x": 2984.986572265625,
         "y": 2701.152587890625
       }
     ],
     "type": "line"
   }
 ]
}

The OrthoResponse contains annotations in local coordinates of the task. Note that the Annotations returned in the OrthoResponse are in 2D.

PropertyTypeDescription
response_typestringConstant “ortho_response”
annotationsArray<Annotation>In the 2D pixel coordinate space of the projected TopDown task image
linksArray<Link>

Definition: WorldResponse

{
 "response_type": "world_response",
 "annotations": [
   {
     "label": "Lane Line",
     "uuid": "3927f821-bfca-4cc9-8f6c-4fcf3c9e524e",
     "vertices_3d": [
       {
         "x": 10014.00146484375,
         "y": 10052.18505859375,
         "z": 5.153
       },
       ...
     ],
     "type": "line"
   }
 ]
}

The WorldResponse contains annotations that are re-projected from the region_of_interest_3d into the World scene coordinates of the original Lidar data. The GroundMesh is used to assign altitude (Z-coordinates) to all Annotation vertices.
Note that aerial imagery tasks have their WorldResponse and OrthoResponse in the same coordinate space as no region_of_interest_3d is specified.

PropertyTypeDescription
response_typestringConstant “world_response”
annotationsArray<Annotation>In 3D world coordinate space
linksArray<Link>

Definition: CameraResponse

[
 {
   "response_type": "camera_response",
   "annotations": [
     {
       "label": "Lane Line",
       "uuid": "3927f821-bfca-4cc9-8f6c-4fcf3c9e524e",
       "vertices_3d": [
         ...
       ],
       "type": "line"
     }
   ],
   "frame_number": 0,
   "camera_index": 0,
   "metadata": {}
 },
 ...
]

For each Lidar Camera Context Attachment, a CameraResponse is generated, which holds an array of all World annotations projected into the 3D coordinate space from the PoV of the Lidar camera. In the LidarTopdown final task response, the “camera” field holds an array of CameraResponse, one for each Lidar Camera Context Attachment.

PropertyTypeDescription
response_typestringConstant “camera_response”
annotationsArray<Annotation>In 3D world coordinate space from the perspective of the indicated camera context attachment
frame_numberfloatMatches the frame_number on the submitted camera context attachment
camera_indexfloatMatches the camera_index on the submitted camera context attachment

Definition: ImageResponse

[
 {
   "response_type": "image_response",
   "annotations": [
     {
       "label": "Lane Line",
       "uuid": "3927f821-bfca-4cc9-8f6c-4fcf3c9e524e",
       "vertices": [
         ...
       ],
       "type": "line"
     }
   ],
   "frame_number": 0,
   "camera_index": 0,
   "metadata": {}
 },
 ...
]

For each Lidar Camera Context Attachment, the ImageResponse holds an array of all World annotations projected onto the camera context image itself. These annotations are in the image space of the camera context image, and as such are 2D annotations. As with the ImageResponse, in the LidarTopdown final task response, the CameraResponse is an array holding CameraResponses for each Lidar Camera Context Attachment.

PropertyTypeDescription
response_typestringConstant "camera"
annotationsArray<Annotation>In the 2D pixel coordinate space of the camera context image
frame_numberfloatMatches the frame_number on the submitted camera context attachment
camera_indexfloatMatches the camera_index on the submitted camera context attachment

Definition: Frame

Frame objects represent all the point cloud, image, and other data that is sent to the annotator.

PropertyTypeDescription
device_positionVector3position of the LIDAR sensor or car with respect to a static frame of reference, i.e. a pole at (0,0,0) remains at (0,0,0) throughout all frames. This should use the same coordinate system as the points and radar_points.
device_headingQuaternionHeading of the car or robot that the LIDAR is on top of with respect to a static frame of reference, expressed as a Quaternion. See Heading Examples for examples.
device_gps_poseGPSPose (optional)GPS pose (location and bearing) of the robot in the world. The GPS pose provided should correspond to the best estimate of the pose of the same point as defined in device_position and device_heading.
pointslist of LidarPointSeries of points representing the LIDAR point cloud, normalized with respect to a static frame of reference, i.e. a pole at (0,0,0) remains at (0,0,0) throughout all frames. This should use the same coordinate system as device_position and radar_points.

For LidarTopDown:
Using the LiDAR points, Scale generates two images used for labeling:
1. A flattened 2D image with point cloud density, elevation, and ego trajectory
2. A ground mesh to use when projecting annotations into camera images
Multi-pass LiDAR data is ideal for LidarTopDown annotation as sparse areas cannot be annotated leading to global inconsistency.
radar_pointslist of RadarPoint (optional)A list of RadarPoints corresponding to the given frame, defining objects which should be labeled using a combination of radar and camera. This should use the same coordinate system as device_position and points.
imageslist of CameraImage (optional)A list of CameraImage objects that can be superimposed over the LIDAR data.

For LidarTopDown:
Annotations are projected into camera images as an additional reference point
- You may optionally provide a metadata field. This field will be included in each item of the camera response and image response
- Supports different camera frame rates from Lidar
- You may specify “images”:[] when there are no camera images for a given frame
timestampfloat (optional)The starting timestamp of the sensor rotation, in nanoseconds

Example JSON file for a Frame

https://static.scale.com/scaleapi-lidar-pointclouds/example.json

Steps to Process a series of LIDAR data

  1. Create a Frame JSON object and save to a file; alternatively, create a LidarFrame protobuf message and save to a file. Repeat for each lidar frame.
  2. Upload the files to a Scale accessible location (e.g. an S3 bucket) and record the URLs.
  3. Send a POST request to https://api.scale.com/v1/task/lidarannotation