TextGridTools¶
TextGridTools is a free Python package for processing, querying and manipulating Praat TextGrid files. TextGridTools improves on many deficiencies of Praat’s embedded scripting language by providing a clean data model for TextGrid objects and their attributes, and offering functionality for common annotation-related tasks, for instance calculation of interannotator agreement measures. Owing to seamless integration with other Python tools, such as data analysis libraries and interactive interpreters, users gain access to a versatile and powerful computing environment without the need of repeated format conversions.
Overview¶
Praat has become a de-facto standard tool for phonetic analysis, transcription of speech, and classification of speech events. To this end, it provides an intuitive point-and-click user interface for selecting intervals or points in the audio data and for labelling these with arbitrary text, which is then displayed time-aligned alongside the waveform. Praat supports annotation on multiple independent tiers, allowing for application of multidimensional annotation schemata, annotations with different degrees of granularity, or for independent transcriptions of individual participants in a dialogue recording.
Much of Praat’s usefulness comes from its programmability via a simple embedded scripting language, ‘Praat script,’ which allows full access to Praat’s functions and data structures. Coupled with basic control flow mechanisms, Praat script allows automatisation of tedious and time consuming tasks. However, in spite of its many virtues and its ease of use, Praat script lacks basic features of modern programming languages, such as return statements in functions, iterators, or even basic data structures such as lists or hash tables. Additionally, not being a general-purpose language, it falls short of functionality which is not directly tied to Praat itself, such as plotting or statistical analysis.
These limitations are particularly evident when Praat script is used to manipulate Praat’s native annotation objects, stored in plain text ‘TextGrid’ files. Such tasks often do not require any of Praat’s advanced audio analysis features but would greatly benefit from access to specialised text processing tools and simple integration with external data analysis libraries. Faced with lack of such functionality, Praat users wishing to carry out more complex analyses on their annotations are forced to first export their data into some intermediate format, such as comma-separated files, before importing them into a data analysis framework of their choice.
To overcome these shortcomings, we have developed ‘TextGridTools,’ a Python package offering functions to parse, manipulate and query Praat annotations. TextGridTools implements all of TextGrid-related objects, such as interval and point tiers, as native Python classes and offers a clean API for accessing their attributes. Additional functions are available to perform more complex operations, such as calculation of inter-annotator agreement measures between several annotation tiers.
Coupled with Python’s expressive syntax, TextGridTools allows for more compact and human-readable program code than that of Praat script. Additionally, with access to a fully-fledged programming language, users are able to carry out their analyses in one step, without Praat script serving as a mere exporting tool. Using TextGridTools annotations can be accessed directly from a Python program and processed using one of Python’s many data analysis libraries (e.g., NumPy, SciPy, Matplotlib, RPy or pandas, cf. [1]).
TextGridTools is released under the GNU General Public Licence v3.0, and hosted on GitHub allowing users to contribute their changes and extend the existing functionality. TextGridTools is compatible with both Python 2 and Python 3.
For a longer overview as well as for citing TextGridTools, see [2] (PDF).
Installation¶
TextGridTools is part of the Python Package Index repository . To get the latest stable version run pip install tgt
.
To get the latest development version, use the GitHub repository: git clone https://github.com/hbuschme/TextGridTools.git
.
[1] | McKinney, W. (2012). Python for Data Analysis: Agile Tools for Real World Data. Sebastopol, CA: O’Reilly. |
[2] | Buschmeier, H. and Włodarczak, M. (2013). TextGridTools: A TextGrid Processing and Analysis Toolkit for Python. In P. Wagner (ed.) Tagungsband der 24. Konferenz zur Elektronischen Sprachsignalverarbeitung (ESSV 2013). Dresden: TUDpress, pp. 152-157. |
API¶
-
class
tgt.core.
TextGrid
(filename='')¶ A TextGrid.
-
add_tier
(tier)¶ Add a tier.
-
add_tiers
(tiers)¶ Add a sequence of tiers.
-
delete_tier
(tier_name)¶ Delete a tier.
-
delete_tiers
(tier_names, complement=False)¶ Delete a list of tiers.
If complement is False, delete tiers with the specified names. If complement is True, delete tiers not specified.
-
end_time
¶ TextGrid end time.
-
get_tier_by_name
(name)¶ Get the first tier with the specified name.
-
get_tier_names
()¶ Get names of all tiers.
-
get_tiers_by_name
(name)¶ Get a list of all tiers with the specified name.
-
has_tier
(name)¶ Check whether TextGrid has a tier of the specified name.
-
insert_tier
(tier, position)¶ Insert a tier at the specified position.
-
start_time
¶ TextGrid start time.
-
tiers
¶ Tiers in this TextGrid object.
-
-
class
tgt.core.
Tier
(start_time=0, end_time=0, name='', objects=None)¶ An abstract tier.
-
add_annotation
(obj)¶ Adds an annotation object to this tier.
The annotation object is inserted at the correct position within the tier. If the space is already (partially) occupied by a different annotation object, a ValueError is raised.
-
add_annotations
(objects)¶ Add a sequence of annotation objects.
-
annotations
¶ The list of annotations of this tier.
-
delete_annotation_by_end_time
(time)¶ Delete the annotation object that ends at time.
-
delete_annotation_by_start_time
(time)¶ Delete the annotation object that starts at time.
-
delete_annotations_between_timepoints
(start, end, left_overlap=False, right_overlap=False)¶ Delete annotation objects between start and end.
If left_overlap or right_overlap is False annotation objects overlapping with start or end are excluded.
-
delete_annotations_by_time
(time)¶ Delete annotation objects at the specified time.
-
delete_annotations_with_text
(pattern='', n=0, regex=False)¶ Delete annotation objects with text matching the pattern.
If n > 0 the first n matches are deleted, if n < 0, the last n matches are deleted, if n = 0 all matches are deleted. The pattern is treated as a regular expression, if regex is True.
-
delete_empty_annotations
()¶ Delete annotation object with empty or whitespace-only text.
-
end_time
¶ End time.
-
get_annotation_by_end_time
(time)¶ Get the annotation object that ends at time.
-
get_annotation_by_start_time
(time)¶ Get the annotation object that starts at time.
-
get_annotations_between_timepoints
(start, end, left_overlap=False, right_overlap=False)¶ Get annotation objects between start and end.
If left_overlap or right_overlap is False annotation objects overlapping with start or end are excluded.
-
get_annotations_by_time
(time)¶ Get annotation objects at the specified time.
-
get_annotations_with_matching_text
(pattern='', n=0, regex=False)¶ Get annotation objects with text matching the pattern.
If n > 0 the first n matches are returned, if n < 0, the last n matches are returned, if n = 0 all matches are returned. The pattern is treated as a regular expression, if regex is True.
Note: get_annotations_with_matching_text is deprecated. Use get_annotations_with_text instead.
-
get_annotations_with_text
(pattern='', n=0, regex=False)¶ Get annotation objects with text matching the pattern.
If n > 0 the first n matches are returned, if n < 0, the last n matches are returned, if n = 0 all matches are returned. The pattern is treated as a regular expression, if regex is True.
-
get_nearest_annotation
(time, pattern='.*', boundary='both', direction='both', exclude_overlapped=False)¶ Get a list of the annotation object(s) nearest to time.
Boundary specifies whether the distance to an annotation object is calculated based on its start time (‘start’), end time (‘end’) or both (‘both’). Direction specifies whether it is looked to the left hand side of time (‘left’), to the right hand side of time (‘right’) or to both sides (‘both’). Annotations overlapping with time can be excluded.
Note: When time lies exactly on a boundary, this boundary is both to the left and to the right of time.
-
start_time
¶ Start time.
-
tier_type
()¶ Return the type of the tier as a string.
-
-
class
tgt.core.
IntervalTier
(start_time=0, end_time=0, name='', objects=None)¶ An IntervalTier.
-
add_interval
(interval)¶ Add an interval to this tier.
-
add_intervals
(intervals)¶ Add a sequence of intervals to this tier.
-
get_copy_with_gaps_filled
(start_time=None, end_time=None, empty_string='')¶ Returns a copy where gaps are filled with empty intervals.
-
get_copy_with_same_intervals_merged
()¶ Returns a copy of TIER in which consecutive intervals with identical labels are merged.
-
intervals
¶ The list of intervals of this tier.
-
-
class
tgt.core.
PointTier
(start_time=0, end_time=0, name='', objects=None)¶ A PointTier (also “TextTier”).
-
add_point
(point)¶ Add a point to this tier.
-
add_points
(points)¶ Adds a list of points to this tier.
-
points
¶ The list of points of this tier.
-
tier_type
()¶ Return the type of the tier as a string.
As Praat’s point tiers are for some reason called “TextTier” in the TextGrid file format we simply return a string literal here.
-
-
class
tgt.core.
Interval
(start_time, end_time, text='')¶ An interval of two points of time with a text label.
-
class
tgt.core.
Point
(time, text='')¶ A point of time with a text label.
Internally an Annotation where start time equals end time.
-
end_time
¶ The point of time.
-
start_time
¶ The point of time.
-
time
¶ The point of time.
-
-
class
tgt.core.
Time
¶ A representation of point in time with a predefined precision.
-
exception
tgt.core.
TextGridToolsException
¶
-
tgt.io.
correct_start_end_times_and_fill_gaps
(textgrid)¶ Correct the start/end times of all tiers and fill gaps.
Returns a copy of a textgrid, where empty gaps between intervals are filled with empty intervals and where start and end times are unified with the start and end times of the whole textgrid.
-
tgt.io.
deescape_text
(text)¶ De-escape text when importing from TextGrid.
-
tgt.io.
escape_text
(text)¶ Escape text for exporting to TextGrid.
-
tgt.io.
export_to_elan
(textgrid, encoding='utf-8', include_empty_intervals=False, include_point_tiers=True, point_tier_annotation_duration=0.04)¶ Convert a TextGrid object into a string of ELAN eaf format.
-
tgt.io.
export_to_long_textgrid
(textgrid)¶ Convert a TextGrid object into a string of Praat long TextGrid format.
-
tgt.io.
export_to_short_textgrid
(textgrid)¶ Convert a TextGrid object into a string of Praat short TextGrid format.
-
tgt.io.
export_to_table
(textgrid, separator=', ')¶ Convert a TextGrid object into a table with fields delimited with the specified separator (comma by default).
-
tgt.io.
include_empty_intervals_in_tier
(tier_name, include_empty_intervals)¶ Check whether to include empty intervals for a particular tier
-
tgt.io.
read_long_textgrid
(filename, stg, include_empty_intervals=False)¶ Read a Praat long TextGrid file and return a TextGrid object.
-
tgt.io.
read_short_textgrid
(filename, stg, include_empty_intervals=False)¶ Read a Praat short TextGrid file and return a TextGrid object.
-
tgt.io.
read_textgrid
(filename, encoding='utf-8', include_empty_intervals=False)¶ Read a Praat TextGrid file and return a TextGrid object.
If include_empty_intervals is False (the default), empty intervals are excluded. If True, they are included. Empty intervals from specific tiers can be also included by specifying tier names as a string (for one tier) or as a list.
-
tgt.io.
write_to_file
(textgrid, filename, format='short', encoding='utf-8', **kwargs)¶ Write a TextGrid object to a file in the specified format.
-
tgt.util.
chronogram
(tiers, speech_label=None, silence_label=None)¶ Construct a chronogram between intervals in input tiers. Interval labels are classified as silences or volcalisations by matching them against the speech_label and the silence_label regular expressions. By default all intervals with empty or whitespace-only labels are treated as silences.
The code is a generalisation of Jaffe and Feldstein’s (1970) 6-state Markov model to an arbitrary number of speakers. Instances of silences and overlaps are classified as within-speaker-overlap (wso), between-speaker-overlap (bso), within-speaker-silence (wss) or between-speaker-silence (bss). Individual vocalistions are labelled with the the source tier name.
-
tgt.util.
concatenate_textgrids
(textgrids, ignore_nonmatching_tiers=False, use_absolute_time=False)¶ Concatenates TextGrid objects and return a new one.
If ignore_nonmatching_tiers is False, an exception is raised if the number or names of tiers differ among the TextGrids.
If use_absolute_time is False, start and end times of intervals are offset by the end_time of the preceeding tier. If use_absolute_time is True, start and end times of intervals are used as is. This is useful if start_time of textgrids does not equal 0.
Keyword argument: textgrids – a list of TextGrid objects ignore_nonmatching_tiers – a boolean (default False) use_absolute_time – a boolean (default False)
-
tgt.util.
concatenate_tiers
(tier1, tier2, offset)¶ Concatenate two tiers and return a new tier.
Offset is the time added to each interval’s boundaries in order to put them after the intervals of the preceeding tier. If intervals have absolute timing on each tier (i.e., start times of tier > 0 for later tiers, an offset of 0 should be used).
Keyword argument: tier1 – Tier object tier2 – Tier object offset – float (>= 0)
-
tgt.util.
get_overlapping_intervals
(tier1, tier2, regex='[^\\s]+', overlap_label=None)¶ Return a list of overlaps between intervals of tier1 and tier2. If no overlap_label is specified, concatenated labels of overlapping intervals are used as the resulting label.
All nonempty intervals are included in the search by default.
-
tgt.util.
merge_textgrids
(textgrids, ignore_duplicates=True)¶ Return a TextGrid object with tiers in all textgrids.
If ignore_duplicates is False, tiers with equal names are renamed by adding a path of the textgrid or a unique number incremented with each occurrence.
-
tgt.util.
shift_boundaries
(tier, left, right)¶ Return a copy of the tier with boundaries shifted by the specified amount of time (in seconds). Positive values expand the tier and negative values shrink it, i.e.: * positive value of left shifts the left boundary to the left * negative value of left shifts the left boundary to the right * positive value of right shifts the right boundary to the right * negative value of right shifts the right boundary to the left.
-
tgt.util.
turns
(chrono)¶ Returns turns (defined as intervals bounded by solo vocalisations of two different speakers) given a chronogram.
-
tgt.agreement.
align_labels
(tiers_list, precision=None, regex='[^\\s]+')¶ Create a list of lists for all time-aligned Interval or Point object in tiers_list, whose text matches regex. For example:
- [[label_1-tier_1, label_1-tier_2, label_1-tier_3],
- [label_2-tier_1, label_2-tier_2, label_2-tier_3], … [label_n-tier_n, label_n-tier_n, label_n-tier_n]]
The allowed mismatch between object’s timestamps can be controlled via the precision parameter.
-
tgt.agreement.
cohen_kappa
(a)¶ Calculates Cohen’s kappa for the input array.
-
tgt.agreement.
cont_table
(tiers_list, precision, regex)¶ Produce a contingency table from annotations in tiers_list whose text matches regex, and whose time stamps are not misaligned by more than precision.
-
tgt.agreement.
fleiss_chance_agreement
(a)¶ Returns the chance agreement for the input array.
-
tgt.agreement.
fleiss_kappa
(a)¶ Calculates Fleiss’s kappa for the input array (with categories in columns and items in rows).
-
tgt.agreement.
fleiss_observed_agreement
(a)¶ Return the observed agreement for the input array.
-
tgt.agreement.
scott_pi
(a)¶ Calculates Scott’s Pi for the input array.