convolution.h 32.2 KB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683
// nnet3/convolution.h

// Copyright      2017  Johns Hopkins University (author: Daniel Povey)

// See ../../COPYING for clarification regarding multiple authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//  http://www.apache.org/licenses/LICENSE-2.0
//
// THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
// WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
// MERCHANTABLITY OR NON-INFRINGEMENT.
// See the Apache 2 License for the specific language governing permissions and
// limitations under the License.

#ifndef KALDI_NNET3_NNET_CONVOLUTION_H_
#define KALDI_NNET3_NNET_CONVOLUTION_H_

#include "base/kaldi-common.h"
#include "util/common-utils.h"
#include "itf/options-itf.h"
#include "matrix/matrix-lib.h"
#include "cudamatrix/cu-matrix-lib.h"
#include "nnet3/nnet-common.h"

#include <iostream>

namespace kaldi {
namespace nnet3 {

/// @file  convolution.h
///
/// This file contains some fairly low-level utilities for implementing
/// convolutional neural networks and related methods such as TDNNs, which are
/// mostly used in nnet-convolutional-component.h.  This would not necessarily
/// be suitable as a general-purpose and self-contained setup for convolution,
/// as it is quite linked with the overall framework of the nnet3 library.
/// (the underlying ideas might be usable, though).
///
/// We have chosen to implement this here, rather than using CuDNN, because we
/// realized that it was quite easy to efficiently implement CNNs in the nnet3
/// framework in a way that would support both GPUs and CPUs, at least for the
/// typical setups that have small patch dimensions (like 1x1 or 3x3).  In a
/// typical 3x3 convolution, the entire convolution can be done using 3 matrix
/// multiplies (and 3 corresponding CopyColsFromMat calls).


namespace time_height_convolution {

/**
   This comment explains the basic framework used for everything related to
   time-height convolution.  We are doing convolution in 2 dimensions; these
   would normally be width and height, but in the nnet3 framework we identify
   the width with the 'time' dimension (the 't' element of an Index).  This
   enables us to use this framework in the normal way for speech tasks, and it
   turns out to have other advantages it too, giving us a very efficient and
   easy implementation of CNNs (basically, the nnet3 framework takes care of
   certain reorderings for us).  As mentioned, the 't' index will correspond to
   the width, and the vectors we operate on will be of dimension height *
   num-filters, where the filter-index has the stride of 1.

   We will use the GeneralComponent interface, and its function
   ReorderIndexes(), to ensure that the input and output Indexes of the
   component have a specified regular structure; we'll pad with 'blank' Indexes
   (t=kNoTime) on the input and output of the component, as needed to ensure
   that it's an evenly spaced grid over n and t, with x always zero and the t
   values evenly spaced.  (However, a note on even spacing: for computations
   with downsampling this ordering of the 't' values is bit more complicated,
   search for 'blocks' in the rest of this header for more information).

   First consider the simplest case, call it "same-t-stride" (where there is no
   downsampling on the time index, i.e.  the input and output 't' values have
   the same stride, like 1, 2 or 4).  The input and output matrices have
   dimension num-t-values * num-images, with the num-t-values having the higher
   stride.  The computation involves copying a row-range of the input matrix to
   a temporary matrix with a column mapping (the temporary matrix will typically
   have more columns than the input matrix); and then doing a matrix-multiply
   between the reshaped temporary matrix and a block of the parameters; the
   block corresponds to a particular time-offset.  Then we may need to repeat
   the whole process with a different, shifted row-range of the input matrix and
   a different column map.  You may have to read the rest of this header, to
   understand this in more detail.
 */


/**
   This struct represents a convolutional model from a structural point
   of view (it doesn't contain the actual parameters).  Note: the parameters
   are to be stored in a matrix of dimension (num_filters_out) by
   (offsets.size() * num_filters_in) [where the offset-index has the larger
   stride than the filter-index].

   Partly out of a desire for generality, but also for convenience in
   implementation and integration with nnet3, at this level we don't represent
   the patch size in the normal way like '1x1' or '3x3', but as a list of pairs
   (time-offset, height-offset).  E.g. a 1x1 patch would normally be the single
   pair (0,0), and a 3x3 patch might be represented as

   offsets={ (0,0),(0,1),(0,2), (1,0),(1,1),(1,2), (2,0),(2,1),(2,2) }

   However-- and you have to be a bit careful here-- the time indexes are on an
   *absolute* numbering scheme so that if you had downsampled the time axis on a
   previous layer, the time-offsets might all be multiples of (e.g.) 2 or 4, but
   the height-offsets would normally always be separated by 1.  [note: we always
   normalize the list of (time-offset, height-offset) pairs with the
   lexicographical ordering that you see above.]  This asymmetry between time
   and height may not be very aesthetic, but the absolute numbering of time is
   at the core of how the framework works.  Note: the offsets don't have to
   start from zero, they can be less than zero, just like the offsets in TDNNs
   which are often lists like (-3,0,3).  Don't be surprised to see things like:

   offsets={ (-3,-1),(-3,0),(-3,1), (0,-1),(0,0),(0,2), (3,-1),(3,0),(3,1) }

   If there are negative offsets in the height dimension (as above) it means
   that there is zero-padding in the height dimension (because the first
   height-index at both the input and the output is 0, so having a height-offset
   means that to compute the output at height-index  0 we need the input at
   height-index -1, which doesn't exist; this implies zero padding on the
   bottom of the image.
 */
struct ConvolutionModel {
  int32 num_filters_in;   // number of input filters, e.g. 128.
  int32 num_filters_out;  // number of output filters, e.g. 256.
  int32 height_in;   // image height in, e.g. 40.
  int32 height_out;  // image height out, e.g. 40 (no subsampling or zero
                     // padding), 38 (with zero padding) (or for an example with
                     // 2x subsampling and no zero-padding: maybe 20).
  int32 height_subsample_out;  // subsampling factor for height.  In the 3
                               // examples given for height_out above, would be
                               // 1, 1 and 2 respectively.
  struct Offset {
    int32 time_offset;
    int32 height_offset;
    // give it a lexicographic ordering.
    inline bool operator < (const Offset &other) const {
      if (time_offset < other.time_offset) return true;
      else if (time_offset > other.time_offset) return false;
      else return height_offset < other.height_offset;
    }
    inline bool operator <= (const Offset &other) const {
      if (time_offset < other.time_offset) return true;
      else if (time_offset > other.time_offset) return false;
      else return height_offset <= other.height_offset;
    }
    inline bool operator == (const Offset &other) const {
      return time_offset == other.time_offset &&
          height_offset == other.height_offset;
    }
  };
  // For a 3x3 patch, the 'offsets' vector would be a list of 9 elements.  It's
  // always unique and sorted in lexicographic order.  See the extended comment
  // for struct ConvolutionModel for an explanation.
  std::vector<Offset> offsets;

  // This set, 'required_time_offsets', relates to zero-padding on the time
  // axis.  It should consist of a nonempty subset of the time-offset values
  // that have been seen in offsets[*].time_offset.  If there is no zero-padding
  // on the time (width) axis it would be that entire set.  If there is
  // zero-padding it would in most circumstances contain just the middle one,
  // e.g. of {0,1,2} we'd keep just {1}, or of {-3,0,3} we'd keep just {0}.  The
  // way to understand it is that all the time-offsets define dependencies in
  // the computation, but the list of 'required' offsets determines when a
  // computation can proceed when some of the dependencies are not present (any
  // non-required depenencies that were not present default to zero).
  std::set<int32> required_time_offsets;

  // This variable, which is derived from 'offsets', stores all the time offsets
  // that are present there, i.e. all the values of 'offsets[*].time_offset'
  std::set<int32> all_time_offsets;

  // This variable, which is derived from 'offsets', is the greatest common
  // divisor of the differences between the members of 'all_time_offsets';
  // e.g. if 'all_time_offsets' is {1,3,5} it would be 2.  It is used to figure
  // out what grid structure the input to the computation should have.  It is
  // set to zero if all_time_offsets.size() == 1.
  int32 time_offsets_modulus;


  // Computes the derived parameters 'all_time_offsets' and
  // 'time_offsets_modulus'.
  void ComputeDerived();

  // You'll notice that there is nothing here that explicitly specifies the
  // padding.  At this level, any padding on the height axis is implicit.  For
  // example, suppose there is a height-offset of -1, that implies we must be
  // padding at the bottom by at least 1, because the output height-index starts
  // from 0, and it would require the input at height -1, whereas the input
  // height-index starts from 0.  All padding is implicitly zero-padding.
  // Padding in the height dimension depends on (height_in, height_out,
  // height_subsample_out) and the 'height_offset' members of 'offsets'; padding
  // in the time dimension depends on 'required_time_offset'
  // vs. 'all_time_offsets'.

  // the InputDim() and OutputDim() really relate to its behavior in a
  // neural-net component, they are the input-dim and output-dim of the features
  // that the component has as input/output; physically, this is the column
  // dimension at the input and output of the component.  The time dimension
  // corresponds to the row-index of those features.
  int32 InputDim() const { return num_filters_in * height_in; }
  int32 OutputDim() const { return num_filters_out * height_out; }
  // number of rows in the parameter matrix
  int32 ParamRows() const { return num_filters_out; }
  // number of cols in the parameter matrix
  int32 ParamCols() const { return num_filters_in * static_cast<int32>(offsets.size()); }

  ConvolutionModel() { }

  bool operator == (const ConvolutionModel &other) const;

  /*
    Checks that this model makes sense, and returns true if so; if not, returns
    false (and if it's for certain less-obvious reasons, prints a warning first
     explaining why)..

   @param [in] check_heights_used  If true, part of the check is that all
         height-values at the input are used at some point (if they
         are not, this model is probably not what you intended).
   @param [in] allow_height_padding  If true, the checking code assumes that
         zero-padding on the height axis is permitted.
   @return  Returns true if the check passed, false otherwise.
  */
  bool Check(bool check_heights_used = true,
             bool allow_height_padding = true) const;

  // Returns an info-string that describes the model; it looks like
  // "num-filters-in=32, num-filters-out=64, height-in=40, height-out=40, ... ".
  // It's suitable for use in the 'info' output of the convolutional component.
  std::string Info() const;

  void Write(std::ostream &os, bool binary) const;
  void Read(std::istream &is, bool binary);
};


/**
   This struct represents the structure of a convolution computation.
   This is used inside the PrecomputedIndexes object for
   the TimeHeightConvolutionComponent (it depends on the inputs and
   outputs as well as the layer).

   *CAUTION*: this is after certain transformations of the problem, so the
   height_in may not always be the "real" height of the input image (it may be a
   multiple thereof), and the num_t_in may not always be the "real" number
   of distinct time-steps on the input of the computation (it may be a divisor
   thereof).  ConvolutionComputation contains the info needed to actually
   perform the computation.
*/
struct ConvolutionComputation {
  // num_filters_in and num_filters_out will be the same as in the model.
  int32 num_filters_in, num_filters_out;
  // height_out will be the same as in the model, but height_in may be
  // affected by reshaping (may be larger than the model's height_in).
  int32 height_in, height_out;
  // num_t_in and num_t_out are the number of rows in the input and output
  // matrices, but num_t_in may be affected by reshaping (may be smaller
  // than the model's num_t_in).
  // num_t_in will be >= num_times_out, and if it's greater it will be greater by a
  // small additive term, not by a multiplicative factor.
  int32 num_t_in, num_t_out;
  // num_images is the number of (n,x) pairs present in the input/output
  // indexes (although in most setups the x values will all be zero and
  // they will only vary in n).
  int32 num_images;

  // temp_rows and temp_cols define the size of a temporary matrix that the
  // computation uses.  temp_rows is the number of rows in that temporary
  // matrix; it will normally be equal to [multiplying from greatest to least
  // stride], (num_times_out * num_images), but it may be less in order to save
  // memory.  The execution code is in charge of looping over the data using
  // this matrix, in order to ensure that we cover all output rows.  If you are
  // just trying to understand the framework, assume that it's always equal to
  // num_times_out * num_images.

  // Note: if all of the steps[*].columns_are_contiguous values are true AND all
  // of the steps[*].columns.Dim() equal the input-num-cols (=num_filters_in *
  // height_in), then the temporary matrix is never needed and in that case,
  // temp_rows and temp_cols will both be zero.
  int32 temp_rows, temp_cols;

  // There may be a few steps in the computation (e.g. in a 3x3 convolution
  // without subsampling, there would be 3 steps), and the output is a summation
  // over contributions from each step.  each step has a different value
  // 'input_time_shift' (which is the number of input rows to discard at the
  // start of the input matrix, and won't be the same as the increment in 't',
  // if t_step_in in the ConvolutionComputationIo != 1.
  struct ConvolutionStep {
    // input_time_shift >= 0 is the number of initial time-indexes of the input
    // (i.e. the number of initial rows of the matrix) that we discard for this
    // step. We may discard some final time-indexes too, if needed so that the
    // total number of input time-indexes equals the total number of output
    // time-indexes.
    int32 input_time_shift;

    // params_start_col >= 0 says the start-column-index of the parameter matrix
    // where we start a sub-matrix to be used in this step (the num-cols of that
    // sub-matrix is given by columns.Dim() / height_out).
    int32 params_start_col;

    // height_map is the 'upstream' parameter from which 'columns' and
    // 'backward_columns' are derived; it compactly defines a column mapping
    // that is used when copying the input to a temporary matrix.
    // height_map.dim() * num_filters_in gives the num-cols in this temporary
    // matrix.  Each element of 'height_map' corresponds to a column range of
    // 'num_filters_in' columnn of the temporary matrix, and it says which
    // (same-sized) column-range of the input matrix is to be used as the source
    // for this data.  Its elements are in the range -1 <= height_map[i] <
    // num_filters_in, where -1's are used for blocks that have zero values.
    // height_map would be the same as 'columns' if num_filters_in == 1.
    std::vector<int32> height_map;

    // 'columns' is derived from 'pixel_map'.
    // columns.Dim() <= temp_cols is the num-columns of
    // a sub-matrix of the temporary matrix, that we
    // populate on this step.
    //
    // -1 <= columns[i] < height_in * num_filters_in
    // gives the dimension of the (reshaped) input to copy
    // If columns[i] == -1, it means write a zero.
    CuArray<int32> columns;

    // 'backward_columns' is derived from 'columns', it is used in
    // the backprop.  Each element of 'backward_columns' has the
    // same dim as the num-cols of the input matrix.  It's basically
    // the reverse map of 'columns', but split into multiple parts (and
    // padded with -1's as necessary) so that we can process elements
    // of the input which are copied multiple times to the temporary
    // matrix.
    std::vector<CuArray<int32> > backward_columns;

    // 'columns_are_contiguous' is derived from 'columns'; it's true if
    // 'columns' is a contiguous range of nonnegative integers, like '20, 21,
    // 22, ... '.
    bool columns_are_contiguous;
    // 'first_column' is derived from 'columns'; it equals columns[0].  It is
    // only of interest if 'columns_are_contiguous' is true (it enables an
    // optimization).
    int32 first_column;
  };
  std::vector<ConvolutionStep> steps;


  void Write(std::ostream &os, bool binary) const;
  void Read(std::istream &is, bool binary);

  // Computes derived variables in 'steps', i.e. 'columns', 'backward_columns',
  // columns_are_contiguous, and 'first_column'.
  void ComputeDerived();

  // check that this computation makes sense; crash if not.
  void Check() const;
};



/**
   This struct contains options for compiling the convolutional computation.
 */
struct ConvolutionComputationOptions {
  // max_memory_mb determines how many megabytes of memory we are willing to use
  // for the temporary matrix.  If it would exceed this amount, we do the
  // computation in batches.
  BaseFloat max_memory_mb;
  ConvolutionComputationOptions(): max_memory_mb(200.0) { }
};



// This struct represents the structure of the input and output of a
// convolutional computation (the input and output images; not the model itself,
// which is represented by ConvolutionModel).  We require that both the input
// and output indexes have a regular repeated structure, and if this is not the
// case then the input and output indexes will be padded with 'blank' indexes
// (indexes having a 't' vlaue of kNoTime) as needed to fit them into regular
// grids.  In addition 'blank' indexes may be added to reflect zero-padding on
// the input.
struct ConvolutionComputationIo {
  int32 num_images;  // 'num_images' is the number of distinct (n,x) values in
                     // the indexes.  Normally the x values would all be zero
                     // and the n values would go from 0 to num_images - 1, but
                     // this is not required.  We do enforce (via padding) that
                     // each (n,x) pair, i.e. each image, is associated with the
                     // same number of 't' values.

  // the following represents the set of 't' values on the input and output.
  // their meaning is obvious, but we should note that if there is just one
  // output or input index, we will set the step to zero when initially
  // creating this struct, and it may get set to other values later on, mostly
  // to avoid creating extra code paths.
  int32 start_t_in, t_step_in, num_t_in;
  int32 start_t_out, t_step_out, num_t_out;

  // reorder_t_in will be 1 in normal cases (no downsampling), but it may have values
  // greater than 1 (e.g. 2 if we're downsampling by a factor of 2).
  // This doesn't affect the set of indexes on the input, but it affects how they
  // are ordered.
  //
  //   If reorder_t_in == 1 then order the indexes one block for all
  // indexes with t=t0=start_t_in; then one block for all
  // t=t1=(start_t_in+t_step_in); then one block for t=t2, t=t3, and so on.
  //
  //   If reorder_t_in is >1 (for example, 2), then the values for t0 and t1 would
  // be interspersed in a single block; then the values for t1 and t2 would
  // be interspersed in the next block; and so on.  Within these blocks,
  // it's the 't' values that have the smaller stride.  This ordering allows
  // a reshaping such that we can imagine that the input and output have the
  // same 't' increment; it's useful in subsampling convolutions..
  int32 reorder_t_in;

  void Write(std::ostream &os, bool binary) const;
  void Read(std::istream &is, bool binary);
};

/**
   Check that this model and this I/O request are compatible in
   terms of required context, etc, and crash if not.
   if allow_extra_input == false, this will crash if the
   input 'io' object has time values that would never be
   used because they are before/after the first/last
   desired time values. */
void CheckModelAndIo(const ConvolutionModel &model,
                     const ConvolutionComputationIo &io,
                     bool allow_extra_input = false);


/**
   This function does the compilation for a convolution computation; it's
   a wrapper for the functions below, which should not have to be called
   by the end user.

   @param [in] model  The convolution model that this computation is for.
   @param [in] input_indexes   The list of Indexes available at the input of
                      the computation.
   @param [in] output_indexes  The list of Indexes requested to be computed
                      at the output of the computation.  It is an error if
                      all dependencies are not satisfied (specifically: for
                      each Index (n,t,x) in 'output_indexes', the Index
                      (n,t+time_offset,x) must be present in 'input_indexes'
                      for each time_offset in model.required_time_offsets.
   @param [out] computation  If non-NULL, the compiled computation will be
                      written to this location.

 */
void CompileConvolutionComputation(
    const ConvolutionModel &model,
    const std::vector<Index> &input_indexes,
    const std::vector<Index> &output_indexes,
    const ConvolutionComputationOptions &opts,
    ConvolutionComputation *computation,
    std::vector<Index> *input_indexes_modified,
    std::vector<Index> *output_indexes_modified);


/**
   \brief This does the forward computation of convolution.  (note: this is
         convolution without a bias term; you have to handle that separately).

   @param [in] conv_comp  A struct that describes the computation
                          to be performed.
   @param [in] input     The input to the convolution.  This
             should be of dimension (or should be reshapable to
             the dimension) conv_comp.num_t_in * conv_comp.num_images
             by conv_comp.height_in * num_filters_in.  [highest-stride
             indexes come first in these multiplications].  It must
             satisfy input.NumCols() == input.Stride().
   @param [in] params   The parameters of the convolution.  This should be of
             dimension conv_comp.ParamRows() by conv_comp.ParamCols().
   @param [out] output   The output of the convolution (this function
             *adds to* the output).  Should be of dimension
             conv_comp.num_t_out * conv_comp.num_images
             by conv_comp.height_out * num_filters_out.  It must
             satisfy output.NumCols() == output.Stride().
 */
void ConvolveForward(
    const ConvolutionComputation &conv_comp,
    const CuMatrixBase<BaseFloat> &input,
    const CuMatrixBase<BaseFloat> &params,
    CuMatrixBase<BaseFloat> *output);


/**
   \brief This does the part of the backward derivative computation
          of convolution, that propagates derivatives back to
          the input data.  See also ConvolveBackwardParams(), which
          is for the parameter derivative.

   @param [in] conv_comp  A struct that describes the convolution
                          computation (should be the same as in
                          the corresponding forward pass).
   @param [in] params The parameters used in the forward convolution.  This
             should be of dimension num_filters_out by (X * num_filters_in),
             where X is the total number of pixels in the patches, which equals
             model.offsets.size() in the model for which the computation was
             compiled.  E.g. for a regular 3x3 kernel, X would be 9.
  @param [in] output_deriv The derivative of the objective function w.r.t. the
             output of the convolution.  Should be of dimension
             conv_comp.num_t_out * conv_comp.num_images by conv_comp.height_out
             * num_filters_out.  It must satisfy output_deriv.NumCols() ==
             output_deriv.Stride().
   @param [out] input_deriv  If non-NULL, the backpropagated derivative of
             the objective function w.r.t. the input will be *added to* this
             matrix.  Should be the same dimension as the input to the original
             ConvolveForward() call.
*/
void ConvolveBackwardData(
    const ConvolutionComputation &conv_comp,
    const CuMatrixBase<BaseFloat> &params,
    const CuMatrixBase<BaseFloat> &output_deriv,
    CuMatrixBase<BaseFloat> *input_deriv);

/**
   \brief This does the part of the backward derivative computation
          of convolution, that computes derivatives w.r.t. the
          parameters.  See also ConvolveBackwardData(), which computes
          derivatives w.r.t. the input data.

   @param [in] conv_comp  A struct that describes the computation
                          that was performed in the forward pass.
   @param [in] input     The input to the original forward convolution.  This
             should be of dimension (or should be reshapable to
             the dimension) conv_comp.num_t_in * conv_comp.num_images
             by conv_comp.height_in * num_filters_in.  [highest-stride
             indexes come first in these multiplications].  It must
             satisfy input.NumCols() == input.Stride().
   @param [in] output_deriv The derivative of the objective function w.r.t. the
             output of the convolution.  Should be of dimension
             conv_comp.num_t_out * conv_comp.num_images by conv_comp.height_out
             * num_filters_out.  It must satisfy output_deriv.NumCols() ==
             output_deriv.Stride().
   @param [in] alpha   This scalar is multiplied into the derivative when
             we add to params_deriv, i.e. *params_deriv += alpha * derivative.
   @param [out] params_deriv  The derivative of the objective function
             w.r.t the parameters (the 'params' given to the ConvolveForward
             function) is *added* to this location.  This matrix should be
             of dimension conv_comp.NumRows() by conv_comp.NumCols().
*/
void ConvolveBackwardParams(
    const ConvolutionComputation &conv_comp,
    const CuMatrixBase<BaseFloat> &input,
    const CuMatrixBase<BaseFloat> &output_deriv,
    BaseFloat alpha,
    CuMatrixBase<BaseFloat> *params_deriv);


/**
   This function takes lists of input and output indexes to a computation
   (e.g. as supplied to ReorderIndexes()), and figures out a regular structure
   for them (i.e. the smallest grid that will completely cover all the t,n
   pairs).
   This function ignores any 't' values that are kNoTime.
*/
void GetComputationIo(
    const std::vector<Index> &input_indexes,
    const std::vector<Index> &output_indexes,
    ConvolutionComputationIo *io);


/**
   This function computes the reordered and possibly padded indexes
   corresponding to the computation in 'io'.  Note: the computation may have
   undergone various manipulations (padding, etc.) after being obtained by the
   function GetComputationIo().  The original input and output indexes are
   needed because they dictate the set of (n, x) pairs; and because they
   determine when to use 'real' indexes and when to use 'blank' padding values
   (i.e. when to replace the t values in the indexes by kNoTime).
*/
void GetIndexesForComputation(
    const ConvolutionComputationIo &io,
    const std::vector<Index> &orig_input_indexes,
    const std::vector<Index> &orig_output_indexes,
    std::vector<Index> *input_indexes,
    std::vector<Index> *output_indexes);


/**
   This function extends the set of input indexes that the computation
   has, to account for any required zero-padding in the time dimension.
   It reads model.all_time_offsets and model.time_offsets_modulus;
   and it may modify members start_t_in t_stride_in and num_t_in of *io.

   This is stage 1 of compilation.
 */
void PadComputationInputTime(const ConvolutionModel &model,
                             ConvolutionComputationIo *io);


/**
  This function takes a model that might require zero padding
  in the height dimension and outputs a model accepting a
  possibly-larger input dimension which does not require zero
  padding. *model_padded may differ from 'model' in its height_in and its
  'offsets' variable (the height-offsets need to be shifted if we pad at the
  bottom).  We then work out the computation in terms of the model that doesn't
  need padding (which is easier), and later convert it back to work in the space
  where there is no padding.

   This is stage 2 of compilation.
 */
void PadModelHeight(const ConvolutionModel &model,
                    ConvolutionModel *model_padded);


/** This function modifies, if necessary, a computation that has been built for
    the model 'model_padded', so that it can work for the original model
    'model'.  This may involve modifying the members 'height_in', 'temp_cols',
    and the column-related members of the elements of the 'steps' array.
    View it as the reverse step for 'PadModelHeight'.

    This function has to be aware that the computation will have been compiled
    after 'AppendInputFrames()' was called [this makes a difference in setups
    with subsampling], so the computation may have been built for input
    frames that were appended over several of the frames that 'model_padded'
    would require.

    This is the reverse step for stage 2 of compilation (it's a transformation
     of the computation).
 */
void UnPadModelHeight(const ConvolutionComputationOptions &opts,
                      const ConvolutionModel &model,
                      const ConvolutionModel &model_padded,
                      ConvolutionComputation *computation);

/**
   This function takes an input model and I/O specification, and it modifies
   both of them if necessary to ensure that the output 'io_appended' object has
   the same input and output time strides (i.e. t_stride_in == t_stride_out).
   This is done by appending the input frames across several time values and
   viewing them as single frames of larger dimension.

   The reason why 'io' is non-const is that it may be necessary to pad the
   number of input frames to ensure that the number of input frames is divisible
   by a multiple of t_stride_out / t_stride_in (if we pad the input frames, we
   pad to the right).

   The model in 'model_appended' may have larger height_in, and
   different values of 'offsets' and derived variables thereof, versus the model
   in 'model'.

   This is stage 3 of compilation.
*/
void AppendInputFrames(const ConvolutionModel &model,
                       ConvolutionComputationIo *io,
                       ConvolutionModel *model_appended,
                       ConvolutionComputationIo *io_appended);


/*
  This function takes a model and a specification of the comptuation's
  IO, and generates the computation.  This is stage 4 of the compilation.
  It assumes that stages 1, 2 and 3 have already been done so that:

    - Any required padding of the time axis (stage 1) and the height axis
      (stage 2) have been done (so any desired input values are available).
    - The t_stride_in and t_stride_out of the io object have the same value
      (stage 3).

  At this point the compilation process is actually quite simple: for each
  time shift (where the number of time shifts equals num_t_in + 1 - num_t_out
  of 'io'), we do a computation that copies and maybe duplicates the input
  columns to a temporary matrix, and then does a matrix multiplication
  between that temporary matrix
 */
void MakeComputation(const ConvolutionModel &model,
                     ConvolutionComputationIo &io,
                     const ConvolutionComputationOptions &opts,
                     ConvolutionComputation *computation);


} // namespace time_height_convolution

} // namespace nnet3





} // namespace kaldi


#endif