完成tflite_micro库目录结构,完成496TF移植部分文档

This commit is contained in:
QingChuanWS
2020-12-23 10:29:25 +08:00
parent 19280a8a79
commit c8eef23602
811 changed files with 219052 additions and 65 deletions

View File

@@ -0,0 +1,202 @@
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

View File

@@ -0,0 +1,900 @@
// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
// fixedpoint.h: fixed-point arithmetic, with basic operations and
// a few math functions such as tanh.
#ifndef GEMMLOWP_INTERNAL_FIXEDPOINT_H_
#define GEMMLOWP_INTERNAL_FIXEDPOINT_H_
#include <algorithm>
#include <cassert>
#include <cmath>
#include <cstdint>
#include <limits>
#include "../internal/detect_platform.h"
namespace gemmlowp {
// Part 1: Low-level integer-arithmetic primitives.
// The implementations here are generic implementations valid for
// scalar types (e.g. std::int32_t). Architecture-specific SIMD types
// (e.g. NEON int32x4_t) may be supported by providing
// specializations for them in separate files.
//
// The purpose of these primitives is two-fold:
// - They will be used to implement higher-level fixed-point
// abstractions, namely the FixedPoint class and its arithmetic
// operators.
// - They will be directly used to implement some more involved
// fixed-point computations, e.g. the fixed-point implementation
// of math functions such as tanh.
// Some compile-time traits around raw types to handle SIMD aspects:
// number of lanes, underlying scalar type.
template <typename tIntegerType>
struct FixedPointRawTypeTraits {};
template <>
struct FixedPointRawTypeTraits<std::int32_t> {
typedef std::int32_t ScalarRawType;
static constexpr int kLanes = 1;
};
template <>
struct FixedPointRawTypeTraits<std::int16_t> {
typedef std::int16_t ScalarRawType;
static constexpr int kLanes = 1;
};
// Returns a SIMD value duplicating a scalar value across all lanes.
template <typename tRawType>
tRawType Dup(typename FixedPointRawTypeTraits<tRawType>::ScalarRawType x) {
return x;
}
// Plain bit-wise AND
template <typename tIntegerType>
tIntegerType BitAnd(tIntegerType a, tIntegerType b) {
return a & b;
}
// Plain bit-wise OR
template <typename tIntegerType>
tIntegerType BitOr(tIntegerType a, tIntegerType b) {
return a | b;
}
// Plain bit-wise XOR
template <typename tIntegerType>
tIntegerType BitXor(tIntegerType a, tIntegerType b) {
return a ^ b;
}
// Plain bit-wise NOT
template <typename tIntegerType>
tIntegerType BitNot(tIntegerType a) {
return ~a;
}
// Integer addition. Not saturating. Overflow is undefined behavior.
template <typename tIntegerType>
tIntegerType Add(tIntegerType a, tIntegerType b) {
return a + b;
}
// Integer subtraction. Not saturating. Overflow is undefined behavior.
template <typename tIntegerType>
tIntegerType Mul(tIntegerType a, tIntegerType b) {
return a * b;
}
template <typename tIntegerType>
tIntegerType Sub(tIntegerType a, tIntegerType b) {
return a - b;
}
// Integer unary negative. Not saturating. Overflow is undefined behavior.
template <typename tIntegerType>
tIntegerType Neg(tIntegerType a) {
return -a;
}
// Integer arithmetic left-shift, equivalent to multiplying with a power of two.
// Negative values are OK. In case of overflow, no Undefined
// Behavior, but the results are implementation-defined (in practice,
// they currently are saturated, but we make no commitment to that). The idea
// is that the caller will want to implement the overflowing cases with
// saturation with compare-and-mask, so we don't care about the results
// in the overflow case, we just want to avoid undefined behavior.
//
// tIntegerType may be int32 or any narrower signed type.
template <typename tIntegerType>
tIntegerType ShiftLeft(tIntegerType a, int offset) {
const std::int64_t wide_a = static_cast<std::int64_t>(a);
const std::int64_t wide_shifted = wide_a * (1 << offset);
const auto min = std::numeric_limits<tIntegerType>::min();
const auto max = std::numeric_limits<tIntegerType>::max();
return wide_shifted < min
? min
: wide_shifted > max ? max
: static_cast<tIntegerType>(wide_shifted);
}
// Integer arithmetic right-shift. Not rounding.
// Relying on implementation-defined, but in-practice-consistent,
// C++ compiler behavior.
template <typename tIntegerType>
tIntegerType ShiftRight(tIntegerType a, int offset) {
return a >> offset;
}
// Each bit of the result is set to the corresponding bit of either then_val or
// else_val depending on whether the corresponding bit of if_mask is set.
// Equivalent to the VBSL instruction in ARM NEON.
template <typename tIntegerType>
tIntegerType SelectUsingMask(tIntegerType if_mask, tIntegerType then_val,
tIntegerType else_val) {
return BitXor(BitAnd(if_mask, then_val), BitAnd(BitNot(if_mask), else_val));
}
// For each input scalar, the corresponding bits of the result are set if the
// input scalar is non-zero.
template <typename tIntegerType>
tIntegerType MaskIfNonZero(tIntegerType a) {
static constexpr tIntegerType zero = 0;
return a ? BitNot(zero) : zero;
}
// For each input scalar, the corresponding bits of the result are set if the
// input scalar is zero.
template <typename tIntegerType>
tIntegerType MaskIfZero(tIntegerType a) {
return MaskIfNonZero<tIntegerType>(!a);
}
// For each pair of input scalars, the corresponding bits of the result are
// set if the input scalars are equal.
template <typename tIntegerType>
tIntegerType MaskIfEqual(tIntegerType a, tIntegerType b) {
return MaskIfNonZero<tIntegerType>(a == b);
}
// For each pair of input scalars, the corresponding bits of the result are
// set if the input scalars are not equal.
template <typename tIntegerType>
tIntegerType MaskIfNotEqual(tIntegerType a, tIntegerType b) {
return MaskIfNonZero<tIntegerType>(a != b);
}
// For each pair of input scalars, the corresponding bits of the result are
// set if the input scalars a, b satisfy a > b.
template <typename tIntegerType>
tIntegerType MaskIfGreaterThan(tIntegerType a, tIntegerType b) {
return MaskIfNonZero<tIntegerType>(a > b);
}
// For each pair of input scalars, the corresponding bits of the result are
// set if the input scalars a, b satisfy a >= b.
template <typename tIntegerType>
tIntegerType MaskIfGreaterThanOrEqual(tIntegerType a, tIntegerType b) {
return MaskIfNonZero<tIntegerType>(a >= b);
}
// For each pair of input scalars, the corresponding bits of the result are
// set if the input scalars a, b satisfy a < b.
template <typename tIntegerType>
tIntegerType MaskIfLessThan(tIntegerType a, tIntegerType b) {
return MaskIfNonZero<tIntegerType>(a < b);
}
// For each pair of input scalars, the corresponding bits of the result are
// set if the input scalars a, b satisfy a <= b.
template <typename tIntegerType>
tIntegerType MaskIfLessThanOrEqual(tIntegerType a, tIntegerType b) {
return MaskIfNonZero<tIntegerType>(a <= b);
}
// Returns true if all of the input scalars are nonzero.
// This function may currently assume that each of the input scalars has either
// all or none of its bits set. Otherwise, its behavior is currently undefined.
template <typename tIntegerType>
bool All(tIntegerType a) {
return a;
}
// Returns true if any of the input scalars are nonzero.
// This function may currently assume that each of the input scalars has either
// all or none of its bits set. Otherwise, its behavior is currently undefined.
template <typename tIntegerType>
bool Any(tIntegerType a) {
return a;
}
// Returns (a+b)/2, rounded to the nearest integer.
// Equivalent to VRHADD in the ARM NEON instruction set.
template <typename IntegerType>
IntegerType RoundingHalfSum(IntegerType a, IntegerType b) {
static_assert(std::is_same<IntegerType, void>::value, "unimplemented");
(void)b;
return a;
}
template <>
inline std::int32_t RoundingHalfSum(std::int32_t a, std::int32_t b) {
std::int64_t a64 = a;
std::int64_t b64 = b;
std::int64_t sum = a64 + b64;
std::int64_t sign = sum >= 0 ? 1 : -1;
return static_cast<std::int32_t>((sum + sign) / 2);
}
template <>
inline std::int16_t RoundingHalfSum(std::int16_t a, std::int16_t b) {
std::int32_t a32 = a;
std::int32_t b32 = b;
std::int32_t sum = a32 + b32;
std::int32_t sign = sum >= 0 ? 1 : -1;
return static_cast<std::int16_t>((sum + sign) / 2);
}
template <typename IntegerType>
IntegerType SaturatingAdd(IntegerType a, IntegerType b) {
static_assert(std::is_same<IntegerType, void>::value, "unimplemented");
(void)b;
return a;
}
// So far this is only needed for int16.
template <>
inline std::int16_t SaturatingAdd(std::int16_t a, std::int16_t b) {
std::int32_t a32 = a;
std::int32_t b32 = b;
std::int32_t sum = a32 + b32;
return static_cast<std::int16_t>(
std::min(static_cast<std::int32_t>(32767),
std::max(static_cast<std::int32_t>(-32768), sum)));
}
// Returns a+b, saturating if the integers are 16bit or narrower,
// otherwise just a plain addition.
template <typename IntegerType, bool Is16Bit>
struct AddSaturatingIf16BitImpl {
static IntegerType Run(IntegerType a, IntegerType b) { return Add(a, b); }
};
template <typename IntegerType>
struct AddSaturatingIf16BitImpl<IntegerType, true> {
static IntegerType Run(IntegerType a, IntegerType b) {
return SaturatingAdd(a, b);
}
};
template <typename IntegerType>
IntegerType AddSaturatingIf16Bit(IntegerType a, IntegerType b) {
using ScalarType =
typename FixedPointRawTypeTraits<IntegerType>::ScalarRawType;
return AddSaturatingIf16BitImpl<IntegerType, sizeof(ScalarType) == 2>::Run(a,
b);
}
// Returns the integer that represents the product of two fixed-point
// numbers, interpreting all integers as fixed-point values in the
// interval [-1, 1), rounding to the nearest value, and saturating
// -1 * -1 to the maximum value (since 1 is not in the half-open
// interval [-1, 1)).
//
// [The explanation below specializes to std::int32_t for example purpose.]
//
// The mapping between IntegerType and the interval [-1, 1) is unique and
// implied by IntegerType, which is assumed to be signed. For example,
// for IntegerType==std::int32_t, the mapping is
// real_value = integer_value / 2^31.
// So in this case, and leaving aside rounding and saturating, this
// function computes ((a / 2^31) * (b / 2^31)) * 2^31, which simplifies to
// (a * b) / 2^31.
//
// The 'doubling' part in the name of this function comes from the fact that
// this operation is very close to a "multiply-high" operation, keeping only
// the top half bits, except that that would be effectively computing
// (a * b) / 2^32,
// so here we are computing 2x that, since
// 1/2^31 = 2 * 1/2^32.
// The idea is to use all of the available 32 bits in the destination int32
// value.
//
// [End of the explanation specializing to int32.]
//
// This is equivalent to the VQRDMULH instruction in ARM NEON.
template <typename IntegerType>
IntegerType SaturatingRoundingDoublingHighMul(IntegerType a, IntegerType b) {
static_assert(std::is_same<IntegerType, void>::value, "unimplemented");
(void)b;
return a;
}
// This function implements the same computation as the ARMv7 NEON VQRDMULH
// instruction.
template <>
inline std::int32_t SaturatingRoundingDoublingHighMul(std::int32_t a,
std::int32_t b) {
bool overflow = a == b && a == std::numeric_limits<std::int32_t>::min();
std::int64_t a_64(a);
std::int64_t b_64(b);
std::int64_t ab_64 = a_64 * b_64;
std::int32_t nudge = ab_64 >= 0 ? (1 << 30) : (1 - (1 << 30));
std::int32_t ab_x2_high32 =
static_cast<std::int32_t>((ab_64 + nudge) / (1ll << 31));
return overflow ? std::numeric_limits<std::int32_t>::max() : ab_x2_high32;
}
template <>
inline std::int16_t SaturatingRoundingDoublingHighMul(std::int16_t a,
std::int16_t b) {
bool overflow = a == b && a == std::numeric_limits<std::int16_t>::min();
std::int32_t a_32(a);
std::int32_t b_32(b);
std::int32_t ab_32 = a_32 * b_32;
std::int16_t nudge = ab_32 >= 0 ? (1 << 14) : (1 - (1 << 14));
std::int16_t ab_x2_high16 =
static_cast<std::int16_t>((ab_32 + nudge) / (1 << 15));
return overflow ? std::numeric_limits<std::int16_t>::max() : ab_x2_high16;
}
// Correctly-rounded-to-nearest division by a power-of-two.
// Also known as a rounding arithmetic right shift.
template <typename IntegerType>
inline IntegerType RoundingDivideByPOT(IntegerType x, int exponent) {
assert(exponent >= 0);
assert(exponent <= 31);
const IntegerType mask = Dup<IntegerType>((1ll << exponent) - 1);
const IntegerType zero = Dup<IntegerType>(0);
const IntegerType one = Dup<IntegerType>(1);
const IntegerType remainder = BitAnd(x, mask);
const IntegerType threshold =
Add(ShiftRight(mask, 1), BitAnd(MaskIfLessThan(x, zero), one));
return Add(ShiftRight(x, exponent),
BitAnd(MaskIfGreaterThan(remainder, threshold), one));
}
// Returns the product of a run-time integer value by a compile-time power
// of two, with either a positive exponent (equivalent to an arithmetic
// left shift, saturating) or a negative exponent (equivalent to an arithmetic
// right shift, rounding to nearest).
template <int Exponent, typename IntegerType,
int ExponentSign = (Exponent > 0 ? 1 : Exponent < 0 ? -1 : 0)>
struct ImplSaturatingRoundingMultiplyByPOT {};
template <int Exponent, typename IntegerType>
struct ImplSaturatingRoundingMultiplyByPOT<Exponent, IntegerType, 0> {
static IntegerType eval(IntegerType x) { return x; }
};
template <int Exponent, typename IntegerType>
struct ImplSaturatingRoundingMultiplyByPOT<Exponent, IntegerType, 1> {
static IntegerType eval(IntegerType x) {
using ScalarIntegerType =
typename FixedPointRawTypeTraits<IntegerType>::ScalarRawType;
const IntegerType min =
Dup<IntegerType>(std::numeric_limits<ScalarIntegerType>::min());
const IntegerType max =
Dup<IntegerType>(std::numeric_limits<ScalarIntegerType>::max());
const int ScalarIntegerTypeBits = 8 * sizeof(ScalarIntegerType);
const std::int32_t threshold =
((1 << (ScalarIntegerTypeBits - 1 - Exponent)) - 1);
const IntegerType positive_mask =
MaskIfGreaterThan(x, Dup<IntegerType>(threshold));
const IntegerType negative_mask =
MaskIfLessThan(x, Dup<IntegerType>(-threshold));
IntegerType result = ShiftLeft(x, Exponent);
result = SelectUsingMask(positive_mask, max, result);
result = SelectUsingMask(negative_mask, min, result);
return result;
}
};
template <int Exponent, typename IntegerType>
struct ImplSaturatingRoundingMultiplyByPOT<Exponent, IntegerType, -1> {
static IntegerType eval(IntegerType x) {
return RoundingDivideByPOT<IntegerType>(x, -Exponent);
}
};
template <int Exponent, typename IntegerType>
IntegerType SaturatingRoundingMultiplyByPOT(IntegerType x) {
return ImplSaturatingRoundingMultiplyByPOT<Exponent, IntegerType>::eval(x);
}
// Part 2: the FixedPoint class.
// A FixedPoint object represents a fixed-point value stored in the underlying
// integer type tRawType, if tRawType is a plain scalar integer type.
// Alternatively, tRawType may be a SIMD type (e.g. NEON int32x4_t) in which
// case a FixedPoint object represents a corresponding SIMD vector of fixed
// point values.
//
// tIntegerBits describes the range of the fixed-point format: if
// tIntegerBits == m then the range of representable values is the half-open
// interval [-2^m; 2^m) where the open boundary on the right side means that
// 2^m is not representable (how close the maximum representable value is to
// it, depends on bit-depth of tRawType).
//
// In "Q format notation",
// https://en.wikipedia.org/wiki/Q_(number_format)
// we are describing the format
// Qm.n
// where
// m = tIntegerBits
// and
// n = NumberOfBits(tRawType) - (m + 1)
// Note that the (m + 1) in the above line is because we adopt the convention
// that we count the integer bits exclusively of the sign bit; so (m + 1) is
// the total number of integer bits inclusive of the sign bit.
//
// Accordingly, the number of integral representable values in our range
// [-2^m ; 2^m)
// is equal to 2^(m+1).
template <typename tRawType, int tIntegerBits>
class FixedPoint {
public:
typedef tRawType RawType;
typedef FixedPointRawTypeTraits<RawType> RawTypeTraits;
typedef typename RawTypeTraits::ScalarRawType ScalarRawType;
static constexpr int kTotalBits = 8 * sizeof(ScalarRawType);
static constexpr int kIntegerBits = tIntegerBits;
static constexpr int kFractionalBits = kTotalBits - 1 - kIntegerBits;
static_assert(kIntegerBits >= 0 && kIntegerBits < kTotalBits,
"bad IntegerBits");
typedef FixedPoint<ScalarRawType, kIntegerBits> ScalarFixedPointType;
static const ScalarRawType ScalarRawMin() {
return std::numeric_limits<ScalarRawType>::min();
}
static const ScalarRawType ScalarRawMax() {
return std::numeric_limits<ScalarRawType>::max();
}
static const ScalarRawType RawMin() {
return VectorFromScalar(ScalarRawMin());
}
static const ScalarRawType RawMax() {
return VectorFromScalar(ScalarRawMax());
}
static FixedPoint FromRaw(RawType x) {
FixedPoint retval;
retval.raw() = x;
return retval;
}
static FixedPoint FromScalarRaw(ScalarRawType x) {
FixedPoint retval;
retval.raw() = Dup<RawType>(x);
return retval;
}
static FixedPoint FromScalarFixedPoint(ScalarFixedPointType x) {
return FromScalarRaw(x.raw());
}
template <int Exponent>
static FixedPoint ConstantPOT() {
static constexpr int kOffset = kFractionalBits + Exponent;
static_assert(
kOffset < 31,
"Constant not exactly representable in this fixed-point format");
return FromScalarRaw(ScalarRawType(1) << kOffset);
}
static FixedPoint Zero() { return FromScalarRaw(0); }
static FixedPoint One() {
return FromScalarRaw(
kIntegerBits == 0
? ScalarRawMax()
: (ScalarRawType(1) << (kIntegerBits == 0 ? 0 : kFractionalBits)));
}
static FixedPoint FromDouble(double x) {
const double min_bound = static_cast<double>(ScalarRawMin());
const double max_bound = static_cast<double>(ScalarRawMax());
return FromScalarRaw(static_cast<ScalarRawType>(std::min(
std::max(round(x * static_cast<double>(1ll << kFractionalBits)),
min_bound),
max_bound)));
}
RawType raw() const { return i_; }
RawType& raw() { return i_; }
private:
RawType i_;
};
// Part 3: implementation of arithmetic operators for the
// FixedPoint class, and a few related functions.
// A FixedPoint multiplication is just a
// SaturatingRoundingDoublingHighMul operation on the underlying
// raw integer values. The IntegerBits simply add up, as is obvious
// from the fact that the range is [-2^IntegerBits, 2^IntegerBits).
template <typename tRawType, int tIntegerBits_a, int tIntegerBits_b>
FixedPoint<tRawType, tIntegerBits_a + tIntegerBits_b> operator*(
FixedPoint<tRawType, tIntegerBits_a> a,
FixedPoint<tRawType, tIntegerBits_b> b) {
FixedPoint<tRawType, tIntegerBits_a + tIntegerBits_b> c;
c.raw() = SaturatingRoundingDoublingHighMul(a.raw(), b.raw());
return c;
}
// Tweaking IntegerBits gives exact multiplication by a power of two.
template <int tExponent, typename tRawType, int tIntegerBits>
FixedPoint<tRawType, tExponent + tIntegerBits> ExactMulByPot(
FixedPoint<tRawType, tIntegerBits> a) {
FixedPoint<tRawType, tExponent + tIntegerBits> c;
c.raw() = a.raw();
return c;
}
// If we want to leave IntegerBits fixed, then multiplication
// by a power of two has to be saturating/rounding, not exact anymore.
template <int tExponent, typename tRawType, int tIntegerBits>
FixedPoint<tRawType, tIntegerBits> SaturatingRoundingMultiplyByPOT(
FixedPoint<tRawType, tIntegerBits> a) {
return FixedPoint<tRawType, tIntegerBits>::FromRaw(
SaturatingRoundingMultiplyByPOT<tExponent>(a.raw()));
}
// Generic arithmetic operators.
#define MAKE_FIXEDPOINT_UNARY_FUNC(FuncName, ImplFuncName) \
template <typename tRawType, int tIntegerBits> \
FixedPoint<tRawType, tIntegerBits> FuncName( \
FixedPoint<tRawType, tIntegerBits> a) { \
return FixedPoint<tRawType, tIntegerBits>::FromRaw(ImplFuncName(a.raw())); \
}
#define MAKE_FIXEDPOINT_BINARY_FUNC(FuncName, ImplFuncName) \
template <typename tRawType, int tIntegerBits> \
FixedPoint<tRawType, tIntegerBits> FuncName( \
FixedPoint<tRawType, tIntegerBits> a, \
FixedPoint<tRawType, tIntegerBits> b) { \
return FixedPoint<tRawType, tIntegerBits>::FromRaw( \
ImplFuncName(a.raw(), b.raw())); \
}
MAKE_FIXEDPOINT_UNARY_FUNC(operator-, Neg)
MAKE_FIXEDPOINT_UNARY_FUNC(operator~, BitNot)
MAKE_FIXEDPOINT_BINARY_FUNC(operator+, Add)
MAKE_FIXEDPOINT_BINARY_FUNC(operator-, Sub)
MAKE_FIXEDPOINT_BINARY_FUNC(operator&, BitAnd)
MAKE_FIXEDPOINT_BINARY_FUNC(operator^, BitXor)
MAKE_FIXEDPOINT_BINARY_FUNC(operator|, BitOr)
MAKE_FIXEDPOINT_BINARY_FUNC(RoundingHalfSum, RoundingHalfSum)
#undef MAKE_FIXEDPOINT_UNARY_FUNC
#undef MAKE_FIXEDPOINT_BINARY_FUNC
#define MAKE_FIXEDPOINT_UNARY_FUNC_RETURNING_RAW(FuncName) \
template <typename tRawType, int tIntegerBits> \
tRawType FuncName(FixedPoint<tRawType, tIntegerBits> a) { \
return FuncName(a.raw()); \
}
#define MAKE_FIXEDPOINT_BINARY_FUNC_RETURNING_RAW(FuncName) \
template <typename tRawType, int tIntegerBits> \
tRawType FuncName(FixedPoint<tRawType, tIntegerBits> a, \
FixedPoint<tRawType, tIntegerBits> b) { \
return FuncName(a.raw(), b.raw()); \
}
MAKE_FIXEDPOINT_UNARY_FUNC_RETURNING_RAW(MaskIfZero)
MAKE_FIXEDPOINT_UNARY_FUNC_RETURNING_RAW(MaskIfNonZero)
MAKE_FIXEDPOINT_BINARY_FUNC_RETURNING_RAW(MaskIfEqual)
MAKE_FIXEDPOINT_BINARY_FUNC_RETURNING_RAW(MaskIfNotEqual)
MAKE_FIXEDPOINT_BINARY_FUNC_RETURNING_RAW(MaskIfGreaterThan)
MAKE_FIXEDPOINT_BINARY_FUNC_RETURNING_RAW(MaskIfGreaterThanOrEqual)
MAKE_FIXEDPOINT_BINARY_FUNC_RETURNING_RAW(MaskIfLessThan)
MAKE_FIXEDPOINT_BINARY_FUNC_RETURNING_RAW(MaskIfLessThanOrEqual)
#undef MAKE_FIXEDPOINT_UNARY_FUNC_RETURNING_RAW
#undef MAKE_FIXEDPOINT_BINARY_FUNC_RETURNING_RAW
template <typename tRawType, int tIntegerBits>
FixedPoint<tRawType, tIntegerBits> SelectUsingMask(
tRawType if_mask, FixedPoint<tRawType, tIntegerBits> then_val,
FixedPoint<tRawType, tIntegerBits> else_val) {
return FixedPoint<tRawType, tIntegerBits>::FromRaw(
SelectUsingMask(if_mask, then_val.raw(), else_val.raw()));
}
template <typename tRawType, int tIntegerBits>
bool operator==(FixedPoint<tRawType, tIntegerBits> a,
FixedPoint<tRawType, tIntegerBits> b) {
return All(MaskIfEqual(a.raw(), b.raw()));
}
template <typename tRawType, int tIntegerBits>
bool operator!=(FixedPoint<tRawType, tIntegerBits> a,
FixedPoint<tRawType, tIntegerBits> b) {
return !(a == b);
}
template <typename tRawType, int tIntegerBits>
FixedPoint<tRawType, tIntegerBits> SaturatingAdd(
FixedPoint<tRawType, tIntegerBits> a,
FixedPoint<tRawType, tIntegerBits> b) {
return FixedPoint<tRawType, tIntegerBits>::FromRaw(
SaturatingAdd(a.raw(), b.raw()));
}
template <typename tRawType, int tIntegerBits>
FixedPoint<tRawType, tIntegerBits> AddSaturatingIf16Bit(
FixedPoint<tRawType, tIntegerBits> a,
FixedPoint<tRawType, tIntegerBits> b) {
return FixedPoint<tRawType, tIntegerBits>::FromRaw(
AddSaturatingIf16Bit(a.raw(), b.raw()));
}
// Conversion to floating-point.
template <typename tRawType, int tIntegerBits>
double ToDouble(FixedPoint<tRawType, tIntegerBits> x) {
static_assert(FixedPointRawTypeTraits<tRawType>::kLanes == 1,
"not applicable to SIMD types");
typedef FixedPoint<tRawType, tIntegerBits> F;
return x.raw() / static_cast<double>(1ll << F::kFractionalBits);
}
// Rescale changes the number of IntegerBits and updates the underlying
// raw integer value accordingly.
template <int tIntegerBitsDst, typename tRawType, int tIntegerBitsSrc>
FixedPoint<tRawType, tIntegerBitsDst> Rescale(
FixedPoint<tRawType, tIntegerBitsSrc> x) {
static constexpr int kExponent = tIntegerBitsSrc - tIntegerBitsDst;
FixedPoint<tRawType, tIntegerBitsDst> result;
result.raw() = SaturatingRoundingMultiplyByPOT<kExponent>(x.raw());
return result;
}
// CheckedFixedPointConstant allows to specify fixed-point constants
// initialized as real numbers, in a way that does not compile floating-point
// arithmetic in production code, yet still checks agreement with the
// floating-point expressions when asserts are enabled.
//
// The raw integer value provided is always a int32, encoding a 32-bit
// fixed-point value, regardless of the actual Scalar type. This allows
// writing generic code that applies just as well to the 32-bit and 16-bit
// cases. In the 16-bit case, the raw integer value is internally
// rounding-shifted by 16 bits to the right.
template <typename FixedPointType>
inline typename FixedPointType::ScalarRawType RescaleConstantInitializer(
std::int32_t int32_value) {
typedef typename FixedPointType::ScalarRawType ScalarRawType;
static constexpr int ScalarTypeBits = 8 * sizeof(ScalarRawType);
return static_cast<ScalarRawType>(
RoundingDivideByPOT<std::int32_t>(int32_value, 32 - ScalarTypeBits));
}
#ifdef GEMMLOWP_ENABLE_FIXEDPOINT_CONSTANTS_CHECKS
template <typename FixedPointType>
FixedPointType CheckedFixedPointConstant(std::int32_t raw_value,
double double_value) {
const FixedPointType result = FixedPointType::FromScalarRaw(raw_value);
assert(result == FixedPointType::FromDouble(double_value));
return result;
}
#define GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(FixedPointType, \
ScalarRawInt32Value, DoubleValue) \
(gemmlowp::CheckedFixedPointConstant<FixedPointType>( \
gemmlowp::RescaleConstantInitializer<FixedPointType>( \
ScalarRawInt32Value), \
DoubleValue))
#else
#define GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(FixedPointType, \
ScalarRawInt32Value, DoubleValue) \
(FixedPointType::FromScalarRaw( \
gemmlowp::RescaleConstantInitializer<FixedPointType>( \
ScalarRawInt32Value)))
#endif
// Implementation of exponential function.
// Returns exp(x) for x in [-1/4, 0).
template <typename tRawType>
FixedPoint<tRawType, 0> exp_on_interval_between_negative_one_quarter_and_0_excl(
FixedPoint<tRawType, 0> a) {
typedef FixedPoint<tRawType, 0> F;
const F constant_term =
GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(F, 1895147668, std::exp(-1.0 / 8.0));
const F constant_1_over_3 =
GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(F, 715827883, 1.0 / 3.0);
// We're evaluating a Taylor expansion around -1/8, so we do the change of
// variable: x = a + 1/8.
// In fixed-point with 0 integer bits, 1/8 is represented by 1 << 28.
F x = a + F::template ConstantPOT<-3>();
F x2 = x * x;
F x3 = x2 * x;
F x4 = x2 * x2;
F x4_over_4 = SaturatingRoundingMultiplyByPOT<-2>(x4);
F x4_over_24_plus_x3_over_6_plus_x2_over_2 =
SaturatingRoundingMultiplyByPOT<-1>(
((x4_over_4 + x3) * constant_1_over_3) + x2);
return AddSaturatingIf16Bit(
constant_term,
constant_term * (x + x4_over_24_plus_x3_over_6_plus_x2_over_2));
}
// Returns exp(x) for x < 0.
template <typename tRawType, int tIntegerBits>
FixedPoint<tRawType, 0> exp_on_negative_values(
FixedPoint<tRawType, tIntegerBits> a) {
typedef FixedPoint<tRawType, tIntegerBits> InputF;
typedef FixedPoint<tRawType, 0> ResultF;
static constexpr int kFractionalBits = InputF::kFractionalBits;
static constexpr int kIntegerBits = InputF::kIntegerBits;
const InputF kOneQuarter = InputF::template ConstantPOT<-2>();
InputF mask = kOneQuarter - InputF::FromScalarRaw(1);
InputF a_mod_quarter_minus_one_quarter = (a & mask) - kOneQuarter;
ResultF result = exp_on_interval_between_negative_one_quarter_and_0_excl(
Rescale<0>(a_mod_quarter_minus_one_quarter));
tRawType remainder = (a_mod_quarter_minus_one_quarter - a).raw();
#define GEMMLOWP_EXP_BARREL_SHIFTER(Exponent, FixedPointMultiplier) \
if (kIntegerBits > Exponent) { \
const ResultF kMultiplier = GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT( \
ResultF, FixedPointMultiplier, std::exp(-std::pow(2.0, Exponent))); \
static constexpr int kShiftAmount = \
kIntegerBits > Exponent ? kFractionalBits + Exponent : 0; \
result = SelectUsingMask( \
MaskIfNonZero(BitAnd(remainder, Dup<tRawType>(1 << kShiftAmount))), \
result * kMultiplier, result); \
}
GEMMLOWP_EXP_BARREL_SHIFTER(-2, 1672461947);
GEMMLOWP_EXP_BARREL_SHIFTER(-1, 1302514674);
GEMMLOWP_EXP_BARREL_SHIFTER(+0, 790015084);
GEMMLOWP_EXP_BARREL_SHIFTER(+1, 290630308);
GEMMLOWP_EXP_BARREL_SHIFTER(+2, 39332535);
GEMMLOWP_EXP_BARREL_SHIFTER(+3, 720401);
GEMMLOWP_EXP_BARREL_SHIFTER(+4, 242);
#undef GEMMLOWP_EXP_BARREL_SHIFTER
static constexpr int clampB = kIntegerBits > 5 ? 36 - kIntegerBits : 0;
if (kIntegerBits > 5) {
const InputF clamp =
GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(InputF, -(1 << clampB), -32.0);
result = SelectUsingMask(MaskIfLessThan(a, clamp), ResultF::Zero(), result);
}
result = SelectUsingMask(MaskIfZero(a), ResultF::One(), result);
return result;
}
// Implementation of tanh: (1 - exp(-2x)) / (1 + exp(-2x)).
// Returns (1 - x) / (1 + x) for x in (0, 1).
template <typename tRawType>
FixedPoint<tRawType, 0> one_minus_x_over_one_plus_x_for_x_in_0_1(
FixedPoint<tRawType, 0> a) {
typedef FixedPoint<tRawType, 0> F0;
typedef FixedPoint<tRawType, 2> F2;
F0 half_denominator = RoundingHalfSum(a, F0::One());
// Newton-Raphson division
// https://en.wikipedia.org/wiki/Division_algorithm#Newton.E2.80.93Raphson_division
// Refer to that page for the logic behind the 48/17 and 32/17 constants.
const F2 constant_48_over_17 =
GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(F2, 1515870810, 48.0 / 17.0);
const F2 constant_neg_32_over_17 =
GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(F2, -1010580540, -32.0 / 17.0);
F2 x = constant_48_over_17 + half_denominator * constant_neg_32_over_17;
for (int i = 0; i < 3; i++) {
F2 half_denominator_times_x = half_denominator * x;
F2 one_minus_half_denominator_times_x =
F2::One() - half_denominator_times_x;
x = x + Rescale<2>(x * one_minus_half_denominator_times_x);
}
return Rescale<0>(x - F2::One());
}
// Returns -tanh(x) for x < 0.
template <typename tRawType, int tIntegerBits>
FixedPoint<tRawType, 0> neg_tanh_on_negative_values(
FixedPoint<tRawType, tIntegerBits> a) {
return one_minus_x_over_one_plus_x_for_x_in_0_1(
exp_on_negative_values(ExactMulByPot<1>(a)));
}
// Returns tanh(x) for any x.
template <typename tRawType, int tIntegerBits>
FixedPoint<tRawType, 0> tanh(FixedPoint<tRawType, tIntegerBits> a) {
typedef FixedPoint<tRawType, tIntegerBits> InputF;
typedef FixedPoint<tRawType, 0> ResultF;
tRawType mask_if_negative = MaskIfLessThan(a, InputF::Zero());
tRawType mask_if_zero = MaskIfZero(a);
InputF n = SelectUsingMask(mask_if_negative, a, -a);
ResultF t = neg_tanh_on_negative_values(n);
return SelectUsingMask(mask_if_zero, ResultF::Zero(),
SelectUsingMask(mask_if_negative, -t, t));
}
// Implementation of logistic function.
// Returns 1 / (1 + x) for x in (0, 1).
template <typename tRawType>
FixedPoint<tRawType, 0> one_over_one_plus_x_for_x_in_0_1(
FixedPoint<tRawType, 0> a) {
typedef FixedPoint<tRawType, 0> F0;
typedef FixedPoint<tRawType, 2> F2;
F0 half_denominator = RoundingHalfSum(a, F0::One());
// Newton-Raphson division
// https://en.wikipedia.org/wiki/Division_algorithm#Newton.E2.80.93Raphson_division
// Refer to that page for the logic behind the 48/17 and 32/17 constants.
const F2 constant_48_over_17 =
GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(F2, 1515870810, 48.0 / 17.0);
const F2 constant_neg_32_over_17 =
GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(F2, -1010580540, -32.0 / 17.0);
F2 x = constant_48_over_17 + half_denominator * constant_neg_32_over_17;
for (int i = 0; i < 3; i++) {
F2 half_denominator_times_x = half_denominator * x;
F2 one_minus_half_denominator_times_x =
F2::One() - half_denominator_times_x;
x = x + Rescale<2>(x * one_minus_half_denominator_times_x);
}
return Rescale<0>(ExactMulByPot<-1>(x));
}
// Returns logistic(x) = 1 / (1 + exp(-x)) for x > 0.
template <typename tRawType, int tIntegerBits>
FixedPoint<tRawType, 0> logistic_on_positive_values(
FixedPoint<tRawType, tIntegerBits> a) {
return one_over_one_plus_x_for_x_in_0_1(exp_on_negative_values(-a));
}
// Returns logistic(x) = 1 / (1 + exp(-x)) for any x.
template <typename tRawType, int tIntegerBits>
FixedPoint<tRawType, 0> logistic(FixedPoint<tRawType, tIntegerBits> a) {
typedef FixedPoint<tRawType, tIntegerBits> InputF;
typedef FixedPoint<tRawType, 0> ResultF;
tRawType mask_if_positive = MaskIfGreaterThan(a, InputF::Zero());
tRawType mask_if_zero = MaskIfZero(a);
InputF abs_input = SelectUsingMask(mask_if_positive, a, -a);
ResultF result_if_positive = logistic_on_positive_values(abs_input);
ResultF result_if_negative = ResultF::One() - result_if_positive;
const ResultF one_half =
GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(ResultF, 1 << 30, 0.5);
return SelectUsingMask(mask_if_zero, one_half,
SelectUsingMask(mask_if_positive, result_if_positive,
result_if_negative));
}
} // end namespace gemmlowp
#ifdef GEMMLOWP_NEON
#include "./fixedpoint_neon.h"
#elif defined(GEMMLOWP_AVX2)
#include "./fixedpoint_avx.h"
#elif defined(GEMMLOWP_SSE4)
#include "./fixedpoint_sse.h"
#elif defined(GEMMLOWP_MSA)
#include "./fixedpoint_msa.h"
#endif
#endif // GEMMLOWP_INTERNAL_FIXEDPOINT_H_

View File

@@ -0,0 +1,384 @@
// Copyright 2015 Google Inc. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
// fixedpoint_SSE.h: optimized SSE specializations of the templates
// in fixedpoint.h.
#ifndef GEMMLOWP_INTERNAL_FIXEDPOINT_SSE_H_
#define GEMMLOWP_INTERNAL_FIXEDPOINT_SSE_H_
#include <smmintrin.h>
#include "fixedpoint.h"
namespace gemmlowp {
// SSE intrinsics are not finely typed: there is a single __m128i vector
// type that does not distinguish between "int32x4" and "int16x8" use
// cases, unlike the NEON equivalents. Because we had initially focused
// on int32x4, we did not pay attention and specialized these fixedpoint
// templates directly for __m128i hardcoding the int32x4 semantics,
// not leaving room for int16x8 semantics. Amending that by adding a separate
// data type, int16x8_m128i, that wraps __m128i while being a separate
// type.
struct int16x8_m128i {
int16x8_m128i() {}
explicit int16x8_m128i(__m128i w) : v(w) {}
~int16x8_m128i() {}
__m128i v;
};
template <>
struct FixedPointRawTypeTraits<__m128i> {
typedef std::int32_t ScalarRawType;
static constexpr int kLanes = 4;
};
template <>
struct FixedPointRawTypeTraits<int16x8_m128i> {
typedef std::int16_t ScalarRawType;
static constexpr int kLanes = 8;
};
template <>
inline __m128i BitAnd(__m128i a, __m128i b) {
return _mm_and_si128(a, b);
}
template <>
inline int16x8_m128i BitAnd(int16x8_m128i a, int16x8_m128i b) {
return int16x8_m128i(_mm_and_si128(a.v, b.v));
}
template <>
inline __m128i BitOr(__m128i a, __m128i b) {
return _mm_or_si128(a, b);
}
template <>
inline int16x8_m128i BitOr(int16x8_m128i a, int16x8_m128i b) {
return int16x8_m128i(_mm_or_si128(a.v, b.v));
}
template <>
inline __m128i BitXor(__m128i a, __m128i b) {
return _mm_xor_si128(a, b);
}
template <>
inline int16x8_m128i BitXor(int16x8_m128i a, int16x8_m128i b) {
return int16x8_m128i(_mm_xor_si128(a.v, b.v));
}
template <>
inline __m128i BitNot(__m128i a) {
return _mm_andnot_si128(a, _mm_set1_epi32(-1));
}
template <>
inline int16x8_m128i BitNot(int16x8_m128i a) {
return int16x8_m128i(_mm_andnot_si128(a.v, _mm_set1_epi16(-1)));
}
template <>
inline __m128i Add(__m128i a, __m128i b) {
return _mm_add_epi32(a, b);
}
template <>
inline int16x8_m128i Add(int16x8_m128i a, int16x8_m128i b) {
return int16x8_m128i(_mm_add_epi16(a.v, b.v));
}
template <>
inline __m128i Mul(__m128i a, __m128i b) {
return _mm_mullo_epi32(a, b);
}
template <>
inline int16x8_m128i Mul(int16x8_m128i a, int16x8_m128i b) {
return int16x8_m128i(_mm_mullo_epi16(a.v, b.v));
}
template <>
inline __m128i Sub(__m128i a, __m128i b) {
return _mm_sub_epi32(a, b);
}
template <>
inline int16x8_m128i Sub(int16x8_m128i a, int16x8_m128i b) {
return int16x8_m128i(_mm_sub_epi16(a.v, b.v));
}
template <>
inline __m128i Neg(__m128i a) {
return _mm_sign_epi32(a, _mm_set1_epi32(-1));
}
template <>
inline int16x8_m128i Neg(int16x8_m128i a) {
return int16x8_m128i(_mm_sign_epi16(a.v, _mm_set1_epi16(-1)));
}
template <>
inline __m128i ShiftLeft(__m128i a, int offset) {
return _mm_slli_epi32(a, offset);
}
template <>
inline int16x8_m128i ShiftLeft(int16x8_m128i a, int offset) {
return int16x8_m128i(_mm_slli_epi16(a.v, offset));
}
template <>
inline __m128i ShiftRight(__m128i a, int offset) {
return _mm_srai_epi32(a, offset);
}
template <>
inline int16x8_m128i ShiftRight(int16x8_m128i a, int offset) {
return int16x8_m128i(_mm_srai_epi16(a.v, offset));
}
template <>
inline __m128i SelectUsingMask(__m128i if_mask, __m128i then_val,
__m128i else_val) {
// borrowed from Intel's arm_neon_sse.h header.
return _mm_or_si128(_mm_and_si128(if_mask, then_val),
_mm_andnot_si128(if_mask, else_val));
}
template <>
inline int16x8_m128i SelectUsingMask(int16x8_m128i if_mask,
int16x8_m128i then_val,
int16x8_m128i else_val) {
// borrowed from Intel's arm_neon_sse.h header.
return int16x8_m128i(SelectUsingMask(if_mask.v, then_val.v, else_val.v));
}
template <>
inline __m128i MaskIfEqual(__m128i a, __m128i b) {
return _mm_cmpeq_epi32(a, b);
}
template <>
inline int16x8_m128i MaskIfEqual(int16x8_m128i a, int16x8_m128i b) {
return int16x8_m128i(_mm_cmpeq_epi16(a.v, b.v));
}
template <>
inline __m128i MaskIfNotEqual(__m128i a, __m128i b) {
return BitNot(MaskIfEqual(a, b));
}
template <>
inline int16x8_m128i MaskIfNotEqual(int16x8_m128i a, int16x8_m128i b) {
return BitNot(MaskIfEqual(a, b));
}
template <>
inline __m128i MaskIfZero(__m128i a) {
return MaskIfEqual(a, _mm_set1_epi32(0));
}
template <>
inline int16x8_m128i MaskIfZero(int16x8_m128i a) {
return MaskIfEqual(a, int16x8_m128i(_mm_set1_epi16(0)));
}
template <>
inline __m128i MaskIfNonZero(__m128i a) {
return MaskIfNotEqual(a, _mm_set1_epi32(0));
}
template <>
inline int16x8_m128i MaskIfNonZero(int16x8_m128i a) {
return MaskIfNotEqual(a, int16x8_m128i(_mm_set1_epi16(0)));
}
template <>
inline __m128i MaskIfGreaterThan(__m128i a, __m128i b) {
return _mm_cmpgt_epi32(a, b);
}
template <>
inline int16x8_m128i MaskIfGreaterThan(int16x8_m128i a, int16x8_m128i b) {
return int16x8_m128i(_mm_cmpgt_epi16(a.v, b.v));
}
template <>
inline __m128i MaskIfLessThan(__m128i a, __m128i b) {
return _mm_cmplt_epi32(a, b);
}
template <>
inline int16x8_m128i MaskIfLessThan(int16x8_m128i a, int16x8_m128i b) {
return int16x8_m128i(_mm_cmplt_epi16(a.v, b.v));
}
template <>
inline __m128i MaskIfGreaterThanOrEqual(__m128i a, __m128i b) {
return BitNot(MaskIfLessThan(a, b));
}
template <>
inline int16x8_m128i MaskIfGreaterThanOrEqual(int16x8_m128i a,
int16x8_m128i b) {
return BitNot(MaskIfLessThan(a, b));
}
template <>
inline __m128i MaskIfLessThanOrEqual(__m128i a, __m128i b) {
return BitNot(MaskIfGreaterThan(a, b));
}
template <>
inline int16x8_m128i MaskIfLessThanOrEqual(int16x8_m128i a, int16x8_m128i b) {
return BitNot(MaskIfGreaterThan(a, b));
}
/* Assumptions:
- All and Any are used on masks.
- masks are all_ones for true lanes, all_zeroes otherwise.
Hence, All means all 128bits set, and Any means any bit set.
*/
template <>
inline bool All(__m128i a) {
return _mm_testc_si128(a, a);
}
template <>
inline bool All(int16x8_m128i a) {
return _mm_testc_si128(a.v, a.v);
}
template <>
inline bool Any(__m128i a) {
return !_mm_testz_si128(a, a);
}
template <>
inline bool Any(int16x8_m128i a) {
return !_mm_testz_si128(a.v, a.v);
}
template <>
inline __m128i RoundingHalfSum(__m128i a, __m128i b) {
/* __m128i round_bit_mask, a_over_2, b_over_2, round_bit, sum; */
/* We divide the inputs before the add to avoid the overflow and costly test
*/
/* of checking if an overflow occured on signed add */
/* round_bit_mask = _mm_set1_epi32(1); */
/* a_over_2 = _mm_srai_epi32(a, 1); */
/* b_over_2 = _mm_srai_epi32(b, 1); */
/* sum = Add(a_over_2, b_over_2); */
/* round_bit = _mm_sign_epi32(BitAnd(BitOr(a,b), round_bit_mask), sum); */
/* return Add(sum, round_bit); */
/* Other possibility detecting overflow and xor the sign if an overflow
* happened*/
__m128i one, sign_bit_mask, sum, rounded_half_sum, overflow, result;
one = _mm_set1_epi32(1);
sign_bit_mask = _mm_set1_epi32(0x80000000);
sum = Add(a, b);
rounded_half_sum = _mm_srai_epi32(Add(sum, one), 1);
overflow =
BitAnd(BitAnd(BitXor(a, rounded_half_sum), BitXor(b, rounded_half_sum)),
sign_bit_mask);
result = BitXor(rounded_half_sum, overflow);
return result;
}
template <>
inline int16x8_m128i RoundingHalfSum(int16x8_m128i a, int16x8_m128i b) {
// Idea: go to unsigned to use _mm_avg_epu16,
// borrowed from Intel's arm_neon_sse.h header.
__m128i constant_neg_32768 = _mm_set1_epi16(-32768);
__m128i a_unsigned = _mm_sub_epi16(a.v, constant_neg_32768);
__m128i b_unsigned = _mm_sub_epi16(b.v, constant_neg_32768);
__m128i avg_unsigned = _mm_avg_epu16(a_unsigned, b_unsigned);
__m128i avg = _mm_add_epi16(avg_unsigned, constant_neg_32768);
return int16x8_m128i(avg);
}
template <>
inline __m128i SaturatingRoundingDoublingHighMul(__m128i a, __m128i b) {
__m128i min, saturation_mask, a0_a2, a1_a3, b0_b2, b1_b3;
__m128i a0b0_a2b2, a1b1_a3b3, a0b0_a2b2_rounded, a1b1_a3b3_rounded;
__m128i a0b0_a2b2_rounded_2x, a1b1_a3b3_rounded_2x, result;
__m128i nudge;
// saturation only happen if a == b == INT_MIN
min = _mm_set1_epi32(std::numeric_limits<std::int32_t>::min());
saturation_mask = BitAnd(MaskIfEqual(a, b), MaskIfEqual(a, min));
// a = a0 | a1 | a2 | a3
// b = b0 | b1 | b2 | b3
a0_a2 = a;
a1_a3 = _mm_srli_si128(a, 4);
b0_b2 = b;
b1_b3 = _mm_srli_si128(b, 4);
a0b0_a2b2 = _mm_mul_epi32(a0_a2, b0_b2);
a1b1_a3b3 = _mm_mul_epi32(a1_a3, b1_b3);
// do the rounding and take into account that it will be doubled
nudge = _mm_set1_epi64x(1 << 30);
a0b0_a2b2_rounded = _mm_add_epi64(a0b0_a2b2, nudge);
a1b1_a3b3_rounded = _mm_add_epi64(a1b1_a3b3, nudge);
// do the doubling
a0b0_a2b2_rounded_2x = _mm_slli_epi64(a0b0_a2b2_rounded, 1);
a1b1_a3b3_rounded_2x = _mm_slli_epi64(a1b1_a3b3_rounded, 1);
// get the high part of the products
result = _mm_blend_epi16(_mm_srli_si128(a0b0_a2b2_rounded_2x, 4),
a1b1_a3b3_rounded_2x, 0xcc);
// saturate those which overflowed
return SelectUsingMask(saturation_mask, min, result);
}
template <>
inline int16x8_m128i SaturatingRoundingDoublingHighMul(int16x8_m128i a,
int16x8_m128i b) {
// Idea: use _mm_mulhrs_epi16 then saturate with a bit-operation,
// borrowed from Intel's arm_neon_sse.h header.
__m128i result_unsaturated = _mm_mulhrs_epi16(a.v, b.v);
__m128i saturation_mask =
_mm_cmpeq_epi16(result_unsaturated, _mm_set1_epi16(0x8000));
__m128i result = _mm_xor_si128(result_unsaturated, saturation_mask);
return int16x8_m128i(result);
}
template <>
inline __m128i Dup<__m128i>(std::int32_t x) {
return _mm_set1_epi32(x);
}
template <>
inline int16x8_m128i Dup<int16x8_m128i>(std::int16_t x) {
return int16x8_m128i(_mm_set1_epi16(x));
}
// So far this is only needed for int16.
template <>
inline int16x8_m128i SaturatingAdd(int16x8_m128i a, int16x8_m128i b) {
return int16x8_m128i(_mm_adds_epi16(a.v, b.v));
}
} // end namespace gemmlowp
#endif // GEMMLOWP_INTERNAL_FIXEDPOINT_SSE_H_

View File

@@ -0,0 +1,166 @@
// Copyright 2018 The Gemmlowp Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
// detect_platform.h: Sets up macros that control architecture-specific
// features of gemmlowp's implementation.
#ifndef GEMMLOWP_INTERNAL_DETECT_PLATFORM_H_
#define GEMMLOWP_INTERNAL_DETECT_PLATFORM_H_
// Our inline assembly path assume GCC/Clang syntax.
// Native Client doesn't seem to support inline assembly(?).
#if defined(__GNUC__) && !defined(__native_client__)
#define GEMMLOWP_ALLOW_INLINE_ASM
#endif
// Define macro statement that avoids inlining for GCC.
// For non-GCC, define as empty macro.
#if defined(__GNUC__)
#define GEMMLOWP_NOINLINE __attribute__((noinline))
#else
#define GEMMLOWP_NOINLINE
#endif
// Detect ARM, 32-bit or 64-bit
#ifdef __arm__
#define GEMMLOWP_ARM_32
#endif
#ifdef __aarch64__
#define GEMMLOWP_ARM_64
#endif
#if defined(GEMMLOWP_ARM_32) || defined(GEMMLOWP_ARM_64)
#define GEMMLOWP_ARM
#endif
// Detect MIPS, 32-bit or 64-bit
#if defined(__mips) && !defined(__LP64__)
#define GEMMLOWP_MIPS_32
#endif
#if defined(__mips) && defined(__LP64__)
#define GEMMLOWP_MIPS_64
#endif
#if defined(GEMMLOWP_MIPS_32) || defined(GEMMLOWP_MIPS_64)
#define GEMMLOWP_MIPS
#endif
// Detect x86, 32-bit or 64-bit
#if defined(__i386__) || defined(_M_IX86) || defined(_X86_) || defined(__i386)
#define GEMMLOWP_X86_32
#endif
#if defined(__x86_64__) || defined(_M_X64) || defined(__amd64)
#define GEMMLOWP_X86_64
#endif
#if defined(GEMMLOWP_X86_32) || defined(GEMMLOWP_X86_64)
#define GEMMLOWP_X86
#endif
// Some of our optimized paths use inline assembly and for
// now we don't bother enabling some other optimized paths using intrinddics
// where we can't use inline assembly paths.
#ifdef GEMMLOWP_ALLOW_INLINE_ASM
// Detect NEON. It's important to check for both tokens.
#if (defined __ARM_NEON) || (defined __ARM_NEON__)
#define GEMMLOWP_NEON
#endif
// Convenience NEON tokens for 32-bit or 64-bit
#if defined(GEMMLOWP_NEON) && defined(GEMMLOWP_ARM_32)
#define GEMMLOWP_NEON_32
#endif
#if defined(GEMMLOWP_NEON) && defined(GEMMLOWP_ARM_64)
#define GEMMLOWP_NEON_64
#endif
// Detect MIPS MSA.
// Limit MSA optimizations to little-endian CPUs for now.
// TODO: Perhaps, eventually support MSA optimizations on big-endian CPUs?
#if defined(GEMMLOWP_MIPS) && (__mips_isa_rev >= 5) && defined(__mips_msa) && \
defined(__BYTE_ORDER__) && (__BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__)
#define GEMMLOWP_MSA
#endif
// Convenience MIPS MSA tokens for 32-bit or 64-bit.
#if defined(GEMMLOWP_MSA) && defined(GEMMLOWP_MIPS_32)
#define GEMMLOWP_MSA_32
#endif
#if defined(GEMMLOWP_MSA) && defined(GEMMLOWP_MIPS_64)
#define GEMMLOWP_MSA_64
#endif
// compiler define for AVX2 -D GEMMLOWP_ENABLE_AVX2
// Detect AVX2
#if defined(__AVX2__) && defined(GEMMLOWP_ENABLE_AVX2)
#define GEMMLOWP_AVX2
// Detect SSE4.
// MSVC does not have __SSE4_1__ macro, but will enable SSE4
// when AVX is turned on.
#elif defined(__SSE4_1__) || (defined(_MSC_VER) && defined(__AVX__))
#define GEMMLOWP_SSE4
// Detect SSE3.
#elif defined(__SSE3__)
#define GEMMLOWP_SSE3
#endif
// Convenience SSE4 tokens for 32-bit or 64-bit
#if defined(GEMMLOWP_SSE4) && defined(GEMMLOWP_X86_32) && \
!defined(GEMMLOWP_DISABLE_SSE4)
#define GEMMLOWP_SSE4_32
#endif
#if defined(GEMMLOWP_SSE3) && defined(GEMMLOWP_X86_32)
#define GEMMLOWP_SSE3_32
#endif
#if defined(GEMMLOWP_SSE4) && defined(GEMMLOWP_X86_64) && \
!defined(GEMMLOWP_DISABLE_SSE4)
#define GEMMLOWP_SSE4_64
#endif
#if defined(GEMMLOWP_SSE3) && defined(GEMMLOWP_X86_64)
#define GEMMLOWP_SSE3_64
#endif
#if defined(GEMMLOWP_AVX2) && defined(GEMMLOWP_X86_64)
#define GEMMLOWP_AVX2_64
#endif
#if defined(__has_feature)
#if __has_feature(memory_sanitizer)
#include <sanitizer/msan_interface.h>
#define GEMMLOWP_MARK_MEMORY_AS_INITIALIZED __msan_unpoison
#elif __has_feature(address_sanitizer)
#include <sanitizer/asan_interface.h>
#define GEMMLOWP_MARK_MEMORY_AS_INITIALIZED __asan_unpoison_memory_region
#endif
#endif
#endif // GEMMLOWP_ALLOW_INLINE_ASM
// Detect Android. Don't conflate with ARM - we care about tuning
// for non-ARM Android devices too. This can be used in conjunction
// with x86 to tune differently for mobile x86 CPUs (Atom) vs. desktop x86 CPUs.
#if defined(__ANDROID__) || defined(ANDROID)
#define GEMMLOWP_ANDROID
#endif
#endif // GEMMLOWP_INTERNAL_DETECT_PLATFORM_H_