diff --git a/Documentation/asm-annotations.rst b/Documentation/asm-annotations.rst new file mode 100644 index 000000000000..29ccd6e61fe5 --- /dev/null +++ b/Documentation/asm-annotations.rst @@ -0,0 +1,216 @@ +Assembler Annotations +===================== + +Copyright (c) 2017-2019 Jiri Slaby + +This document describes the new macros for annotation of data and code in +assembly. In particular, it contains information about ``SYM_FUNC_START``, +``SYM_FUNC_END``, ``SYM_CODE_START``, and similar. + +Rationale +--------- +Some code like entries, trampolines, or boot code needs to be written in +assembly. The same as in C, such code is grouped into functions and +accompanied with data. Standard assemblers do not force users into precisely +marking these pieces as code, data, or even specifying their length. +Nevertheless, assemblers provide developers with such annotations to aid +debuggers throughout assembly. On top of that, developers also want to mark +some functions as *global* in order to be visible outside of their translation +units. + +Over time, the Linux kernel has adopted macros from various projects (like +``binutils``) to facilitate such annotations. So for historic reasons, +developers have been using ``ENTRY``, ``END``, ``ENDPROC``, and other +annotations in assembly. Due to the lack of their documentation, the macros +are used in rather wrong contexts at some locations. Clearly, ``ENTRY`` was +intended to denote the beginning of global symbols (be it data or code). +``END`` used to mark the end of data or end of special functions with +*non-standard* calling convention. In contrast, ``ENDPROC`` should annotate +only ends of *standard* functions. + +When these macros are used correctly, they help assemblers generate a nice +object with both sizes and types set correctly. For example, the result of +``arch/x86/lib/putuser.S``:: + + Num: Value Size Type Bind Vis Ndx Name + 25: 0000000000000000 33 FUNC GLOBAL DEFAULT 1 __put_user_1 + 29: 0000000000000030 37 FUNC GLOBAL DEFAULT 1 __put_user_2 + 32: 0000000000000060 36 FUNC GLOBAL DEFAULT 1 __put_user_4 + 35: 0000000000000090 37 FUNC GLOBAL DEFAULT 1 __put_user_8 + +This is not only important for debugging purposes. When there are properly +annotated objects like this, tools can be run on them to generate more useful +information. In particular, on properly annotated objects, ``objtool`` can be +run to check and fix the object if needed. Currently, ``objtool`` can report +missing frame pointer setup/destruction in functions. It can also +automatically generate annotations for :doc:`ORC unwinder ` +for most code. Both of these are especially important to support reliable +stack traces which are in turn necessary for :doc:`Kernel live patching +`. + +Caveat and Discussion +--------------------- +As one might realize, there were only three macros previously. That is indeed +insufficient to cover all the combinations of cases: + +* standard/non-standard function +* code/data +* global/local symbol + +There was a discussion_ and instead of extending the current ``ENTRY/END*`` +macros, it was decided that brand new macros should be introduced instead:: + + So how about using macro names that actually show the purpose, instead + of importing all the crappy, historic, essentially randomly chosen + debug symbol macro names from the binutils and older kernels? + +.. _discussion: https://lkml.kernel.org/r/20170217104757.28588-1-jslaby@suse.cz + +Macros Description +------------------ + +The new macros are prefixed with the ``SYM_`` prefix and can be divided into +three main groups: + +1. ``SYM_FUNC_*`` -- to annotate C-like functions. This means functions with + standard C calling conventions, i.e. the stack contains a return address at + the predefined place and a return from the function can happen in a + standard way. When frame pointers are enabled, save/restore of frame + pointer shall happen at the start/end of a function, respectively, too. + + Checking tools like ``objtool`` should ensure such marked functions conform + to these rules. The tools can also easily annotate these functions with + debugging information (like *ORC data*) automatically. + +2. ``SYM_CODE_*`` -- special functions called with special stack. Be it + interrupt handlers with special stack content, trampolines, or startup + functions. + + Checking tools mostly ignore checking of these functions. But some debug + information still can be generated automatically. For correct debug data, + this code needs hints like ``UNWIND_HINT_REGS`` provided by developers. + +3. ``SYM_DATA*`` -- obviously data belonging to ``.data`` sections and not to + ``.text``. Data do not contain instructions, so they have to be treated + specially by the tools: they should not treat the bytes as instructions, + nor assign any debug information to them. + +Instruction Macros +~~~~~~~~~~~~~~~~~~ +This section covers ``SYM_FUNC_*`` and ``SYM_CODE_*`` enumerated above. + +* ``SYM_FUNC_START`` and ``SYM_FUNC_START_LOCAL`` are supposed to be **the + most frequent markings**. They are used for functions with standard calling + conventions -- global and local. Like in C, they both align the functions to + architecture specific ``__ALIGN`` bytes. There are also ``_NOALIGN`` variants + for special cases where developers do not want this implicit alignment. + + ``SYM_FUNC_START_WEAK`` and ``SYM_FUNC_START_WEAK_NOALIGN`` markings are + also offered as an assembler counterpart to the *weak* attribute known from + C. + + All of these **shall** be coupled with ``SYM_FUNC_END``. First, it marks + the sequence of instructions as a function and computes its size to the + generated object file. Second, it also eases checking and processing such + object files as the tools can trivially find exact function boundaries. + + So in most cases, developers should write something like in the following + example, having some asm instructions in between the macros, of course:: + + SYM_FUNC_START(function_hook) + ... asm insns ... + SYM_FUNC_END(function_hook) + + In fact, this kind of annotation corresponds to the now deprecated ``ENTRY`` + and ``ENDPROC`` macros. + +* ``SYM_FUNC_START_ALIAS`` and ``SYM_FUNC_START_LOCAL_ALIAS`` serve for those + who decided to have two or more names for one function. The typical use is:: + + SYM_FUNC_START_ALIAS(__memset) + SYM_FUNC_START(memset) + ... asm insns ... + SYM_FUNC_END(memset) + SYM_FUNC_END_ALIAS(__memset) + + In this example, one can call ``__memset`` or ``memset`` with the same + result, except the debug information for the instructions is generated to + the object file only once -- for the non-``ALIAS`` case. + +* ``SYM_CODE_START`` and ``SYM_CODE_START_LOCAL`` should be used only in + special cases -- if you know what you are doing. This is used exclusively + for interrupt handlers and similar where the calling convention is not the C + one. ``_NOALIGN`` variants exist too. The use is the same as for the ``FUNC`` + category above:: + + SYM_CODE_START_LOCAL(bad_put_user) + ... asm insns ... + SYM_CODE_END(bad_put_user) + + Again, every ``SYM_CODE_START*`` **shall** be coupled by ``SYM_CODE_END``. + + To some extent, this category corresponds to deprecated ``ENTRY`` and + ``END``. Except ``END`` had several other meanings too. + +* ``SYM_INNER_LABEL*`` is used to denote a label inside some + ``SYM_{CODE,FUNC}_START`` and ``SYM_{CODE,FUNC}_END``. They are very similar + to C labels, except they can be made global. An example of use:: + + SYM_CODE_START(ftrace_caller) + /* save_mcount_regs fills in first two parameters */ + ... + + SYM_INNER_LABEL(ftrace_caller_op_ptr, SYM_L_GLOBAL) + /* Load the ftrace_ops into the 3rd parameter */ + ... + + SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL) + call ftrace_stub + ... + retq + SYM_CODE_END(ftrace_caller) + +Data Macros +~~~~~~~~~~~ +Similar to instructions, there is a couple of macros to describe data in the +assembly. + +* ``SYM_DATA_START`` and ``SYM_DATA_START_LOCAL`` mark the start of some data + and shall be used in conjunction with either ``SYM_DATA_END``, or + ``SYM_DATA_END_LABEL``. The latter adds also a label to the end, so that + people can use ``lstack`` and (local) ``lstack_end`` in the following + example:: + + SYM_DATA_START_LOCAL(lstack) + .skip 4096 + SYM_DATA_END_LABEL(lstack, SYM_L_LOCAL, lstack_end) + +* ``SYM_DATA`` and ``SYM_DATA_LOCAL`` are variants for simple, mostly one-line + data:: + + SYM_DATA(HEAP, .long rm_heap) + SYM_DATA(heap_end, .long rm_stack) + + In the end, they expand to ``SYM_DATA_START`` with ``SYM_DATA_END`` + internally. + +Support Macros +~~~~~~~~~~~~~~ +All the above reduce themselves to some invocation of ``SYM_START``, +``SYM_END``, or ``SYM_ENTRY`` at last. Normally, developers should avoid using +these. + +Further, in the above examples, one could see ``SYM_L_LOCAL``. There are also +``SYM_L_GLOBAL`` and ``SYM_L_WEAK``. All are intended to denote linkage of a +symbol marked by them. They are used either in ``_LABEL`` variants of the +earlier macros, or in ``SYM_START``. + + +Overriding Macros +~~~~~~~~~~~~~~~~~ +Architecture can also override any of the macros in their own +``asm/linkage.h``, including macros specifying the type of a symbol +(``SYM_T_FUNC``, ``SYM_T_OBJECT``, and ``SYM_T_NONE``). As every macro +described in this file is surrounded by ``#ifdef`` + ``#endif``, it is enough +to define the macros differently in the aforementioned architecture-dependent +header. diff --git a/Documentation/index.rst b/Documentation/index.rst index 1cdc139adb40..c1a24a503a75 100644 --- a/Documentation/index.rst +++ b/Documentation/index.rst @@ -94,6 +94,14 @@ needed). vm/index bpf/index +Architecture-agnostic documentation +----------------------------------- + +.. toctree:: + :maxdepth: 2 + + asm-annotations + Architecture-specific documentation ----------------------------------- diff --git a/arch/arm64/boot/dts/vendor/qcom/xiaomi-sm8250-common.dtsi b/arch/arm64/boot/dts/vendor/qcom/xiaomi-sm8250-common.dtsi index c819b8d8b80b..96d002c5237f 100755 --- a/arch/arm64/boot/dts/vendor/qcom/xiaomi-sm8250-common.dtsi +++ b/arch/arm64/boot/dts/vendor/qcom/xiaomi-sm8250-common.dtsi @@ -17,43 +17,21 @@ <1804800>; qcom,cpufreq-table-1 = - < 825600>, - < 940800>, - <1056000>, - <1171200>, - <1286400>, - <1382400>, <1478400>, - <1574400>, - <1670400>, <1766400>, <1862400>, - <1958400>, <2054400>, - <2150400>, <2246400>, <2342400>, <2419200>; qcom,cpufreq-table-2 = - < 960000>, - <1075200>, - <1190400>, - <1305600>, - <1401600>, - <1516800>, - <1632000>, - <1747200>, <1862400>, - <1977600>, <2073600>, - <2169600>, <2265600>, <2361600>, <2457600>, <2553600>, - <2649600>, - <2745600>, <2841600>, <3187200>; }; diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h index 886669ba00aa..c6e7ecc2e510 100644 --- a/arch/arm64/include/asm/assembler.h +++ b/arch/arm64/include/asm/assembler.h @@ -481,6 +481,7 @@ USER(\label, ic ivau, \tmp2) // invalidate I line PoU .endm /* + * Deprecated! Use SYM_FUNC_{START,START_WEAK,END}_PI instead. * Annotate a function as position independent, i.e., safe to be called before * the kernel virtual mapping is activated. */ diff --git a/arch/arm64/include/asm/linkage.h b/arch/arm64/include/asm/linkage.h index 1b266292f0be..2415aeb674fd 100644 --- a/arch/arm64/include/asm/linkage.h +++ b/arch/arm64/include/asm/linkage.h @@ -4,4 +4,28 @@ #define __ALIGN .align 2 #define __ALIGN_STR ".align 2" +/* + * Annotate a function as position independent, i.e., safe to be called before + * the kernel virtual mapping is activated. + */ +#define SYM_FUNC_START_PI(x) \ + SYM_FUNC_START_ALIAS(__pi_##x); \ + SYM_FUNC_START(x) + +#define SYM_FUNC_START_WEAK_PI(x) \ + SYM_FUNC_START_ALIAS(__pi_##x); \ + SYM_FUNC_START_WEAK(x) + +#define SYM_FUNC_START_WEAK_ALIAS_PI(x) \ + SYM_FUNC_START_ALIAS(__pi_##x); \ + SYM_START(x, SYM_L_WEAK, SYM_A_ALIGN) + +#define SYM_FUNC_END_PI(x) \ + SYM_FUNC_END(x); \ + SYM_FUNC_END_ALIAS(__pi_##x) + +#define SYM_FUNC_END_ALIAS_PI(x) \ + SYM_FUNC_END_ALIAS(x); \ + SYM_FUNC_END_ALIAS(__pi_##x) + #endif diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile index a960d2179177..afbc66f1348d 100644 --- a/arch/arm64/lib/Makefile +++ b/arch/arm64/lib/Makefile @@ -1,7 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 lib-y := clear_user.o delay.o copy_from_user.o \ copy_to_user.o copy_in_user.o copy_page.o \ - clear_page.o csum.o memchr.o memcpy.o memmove.o \ + clear_page.o csum.o memchr.o memcpy.o \ memset.o memcmp.o strcmp.o strncmp.o strlen.o \ strnlen.o strchr.o strrchr.o tishift.o diff --git a/arch/arm64/lib/clear_user.S b/arch/arm64/lib/clear_user.S index 4374020c824a..9e7d893d58ee 100644 --- a/arch/arm64/lib/clear_user.S +++ b/arch/arm64/lib/clear_user.S @@ -1,23 +1,10 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* - * Based on arch/arm/lib/clear_user.S - * - * Copyright (C) 2012 ARM Ltd. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program. If not, see . + * Copyright (C) 2021 Arm Ltd. */ -#include -#include +#include +#include .text @@ -29,34 +16,41 @@ * * Alignment fixed up by hardware. */ -ENTRY(__arch_clear_user) - uaccess_enable_not_uao x2, x3, x4 - mov x2, x1 // save the size for fixup return + + .p2align 4 + // Alignment is for the loop, but since the prologue (including BTI) + // is also 16 bytes we can keep any padding outside the function +SYM_FUNC_START(__arch_clear_user) + add x2, x0, x1 subs x1, x1, #8 b.mi 2f 1: -uao_user_alternative 9f, str, sttr, xzr, x0, 8 +USER(9f, sttr xzr, [x0]) + add x0, x0, #8 subs x1, x1, #8 - b.pl 1b -2: adds x1, x1, #4 - b.mi 3f -uao_user_alternative 9f, str, sttr, wzr, x0, 4 - sub x1, x1, #4 -3: adds x1, x1, #2 - b.mi 4f -uao_user_alternative 9f, strh, sttrh, wzr, x0, 2 - sub x1, x1, #2 -4: adds x1, x1, #1 - b.mi 5f -uao_user_alternative 9f, strb, sttrb, wzr, x0, 0 + b.hi 1b +USER(9f, sttr xzr, [x2, #-8]) + mov x0, #0 + ret + +2: tbz x1, #2, 3f +USER(9f, sttr wzr, [x0]) +USER(8f, sttr wzr, [x2, #-4]) + mov x0, #0 + ret + +3: tbz x1, #1, 4f +USER(9f, sttrh wzr, [x0]) +4: tbz x1, #0, 5f +USER(7f, sttrb wzr, [x2, #-1]) 5: mov x0, #0 - uaccess_disable_not_uao x2, x3 ret -ENDPROC(__arch_clear_user) +SYM_FUNC_END(__arch_clear_user) .section .fixup,"ax" .align 2 -9: mov x0, x2 // return the original size - uaccess_disable_not_uao x2, x3 +7: sub x0, x2, #5 // Adjust for faulting on the final byte... +8: add x0, x0, #4 // ...or the second word of the 4-7 byte case +9: sub x0, x2, x0 ret .previous diff --git a/arch/arm64/lib/memchr.S b/arch/arm64/lib/memchr.S index 0f164a4baf52..152241bfe1f1 100644 --- a/arch/arm64/lib/memchr.S +++ b/arch/arm64/lib/memchr.S @@ -1,20 +1,6 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* - * Based on arch/arm/lib/memchr.S - * - * Copyright (C) 1995-2000 Russell King - * Copyright (C) 2013 ARM Ltd. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program. If not, see . + * Copyright (C) 2021 Arm Ltd. */ #include @@ -30,15 +16,59 @@ * Returns: * x0 - address of first occurrence of 'c' or 0 */ -WEAK(memchr) - and w1, w1, #0xff -1: subs x2, x2, #1 - b.mi 2f - ldrb w3, [x0], #1 - cmp w3, w1 - b.ne 1b - sub x0, x0, #1 + +#define L(label) .L ## label + +#define REP8_01 0x0101010101010101 +#define REP8_7f 0x7f7f7f7f7f7f7f7f + +#define srcin x0 +#define chrin w1 +#define cntin x2 + +#define result x0 + +#define wordcnt x3 +#define rep01 x4 +#define repchr x5 +#define cur_word x6 +#define cur_byte w6 +#define tmp x7 +#define tmp2 x8 + + .p2align 4 + nop +SYM_FUNC_START_WEAK_PI(memchr) + and chrin, chrin, #0xff + lsr wordcnt, cntin, #3 + cbz wordcnt, L(byte_loop) + mov rep01, #REP8_01 + mul repchr, x1, rep01 + and cntin, cntin, #7 +L(word_loop): + ldr cur_word, [srcin], #8 + sub wordcnt, wordcnt, #1 + eor cur_word, cur_word, repchr + sub tmp, cur_word, rep01 + orr tmp2, cur_word, #REP8_7f + bics tmp, tmp, tmp2 + b.ne L(found_word) + cbnz wordcnt, L(word_loop) +L(byte_loop): + cbz cntin, L(not_found) + ldrb cur_byte, [srcin], #1 + sub cntin, cntin, #1 + cmp cur_byte, chrin + b.ne L(byte_loop) + sub srcin, srcin, #1 + ret +L(found_word): +CPU_LE( rev tmp, tmp) + clz tmp, tmp + sub tmp, tmp, #64 + add result, srcin, tmp, asr #3 ret -2: mov x0, #0 +L(not_found): + mov result, #0 ret -ENDPIPROC(memchr) +SYM_FUNC_END_PI(memchr) diff --git a/arch/arm64/lib/memcmp.S b/arch/arm64/lib/memcmp.S index f365a5055c30..3a7c7cfedc43 100644 --- a/arch/arm64/lib/memcmp.S +++ b/arch/arm64/lib/memcmp.S @@ -1,39 +1,20 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* - * Copyright (c) 2017 ARM Ltd - * All rights reserved. + * Copyright (c) 2013-2021, Arm Limited. * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions - * are met: - * 1. Redistributions of source code must retain the above copyright - * notice, this list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright - * notice, this list of conditions and the following disclaimer in the - * documentation and/or other materials provided with the distribution. - * 3. The name of the company may not be used to endorse or promote - * products derived from this software without specific prior written - * permission. - * - * THIS SOFTWARE IS PROVIDED BY ARM LTD ``AS IS'' AND ANY EXPRESS OR IMPLIED - * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF - * MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. - * IN NO EVENT SHALL ARM LTD BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, - * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED - * TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR - * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF - * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING - * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * Adapted from the original at: + * https://github.com/ARM-software/optimized-routines/blob/e823e3abf5f89ecb/string/aarch64/memcmp.S */ +#include +#include + /* Assumptions: * * ARMv8-a, AArch64, unaligned accesses. */ -/* includes here */ -#include -#include +#define L(label) .L ## label /* Parameters and result. */ #define src1 x0 @@ -44,88 +25,114 @@ /* Internal variables. */ #define data1 x3 #define data1w w3 -#define data2 x4 -#define data2w w4 -#define tmp1 x5 - -/* Small inputs of less than 8 bytes are handled separately. This allows the - main code to be sped up using unaligned loads since there are now at least - 8 bytes to be compared. If the first 8 bytes are equal, align src1. - This ensures each iteration does at most one unaligned access even if both - src1 and src2 are unaligned, and mutually aligned inputs behave as if - aligned. After the main loop, process the last 8 bytes using unaligned - accesses. */ - -.p2align 6 -WEAK(memcmp) +#define data1h x4 +#define data2 x5 +#define data2w w5 +#define data2h x6 +#define tmp1 x7 +#define tmp2 x8 + +SYM_FUNC_START_WEAK_PI(memcmp) subs limit, limit, 8 - b.lo .Lless8 + b.lo L(less8) - /* Limit >= 8, so check first 8 bytes using unaligned loads. */ ldr data1, [src1], 8 ldr data2, [src2], 8 - and tmp1, src1, 7 - add limit, limit, tmp1 cmp data1, data2 - bne .Lreturn + b.ne L(return) + + subs limit, limit, 8 + b.gt L(more16) + + ldr data1, [src1, limit] + ldr data2, [src2, limit] + b L(return) + +L(more16): + ldr data1, [src1], 8 + ldr data2, [src2], 8 + cmp data1, data2 + bne L(return) + + /* Jump directly to comparing the last 16 bytes for 32 byte (or less) + strings. */ + subs limit, limit, 16 + b.ls L(last_bytes) + + /* We overlap loads between 0-32 bytes at either side of SRC1 when we + try to align, so limit it only to strings larger than 128 bytes. */ + cmp limit, 96 + b.ls L(loop16) /* Align src1 and adjust src2 with bytes not yet done. */ + and tmp1, src1, 15 + add limit, limit, tmp1 sub src1, src1, tmp1 sub src2, src2, tmp1 - subs limit, limit, 8 - b.ls .Llast_bytes - - /* Loop performing 8 bytes per iteration using aligned src1. - Limit is pre-decremented by 8 and must be larger than zero. - Exit if <= 8 bytes left to do or if the data is not equal. */ + /* Loop performing 16 bytes per iteration using aligned src1. + Limit is pre-decremented by 16 and must be larger than zero. + Exit if <= 16 bytes left to do or if the data is not equal. */ .p2align 4 -.Lloop8: - ldr data1, [src1], 8 - ldr data2, [src2], 8 - subs limit, limit, 8 - ccmp data1, data2, 0, hi /* NZCV = 0b0000. */ - b.eq .Lloop8 +L(loop16): + ldp data1, data1h, [src1], 16 + ldp data2, data2h, [src2], 16 + subs limit, limit, 16 + ccmp data1, data2, 0, hi + ccmp data1h, data2h, 0, eq + b.eq L(loop16) cmp data1, data2 - bne .Lreturn + bne L(return) + mov data1, data1h + mov data2, data2h + cmp data1, data2 + bne L(return) - /* Compare last 1-8 bytes using unaligned access. */ -.Llast_bytes: - ldr data1, [src1, limit] - ldr data2, [src2, limit] + /* Compare last 1-16 bytes using unaligned access. */ +L(last_bytes): + add src1, src1, limit + add src2, src2, limit + ldp data1, data1h, [src1] + ldp data2, data2h, [src2] + cmp data1, data2 + bne L(return) + mov data1, data1h + mov data2, data2h + cmp data1, data2 /* Compare data bytes and set return value to 0, -1 or 1. */ -.Lreturn: +L(return): #ifndef __AARCH64EB__ rev data1, data1 rev data2, data2 #endif - cmp data1, data2 -.Lret_eq: + cmp data1, data2 +L(ret_eq): cset result, ne cneg result, result, lo - ret + ret .p2align 4 /* Compare up to 8 bytes. Limit is [-8..-1]. */ -.Lless8: +L(less8): adds limit, limit, 4 - b.lo .Lless4 + b.lo L(less4) ldr data1w, [src1], 4 ldr data2w, [src2], 4 cmp data1w, data2w - b.ne .Lreturn + b.ne L(return) sub limit, limit, 4 -.Lless4: +L(less4): adds limit, limit, 4 - beq .Lret_eq -.Lbyte_loop: + beq L(ret_eq) +L(byte_loop): ldrb data1w, [src1], 1 ldrb data2w, [src2], 1 subs limit, limit, 1 ccmp data1w, data2w, 0, ne /* NZCV = 0b0000. */ - b.eq .Lbyte_loop + b.eq L(byte_loop) sub result, data1w, data2w ret -ENDPIPROC(memcmp) + +SYM_FUNC_END_PI(memcmp) diff --git a/arch/arm64/lib/memcpy.S b/arch/arm64/lib/memcpy.S index dfedd4ab1a76..c1a2b5d959b7 100644 --- a/arch/arm64/lib/memcpy.S +++ b/arch/arm64/lib/memcpy.S @@ -1,76 +1,248 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* - * Copyright (C) 2013 ARM Ltd. - * Copyright (C) 2013 Linaro. + * Copyright (c) 2012-2021, Arm Limited. * - * This code is based on glibc cortex strings work originally authored by Linaro - * and re-licensed under GPLv2 for the Linux kernel. The original code can - * be found @ - * - * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ - * files/head:/src/aarch64/ - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program. If not, see . + * Adapted from the original at: + * https://github.com/ARM-software/optimized-routines/blob/afd6244a1f8d9229/string/aarch64/memcpy.S */ #include #include -#include -/* - * Copy a buffer from src to dest (alignment handled by the hardware) +/* Assumptions: + * + * ARMv8-a, AArch64, unaligned accesses. * - * Parameters: - * x0 - dest - * x1 - src - * x2 - n - * Returns: - * x0 - dest */ - .macro ldrb1 ptr, regB, val - ldrb \ptr, [\regB], \val - .endm - .macro strb1 ptr, regB, val - strb \ptr, [\regB], \val - .endm +#define L(label) .L ## label + +#define dstin x0 +#define src x1 +#define count x2 +#define dst x3 +#define srcend x4 +#define dstend x5 +#define A_l x6 +#define A_lw w6 +#define A_h x7 +#define B_l x8 +#define B_lw w8 +#define B_h x9 +#define C_l x10 +#define C_lw w10 +#define C_h x11 +#define D_l x12 +#define D_h x13 +#define E_l x14 +#define E_h x15 +#define F_l x16 +#define F_h x17 +#define G_l count +#define G_h dst +#define H_l src +#define H_h srcend +#define tmp1 x14 - .macro ldrh1 ptr, regB, val - ldrh \ptr, [\regB], \val - .endm +/* This implementation handles overlaps and supports both memcpy and memmove + from a single entry point. It uses unaligned accesses and branchless + sequences to keep the code small, simple and improve performance. - .macro strh1 ptr, regB, val - strh \ptr, [\regB], \val - .endm + Copies are split into 3 main cases: small copies of up to 32 bytes, medium + copies of up to 128 bytes, and large copies. The overhead of the overlap + check is negligible since it is only required for large copies. - .macro ldr1 ptr, regB, val - ldr \ptr, [\regB], \val - .endm + Large copies use a software pipelined loop processing 64 bytes per iteration. + The destination pointer is 16-byte aligned to minimize unaligned accesses. + The loop tail is handled by always copying 64 bytes from the end. +*/ - .macro str1 ptr, regB, val - str \ptr, [\regB], \val - .endm +SYM_FUNC_START_ALIAS(__memmove) +SYM_FUNC_START_WEAK_ALIAS_PI(memmove) +SYM_FUNC_START_ALIAS(__memcpy) +SYM_FUNC_START_WEAK_PI(memcpy) + add srcend, src, count + add dstend, dstin, count + cmp count, 128 + b.hi L(copy_long) + cmp count, 32 + b.hi L(copy32_128) - .macro ldp1 ptr, regB, regC, val - ldp \ptr, \regB, [\regC], \val - .endm + /* Small copies: 0..32 bytes. */ + cmp count, 16 + b.lo L(copy16) + ldp A_l, A_h, [src] + ldp D_l, D_h, [srcend, -16] + stp A_l, A_h, [dstin] + stp D_l, D_h, [dstend, -16] + ret + + /* Copy 8-15 bytes. */ +L(copy16): + tbz count, 3, L(copy8) + ldr A_l, [src] + ldr A_h, [srcend, -8] + str A_l, [dstin] + str A_h, [dstend, -8] + ret - .macro stp1 ptr, regB, regC, val - stp \ptr, \regB, [\regC], \val - .endm + .p2align 3 + /* Copy 4-7 bytes. */ +L(copy8): + tbz count, 2, L(copy4) + ldr A_lw, [src] + ldr B_lw, [srcend, -4] + str A_lw, [dstin] + str B_lw, [dstend, -4] + ret -ENTRY(__memcpy) -WEAK(memcpy) -#include "copy_template.S" + /* Copy 0..3 bytes using a branchless sequence. */ +L(copy4): + cbz count, L(copy0) + lsr tmp1, count, 1 + ldrb A_lw, [src] + ldrb C_lw, [srcend, -1] + ldrb B_lw, [src, tmp1] + strb A_lw, [dstin] + strb B_lw, [dstin, tmp1] + strb C_lw, [dstend, -1] +L(copy0): ret -ENDPIPROC(memcpy) -ENDPROC(__memcpy) + + .p2align 4 + /* Medium copies: 33..128 bytes. */ +L(copy32_128): + ldp A_l, A_h, [src] + ldp B_l, B_h, [src, 16] + ldp C_l, C_h, [srcend, -32] + ldp D_l, D_h, [srcend, -16] + cmp count, 64 + b.hi L(copy128) + stp A_l, A_h, [dstin] + stp B_l, B_h, [dstin, 16] + stp C_l, C_h, [dstend, -32] + stp D_l, D_h, [dstend, -16] + ret + + .p2align 4 + /* Copy 65..128 bytes. */ +L(copy128): + ldp E_l, E_h, [src, 32] + ldp F_l, F_h, [src, 48] + cmp count, 96 + b.ls L(copy96) + ldp G_l, G_h, [srcend, -64] + ldp H_l, H_h, [srcend, -48] + stp G_l, G_h, [dstend, -64] + stp H_l, H_h, [dstend, -48] +L(copy96): + stp A_l, A_h, [dstin] + stp B_l, B_h, [dstin, 16] + stp E_l, E_h, [dstin, 32] + stp F_l, F_h, [dstin, 48] + stp C_l, C_h, [dstend, -32] + stp D_l, D_h, [dstend, -16] + ret + + .p2align 4 + /* Copy more than 128 bytes. */ +L(copy_long): + /* Use backwards copy if there is an overlap. */ + sub tmp1, dstin, src + cbz tmp1, L(copy0) + cmp tmp1, count + b.lo L(copy_long_backwards) + + /* Copy 16 bytes and then align dst to 16-byte alignment. */ + + ldp D_l, D_h, [src] + and tmp1, dstin, 15 + bic dst, dstin, 15 + sub src, src, tmp1 + add count, count, tmp1 /* Count is now 16 too large. */ + ldp A_l, A_h, [src, 16] + stp D_l, D_h, [dstin] + ldp B_l, B_h, [src, 32] + ldp C_l, C_h, [src, 48] + ldp D_l, D_h, [src, 64]! + subs count, count, 128 + 16 /* Test and readjust count. */ + b.ls L(copy64_from_end) + +L(loop64): + stp A_l, A_h, [dst, 16] + ldp A_l, A_h, [src, 16] + stp B_l, B_h, [dst, 32] + ldp B_l, B_h, [src, 32] + stp C_l, C_h, [dst, 48] + ldp C_l, C_h, [src, 48] + stp D_l, D_h, [dst, 64]! + ldp D_l, D_h, [src, 64]! + subs count, count, 64 + b.hi L(loop64) + + /* Write the last iteration and copy 64 bytes from the end. */ +L(copy64_from_end): + ldp E_l, E_h, [srcend, -64] + stp A_l, A_h, [dst, 16] + ldp A_l, A_h, [srcend, -48] + stp B_l, B_h, [dst, 32] + ldp B_l, B_h, [srcend, -32] + stp C_l, C_h, [dst, 48] + ldp C_l, C_h, [srcend, -16] + stp D_l, D_h, [dst, 64] + stp E_l, E_h, [dstend, -64] + stp A_l, A_h, [dstend, -48] + stp B_l, B_h, [dstend, -32] + stp C_l, C_h, [dstend, -16] + ret + + .p2align 4 + + /* Large backwards copy for overlapping copies. + Copy 16 bytes and then align dst to 16-byte alignment. */ +L(copy_long_backwards): + ldp D_l, D_h, [srcend, -16] + and tmp1, dstend, 15 + sub srcend, srcend, tmp1 + sub count, count, tmp1 + ldp A_l, A_h, [srcend, -16] + stp D_l, D_h, [dstend, -16] + ldp B_l, B_h, [srcend, -32] + ldp C_l, C_h, [srcend, -48] + ldp D_l, D_h, [srcend, -64]! + sub dstend, dstend, tmp1 + subs count, count, 128 + b.ls L(copy64_from_start) + +L(loop64_backwards): + stp A_l, A_h, [dstend, -16] + ldp A_l, A_h, [srcend, -16] + stp B_l, B_h, [dstend, -32] + ldp B_l, B_h, [srcend, -32] + stp C_l, C_h, [dstend, -48] + ldp C_l, C_h, [srcend, -48] + stp D_l, D_h, [dstend, -64]! + ldp D_l, D_h, [srcend, -64]! + subs count, count, 64 + b.hi L(loop64_backwards) + + /* Write the last iteration and copy 64 bytes from the start. */ +L(copy64_from_start): + ldp G_l, G_h, [src, 48] + stp A_l, A_h, [dstend, -16] + ldp A_l, A_h, [src, 32] + stp B_l, B_h, [dstend, -32] + ldp B_l, B_h, [src, 16] + stp C_l, C_h, [dstend, -48] + ldp C_l, C_h, [src] + stp D_l, D_h, [dstend, -64] + stp G_l, G_h, [dstin, 48] + stp A_l, A_h, [dstin, 32] + stp B_l, B_h, [dstin, 16] + stp C_l, C_h, [dstin] + ret + +SYM_FUNC_END_PI(memcpy) +SYM_FUNC_END_ALIAS(__memcpy) +SYM_FUNC_END_ALIAS_PI(memmove) +SYM_FUNC_END_ALIAS(__memmove) diff --git a/arch/arm64/lib/memmove.S b/arch/arm64/lib/memmove.S deleted file mode 100644 index d2dadccb62c5..000000000000 --- a/arch/arm64/lib/memmove.S +++ /dev/null @@ -1,201 +0,0 @@ -/* - * Copyright (C) 2013 ARM Ltd. - * Copyright (C) 2013 Linaro. - * - * This code is based on glibc cortex strings work originally authored by Linaro - * and re-licensed under GPLv2 for the Linux kernel. The original code can - * be found @ - * - * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ - * files/head:/src/aarch64/ - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program. If not, see . - */ - -#include -#include -#include - -/* - * Move a buffer from src to test (alignment handled by the hardware). - * If dest <= src, call memcpy, otherwise copy in reverse order. - * - * Parameters: - * x0 - dest - * x1 - src - * x2 - n - * Returns: - * x0 - dest - */ -dstin .req x0 -src .req x1 -count .req x2 -tmp1 .req x3 -tmp1w .req w3 -tmp2 .req x4 -tmp2w .req w4 -tmp3 .req x5 -tmp3w .req w5 -dst .req x6 - -A_l .req x7 -A_h .req x8 -B_l .req x9 -B_h .req x10 -C_l .req x11 -C_h .req x12 -D_l .req x13 -D_h .req x14 - -ENTRY(__memmove) -WEAK(memmove) - prfm pldl1strm, [src, #L1_CACHE_BYTES] - cmp dstin, src - b.lo __memcpy - add tmp1, src, count - cmp dstin, tmp1 - b.hs __memcpy /* No overlap. */ - - add dst, dstin, count - add src, src, count - cmp count, #16 - b.lo .Ltail15 /*probably non-alignment accesses.*/ - - ands tmp2, src, #15 /* Bytes to reach alignment. */ - b.eq .LSrcAligned - sub count, count, tmp2 - /* - * process the aligned offset length to make the src aligned firstly. - * those extra instructions' cost is acceptable. It also make the - * coming accesses are based on aligned address. - */ - tbz tmp2, #0, 1f - ldrb tmp1w, [src, #-1]! - strb tmp1w, [dst, #-1]! -1: - tbz tmp2, #1, 2f - ldrh tmp1w, [src, #-2]! - strh tmp1w, [dst, #-2]! -2: - tbz tmp2, #2, 3f - ldr tmp1w, [src, #-4]! - str tmp1w, [dst, #-4]! -3: - tbz tmp2, #3, .LSrcAligned - ldr tmp1, [src, #-8]! - str tmp1, [dst, #-8]! - -.LSrcAligned: - cmp count, #64 - b.ge .Lcpy_over64 - - /* - * Deal with small copies quickly by dropping straight into the - * exit block. - */ -.Ltail63: - /* - * Copy up to 48 bytes of data. At this point we only need the - * bottom 6 bits of count to be accurate. - */ - ands tmp1, count, #0x30 - b.eq .Ltail15 - cmp tmp1w, #0x20 - b.eq 1f - b.lt 2f - ldp A_l, A_h, [src, #-16]! - stp A_l, A_h, [dst, #-16]! -1: - ldp A_l, A_h, [src, #-16]! - stp A_l, A_h, [dst, #-16]! -2: - ldp A_l, A_h, [src, #-16]! - stp A_l, A_h, [dst, #-16]! - -.Ltail15: - tbz count, #3, 1f - ldr tmp1, [src, #-8]! - str tmp1, [dst, #-8]! -1: - tbz count, #2, 2f - ldr tmp1w, [src, #-4]! - str tmp1w, [dst, #-4]! -2: - tbz count, #1, 3f - ldrh tmp1w, [src, #-2]! - strh tmp1w, [dst, #-2]! -3: - tbz count, #0, .Lexitfunc - ldrb tmp1w, [src, #-1] - strb tmp1w, [dst, #-1] - -.Lexitfunc: - ret - -.Lcpy_over64: - subs count, count, #128 - b.ge .Lcpy_body_large - /* - * Less than 128 bytes to copy, so handle 64 bytes here and then jump - * to the tail. - */ - ldp A_l, A_h, [src, #-16] - stp A_l, A_h, [dst, #-16] - ldp B_l, B_h, [src, #-32] - ldp C_l, C_h, [src, #-48] - stp B_l, B_h, [dst, #-32] - stp C_l, C_h, [dst, #-48] - ldp D_l, D_h, [src, #-64]! - stp D_l, D_h, [dst, #-64]! - - tst count, #0x3f - b.ne .Ltail63 - ret - - /* - * Critical loop. Start at a new cache line boundary. Assuming - * 64 bytes per line this ensures the entire loop is in one line. - */ - .p2align L1_CACHE_SHIFT -.Lcpy_body_large: - /* pre-load 64 bytes data. */ - ldp A_l, A_h, [src, #-16] - ldp B_l, B_h, [src, #-32] - ldp C_l, C_h, [src, #-48] - ldp D_l, D_h, [src, #-64]! -1: - /* - * interlace the load of next 64 bytes data block with store of the last - * loaded 64 bytes data. - */ - stp A_l, A_h, [dst, #-16] - ldp A_l, A_h, [src, #-16] - stp B_l, B_h, [dst, #-32] - ldp B_l, B_h, [src, #-32] - stp C_l, C_h, [dst, #-48] - ldp C_l, C_h, [src, #-48] - stp D_l, D_h, [dst, #-64]! - ldp D_l, D_h, [src, #-64]! - prfm pldl1strm, [src, #(4*L1_CACHE_BYTES)] - subs count, count, #64 - b.ge 1b - stp A_l, A_h, [dst, #-16] - stp B_l, B_h, [dst, #-32] - stp C_l, C_h, [dst, #-48] - stp D_l, D_h, [dst, #-64]! - - tst count, #0x3f - b.ne .Ltail63 - ret -ENDPIPROC(memmove) -ENDPROC(__memmove) diff --git a/arch/arm64/lib/memset.S b/arch/arm64/lib/memset.S index 316263c47c00..282985a7c850 100644 --- a/arch/arm64/lib/memset.S +++ b/arch/arm64/lib/memset.S @@ -1,25 +1,13 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright (C) 2013 ARM Ltd. * Copyright (C) 2013 Linaro. * * This code is based on glibc cortex strings work originally authored by Linaro - * and re-licensed under GPLv2 for the Linux kernel. The original code can * be found @ * * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ * files/head:/src/aarch64/ - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program. If not, see . */ #include @@ -54,8 +42,8 @@ dst .req x8 tmp3w .req w9 tmp3 .req x9 -ENTRY(__memset) -WEAK(memset) +SYM_FUNC_START_ALIAS(__memset) +SYM_FUNC_START_WEAK_PI(memset) mov dst, dstin /* Preserve return value. */ and A_lw, val, #255 orr A_lw, A_lw, A_lw, lsl #8 @@ -214,5 +202,5 @@ WEAK(memset) ands count, count, zva_bits_x b.ne .Ltail_maybe_long ret -ENDPIPROC(memset) -ENDPROC(__memset) +SYM_FUNC_END_PI(memset) +SYM_FUNC_END_ALIAS(__memset) diff --git a/arch/arm64/lib/strcmp.S b/arch/arm64/lib/strcmp.S index 7d5d15398bfb..13c32ad8a94a 100644 --- a/arch/arm64/lib/strcmp.S +++ b/arch/arm64/lib/strcmp.S @@ -1,96 +1,123 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* - * Copyright (C) 2013 ARM Ltd. - * Copyright (C) 2013 Linaro. + * Copyright (c) 2012-2021, Arm Limited. * - * This code is based on glibc cortex strings work originally authored by Linaro - * and re-licensed under GPLv2 for the Linux kernel. The original code can - * be found @ - * - * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ - * files/head:/src/aarch64/ - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program. If not, see . + * Adapted from the original at: + * https://github.com/ARM-software/optimized-routines/blob/afd6244a1f8d9229/string/aarch64/strcmp.S */ #include #include -/* - * compare two strings +/* Assumptions: * - * Parameters: - * x0 - const string 1 pointer - * x1 - const string 2 pointer - * Returns: - * x0 - an integer less than, equal to, or greater than zero - * if s1 is found, respectively, to be less than, to match, - * or be greater than s2. + * ARMv8-a, AArch64 */ +#define L(label) .L ## label + #define REP8_01 0x0101010101010101 #define REP8_7f 0x7f7f7f7f7f7f7f7f #define REP8_80 0x8080808080808080 /* Parameters and result. */ -src1 .req x0 -src2 .req x1 -result .req x0 +#define src1 x0 +#define src2 x1 +#define result x0 /* Internal variables. */ -data1 .req x2 -data1w .req w2 -data2 .req x3 -data2w .req w3 -has_nul .req x4 -diff .req x5 -syndrome .req x6 -tmp1 .req x7 -tmp2 .req x8 -tmp3 .req x9 -zeroones .req x10 -pos .req x11 - -WEAK(strcmp) +#define data1 x2 +#define data1w w2 +#define data2 x3 +#define data2w w3 +#define has_nul x4 +#define diff x5 +#define syndrome x6 +#define tmp1 x7 +#define tmp2 x8 +#define tmp3 x9 +#define zeroones x10 +#define pos x11 + + /* Start of performance-critical section -- one 64B cache line. */ + .align 6 +SYM_FUNC_START_WEAK_PI(strcmp) eor tmp1, src1, src2 mov zeroones, #REP8_01 tst tmp1, #7 - b.ne .Lmisaligned8 + b.ne L(misaligned8) ands tmp1, src1, #7 - b.ne .Lmutual_align - - /* - * NUL detection works on the principle that (X - 1) & (~X) & 0x80 - * (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and - * can be done in parallel across the entire word. - */ -.Lloop_aligned: + b.ne L(mutual_align) + /* NUL detection works on the principle that (X - 1) & (~X) & 0x80 + (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and + can be done in parallel across the entire word. */ +L(loop_aligned): ldr data1, [src1], #8 ldr data2, [src2], #8 -.Lstart_realigned: +L(start_realigned): sub tmp1, data1, zeroones orr tmp2, data1, #REP8_7f eor diff, data1, data2 /* Non-zero if differences found. */ bic has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */ orr syndrome, diff, has_nul - cbz syndrome, .Lloop_aligned - b .Lcal_cmpresult + cbz syndrome, L(loop_aligned) + /* End of performance-critical section -- one 64B cache line. */ + +L(end): +#ifndef __AARCH64EB__ + rev syndrome, syndrome + rev data1, data1 + /* The MS-non-zero bit of the syndrome marks either the first bit + that is different, or the top bit of the first zero byte. + Shifting left now will bring the critical information into the + top bits. */ + clz pos, syndrome + rev data2, data2 + lsl data1, data1, pos + lsl data2, data2, pos + /* But we need to zero-extend (char is unsigned) the value and then + perform a signed 32-bit subtraction. */ + lsr data1, data1, #56 + sub result, data1, data2, lsr #56 + ret +#else + /* For big-endian we cannot use the trick with the syndrome value + as carry-propagation can corrupt the upper bits if the trailing + bytes in the string contain 0x01. */ + /* However, if there is no NUL byte in the dword, we can generate + the result directly. We can't just subtract the bytes as the + MSB might be significant. */ + cbnz has_nul, 1f + cmp data1, data2 + cset result, ne + cneg result, result, lo + ret +1: + /* Re-compute the NUL-byte detection, using a byte-reversed value. */ + rev tmp3, data1 + sub tmp1, tmp3, zeroones + orr tmp2, tmp3, #REP8_7f + bic has_nul, tmp1, tmp2 + rev has_nul, has_nul + orr syndrome, diff, has_nul + clz pos, syndrome + /* The MS-non-zero bit of the syndrome marks either the first bit + that is different, or the top bit of the first zero byte. + Shifting left now will bring the critical information into the + top bits. */ + lsl data1, data1, pos + lsl data2, data2, pos + /* But we need to zero-extend (char is unsigned) the value and then + perform a signed 32-bit subtraction. */ + lsr data1, data1, #56 + sub result, data1, data2, lsr #56 + ret +#endif -.Lmutual_align: - /* - * Sources are mutually aligned, but are not currently at an - * alignment boundary. Round down the addresses and then mask off - * the bytes that preceed the start point. - */ +L(mutual_align): + /* Sources are mutually aligned, but are not currently at an + alignment boundary. Round down the addresses and then mask off + the bytes that preceed the start point. */ bic src1, src1, #7 bic src2, src2, #7 lsl tmp1, tmp1, #3 /* Bytes beyond alignment -> bits. */ @@ -98,137 +125,51 @@ WEAK(strcmp) neg tmp1, tmp1 /* Bits to alignment -64. */ ldr data2, [src2], #8 mov tmp2, #~0 +#ifdef __AARCH64EB__ /* Big-endian. Early bytes are at MSB. */ -CPU_BE( lsl tmp2, tmp2, tmp1 ) /* Shift (tmp1 & 63). */ + lsl tmp2, tmp2, tmp1 /* Shift (tmp1 & 63). */ +#else /* Little-endian. Early bytes are at LSB. */ -CPU_LE( lsr tmp2, tmp2, tmp1 ) /* Shift (tmp1 & 63). */ - + lsr tmp2, tmp2, tmp1 /* Shift (tmp1 & 63). */ +#endif orr data1, data1, tmp2 orr data2, data2, tmp2 - b .Lstart_realigned - -.Lmisaligned8: - /* - * Get the align offset length to compare per byte first. - * After this process, one string's address will be aligned. - */ - and tmp1, src1, #7 - neg tmp1, tmp1 - add tmp1, tmp1, #8 - and tmp2, src2, #7 - neg tmp2, tmp2 - add tmp2, tmp2, #8 - subs tmp3, tmp1, tmp2 - csel pos, tmp1, tmp2, hi /*Choose the maximum. */ -.Ltinycmp: + b L(start_realigned) + +L(misaligned8): + /* Align SRC1 to 8 bytes and then compare 8 bytes at a time, always + checking to make sure that we don't access beyond page boundary in + SRC2. */ + tst src1, #7 + b.eq L(loop_misaligned) +L(do_misaligned): ldrb data1w, [src1], #1 ldrb data2w, [src2], #1 - subs pos, pos, #1 - ccmp data1w, #1, #0, ne /* NZCV = 0b0000. */ - ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */ - b.eq .Ltinycmp - cbnz pos, 1f /*find the null or unequal...*/ cmp data1w, #1 - ccmp data1w, data2w, #0, cs - b.eq .Lstart_align /*the last bytes are equal....*/ -1: - sub result, data1, data2 - ret - -.Lstart_align: - ands xzr, src1, #7 - b.eq .Lrecal_offset - /*process more leading bytes to make str1 aligned...*/ - add src1, src1, tmp3 - add src2, src2, tmp3 - /*load 8 bytes from aligned str1 and non-aligned str2..*/ + ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */ + b.ne L(done) + tst src1, #7 + b.ne L(do_misaligned) + +L(loop_misaligned): + /* Test if we are within the last dword of the end of a 4K page. If + yes then jump back to the misaligned loop to copy a byte at a time. */ + and tmp1, src2, #0xff8 + eor tmp1, tmp1, #0xff8 + cbz tmp1, L(do_misaligned) ldr data1, [src1], #8 ldr data2, [src2], #8 sub tmp1, data1, zeroones orr tmp2, data1, #REP8_7f - bic has_nul, tmp1, tmp2 - eor diff, data1, data2 /* Non-zero if differences found. */ - orr syndrome, diff, has_nul - cbnz syndrome, .Lcal_cmpresult - /*How far is the current str2 from the alignment boundary...*/ - and tmp3, tmp3, #7 -.Lrecal_offset: - neg pos, tmp3 -.Lloopcmp_proc: - /* - * Divide the eight bytes into two parts. First,backwards the src2 - * to an alignment boundary,load eight bytes from the SRC2 alignment - * boundary,then compare with the relative bytes from SRC1. - * If all 8 bytes are equal,then start the second part's comparison. - * Otherwise finish the comparison. - * This special handle can garantee all the accesses are in the - * thread/task space in avoid to overrange access. - */ - ldr data1, [src1,pos] - ldr data2, [src2,pos] - sub tmp1, data1, zeroones - orr tmp2, data1, #REP8_7f - bic has_nul, tmp1, tmp2 - eor diff, data1, data2 /* Non-zero if differences found. */ - orr syndrome, diff, has_nul - cbnz syndrome, .Lcal_cmpresult - - /*The second part process*/ - ldr data1, [src1], #8 - ldr data2, [src2], #8 - sub tmp1, data1, zeroones - orr tmp2, data1, #REP8_7f - bic has_nul, tmp1, tmp2 - eor diff, data1, data2 /* Non-zero if differences found. */ + eor diff, data1, data2 /* Non-zero if differences found. */ + bic has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */ orr syndrome, diff, has_nul - cbz syndrome, .Lloopcmp_proc + cbz syndrome, L(loop_misaligned) + b L(end) -.Lcal_cmpresult: - /* - * reversed the byte-order as big-endian,then CLZ can find the most - * significant zero bits. - */ -CPU_LE( rev syndrome, syndrome ) -CPU_LE( rev data1, data1 ) -CPU_LE( rev data2, data2 ) - - /* - * For big-endian we cannot use the trick with the syndrome value - * as carry-propagation can corrupt the upper bits if the trailing - * bytes in the string contain 0x01. - * However, if there is no NUL byte in the dword, we can generate - * the result directly. We ca not just subtract the bytes as the - * MSB might be significant. - */ -CPU_BE( cbnz has_nul, 1f ) -CPU_BE( cmp data1, data2 ) -CPU_BE( cset result, ne ) -CPU_BE( cneg result, result, lo ) -CPU_BE( ret ) -CPU_BE( 1: ) - /*Re-compute the NUL-byte detection, using a byte-reversed value. */ -CPU_BE( rev tmp3, data1 ) -CPU_BE( sub tmp1, tmp3, zeroones ) -CPU_BE( orr tmp2, tmp3, #REP8_7f ) -CPU_BE( bic has_nul, tmp1, tmp2 ) -CPU_BE( rev has_nul, has_nul ) -CPU_BE( orr syndrome, diff, has_nul ) - - clz pos, syndrome - /* - * The MS-non-zero bit of the syndrome marks either the first bit - * that is different, or the top bit of the first zero byte. - * Shifting left now will bring the critical information into the - * top bits. - */ - lsl data1, data1, pos - lsl data2, data2, pos - /* - * But we need to zero-extend (char is unsigned) the value and then - * perform a signed 32-bit subtraction. - */ - lsr data1, data1, #56 - sub result, data1, data2, lsr #56 +L(done): + sub result, data1, data2 ret -ENDPIPROC(strcmp) + +SYM_FUNC_END_PI(strcmp) diff --git a/arch/arm64/lib/strlen.S b/arch/arm64/lib/strlen.S index 8e0b14205dcb..7a95b10d0820 100644 --- a/arch/arm64/lib/strlen.S +++ b/arch/arm64/lib/strlen.S @@ -1,126 +1,202 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* - * Copyright (C) 2013 ARM Ltd. - * Copyright (C) 2013 Linaro. + * Copyright (c) 2013-2021, Arm Limited. * - * This code is based on glibc cortex strings work originally authored by Linaro - * and re-licensed under GPLv2 for the Linux kernel. The original code can - * be found @ - * - * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ - * files/head:/src/aarch64/ - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program. If not, see . + * Adapted from the original at: + * https://github.com/ARM-software/optimized-routines/blob/98e4d6a5c13c8e54/string/aarch64/strlen.S */ #include #include -/* - * calculate the length of a string +/* Assumptions: * - * Parameters: - * x0 - const string pointer - * Returns: - * x0 - the return length of specific string + * ARMv8-a, AArch64, unaligned accesses, min page size 4k. */ +#define L(label) .L ## label + /* Arguments and results. */ -srcin .req x0 -len .req x0 +#define srcin x0 +#define len x0 /* Locals and temporaries. */ -src .req x1 -data1 .req x2 -data2 .req x3 -data2a .req x4 -has_nul1 .req x5 -has_nul2 .req x6 -tmp1 .req x7 -tmp2 .req x8 -tmp3 .req x9 -tmp4 .req x10 -zeroones .req x11 -pos .req x12 +#define src x1 +#define data1 x2 +#define data2 x3 +#define has_nul1 x4 +#define has_nul2 x5 +#define tmp1 x4 +#define tmp2 x5 +#define tmp3 x6 +#define tmp4 x7 +#define zeroones x8 + + /* NUL detection works on the principle that (X - 1) & (~X) & 0x80 + (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and + can be done in parallel across the entire word. A faster check + (X - 1) & 0x80 is zero for non-NUL ASCII characters, but gives + false hits for characters 129..255. */ #define REP8_01 0x0101010101010101 #define REP8_7f 0x7f7f7f7f7f7f7f7f #define REP8_80 0x8080808080808080 -WEAK(strlen) - mov zeroones, #REP8_01 - bic src, srcin, #15 - ands tmp1, srcin, #15 - b.ne .Lmisaligned - /* - * NUL detection works on the principle that (X - 1) & (~X) & 0x80 - * (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and - * can be done in parallel across the entire word. - */ - /* - * The inner loop deals with two Dwords at a time. This has a - * slightly higher start-up cost, but we should win quite quickly, - * especially on cores with a high number of issue slots per - * cycle, as we get much better parallelism out of the operations. - */ -.Lloop: - ldp data1, data2, [src], #16 -.Lrealigned: +#define MIN_PAGE_SIZE 4096 + + /* Since strings are short on average, we check the first 16 bytes + of the string for a NUL character. In order to do an unaligned ldp + safely we have to do a page cross check first. If there is a NUL + byte we calculate the length from the 2 8-byte words using + conditional select to reduce branch mispredictions (it is unlikely + strlen will be repeatedly called on strings with the same length). + + If the string is longer than 16 bytes, we align src so don't need + further page cross checks, and process 32 bytes per iteration + using the fast NUL check. If we encounter non-ASCII characters, + fallback to a second loop using the full NUL check. + + If the page cross check fails, we read 16 bytes from an aligned + address, remove any characters before the string, and continue + in the main loop using aligned loads. Since strings crossing a + page in the first 16 bytes are rare (probability of + 16/MIN_PAGE_SIZE ~= 0.4%), this case does not need to be optimized. + + AArch64 systems have a minimum page size of 4k. We don't bother + checking for larger page sizes - the cost of setting up the correct + page size is just not worth the extra gain from a small reduction in + the cases taking the slow path. Note that we only care about + whether the first fetch, which may be misaligned, crosses a page + boundary. */ + +SYM_FUNC_START_WEAK_PI(strlen) + and tmp1, srcin, MIN_PAGE_SIZE - 1 + mov zeroones, REP8_01 + cmp tmp1, MIN_PAGE_SIZE - 16 + b.gt L(page_cross) + ldp data1, data2, [srcin] +#ifdef __AARCH64EB__ + /* For big-endian, carry propagation (if the final byte in the + string is 0x01) means we cannot use has_nul1/2 directly. + Since we expect strings to be small and early-exit, + byte-swap the data now so has_null1/2 will be correct. */ + rev data1, data1 + rev data2, data2 +#endif sub tmp1, data1, zeroones - orr tmp2, data1, #REP8_7f + orr tmp2, data1, REP8_7f sub tmp3, data2, zeroones - orr tmp4, data2, #REP8_7f - bic has_nul1, tmp1, tmp2 - bics has_nul2, tmp3, tmp4 - ccmp has_nul1, #0, #0, eq /* NZCV = 0000 */ - b.eq .Lloop + orr tmp4, data2, REP8_7f + bics has_nul1, tmp1, tmp2 + bic has_nul2, tmp3, tmp4 + ccmp has_nul2, 0, 0, eq + beq L(main_loop_entry) + + /* Enter with C = has_nul1 == 0. */ + csel has_nul1, has_nul1, has_nul2, cc + mov len, 8 + rev has_nul1, has_nul1 + clz tmp1, has_nul1 + csel len, xzr, len, cc + add len, len, tmp1, lsr 3 + ret + /* The inner loop processes 32 bytes per iteration and uses the fast + NUL check. If we encounter non-ASCII characters, use a second + loop with the accurate NUL check. */ + .p2align 4 +L(main_loop_entry): + bic src, srcin, 15 + sub src, src, 16 +L(main_loop): + ldp data1, data2, [src, 32]! +L(page_cross_entry): + sub tmp1, data1, zeroones + sub tmp3, data2, zeroones + orr tmp2, tmp1, tmp3 + tst tmp2, zeroones, lsl 7 + bne 1f + ldp data1, data2, [src, 16] + sub tmp1, data1, zeroones + sub tmp3, data2, zeroones + orr tmp2, tmp1, tmp3 + tst tmp2, zeroones, lsl 7 + beq L(main_loop) + add src, src, 16 +1: + /* The fast check failed, so do the slower, accurate NUL check. */ + orr tmp2, data1, REP8_7f + orr tmp4, data2, REP8_7f + bics has_nul1, tmp1, tmp2 + bic has_nul2, tmp3, tmp4 + ccmp has_nul2, 0, 0, eq + beq L(nonascii_loop) + + /* Enter with C = has_nul1 == 0. */ +L(tail): +#ifdef __AARCH64EB__ + /* For big-endian, carry propagation (if the final byte in the + string is 0x01) means we cannot use has_nul1/2 directly. The + easiest way to get the correct byte is to byte-swap the data + and calculate the syndrome a second time. */ + csel data1, data1, data2, cc + rev data1, data1 + sub tmp1, data1, zeroones + orr tmp2, data1, REP8_7f + bic has_nul1, tmp1, tmp2 +#else + csel has_nul1, has_nul1, has_nul2, cc +#endif sub len, src, srcin - cbz has_nul1, .Lnul_in_data2 -CPU_BE( mov data2, data1 ) /*prepare data to re-calculate the syndrome*/ - sub len, len, #8 - mov has_nul2, has_nul1 -.Lnul_in_data2: - /* - * For big-endian, carry propagation (if the final byte in the - * string is 0x01) means we cannot use has_nul directly. The - * easiest way to get the correct byte is to byte-swap the data - * and calculate the syndrome a second time. - */ -CPU_BE( rev data2, data2 ) -CPU_BE( sub tmp1, data2, zeroones ) -CPU_BE( orr tmp2, data2, #REP8_7f ) -CPU_BE( bic has_nul2, tmp1, tmp2 ) - - sub len, len, #8 - rev has_nul2, has_nul2 - clz pos, has_nul2 - add len, len, pos, lsr #3 /* Bits to bytes. */ + rev has_nul1, has_nul1 + add tmp2, len, 8 + clz tmp1, has_nul1 + csel len, len, tmp2, cc + add len, len, tmp1, lsr 3 ret -.Lmisaligned: - cmp tmp1, #8 - neg tmp1, tmp1 - ldp data1, data2, [src], #16 - lsl tmp1, tmp1, #3 /* Bytes beyond alignment -> bits. */ - mov tmp2, #~0 - /* Big-endian. Early bytes are at MSB. */ -CPU_BE( lsl tmp2, tmp2, tmp1 ) /* Shift (tmp1 & 63). */ +L(nonascii_loop): + ldp data1, data2, [src, 16]! + sub tmp1, data1, zeroones + orr tmp2, data1, REP8_7f + sub tmp3, data2, zeroones + orr tmp4, data2, REP8_7f + bics has_nul1, tmp1, tmp2 + bic has_nul2, tmp3, tmp4 + ccmp has_nul2, 0, 0, eq + bne L(tail) + ldp data1, data2, [src, 16]! + sub tmp1, data1, zeroones + orr tmp2, data1, REP8_7f + sub tmp3, data2, zeroones + orr tmp4, data2, REP8_7f + bics has_nul1, tmp1, tmp2 + bic has_nul2, tmp3, tmp4 + ccmp has_nul2, 0, 0, eq + beq L(nonascii_loop) + b L(tail) + + /* Load 16 bytes from [srcin & ~15] and force the bytes that precede + srcin to 0x7f, so we ignore any NUL bytes before the string. + Then continue in the aligned loop. */ +L(page_cross): + bic src, srcin, 15 + ldp data1, data2, [src] + lsl tmp1, srcin, 3 + mov tmp4, -1 +#ifdef __AARCH64EB__ + /* Big-endian. Early bytes are at MSB. */ + lsr tmp1, tmp4, tmp1 /* Shift (tmp1 & 63). */ +#else /* Little-endian. Early bytes are at LSB. */ -CPU_LE( lsr tmp2, tmp2, tmp1 ) /* Shift (tmp1 & 63). */ - - orr data1, data1, tmp2 - orr data2a, data2, tmp2 - csinv data1, data1, xzr, le - csel data2, data2, data2a, le - b .Lrealigned -ENDPIPROC(strlen) + lsl tmp1, tmp4, tmp1 /* Shift (tmp1 & 63). */ +#endif + orr tmp1, tmp1, REP8_80 + orn data1, data1, tmp1 + orn tmp2, data2, tmp1 + tst srcin, 8 + csel data1, data1, tmp4, eq + csel data2, data2, tmp2, eq + b L(page_cross_entry) + +SYM_FUNC_END_PI(strlen) diff --git a/arch/arm64/lib/strncmp.S b/arch/arm64/lib/strncmp.S index 66bd145935d9..de324476c482 100644 --- a/arch/arm64/lib/strncmp.S +++ b/arch/arm64/lib/strncmp.S @@ -1,310 +1,260 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* - * Copyright (C) 2013 ARM Ltd. - * Copyright (C) 2013 Linaro. + * Copyright (c) 2013-2021, Arm Limited. * - * This code is based on glibc cortex strings work originally authored by Linaro - * and re-licensed under GPLv2 for the Linux kernel. The original code can - * be found @ - * - * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ - * files/head:/src/aarch64/ - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program. If not, see . + * Adapted from the original at: + * https://github.com/ARM-software/optimized-routines/blob/e823e3abf5f89ecb/string/aarch64/strncmp.S */ #include #include -/* - * compare two strings +/* Assumptions: * - * Parameters: - * x0 - const string 1 pointer - * x1 - const string 2 pointer - * x2 - the maximal length to be compared - * Returns: - * x0 - an integer less than, equal to, or greater than zero if s1 is found, - * respectively, to be less than, to match, or be greater than s2. + * ARMv8-a, AArch64 */ +#define L(label) .L ## label + #define REP8_01 0x0101010101010101 #define REP8_7f 0x7f7f7f7f7f7f7f7f #define REP8_80 0x8080808080808080 /* Parameters and result. */ -src1 .req x0 -src2 .req x1 -limit .req x2 -result .req x0 +#define src1 x0 +#define src2 x1 +#define limit x2 +#define result x0 /* Internal variables. */ -data1 .req x3 -data1w .req w3 -data2 .req x4 -data2w .req w4 -has_nul .req x5 -diff .req x6 -syndrome .req x7 -tmp1 .req x8 -tmp2 .req x9 -tmp3 .req x10 -zeroones .req x11 -pos .req x12 -limit_wd .req x13 -mask .req x14 -endloop .req x15 +#define data1 x3 +#define data1w w3 +#define data2 x4 +#define data2w w4 +#define has_nul x5 +#define diff x6 +#define syndrome x7 +#define tmp1 x8 +#define tmp2 x9 +#define tmp3 x10 +#define zeroones x11 +#define pos x12 +#define limit_wd x13 +#define mask x14 +#define endloop x15 +#define count mask -WEAK(strncmp) - cbz limit, .Lret0 +SYM_FUNC_START_WEAK_PI(strncmp) + cbz limit, L(ret0) eor tmp1, src1, src2 mov zeroones, #REP8_01 tst tmp1, #7 - b.ne .Lmisaligned8 - ands tmp1, src1, #7 - b.ne .Lmutual_align + and count, src1, #7 + b.ne L(misaligned8) + cbnz count, L(mutual_align) /* Calculate the number of full and partial words -1. */ - /* - * when limit is mulitply of 8, if not sub 1, - * the judgement of last dword will wrong. - */ - sub limit_wd, limit, #1 /* limit != 0, so no underflow. */ - lsr limit_wd, limit_wd, #3 /* Convert to Dwords. */ + sub limit_wd, limit, #1 /* limit != 0, so no underflow. */ + lsr limit_wd, limit_wd, #3 /* Convert to Dwords. */ - /* - * NUL detection works on the principle that (X - 1) & (~X) & 0x80 - * (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and - * can be done in parallel across the entire word. - */ -.Lloop_aligned: + /* NUL detection works on the principle that (X - 1) & (~X) & 0x80 + (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and + can be done in parallel across the entire word. */ + .p2align 4 +L(loop_aligned): ldr data1, [src1], #8 ldr data2, [src2], #8 -.Lstart_realigned: +L(start_realigned): subs limit_wd, limit_wd, #1 sub tmp1, data1, zeroones orr tmp2, data1, #REP8_7f - eor diff, data1, data2 /* Non-zero if differences found. */ - csinv endloop, diff, xzr, pl /* Last Dword or differences.*/ - bics has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */ + eor diff, data1, data2 /* Non-zero if differences found. */ + csinv endloop, diff, xzr, pl /* Last Dword or differences. */ + bics has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */ ccmp endloop, #0, #0, eq - b.eq .Lloop_aligned + b.eq L(loop_aligned) + /* End of main loop */ - /*Not reached the limit, must have found the end or a diff. */ - tbz limit_wd, #63, .Lnot_limit + /* Not reached the limit, must have found the end or a diff. */ + tbz limit_wd, #63, L(not_limit) /* Limit % 8 == 0 => all bytes significant. */ ands limit, limit, #7 - b.eq .Lnot_limit + b.eq L(not_limit) - lsl limit, limit, #3 /* Bits -> bytes. */ + lsl limit, limit, #3 /* Bits -> bytes. */ mov mask, #~0 -CPU_BE( lsr mask, mask, limit ) -CPU_LE( lsl mask, mask, limit ) +#ifdef __AARCH64EB__ + lsr mask, mask, limit +#else + lsl mask, mask, limit +#endif bic data1, data1, mask bic data2, data2, mask /* Make sure that the NUL byte is marked in the syndrome. */ orr has_nul, has_nul, mask -.Lnot_limit: +L(not_limit): orr syndrome, diff, has_nul - b .Lcal_cmpresult -.Lmutual_align: - /* - * Sources are mutually aligned, but are not currently at an - * alignment boundary. Round down the addresses and then mask off - * the bytes that precede the start point. - * We also need to adjust the limit calculations, but without - * overflowing if the limit is near ULONG_MAX. - */ +#ifndef __AARCH64EB__ + rev syndrome, syndrome + rev data1, data1 + /* The MS-non-zero bit of the syndrome marks either the first bit + that is different, or the top bit of the first zero byte. + Shifting left now will bring the critical information into the + top bits. */ + clz pos, syndrome + rev data2, data2 + lsl data1, data1, pos + lsl data2, data2, pos + /* But we need to zero-extend (char is unsigned) the value and then + perform a signed 32-bit subtraction. */ + lsr data1, data1, #56 + sub result, data1, data2, lsr #56 + ret +#else + /* For big-endian we cannot use the trick with the syndrome value + as carry-propagation can corrupt the upper bits if the trailing + bytes in the string contain 0x01. */ + /* However, if there is no NUL byte in the dword, we can generate + the result directly. We can't just subtract the bytes as the + MSB might be significant. */ + cbnz has_nul, 1f + cmp data1, data2 + cset result, ne + cneg result, result, lo + ret +1: + /* Re-compute the NUL-byte detection, using a byte-reversed value. */ + rev tmp3, data1 + sub tmp1, tmp3, zeroones + orr tmp2, tmp3, #REP8_7f + bic has_nul, tmp1, tmp2 + rev has_nul, has_nul + orr syndrome, diff, has_nul + clz pos, syndrome + /* The MS-non-zero bit of the syndrome marks either the first bit + that is different, or the top bit of the first zero byte. + Shifting left now will bring the critical information into the + top bits. */ + lsl data1, data1, pos + lsl data2, data2, pos + /* But we need to zero-extend (char is unsigned) the value and then + perform a signed 32-bit subtraction. */ + lsr data1, data1, #56 + sub result, data1, data2, lsr #56 + ret +#endif + +L(mutual_align): + /* Sources are mutually aligned, but are not currently at an + alignment boundary. Round down the addresses and then mask off + the bytes that precede the start point. + We also need to adjust the limit calculations, but without + overflowing if the limit is near ULONG_MAX. */ bic src1, src1, #7 bic src2, src2, #7 ldr data1, [src1], #8 - neg tmp3, tmp1, lsl #3 /* 64 - bits(bytes beyond align). */ + neg tmp3, count, lsl #3 /* 64 - bits(bytes beyond align). */ ldr data2, [src2], #8 mov tmp2, #~0 - sub limit_wd, limit, #1 /* limit != 0, so no underflow. */ + sub limit_wd, limit, #1 /* limit != 0, so no underflow. */ +#ifdef __AARCH64EB__ /* Big-endian. Early bytes are at MSB. */ -CPU_BE( lsl tmp2, tmp2, tmp3 ) /* Shift (tmp1 & 63). */ + lsl tmp2, tmp2, tmp3 /* Shift (count & 63). */ +#else /* Little-endian. Early bytes are at LSB. */ -CPU_LE( lsr tmp2, tmp2, tmp3 ) /* Shift (tmp1 & 63). */ - + lsr tmp2, tmp2, tmp3 /* Shift (count & 63). */ +#endif and tmp3, limit_wd, #7 lsr limit_wd, limit_wd, #3 - /* Adjust the limit. Only low 3 bits used, so overflow irrelevant.*/ - add limit, limit, tmp1 - add tmp3, tmp3, tmp1 + /* Adjust the limit. Only low 3 bits used, so overflow irrelevant. */ + add limit, limit, count + add tmp3, tmp3, count orr data1, data1, tmp2 orr data2, data2, tmp2 add limit_wd, limit_wd, tmp3, lsr #3 - b .Lstart_realigned + b L(start_realigned) + + .p2align 4 + /* Don't bother with dwords for up to 16 bytes. */ +L(misaligned8): + cmp limit, #16 + b.hs L(try_misaligned_words) -/*when src1 offset is not equal to src2 offset...*/ -.Lmisaligned8: - cmp limit, #8 - b.lo .Ltiny8proc /*limit < 8... */ - /* - * Get the align offset length to compare per byte first. - * After this process, one string's address will be aligned.*/ - and tmp1, src1, #7 - neg tmp1, tmp1 - add tmp1, tmp1, #8 - and tmp2, src2, #7 - neg tmp2, tmp2 - add tmp2, tmp2, #8 - subs tmp3, tmp1, tmp2 - csel pos, tmp1, tmp2, hi /*Choose the maximum. */ - /* - * Here, limit is not less than 8, so directly run .Ltinycmp - * without checking the limit.*/ - sub limit, limit, pos -.Ltinycmp: +L(byte_loop): + /* Perhaps we can do better than this. */ ldrb data1w, [src1], #1 ldrb data2w, [src2], #1 - subs pos, pos, #1 - ccmp data1w, #1, #0, ne /* NZCV = 0b0000. */ - ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */ - b.eq .Ltinycmp - cbnz pos, 1f /*find the null or unequal...*/ - cmp data1w, #1 - ccmp data1w, data2w, #0, cs - b.eq .Lstart_align /*the last bytes are equal....*/ -1: + subs limit, limit, #1 + ccmp data1w, #1, #0, hi /* NZCV = 0b0000. */ + ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */ + b.eq L(byte_loop) +L(done): sub result, data1, data2 ret - -.Lstart_align: + /* Align the SRC1 to a dword by doing a bytewise compare and then do + the dword loop. */ +L(try_misaligned_words): lsr limit_wd, limit, #3 - cbz limit_wd, .Lremain8 - /*process more leading bytes to make str1 aligned...*/ - ands xzr, src1, #7 - b.eq .Lrecal_offset - add src1, src1, tmp3 /*tmp3 is positive in this branch.*/ - add src2, src2, tmp3 - ldr data1, [src1], #8 - ldr data2, [src2], #8 + cbz count, L(do_misaligned) - sub limit, limit, tmp3 + neg count, count + and count, count, #7 + sub limit, limit, count lsr limit_wd, limit, #3 - subs limit_wd, limit_wd, #1 - sub tmp1, data1, zeroones - orr tmp2, data1, #REP8_7f - eor diff, data1, data2 /* Non-zero if differences found. */ - csinv endloop, diff, xzr, ne/*if limit_wd is 0,will finish the cmp*/ - bics has_nul, tmp1, tmp2 - ccmp endloop, #0, #0, eq /*has_null is ZERO: no null byte*/ - b.ne .Lunequal_proc - /*How far is the current str2 from the alignment boundary...*/ - and tmp3, tmp3, #7 -.Lrecal_offset: - neg pos, tmp3 -.Lloopcmp_proc: - /* - * Divide the eight bytes into two parts. First,backwards the src2 - * to an alignment boundary,load eight bytes from the SRC2 alignment - * boundary,then compare with the relative bytes from SRC1. - * If all 8 bytes are equal,then start the second part's comparison. - * Otherwise finish the comparison. - * This special handle can garantee all the accesses are in the - * thread/task space in avoid to overrange access. - */ - ldr data1, [src1,pos] - ldr data2, [src2,pos] - sub tmp1, data1, zeroones - orr tmp2, data1, #REP8_7f - bics has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */ - eor diff, data1, data2 /* Non-zero if differences found. */ - csinv endloop, diff, xzr, eq - cbnz endloop, .Lunequal_proc +L(page_end_loop): + ldrb data1w, [src1], #1 + ldrb data2w, [src2], #1 + cmp data1w, #1 + ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */ + b.ne L(done) + subs count, count, #1 + b.hi L(page_end_loop) + +L(do_misaligned): + /* Prepare ourselves for the next page crossing. Unlike the aligned + loop, we fetch 1 less dword because we risk crossing bounds on + SRC2. */ + mov count, #8 + subs limit_wd, limit_wd, #1 + b.lo L(done_loop) +L(loop_misaligned): + and tmp2, src2, #0xff8 + eor tmp2, tmp2, #0xff8 + cbz tmp2, L(page_end_loop) - /*The second part process*/ ldr data1, [src1], #8 ldr data2, [src2], #8 - subs limit_wd, limit_wd, #1 sub tmp1, data1, zeroones orr tmp2, data1, #REP8_7f - eor diff, data1, data2 /* Non-zero if differences found. */ - csinv endloop, diff, xzr, ne/*if limit_wd is 0,will finish the cmp*/ - bics has_nul, tmp1, tmp2 - ccmp endloop, #0, #0, eq /*has_null is ZERO: no null byte*/ - b.eq .Lloopcmp_proc - -.Lunequal_proc: - orr syndrome, diff, has_nul - cbz syndrome, .Lremain8 -.Lcal_cmpresult: - /* - * reversed the byte-order as big-endian,then CLZ can find the most - * significant zero bits. - */ -CPU_LE( rev syndrome, syndrome ) -CPU_LE( rev data1, data1 ) -CPU_LE( rev data2, data2 ) - /* - * For big-endian we cannot use the trick with the syndrome value - * as carry-propagation can corrupt the upper bits if the trailing - * bytes in the string contain 0x01. - * However, if there is no NUL byte in the dword, we can generate - * the result directly. We can't just subtract the bytes as the - * MSB might be significant. - */ -CPU_BE( cbnz has_nul, 1f ) -CPU_BE( cmp data1, data2 ) -CPU_BE( cset result, ne ) -CPU_BE( cneg result, result, lo ) -CPU_BE( ret ) -CPU_BE( 1: ) - /* Re-compute the NUL-byte detection, using a byte-reversed value.*/ -CPU_BE( rev tmp3, data1 ) -CPU_BE( sub tmp1, tmp3, zeroones ) -CPU_BE( orr tmp2, tmp3, #REP8_7f ) -CPU_BE( bic has_nul, tmp1, tmp2 ) -CPU_BE( rev has_nul, has_nul ) -CPU_BE( orr syndrome, diff, has_nul ) - /* - * The MS-non-zero bit of the syndrome marks either the first bit - * that is different, or the top bit of the first zero byte. - * Shifting left now will bring the critical information into the - * top bits. - */ - clz pos, syndrome - lsl data1, data1, pos - lsl data2, data2, pos - /* - * But we need to zero-extend (char is unsigned) the value and then - * perform a signed 32-bit subtraction. - */ - lsr data1, data1, #56 - sub result, data1, data2, lsr #56 - ret - -.Lremain8: - /* Limit % 8 == 0 => all bytes significant. */ - ands limit, limit, #7 - b.eq .Lret0 -.Ltiny8proc: - ldrb data1w, [src1], #1 - ldrb data2w, [src2], #1 - subs limit, limit, #1 + eor diff, data1, data2 /* Non-zero if differences found. */ + bics has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */ + ccmp diff, #0, #0, eq + b.ne L(not_limit) + subs limit_wd, limit_wd, #1 + b.pl L(loop_misaligned) - ccmp data1w, #1, #0, ne /* NZCV = 0b0000. */ - ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */ - b.eq .Ltiny8proc - sub result, data1, data2 - ret +L(done_loop): + /* We found a difference or a NULL before the limit was reached. */ + and limit, limit, #7 + cbz limit, L(not_limit) + /* Read the last word. */ + sub src1, src1, 8 + sub src2, src2, 8 + ldr data1, [src1, limit] + ldr data2, [src2, limit] + sub tmp1, data1, zeroones + orr tmp2, data1, #REP8_7f + eor diff, data1, data2 /* Non-zero if differences found. */ + bics has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */ + ccmp diff, #0, #0, eq + b.ne L(not_limit) -.Lret0: +L(ret0): mov result, #0 ret -ENDPIPROC(strncmp) + +SYM_FUNC_END_PI(strncmp) diff --git a/arch/x86/include/asm/linkage.h b/arch/x86/include/asm/linkage.h index 14caa9d9fb7f..e07188e8d763 100644 --- a/arch/x86/include/asm/linkage.h +++ b/arch/x86/include/asm/linkage.h @@ -13,9 +13,13 @@ #ifdef __ASSEMBLY__ -#define GLOBAL(name) \ - .globl name; \ - name: +/* + * GLOBAL is DEPRECATED + * + * use SYM_DATA_START, SYM_FUNC_START, SYM_INNER_LABEL, SYM_CODE_START, or + * similar + */ +#define GLOBAL(name) SYM_ENTRY(name, SYM_L_GLOBAL, SYM_A_NONE) #if defined(CONFIG_X86_64) || defined(CONFIG_X86_ALIGNMENT_16) #define __ALIGN .p2align 4, 0x90 diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig index a6c48a4882ea..6194dcf33c83 100644 --- a/drivers/char/Kconfig +++ b/drivers/char/Kconfig @@ -51,6 +51,21 @@ config SGI_MBCS source "drivers/tty/serial/Kconfig" source "drivers/tty/serdev/Kconfig" +config SRANDOM + tristate "Seed PRNG to replace urandom" + default n + ---help--- + If you say Y here, The kernel support for + Seed PRNG will be enabled. + + This driver will improve built-in random number generators + useful for faster RNG to wipe SSDs. + + To compile this driver as a module, choose M here: the + module will be called srandom. + + If unsure, say N. + config TTY_PRINTK tristate "TTY driver to output user messages via printk" depends on EXPERT && TTY diff --git a/drivers/char/Makefile b/drivers/char/Makefile index 5d633d50b363..7b91ae5d5219 100644 --- a/drivers/char/Makefile +++ b/drivers/char/Makefile @@ -4,6 +4,7 @@ # obj-y += mem.o random.o +obj-$(CONFIG_SRANDOM) += srandom.o obj-$(CONFIG_TTY_PRINTK) += ttyprintk.o obj-y += misc.o obj-$(CONFIG_ATARI_DSP56K) += dsp56k.o diff --git a/drivers/char/mem.c b/drivers/char/mem.c index d861992060e9..4c14a8832aa3 100644 --- a/drivers/char/mem.c +++ b/drivers/char/mem.c @@ -38,6 +38,10 @@ #define DEVPORT_MINOR 4 +#ifdef CONFIG_SRANDOM +#include +#endif + static inline unsigned long size_inside_page(unsigned long start, unsigned long size) { @@ -893,8 +897,20 @@ static const struct memdev { #endif [5] = { "zero", 0666, &zero_fops, 0 }, [7] = { "full", 0666, &full_fops, 0 }, - [8] = { "random", 0666, &urandom_fops, 0 }, + #ifdef CONFIG_SRANDOM + [8] = { "random", 0666, &sfops, 0 }, + [9] = { "urandom", 0666, &sfops, 0 }, + #else + [8] = { "random", 0666, &random_fops, 0 }, [9] = { "urandom", 0666, &urandom_fops, 0 }, + #endif + #ifndef CONFIG_HW_RANDOM + #ifndef CONFIG_SRANDOM + [10] = { "hw_random", 0666, &urandom_fops, 0 }, + #else + [10] = { "hw_random", 0666, &sfops, 0 }, + #endif + #endif #ifdef CONFIG_PRINTK [11] = { "kmsg", 0644, &kmsg_fops, 0 }, #endif diff --git a/drivers/char/srandom.c b/drivers/char/srandom.c new file mode 100644 index 000000000000..3f32bbb8804d --- /dev/null +++ b/drivers/char/srandom.c @@ -0,0 +1,640 @@ +/* + * Copyright (C) 2015-2019 Jonathan Senkerik + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* + * Size of Array. + * Must be >= 64. + * (actual size used will be 64 + * anything greater is thrown away). + * Recommended prime. + */ +#define arr_RND_SIZE 67 +/* + * Number of 512b Array + * (Must be power of 2) + */ +#define num_arr_RND 16 +/* + * Dev name as it appears in /proc/devices + */ +#define sDEVICE_NAME "srandom" +#define AppVERSION "1.38.0" +/* + * Amount of time worker thread should sleep between each operation. + * Recommended prime + */ +#define THREAD_SLEEP_VALUE 7 +#define PAID 0 +#define COPY_TO_USER raw_copy_to_user +#define COPY_FROM_USER raw_copy_from_user +#define KTIME_GET_NS ktime_get_real_ts64 +#define TIMESPEC timespec64 + +/* + * Prototypes + */ +static int device_open(struct inode *, struct file *); +static int device_release(struct inode *, struct file *); +static uint64_t xorshft64(void); +static uint64_t xorshft128(void); +static int nextbuffer(void); +static void update_sarray(int); +static void seed_PRND_s0(void); +static void seed_PRND_s1(void); +static void seed_PRND_x(void); +static int proc_read(struct seq_file *m, void *v); +static int proc_open(struct inode *inode, struct file *file); +static int work_thread(void *data); + +/* + * Global variables are declared as static, so are global within the file. + */ +const struct file_operations sfops = { + .owner = THIS_MODULE, + .open = device_open, + .read = sdevice_read, + .write = sdevice_write, + .release = device_release +}; + +static struct miscdevice srandom_dev = { + MISC_DYNAMIC_MINOR, + "srandom", + &sfops +}; + + +static const struct file_operations proc_fops = { + .owner = THIS_MODULE, + .read = seq_read, + .open = proc_open, + .llseek = seq_lseek, + .release = single_release, +}; + +static struct mutex UpArr_mutex; +static struct mutex Open_mutex; +static struct mutex ArrBusy_mutex; +static struct mutex UpPos_mutex; + +static struct task_struct *kthread; + +/* + * Global variables + */ +/* Used for xorshft64 */ +uint64_t x; +/* Used for xorshft128 */ +uint64_t s[2]; +/* Array of Array of SECURE RND numbers */ +uint64_t (*sarr_RND)[num_arr_RND + 1]; +/* Binary Flags for Busy Arrays */ +uint16_t CC_Busy_Flags; +/* Array reserved to determine which buffer to use */ +int CC_buffer_position; + +uint64_t tm_seed; +struct TIMESPEC ts; + +/* + * Global counters + */ +int16_t sdev_open; /* srandom device current open count */ +int32_t sdev_openCount; /* srandom device total open count */ +uint64_t PRNGCount; /* Total generated (512byte) */ + +/* + * This function is called when the module is loaded + */ +int mod_init(void) +{ + int16_t C, CC; + int ret; + + sdev_open = 0; + sdev_openCount = 0; + PRNGCount = 0; + + mutex_init(&UpArr_mutex); + mutex_init(&Open_mutex); + mutex_init(&ArrBusy_mutex); + mutex_init(&UpPos_mutex); + + /* + * Entropy Initialize #1 + */ + KTIME_GET_NS(&ts); + x = (uint64_t)ts.tv_nsec; + s[0] = xorshft64(); + s[1] = xorshft64(); + + /* + * Register char device + */ + ret = misc_register(&srandom_dev); + if (ret) + pr_debug("/dev/srandom registration failed..\n"); + else + pr_debug("/dev/srandom registered..\n"); + + /* + * Create /proc/srandom + */ + if (!proc_create("srandom", 0, NULL, &proc_fops)) + pr_debug("/proc/srandom registration failed..\n"); + else + pr_debug("/proc/srandom registration registered..\n"); + + pr_debug("Module version: "AppVERSION"\n"); + + sarr_RND = kzalloc((num_arr_RND + 1) * arr_RND_SIZE * sizeof(uint64_t), + GFP_KERNEL); + while (!sarr_RND) { + pr_debug("kzalloc failed to allocate initial memory. retrying...\n"); + sarr_RND = kzalloc((num_arr_RND + 1) * + arr_RND_SIZE * sizeof(uint64_t), GFP_KERNEL); + } + + /* + * Entropy Initialize #2 + */ + seed_PRND_s0(); + seed_PRND_s1(); + seed_PRND_x(); + + /* + * Init the sarray + */ + for (CC = 0; num_arr_RND >= CC; CC++) { + for (C = 0; arr_RND_SIZE >= C; C++) + sarr_RND[CC][C] = xorshft128(); + update_sarray(CC); + } + + kthread = kthread_create(work_thread, NULL, "mykthread"); + wake_up_process(kthread); + + return 0; +} + +/* + * This function is called when the module is unloaded + */ +void mod_exit(void) +{ + kthread_stop(kthread); + misc_deregister(&srandom_dev); + remove_proc_entry("srandom", NULL); + pr_debug("srandom deregistered..\n"); +} + + +/* + * This function is alled when a process tries to open the device file. + * "dd if=/dev/srandom" + */ +static int device_open(struct inode *inode, struct file *file) +{ + while (mutex_lock_interruptible(&Open_mutex)) + ; + + sdev_open++; + sdev_openCount++; + mutex_unlock(&Open_mutex); + + pr_debug("(current open) :%d\n", sdev_open); + pr_debug("(total open) :%d\n", sdev_openCount); + + return 0; +} + + +/* + * Called when a process closes the device file. + */ +static int device_release(struct inode *inode, struct file *file) +{ + while (mutex_lock_interruptible(&Open_mutex)) + ; + + sdev_open--; + mutex_unlock(&Open_mutex); + + pr_debug("(current open) :%d\n", sdev_open); + + return 0; +} + +/* + * Called when a process reads from the device. + */ +ssize_t sdevice_read(struct file *file, char *buf, +size_t count, loff_t *ppos) +{ + /* Buffer to hold numbers to send */ + char *new_buf; + int ret, counter; + int CC; + size_t src_counter; + + pr_debug("count:%zu\n", count); + + /* + * if requested count is small (<512), then select an array and send it + * otherwise, create a new larger buffer to hold it all. + */ + if (count <= 512) { + while (mutex_lock_interruptible(&ArrBusy_mutex)) + ; + + CC = nextbuffer(); + while ((CC_Busy_Flags & 1 << CC) == (1 << CC)) { + CC += 1; + if (num_arr_RND <= CC) + CC = 0; + } + + /* + * Mark the Arry as busy by setting the flag + */ + CC_Busy_Flags += (1 << CC); + mutex_unlock(&ArrBusy_mutex); + + /* + * Send array to device + */ + ret = COPY_TO_USER(buf, sarr_RND[CC], count); + + /* + * Get more RND numbers + */ + update_sarray(CC); + + pr_debug("small CC_Busy_Flags:%d CC:%d\n", CC_Busy_Flags, CC); + + /* + * Clear CC_Busy_Flag + */ + if (mutex_lock_interruptible(&ArrBusy_mutex)) + return -ERESTARTSYS; + + CC_Busy_Flags -= (1 << CC); + mutex_unlock(&ArrBusy_mutex); + } else { + /* + * Allocate memory for new_buf + */ + long count_remaining = count; + + pr_debug("count_remaining:%ld count:%ld\n", + count_remaining, count); + + while (count_remaining > 0) { + pr_debug("count_remaining:%ld count:%ld\n", + count_remaining, count); + + new_buf = kzalloc((count_remaining + 512) * + sizeof(uint8_t), GFP_KERNEL); + while (!new_buf) { + pr_debug("buffered kzalloc failed to allocate buffer.", + "retrying...\n"); + new_buf = kzalloc((count_remaining + 512) * + sizeof(uint8_t), GFP_KERNEL); + } + + counter = 0; + src_counter = 512; + ret = 0; + + /* + * Select a RND array + */ + while (mutex_lock_interruptible(&ArrBusy_mutex)) + ; + + CC = nextbuffer(); + while ((CC_Busy_Flags & 1 << CC) == (1 << CC)) { + CC = xorshft128() & (num_arr_RND - 1); + pr_debug("buffered CC_Busy_Flags:%d CC:%d\n", + CC_Busy_Flags, CC); + } + + /* + * Mark the Arry as busy by setting the flag + */ + CC_Busy_Flags += (1 << CC); + mutex_unlock(&ArrBusy_mutex); + + /* + * Loop until we reach count_remaining size. + */ + while (counter < (int)count_remaining) { + /* + * Copy RND numbers to new_buf + */ + memcpy(new_buf + counter, sarr_RND[CC], + src_counter); + update_sarray(CC); + + pr_debug("buffered COPT_TO_USER counter:%d count_remaining:%zu\n", + counter, count_remaining); + + counter += 512; + } + + /* + * Clear CC_Busy_Flag + */ + while (mutex_lock_interruptible(&ArrBusy_mutex)) + ; + + CC_Busy_Flags -= (1 << CC); + mutex_unlock(&ArrBusy_mutex); + + /* + * Send new_buf to device + */ + ret = COPY_TO_USER(buf, new_buf, count_remaining); + + /* + * Free allocated memory + */ + kfree(new_buf); + + count_remaining = count_remaining - 1048576; + } + } + /* + * return how many chars we sent + */ + return count; +} +EXPORT_SYMBOL(sdevice_read); + +/* + * Called when someone tries to write to /dev/srandom device + */ +ssize_t sdevice_write(struct file *file, +const char __user *buf, size_t count, loff_t *ppos) +{ + char *newdata; + int ret; + + pr_debug("count:%zu\n", count); + + /* + * Allocate memory to read from device + */ + newdata = kzalloc(count, GFP_KERNEL); + while (!newdata) + newdata = kzalloc(count, GFP_KERNEL); + + ret = COPY_FROM_USER(newdata, buf, count); + + /* + * Free memory + */ + kfree(newdata); + + pr_debug("COPT_FROM_USER count:%zu\n", count); + + return count; +} + + + +/* + * Update the sarray with new random numbers + */ +void update_sarray(int CC) +{ + int16_t C; + int64_t X, Y, Z1, Z2, Z3; + + /* + * This function must run exclusivly + */ + while (mutex_lock_interruptible(&UpArr_mutex)) + ; + + PRNGCount++; + + Z1 = xorshft64(); + Z2 = xorshft64(); + Z3 = xorshft64(); + if ((Z1 & 1) == 0) { + pr_debug("0\n"); + for (C = 0; C < (arr_RND_SIZE - 4) ; C = C + 4) { + X = xorshft128(); + Y = xorshft128(); + sarr_RND[CC][C] = sarr_RND[CC][C + 1] ^ X ^ Y; + sarr_RND[CC][C + 1] = sarr_RND[CC][C + 2] ^ Y ^ Z1; + sarr_RND[CC][C + 2] = sarr_RND[CC][C + 3] ^ X ^ Z2; + sarr_RND[CC][C + 3] = X ^ Y ^ Z3; + } + } else { + pr_debug("1\n"); + for (C = 0; C < (arr_RND_SIZE - 4) ; C = C + 4) { + X = xorshft128(); + Y = xorshft128(); + sarr_RND[CC][C] = sarr_RND[CC][C + 1] ^ X ^ Z2; + sarr_RND[CC][C + 1] = sarr_RND[CC][C + 2] ^ X ^ Y; + sarr_RND[CC][C + 2] = sarr_RND[CC][C + 3] ^ Y ^ Z3; + sarr_RND[CC][C + 3] = X ^ Y ^ Z1; + } + } + + mutex_unlock(&UpArr_mutex); + + pr_debug("CC:%d, X:%llu, Y:%llu, Z1:%llu, Z2:%llu, Z3:%llu,\n", + CC, X, Y, Z1, Z2, Z3); +} +EXPORT_SYMBOL(sdevice_write); + +/* + * Seeding the xorshft's + */ +void seed_PRND_s0(void) +{ + KTIME_GET_NS(&ts); + s[0] = (s[0] << 31) ^ (uint64_t)ts.tv_nsec; + pr_debug("x:%llu, s[0]:%llu, s[1]:%llu\n", + x, s[0], s[1]); +} + +void seed_PRND_s1(void) +{ + KTIME_GET_NS(&ts); + s[1] = (s[1] << 24) ^ (uint64_t)ts.tv_nsec; + pr_debug("x:%llu, s[0]:%llu, s[1]:%llu\n", + x, s[0], s[1]); +} + +void seed_PRND_x(void) +{ + KTIME_GET_NS(&ts); + x = (x << 32) ^ (uint64_t)ts.tv_nsec; + pr_debug("x:%llu, s[0]:%llu, s[1]:%llu\n", + x, s[0], s[1]); +} + +/* + * PRNG functions + */ +uint64_t xorshft64(void) +{ + uint64_t z = (x += 0x9E3779B97F4A7C15ULL); + + z = (z ^ (z >> 30)) * 0xBF58476D1CE4E5B9ULL; + z = (z ^ (z >> 27)) * 0x94D049BB133111EBULL; + return z ^ (z >> 31); +} + +uint64_t xorshft128(void) +{ + uint64_t s1 = s[0]; + const uint64_t s0 = s[1]; + + s[0] = s0; + s1 ^= s1 << 23; + return (s[1] = (s1 ^ s0 ^ (s1 >> 17) ^ (s0 >> 26))) + s0; +} + +/* + * This function returns the next sarray to use/read. + */ +int nextbuffer(void) +{ + uint8_t position = (int)((CC_buffer_position * 4) / 64); + uint8_t roll = CC_buffer_position % 16; + uint8_t nextbuffer = (sarr_RND[num_arr_RND][position] >> (roll * 4)) + & (num_arr_RND - 1); + + pr_debug("raw:%lld", + "position:%d", + "roll:%d", + "%s:%d", + "CC_buffer_position:%d\n", + sarr_RND[num_arr_RND][position], + position, + roll, + __func__, + nextbuffer, + CC_buffer_position); + + while (mutex_lock_interruptible(&UpPos_mutex)) + ; + CC_buffer_position++; + mutex_unlock(&UpPos_mutex); + + if (CC_buffer_position >= 1021) { + while (mutex_lock_interruptible(&UpPos_mutex)) + ; + CC_buffer_position = 0; + mutex_unlock(&UpPos_mutex); + update_sarray(num_arr_RND); + } + + return nextbuffer; +} + +/* + * The Kernel thread doing background tasks. + */ +int work_thread(void *data) +{ + int interation = 0; + + while (!kthread_should_stop()) { + if (interation <= num_arr_RND) + update_sarray(interation); + else if (interation == num_arr_RND + 1) + seed_PRND_s0(); + else if (interation == num_arr_RND + 2) + seed_PRND_s1(); + else if (interation == num_arr_RND + 3) + seed_PRND_x(); + else + interation = -1; + + interation++; + ssleep(THREAD_SLEEP_VALUE); + } + + do_exit(0); + return 0; +} + +/* + * This function is called when reading /proc filesystem + */ +int proc_read(struct seq_file *m, void *v) +{ + seq_puts(m, "-----------------------:----------------------\n"); + seq_puts(m, "Device : /dev/"sDEVICE_NAME"\n"); + seq_puts(m, "Module version : "AppVERSION"\n"); + seq_printf(m, "Current open count : %d\n", sdev_open); + seq_printf(m, "Total open count : %d\n", sdev_openCount); + seq_printf(m, "Total K bytes : %llu\n", PRNGCount / 2); + if (PAID == 0) { + seq_puts(m, "-----------------------:----------------------\n"); + seq_puts(m, "Please support my work and efforts contributing\n"); + seq_puts(m, "to the Linux community. A $25 payment per\n"); + seq_puts(m, "server would be highly appreciated.\n"); + } + seq_puts(m, "-----------------------:----------------------\n"); + seq_puts(m, "Author : Jonathan Senkerik\n"); + seq_puts(m, "Website : http://www.jintegrate.co\n"); + seq_puts(m, "github : http://github.com/josenk/srandom\n"); + if (PAID == 0) { + seq_puts(m, "Paypal : josenk@jintegrate.co\n"); + seq_puts(m, "Bitcoin : 1GEtkAm97DphwJbJTPyywv6NbqJKLMtDzA\n"); + seq_puts(m, "Commercial Invoice : Avail on request.\n"); + } + return 0; +} + +int proc_open(struct inode *inode, struct file *file) +{ + return single_open(file, proc_read, NULL); +} + +module_init(mod_init); +module_exit(mod_exit); + +/* + * Module license information + */ +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Jonathan Senkerik "); +MODULE_DESCRIPTION("Improved random number generator."); +MODULE_SUPPORTED_DEVICE("/dev/srandom"); diff --git a/drivers/devfreq/Kconfig b/drivers/devfreq/Kconfig index a0a01fc93c59..2a29883b7d15 100644 --- a/drivers/devfreq/Kconfig +++ b/drivers/devfreq/Kconfig @@ -278,12 +278,6 @@ config DEVFREQ_MSM_LLCCBW_DDR_BOOST_FREQ help Boost frequency for the MSM DDR bus. -config DEVFREQ_MSM_CPU_LLCCBW_BOOST_FREQ - int "Boost freq for cpu-llcc device" - default "0" - help - Boost frequency for the MSM DDR bus. - endif source "drivers/devfreq/event/Kconfig" diff --git a/drivers/devfreq/devfreq_boost.c b/drivers/devfreq/devfreq_boost.c index c7cc64cda721..d909895e103e 100644 --- a/drivers/devfreq/devfreq_boost.c +++ b/drivers/devfreq/devfreq_boost.c @@ -50,9 +50,7 @@ static void devfreq_max_unboost(struct work_struct *work); static struct df_boost_drv df_boost_drv_g __read_mostly = { BOOST_DEV_INIT(df_boost_drv_g, DEVFREQ_MSM_LLCCBW_DDR, - CONFIG_DEVFREQ_MSM_LLCCBW_DDR_BOOST_FREQ), - BOOST_DEV_INIT(df_boost_drv_g, DEVFREQ_MSM_CPU_LLCCBW, - CONFIG_DEVFREQ_MSM_CPU_LLCCBW_BOOST_FREQ) + CONFIG_DEVFREQ_MSM_LLCCBW_DDR_BOOST_FREQ) }; static void __devfreq_boost_kick(struct boost_dev *b) diff --git a/drivers/devfreq/devfreq_devbw.c b/drivers/devfreq/devfreq_devbw.c index 159ff6f3beb7..49ab1fafe09e 100644 --- a/drivers/devfreq/devfreq_devbw.c +++ b/drivers/devfreq/devfreq_devbw.c @@ -182,9 +182,6 @@ int devfreq_add_devbw(struct device *dev) if (!strcmp(dev_name(dev), "soc:qcom,cpu-llcc-ddr-bw")) devfreq_register_boost_device(DEVFREQ_MSM_LLCCBW_DDR, d->df); - - if (!strcmp(dev_name(dev), "soc:qcom,cpu-cpu-llcc-bw")) - devfreq_register_boost_device(DEVFREQ_MSM_CPU_LLCCBW, d->df); return 0; } diff --git a/drivers/gpu/drm/drm_atomic.c b/drivers/gpu/drm/drm_atomic.c index 665b2a76a242..5b5916bee559 100644 --- a/drivers/gpu/drm/drm_atomic.c +++ b/drivers/gpu/drm/drm_atomic.c @@ -2580,7 +2580,6 @@ static int __drm_mode_atomic_ioctl(struct drm_device *dev, void *data, return -EINVAL; if (!(arg->flags & DRM_MODE_ATOMIC_TEST_ONLY)) { - devfreq_boost_kick(DEVFREQ_MSM_CPU_LLCCBW); devfreq_boost_kick(DEVFREQ_MSM_LLCCBW_DDR); } diff --git a/drivers/soc/qcom/subsys-pil-tz.c b/drivers/soc/qcom/subsys-pil-tz.c index 68ea839c6355..896c555f7b54 100644 --- a/drivers/soc/qcom/subsys-pil-tz.c +++ b/drivers/soc/qcom/subsys-pil-tz.c @@ -636,7 +636,7 @@ static int pil_init_image_trusted(struct pil_desc *pil, return -ENOMEM; } - memcpy(mdata_buf, metadata, size); + memcpy_toio((void __iomem *)mdata_buf, metadata, size); desc.args[0] = d->pas_id; desc.args[1] = mdata_phys; diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c index 914b23179421..28fd247d7ffb 100644 --- a/drivers/usb/core/hub.c +++ b/drivers/usb/core/hub.c @@ -47,6 +47,8 @@ #define USB_TP_TRANSMISSION_DELAY 40 /* ns */ #define USB_TP_TRANSMISSION_DELAY_MAX 65535 /* ns */ +extern int deny_new_usb; + /* Protect struct usb_device->state and ->children members * Note: Both are also protected by ->dev.sem, except that ->state can * change to USB_STATE_NOTATTACHED even when the semaphore isn't held. */ @@ -4998,6 +5000,11 @@ static void hub_port_connect(struct usb_hub *hub, int port1, u16 portstatus, goto done; return; } + if (deny_new_usb) { + dev_err(&port_dev->dev, "denied insert of USB device on port %d\n", port1); + goto done; + } + if (hub_is_superspeed(hub->hdev)) unit_load = 150; else diff --git a/include/linux/devfreq_boost.h b/include/linux/devfreq_boost.h index 3d17f41ba4a4..7897cb040833 100644 --- a/include/linux/devfreq_boost.h +++ b/include/linux/devfreq_boost.h @@ -9,7 +9,6 @@ enum df_device { DEVFREQ_MSM_LLCCBW_DDR, - DEVFREQ_MSM_CPU_LLCCBW, DEVFREQ_MAX }; diff --git a/include/linux/linkage.h b/include/linux/linkage.h index d7618c41f74c..f3ae8f3dea2c 100644 --- a/include/linux/linkage.h +++ b/include/linux/linkage.h @@ -75,25 +75,58 @@ #ifdef __ASSEMBLY__ +/* SYM_T_FUNC -- type used by assembler to mark functions */ +#ifndef SYM_T_FUNC +#define SYM_T_FUNC STT_FUNC +#endif + +/* SYM_T_OBJECT -- type used by assembler to mark data */ +#ifndef SYM_T_OBJECT +#define SYM_T_OBJECT STT_OBJECT +#endif + +/* SYM_T_NONE -- type used by assembler to mark entries of unknown type */ +#ifndef SYM_T_NONE +#define SYM_T_NONE STT_NOTYPE +#endif + +/* SYM_A_* -- align the symbol? */ +#define SYM_A_ALIGN ALIGN +#define SYM_A_NONE /* nothing */ + +/* SYM_L_* -- linkage of symbols */ +#define SYM_L_GLOBAL(name) .globl name +#define SYM_L_WEAK(name) .weak name +#define SYM_L_LOCAL(name) /* nothing */ + #ifndef LINKER_SCRIPT #define ALIGN __ALIGN #define ALIGN_STR __ALIGN_STR -#ifndef ENTRY -#define ENTRY(name) \ +/* === DEPRECATED annotations === */ + +#ifndef GLOBAL +/* deprecated, use SYM_DATA*, SYM_ENTRY, or similar */ +#define GLOBAL(name) \ .globl name ASM_NL \ - ALIGN ASM_NL \ name: #endif + +#ifndef ENTRY +/* deprecated, use SYM_FUNC_START */ +#define ENTRY(name) \ + SYM_FUNC_START(name) +#endif #endif /* LINKER_SCRIPT */ #ifndef WEAK +/* deprecated, use SYM_FUNC_START_WEAK* */ #define WEAK(name) \ - .weak name ASM_NL \ - name: + SYM_FUNC_START_WEAK(name) #endif #ifndef END +/* deprecated, use SYM_FUNC_END, SYM_DATA_END, or SYM_END */ #define END(name) \ .size name, .-name #endif @@ -103,11 +136,214 @@ * static analysis tools such as stack depth analyzer. */ #ifndef ENDPROC +/* deprecated, use SYM_FUNC_END */ #define ENDPROC(name) \ - .type name, @function ASM_NL \ - END(name) + SYM_FUNC_END(name) +#endif + +/* === generic annotations === */ + +/* SYM_ENTRY -- use only if you have to for non-paired symbols */ +#ifndef SYM_ENTRY +#define SYM_ENTRY(name, linkage, align...) \ + linkage(name) ASM_NL \ + align ASM_NL \ + name: +#endif + +/* SYM_START -- use only if you have to */ +#ifndef SYM_START +#define SYM_START(name, linkage, align...) \ + SYM_ENTRY(name, linkage, align) +#endif + +/* SYM_END -- use only if you have to */ +#ifndef SYM_END +#define SYM_END(name, sym_type) \ + .type name sym_type ASM_NL \ + .size name, .-name +#endif + +/* === code annotations === */ + +/* + * FUNC -- C-like functions (proper stack frame etc.) + * CODE -- non-C code (e.g. irq handlers with different, special stack etc.) + * + * Objtool validates stack for FUNC, but not for CODE. + * Objtool generates debug info for both FUNC & CODE, but needs special + * annotations for each CODE's start (to describe the actual stack frame). + * + * ALIAS -- does not generate debug info -- the aliased function will + */ + +/* SYM_INNER_LABEL_ALIGN -- only for labels in the middle of code */ +#ifndef SYM_INNER_LABEL_ALIGN +#define SYM_INNER_LABEL_ALIGN(name, linkage) \ + .type name SYM_T_NONE ASM_NL \ + SYM_ENTRY(name, linkage, SYM_A_ALIGN) +#endif + +/* SYM_INNER_LABEL -- only for labels in the middle of code */ +#ifndef SYM_INNER_LABEL +#define SYM_INNER_LABEL(name, linkage) \ + .type name SYM_T_NONE ASM_NL \ + SYM_ENTRY(name, linkage, SYM_A_NONE) +#endif + +/* + * SYM_FUNC_START_LOCAL_ALIAS -- use where there are two local names for one + * function + */ +#ifndef SYM_FUNC_START_LOCAL_ALIAS +#define SYM_FUNC_START_LOCAL_ALIAS(name) \ + SYM_START(name, SYM_L_LOCAL, SYM_A_ALIGN) +#endif + +/* + * SYM_FUNC_START_ALIAS -- use where there are two global names for one + * function + */ +#ifndef SYM_FUNC_START_ALIAS +#define SYM_FUNC_START_ALIAS(name) \ + SYM_START(name, SYM_L_GLOBAL, SYM_A_ALIGN) #endif +/* SYM_FUNC_START -- use for global functions */ +#ifndef SYM_FUNC_START +/* + * The same as SYM_FUNC_START_ALIAS, but we will need to distinguish these two + * later. + */ +#define SYM_FUNC_START(name) \ + SYM_START(name, SYM_L_GLOBAL, SYM_A_ALIGN) +#endif + +/* SYM_FUNC_START_NOALIGN -- use for global functions, w/o alignment */ +#ifndef SYM_FUNC_START_NOALIGN +#define SYM_FUNC_START_NOALIGN(name) \ + SYM_START(name, SYM_L_GLOBAL, SYM_A_NONE) +#endif + +/* SYM_FUNC_START_LOCAL -- use for local functions */ +#ifndef SYM_FUNC_START_LOCAL +/* the same as SYM_FUNC_START_LOCAL_ALIAS, see comment near SYM_FUNC_START */ +#define SYM_FUNC_START_LOCAL(name) \ + SYM_START(name, SYM_L_LOCAL, SYM_A_ALIGN) +#endif + +/* SYM_FUNC_START_LOCAL_NOALIGN -- use for local functions, w/o alignment */ +#ifndef SYM_FUNC_START_LOCAL_NOALIGN +#define SYM_FUNC_START_LOCAL_NOALIGN(name) \ + SYM_START(name, SYM_L_LOCAL, SYM_A_NONE) +#endif + +/* SYM_FUNC_START_WEAK -- use for weak functions */ +#ifndef SYM_FUNC_START_WEAK +#define SYM_FUNC_START_WEAK(name) \ + SYM_START(name, SYM_L_WEAK, SYM_A_ALIGN) +#endif + +/* SYM_FUNC_START_WEAK_NOALIGN -- use for weak functions, w/o alignment */ +#ifndef SYM_FUNC_START_WEAK_NOALIGN +#define SYM_FUNC_START_WEAK_NOALIGN(name) \ + SYM_START(name, SYM_L_WEAK, SYM_A_NONE) +#endif + +/* SYM_FUNC_END_ALIAS -- the end of LOCAL_ALIASed or ALIASed function */ +#ifndef SYM_FUNC_END_ALIAS +#define SYM_FUNC_END_ALIAS(name) \ + SYM_END(name, SYM_T_FUNC) +#endif + +/* + * SYM_FUNC_END -- the end of SYM_FUNC_START_LOCAL, SYM_FUNC_START, + * SYM_FUNC_START_WEAK, ... + */ +#ifndef SYM_FUNC_END +/* the same as SYM_FUNC_END_ALIAS, see comment near SYM_FUNC_START */ +#define SYM_FUNC_END(name) \ + SYM_END(name, SYM_T_FUNC) +#endif + +/* SYM_CODE_START -- use for non-C (special) functions */ +#ifndef SYM_CODE_START +#define SYM_CODE_START(name) \ + SYM_START(name, SYM_L_GLOBAL, SYM_A_ALIGN) +#endif + +/* SYM_CODE_START_NOALIGN -- use for non-C (special) functions, w/o alignment */ +#ifndef SYM_CODE_START_NOALIGN +#define SYM_CODE_START_NOALIGN(name) \ + SYM_START(name, SYM_L_GLOBAL, SYM_A_NONE) #endif +/* SYM_CODE_START_LOCAL -- use for local non-C (special) functions */ +#ifndef SYM_CODE_START_LOCAL +#define SYM_CODE_START_LOCAL(name) \ + SYM_START(name, SYM_L_LOCAL, SYM_A_ALIGN) #endif + +/* + * SYM_CODE_START_LOCAL_NOALIGN -- use for local non-C (special) functions, + * w/o alignment + */ +#ifndef SYM_CODE_START_LOCAL_NOALIGN +#define SYM_CODE_START_LOCAL_NOALIGN(name) \ + SYM_START(name, SYM_L_LOCAL, SYM_A_NONE) +#endif + +/* SYM_CODE_END -- the end of SYM_CODE_START_LOCAL, SYM_CODE_START, ... */ +#ifndef SYM_CODE_END +#define SYM_CODE_END(name) \ + SYM_END(name, SYM_T_NONE) +#endif + +/* === data annotations === */ + +/* SYM_DATA_START -- global data symbol */ +#ifndef SYM_DATA_START +#define SYM_DATA_START(name) \ + SYM_START(name, SYM_L_GLOBAL, SYM_A_NONE) +#endif + +/* SYM_DATA_START -- local data symbol */ +#ifndef SYM_DATA_START_LOCAL +#define SYM_DATA_START_LOCAL(name) \ + SYM_START(name, SYM_L_LOCAL, SYM_A_NONE) +#endif + +/* SYM_DATA_END -- the end of SYM_DATA_START symbol */ +#ifndef SYM_DATA_END +#define SYM_DATA_END(name) \ + SYM_END(name, SYM_T_OBJECT) +#endif + +/* SYM_DATA_END_LABEL -- the labeled end of SYM_DATA_START symbol */ +#ifndef SYM_DATA_END_LABEL +#define SYM_DATA_END_LABEL(name, linkage, label) \ + linkage(label) ASM_NL \ + .type label SYM_T_OBJECT ASM_NL \ + label: \ + SYM_END(name, SYM_T_OBJECT) +#endif + +/* SYM_DATA -- start+end wrapper around simple global data */ +#ifndef SYM_DATA +#define SYM_DATA(name, data...) \ + SYM_DATA_START(name) ASM_NL \ + data ASM_NL \ + SYM_DATA_END(name) +#endif + +/* SYM_DATA_LOCAL -- start+end wrapper around simple local data */ +#ifndef SYM_DATA_LOCAL +#define SYM_DATA_LOCAL(name, data...) \ + SYM_DATA_START_LOCAL(name) ASM_NL \ + data ASM_NL \ + SYM_DATA_END(name) +#endif + +#endif /* __ASSEMBLY__ */ + +#endif /* _LINUX_LINKAGE_H */ diff --git a/include/linux/srandom.h b/include/linux/srandom.h new file mode 100644 index 000000000000..8e4f5039c249 --- /dev/null +++ b/include/linux/srandom.h @@ -0,0 +1,4 @@ +#include +extern const struct file_operations sfops; +extern ssize_t sdevice_read(struct file *, char *, size_t, loff_t *); +extern ssize_t sdevice_write(struct file *, const char *, size_t, loff_t *); diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h index 31bffa69e864..b57d2f986994 100644 --- a/include/linux/vmpressure.h +++ b/include/linux/vmpressure.h @@ -26,6 +26,9 @@ struct vmpressure { struct mutex events_lock; struct work_struct work; + + atomic_long_t users; + rwlock_t users_lock; }; struct mem_cgroup; @@ -36,6 +39,8 @@ extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree, unsigned long scanned, unsigned long reclaimed, int order); extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio); +extern bool vmpressure_inc_users(int order); +extern void vmpressure_dec_users(void); #ifdef CONFIG_MEMCG extern void vmpressure_init(struct vmpressure *vmpr); diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index f3d74b92f35d..58429638d8eb 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -617,7 +617,7 @@ static inline void flush_scheduled_work(void) static inline bool schedule_delayed_work_on(int cpu, struct delayed_work *dwork, unsigned long delay) { - return queue_delayed_work_on(cpu, system_wq, dwork, delay); + return queue_delayed_work_on(cpu, system_power_efficient_wq, dwork, delay); } /** @@ -631,7 +631,7 @@ static inline bool schedule_delayed_work_on(int cpu, struct delayed_work *dwork, static inline bool schedule_delayed_work(struct delayed_work *dwork, unsigned long delay) { - return queue_delayed_work(system_wq, dwork, delay); + return queue_delayed_work(system_power_efficient_wq, dwork, delay); } /** diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 202a30c164d9..ef846ce2741f 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -545,7 +545,6 @@ static ssize_t __cgroup1_procs_write(struct kernfs_open_file *of, !memcmp(of->kn->parent->name, "top-app", sizeof("top-app")) && is_zygote_pid(task->parent->pid)) { devfreq_boost_kick_max(DEVFREQ_MSM_LLCCBW_DDR, 500); - devfreq_boost_kick_max(DEVFREQ_MSM_CPU_LLCCBW, 500); } out_finish: diff --git a/kernel/fork.c b/kernel/fork.c index 8033d3eee62f..9c74872c50f2 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2384,7 +2384,6 @@ long _do_fork(unsigned long clone_flags, /* Boost CPU to the max for 150 ms when userspace launches an app */ if (is_zygote_pid(current->pid)) { devfreq_boost_kick_max(DEVFREQ_MSM_LLCCBW_DDR, 150); - devfreq_boost_kick_max(DEVFREQ_MSM_CPU_LLCCBW, 150); } /* diff --git a/kernel/sysctl.c b/kernel/sysctl.c index fb46651b4913..133915f160f4 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -102,6 +102,10 @@ #if defined(CONFIG_SYSCTL) /* External variables not in a header file. */ +#ifdef CONFIG_USB +int deny_new_usb __read_mostly = 0; +EXPORT_SYMBOL(deny_new_usb); +#endif extern int suid_dumpable; #ifdef CONFIG_COREDUMP extern int core_uses_pid; @@ -1202,6 +1206,17 @@ static struct ctl_table kern_table[] = { .extra1 = &zero, .extra2 = &two, }, +#endif +#ifdef CONFIG_USB + { + .procname = "deny_new_usb", + .data = &deny_new_usb, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax_sysadmin, + .extra1 = &zero, + .extra2 = &one, + }, #endif { .procname = "ngroups_max", diff --git a/lib/string.c b/lib/string.c index f7f7770444bf..211f30bba3c6 100644 --- a/lib/string.c +++ b/lib/string.c @@ -32,6 +32,23 @@ #include #include +#define BYTES_LONG sizeof(long) +#define WORD_MASK (BYTES_LONG - 1) +#define MIN_THRESHOLD (BYTES_LONG * 2) + +/* convenience union to avoid cast between different pointer types */ +union types { + u8 *as_u8; + unsigned long *as_ulong; + uintptr_t as_uptr; +}; + +union const_types { + const u8 *as_u8; + const unsigned long *as_ulong; + uintptr_t as_uptr; +}; + #ifndef __HAVE_ARCH_STRNCASECMP /** * strncasecmp - Case insensitive, length-limited string comparison @@ -751,10 +768,38 @@ EXPORT_SYMBOL(__sysfs_match_string); */ void *memset(void *s, int c, size_t count) { - char *xs = s; + union types dest = { .as_u8 = s }; + + if (count >= MIN_THRESHOLD) { + unsigned long cu = (unsigned long)c; + /* Compose an ulong with 'c' repeated 4/8 times */ +#ifdef CONFIG_ARCH_HAS_FAST_MULTIPLIER + cu *= 0x0101010101010101UL; +#else + cu |= cu << 8; + cu |= cu << 16; + /* Suppress warning on 32 bit machines */ + cu |= (cu << 16) << 16; +#endif + if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) { + /* + * Fill the buffer one byte at time until + * the destination is word aligned. + */ + for (; count && dest.as_uptr & WORD_MASK; count--) + *dest.as_u8++ = c; + } + + /* Copy using the largest size allowed */ + for (; count >= BYTES_LONG; count -= BYTES_LONG) + *dest.as_ulong++ = cu; + } + + /* copy the remainder */ while (count--) - *xs++ = c; + *dest.as_u8++ = c; + return s; } EXPORT_SYMBOL(memset); @@ -848,6 +893,13 @@ EXPORT_SYMBOL(memset64); #endif #ifndef __HAVE_ARCH_MEMCPY + +#ifdef __BIG_ENDIAN +#define MERGE_UL(h, l, d) ((h) << ((d) * 8) | (l) >> ((BYTES_LONG - (d)) * 8)) +#else +#define MERGE_UL(h, l, d) ((h) >> ((d) * 8) | (l) << ((BYTES_LONG - (d)) * 8)) +#endif + /** * memcpy - Copy one area of memory to another * @dest: Where to copy to @@ -859,14 +911,64 @@ EXPORT_SYMBOL(memset64); */ void *memcpy(void *dest, const void *src, size_t count) { - char *tmp = dest; - const char *s = src; + union const_types s = { .as_u8 = src }; + union types d = { .as_u8 = dest }; + int distance = 0; + + if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) { + if (count < MIN_THRESHOLD) + goto copy_remainder; + /* Copy a byte at time until destination is aligned. */ + for (; d.as_uptr & WORD_MASK; count--) + *d.as_u8++ = *s.as_u8++; + + distance = s.as_uptr & WORD_MASK; + } + + if (distance) { + unsigned long last, next; + + /* + * s is distance bytes ahead of d, and d just reached + * the alignment boundary. Move s backward to word align it + * and shift data to compensate for distance, in order to do + * word-by-word copy. + */ + s.as_u8 -= distance; + + next = s.as_ulong[0]; + for (; count >= BYTES_LONG; count -= BYTES_LONG) { + last = next; + next = s.as_ulong[1]; + + d.as_ulong[0] = MERGE_UL(last, next, distance); + + d.as_ulong++; + s.as_ulong++; + } + + /* Restore s with the original offset. */ + s.as_u8 += distance; + } else { + /* + * If the source and dest lower bits are the same, do a simple + * 32/64 bit wide copy. + */ + for (; count >= BYTES_LONG; count -= BYTES_LONG) + *d.as_ulong++ = *s.as_ulong++; + } + +copy_remainder: while (count--) - *tmp++ = *s++; + *d.as_u8++ = *s.as_u8++; + return dest; } EXPORT_SYMBOL(memcpy); + +#undef MERGE_UL + #endif #ifndef __HAVE_ARCH_MEMMOVE @@ -880,19 +982,13 @@ EXPORT_SYMBOL(memcpy); */ void *memmove(void *dest, const void *src, size_t count) { - char *tmp; - const char *s; + if (dest < src || src + count <= dest) + return memcpy(dest, src, count); + + if (dest > src) { + const char *s = src + count; + char *tmp = dest + count; - if (dest <= src) { - tmp = dest; - s = src; - while (count--) - *tmp++ = *s++; - } else { - tmp = dest; - tmp += count; - s = src; - s += count; while (count--) *--tmp = *--s; } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 957d9dd68964..6edc20a888fb 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4542,6 +4542,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, unsigned int cpuset_mems_cookie; int reserve_flags; bool woke_kswapd = false; + bool used_vmpressure = false; /* * We also sanity check to catch abuse of atomic reserves being used by @@ -4580,6 +4581,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, atomic_long_inc(&kswapd_waiters); woke_kswapd = true; } + if (!used_vmpressure) + used_vmpressure = vmpressure_inc_users(order); wake_all_kswapds(order, gfp_mask, ac); } @@ -4674,6 +4677,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, goto nopage; /* Try direct reclaim and then allocating */ + if (!used_vmpressure) + used_vmpressure = vmpressure_inc_users(order); page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac, &did_some_progress); if (page) @@ -4788,6 +4793,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, got_pg: if (woke_kswapd) atomic_long_dec(&kswapd_waiters); + if (used_vmpressure) + vmpressure_dec_users(); if (!page) warn_alloc(gfp_mask, ac->nodemask, "page allocation failure: order:%u", order); diff --git a/mm/vmpressure.c b/mm/vmpressure.c index 7cb746f9fe8d..54aeab2c7f51 100644 --- a/mm/vmpressure.c +++ b/mm/vmpressure.c @@ -217,11 +217,12 @@ static void vmpressure_work_fn(struct work_struct *work) unsigned long scanned; unsigned long reclaimed; unsigned long pressure; + unsigned long flags; enum vmpressure_levels level; bool ancestor = false; bool signalled = false; - spin_lock(&vmpr->sr_lock); + spin_lock_irqsave(&vmpr->sr_lock, flags); /* * Several contexts might be calling vmpressure(), so it is * possible that the work was rescheduled again before the old @@ -232,14 +233,14 @@ static void vmpressure_work_fn(struct work_struct *work) */ scanned = vmpr->tree_scanned; if (!scanned) { - spin_unlock(&vmpr->sr_lock); + spin_unlock_irqrestore(&vmpr->sr_lock, flags); return; } reclaimed = vmpr->tree_reclaimed; vmpr->tree_scanned = 0; vmpr->tree_reclaimed = 0; - spin_unlock(&vmpr->sr_lock); + spin_unlock_irqrestore(&vmpr->sr_lock, flags); pressure = vmpressure_calc_pressure(scanned, reclaimed); level = vmpressure_level(pressure); @@ -279,6 +280,7 @@ static void vmpressure_memcg(gfp_t gfp, struct mem_cgroup *memcg, bool critical, unsigned long reclaimed) { struct vmpressure *vmpr = memcg_to_vmpressure(memcg); + unsigned long flags; /* * If we got here with no pages scanned, then that is an indicator @@ -294,10 +296,10 @@ static void vmpressure_memcg(gfp_t gfp, struct mem_cgroup *memcg, bool critical, return; if (tree) { - spin_lock(&vmpr->sr_lock); + spin_lock_irqsave(&vmpr->sr_lock, flags); scanned = vmpr->tree_scanned += scanned; vmpr->tree_reclaimed += reclaimed; - spin_unlock(&vmpr->sr_lock); + spin_unlock_irqrestore(&vmpr->sr_lock, flags); if (!critical && scanned < calculate_vmpressure_win()) return; @@ -310,15 +312,15 @@ static void vmpressure_memcg(gfp_t gfp, struct mem_cgroup *memcg, bool critical, if (!memcg || memcg == root_mem_cgroup) return; - spin_lock(&vmpr->sr_lock); + spin_lock_irqsave(&vmpr->sr_lock, flags); scanned = vmpr->scanned += scanned; reclaimed = vmpr->reclaimed += reclaimed; if (!critical && scanned < calculate_vmpressure_win()) { - spin_unlock(&vmpr->sr_lock); + spin_unlock_irqrestore(&vmpr->sr_lock, flags); return; } vmpr->scanned = vmpr->reclaimed = 0; - spin_unlock(&vmpr->sr_lock); + spin_unlock_irqrestore(&vmpr->sr_lock, flags); pressure = vmpressure_calc_pressure(scanned, reclaimed); level = vmpressure_level(pressure); @@ -342,18 +344,50 @@ static void vmpressure_memcg(gfp_t gfp, struct mem_cgroup *memcg, bool critical, unsigned long reclaimed) { } #endif +bool vmpressure_inc_users(int order) +{ + struct vmpressure *vmpr = &global_vmpressure; + unsigned long flags; + + if (order > PAGE_ALLOC_COSTLY_ORDER) + return false; + + write_lock_irqsave(&vmpr->users_lock, flags); + if (atomic_long_inc_return_relaxed(&vmpr->users) == 1) { + /* Clear out stale vmpressure data when reclaim begins */ + spin_lock(&vmpr->sr_lock); + vmpr->scanned = 0; + vmpr->reclaimed = 0; + vmpr->stall = 0; + spin_unlock(&vmpr->sr_lock); + } + write_unlock_irqrestore(&vmpr->users_lock, flags); + + return true; +} + +void vmpressure_dec_users(void) +{ + struct vmpressure *vmpr = &global_vmpressure; + + /* Decrement the vmpressure user count with release semantics */ + smp_mb__before_atomic(); + atomic_long_dec(&vmpr->users); +} + static void vmpressure_global(gfp_t gfp, unsigned long scanned, bool critical, unsigned long reclaimed) { struct vmpressure *vmpr = &global_vmpressure; unsigned long pressure; unsigned long stall; + unsigned long flags; if (critical) scanned = calculate_vmpressure_win(); + spin_lock_irqsave(&vmpr->sr_lock, flags); if (scanned) { - spin_lock(&vmpr->sr_lock); vmpr->scanned += scanned; vmpr->reclaimed += reclaimed; @@ -363,17 +397,16 @@ static void vmpressure_global(gfp_t gfp, unsigned long scanned, bool critical, stall = vmpr->stall; scanned = vmpr->scanned; reclaimed = vmpr->reclaimed; - spin_unlock(&vmpr->sr_lock); - if (!critical && scanned < calculate_vmpressure_win()) + if (!critical && scanned < calculate_vmpressure_win()) { + spin_unlock_irqrestore(&vmpr->sr_lock, flags); return; + } } - - spin_lock(&vmpr->sr_lock); vmpr->scanned = 0; vmpr->reclaimed = 0; vmpr->stall = 0; - spin_unlock(&vmpr->sr_lock); + spin_unlock_irqrestore(&vmpr->sr_lock, flags); if (scanned) { pressure = vmpressure_calc_pressure(scanned, reclaimed); @@ -419,9 +452,25 @@ static void __vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool critical, void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree, unsigned long scanned, unsigned long reclaimed, int order) { + struct vmpressure *vmpr = &global_vmpressure; + unsigned long flags; + if (order > PAGE_ALLOC_COSTLY_ORDER) return; + /* + * It's possible for kswapd to keep doing reclaim even though memory + * pressure isn't high anymore. We should only track vmpressure when + * there are failed memory allocations actively stuck in the page + * allocator's slow path. No failed allocations means pressure is fine. + */ + read_lock_irqsave(&vmpr->users_lock, flags); + if (!atomic_long_read(&vmpr->users)) { + read_unlock_irqrestore(&vmpr->users_lock, flags); + return; + } + read_unlock_irqrestore(&vmpr->users_lock, flags); + __vmpressure(gfp, memcg, false, tree, scanned, reclaimed); } @@ -568,6 +617,8 @@ void vmpressure_init(struct vmpressure *vmpr) mutex_init(&vmpr->events_lock); INIT_LIST_HEAD(&vmpr->events); INIT_WORK(&vmpr->work, vmpressure_work_fn); + atomic_long_set(&vmpr->users, 0); + rwlock_init(&vmpr->users_lock); } /**