Skip to content

Recursively search directories for a regex pattern

License

Notifications You must be signed in to change notification settings

p-ranav/hypergrep

Folders and files

NameName
Last commit message
Last commit date
Jun 9, 2023
Jun 9, 2023
Jun 9, 2023
May 5, 2023
Jun 9, 2023
Jun 3, 2023
May 20, 2023
Jun 9, 2023

Repository files navigation

Highlights

  • Search recursively for a regex pattern using Intel Hyperscan.
  • When a git repository is detected, the repository index is searched using libgit2.
  • Similar to grep, ripgrep, ugrep, The Silver Searcher etc.
  • C++17, Multi-threading, SIMD.
  • USAGE GUIDE
  • Implementation notes here.
  • Not cross-platform. Tested in Linux.

Performance

The following tests compare the performance of hypergrep against:

System Details

Type Value
Processor 11th Gen Intel(R) Core(TM) i9-11900KF @ 3.50GHz 3.50 GHz
Instruction Set Extensions Intel® SSE4.1, Intel® SSE4.2, Intel® AVX2, Intel® AVX-512
Installed RAM 32.0 GB (31.9 GB usable)
SSD ADATA SX8200PNP
OS Ubuntu 20.04 LTS
C++ Compiler g++ (Ubuntu 11.1.0-1ubuntu1-20.04) 11.1.0

Vcpkg Installed Libraries

vcpkg commit: 662dbb5

Library Version
argparse 2.9
concurrentqueue 1.0.3
fmt 10.0.0
hyperscan 5.4.2
libgit2 1.6.4

Single Large File Search: OpenSubtitles.raw.en.txt

The following searches are performed on a single large file cached in memory (~13GB, OpenSubtitles.raw.en.gz).

Regex Line Count ag ugrep ripgrep hypergrep
Count number of times Holmes did something
hgrep -c 'Holmes did \w'
27 n/a 1.820 1.022 0.696
Literal with Regex Suffix
hgrep -nw 'Sherlock [A-Z]\w+' en.txt
7882 n/a 1.812 1.509 0.803
Simple Literal
hgrep -nw 'Sherlock Holmes' en.txt
7653 15.764 1.888 1.524 0.658
Simple Literal (case insensitive)
hgrep -inw 'Sherlock Holmes' en.txt
7871 15.599 6.945 2.162 0.650
Alternation of Literals
hgrep -n 'Sherlock Holmes|John Watson|Irene Adler|Inspector Lestrade|Professor Moriarty' en.txt
10078 n/a 6.886 1.836 0.689
Alternation of Literals (case insensitive)
hgrep -in 'Sherlock Holmes|John Watson|Irene Adler|Inspector Lestrade|Professor Moriarty' en.txt
10333 n/a 7.029 3.940 0.770
Words surrounding a literal string
hgrep -n '\w+[\x20]+Holmes[\x20]+\w+' en.txt
5020 n/a 6m 11s 1.523 0.638

Git Repository Search: torvalds/linux

The following searches are performed on the entire Linux kernel source tree (after running make defconfig && make -j8). The commit used is f1fcb.

Regex Line Count ag ugrep ripgrep hypergrep
Simple Literal
hgrep -nw 'PM_RESUME'
9 2.807 0.316 0.147 0.140
Simple Literal (case insensitive)
hgrep -niw 'PM_RESUME'
39 2.904 0.435 0.149 0.141
Regex with Literal Suffix
hgrep -nw '[A-Z]+_SUSPEND'
536 3.080 1.452 0.148 0.143
Alternation of four literals
hgrep -nw '(ERR_SYS|PME_TURN_OFF|LINK_REQ_RST|CFG_BME_EVT)'
16 3.085 0.410 0.153 0.146
Unicode Greek
hgrep -n '\p{Greek}'
111 3.762 0.484 0.345 0.146

Git Repository Search: apple/swift

The following searches are performed on the entire Apple Swift source tree. The commit used is 3865b.

Regex Line Count ag ugrep ripgrep hypergrep
Function/Struct/Enum declaration followed by a valid identifier and opening parenthesis
hgrep -n '(func|struct|enum)\s+[A-Za-z_][A-Za-z0-9_]*\s*\('
59026 1.148 0.954 0.154 0.090
Words starting with alphabetic characters followed by at least 2 digits
hgrep -nw '[A-Za-z]+\d{2,}'
127858 1.169 1.238 0.156 0.095
Workd starting with Uppercase letter, followed by alpha-numeric chars and/or underscores
hgrep -nw '[A-Z][a-zA-Z0-9_]*'
2012372 3.131 2.598 0.550 0.482
Guard let statement followed by valid identifier
hgrep -n 'guard\s+let\s+[a-zA-Z_][a-zA-Z0-9_]*\s*=\s*\w+'
839 0.828 0.174 0.054 0.047

Directory Search: /usr

The following searches are performed on the /usr directory.

Regex Line Count ag ugrep ripgrep hypergrep
Any HTTPS or FTP URL
hgrep "(https?|ftp)://[^\s/$.?#].[^\s]*"
13682 4.597 2.894 0.305 0.171
Any IPv4 IP address
hgrep -w "(?:\d{1,3}\.){3}\d{1,3}"
12643 4.727 2.340 0.324 0.166
Any E-mail address
hgrep -w "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"
47509 5.477 37.209 0.494 0.220
Any valid date MM/DD/YYYY
hgrep "(0[1-9]|1[0-2])/(0[1-9]|[12]\d|3[01])/(19|20)\d{2}"
116 4.239 1.827 0.251 0.163
Count the number of HEX values
hgrep -cw "(?:0x)?[0-9A-Fa-f]+"
68042 5.765 28.691 1.439 0.611
Search any C/C++ for a literal
hgrep --filter "\.(c|cpp|h|hpp)$" test
7355 n/a 0.505 0.118 0.079

Build

Install Dependencies with vcpkg

git clone https://github.com/microsoft/vcpkg
cd vcpkg
./bootstrap-vcpkg.sh
./vcpkg install concurrentqueue fmt argparse libgit2 hyperscan

Build hypergrep using cmake and vcpkg

Clone the repository

git clone https://github.com/p-ranav/hypergrep
cd hypergrep

If cmake is older than 3.19

mkdir build
cd build
cmake -DCMAKE_TOOLCHAIN_FILE=<path_to_vcpkg>/scripts/buildsystems/vcpkg.cmake ..
make

If cmake is newer than 3.19

Use the release preset:

export VCPKG_ROOT=<path_to_vcpkg>
cmake -B build -S . --preset release
cmake --build build

Binary Portability

To build the binary for x86_64 portability, invoke cmake with -DBUILD_PORTABLE=on option. This will use -march=x86-64 -mtune=generic and -static-libgcc -static-libstdc++, and link the C++ standard library and GCC runtime statically into the binary, reducing dependencies on the target system.