Skip to content
This repository has been archived by the owner on Dec 9, 2024. It is now read-only.

Weird behavior of TcpServerConnection #311

Open
tessob opened this issue Apr 8, 2021 · 13 comments
Open

Weird behavior of TcpServerConnection #311

tessob opened this issue Apr 8, 2021 · 13 comments

Comments

@tessob
Copy link

tessob commented Apr 8, 2021

Hello guys!

I rather need a clarification, but not a help. Currently I'm trying to run application in K8s and everything is good so far, but I have a little bit weird logs:

2021-04-08 13:51:59.890  INFO 1 --- [.IO.thread-in-0] c.h.i.server.tcp.TcpServerConnection     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Connection[id=10, /10.154.24.57:5701->/10.154.1.219:15106, qualifier=null, endpoint=null, alive=false, connectionType=NONE, planeIndex=-1] closed. Reason: Connection closed by the other side
2021-04-08 13:51:59.890  INFO 1 --- [.IO.thread-in-1] c.h.i.server.tcp.TcpServerConnection     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Connection[id=11, /10.154.24.57:5701->/10.154.47.197:32686, qualifier=null, endpoint=null, alive=false, connectionType=NONE, planeIndex=-1] closed. Reason: Connection closed by the other side
2021-04-08 13:51:59.890  INFO 1 --- [.IO.thread-in-2] c.h.i.server.tcp.TcpServerConnection     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Connection[id=12, /10.154.24.57:5701->/10.154.5.157:10454, qualifier=null, endpoint=null, alive=false, connectionType=NONE, planeIndex=-1] closed. Reason: Connection closed by the other side
2021-04-08 13:52:00.911  INFO 1 --- [ration.thread-0] c.h.internal.cluster.ClusterService      : [10.154.24.57]:5701 [WORKFLOW] [4.2] 

Members {size:2, ver:2} [
	Member [10.154.24.57]:5701 - 94f6d56b-a72d-4d0a-82ee-a33e47053f78 this
	Member [10.154.92.185]:5701 - 1f66b061-f995-46c0-b886-dd0ee1a49dc3
]

2021-04-08 13:52:04.335  INFO 1 --- [.IO.thread-in-0] c.h.i.server.tcp.TcpServerConnection     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Connection[id=13, /10.154.24.57:5701->/10.154.6.102:16924, qualifier=null, endpoint=null, alive=false, connectionType=NONE, planeIndex=-1] closed. Reason: Connection closed by the other side
2021-04-08 13:52:04.335  INFO 1 --- [.IO.thread-in-1] c.h.i.server.tcp.TcpServerConnection     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Connection[id=14, /10.154.24.57:5701->/10.154.19.176:5548, qualifier=null, endpoint=null, alive=false, connectionType=NONE, planeIndex=-1] closed. Reason: Connection closed by the other side
2021-04-08 13:52:04.336  INFO 1 --- [.IO.thread-in-2] c.h.i.server.tcp.TcpServerConnection     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Connection[id=15, /10.154.24.57:5701->/10.154.5.124:47016, qualifier=null, endpoint=null, alive=false, connectionType=NONE, planeIndex=-1] closed. Reason: Connection closed by the other side
2021-04-08 13:52:04.336  INFO 1 --- [.IO.thread-in-0] c.h.i.server.tcp.TcpServerConnection     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Connection[id=16, /10.154.24.57:5701->/10.154.1.219:52814, qualifier=null, endpoint=null, alive=false, connectionType=NONE, planeIndex=-1] closed. Reason: Connection closed by the other side
2021-04-08 13:52:09.889  INFO 1 --- [.IO.thread-in-2] c.h.i.server.tcp.TcpServerConnection     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Connection[id=18, /10.154.24.57:5701->/10.154.19.176:32638, qualifier=null, endpoint=null, alive=false, connectionType=NONE, planeIndex=-1] closed. Reason: Connection closed by the other side
2021-04-08 13:52:09.889  INFO 1 --- [.IO.thread-in-1] c.h.i.server.tcp.TcpServerConnection     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Connection[id=17, /10.154.24.57:5701->/10.154.29.156:1518, qualifier=null, endpoint=null, alive=false, connectionType=NONE, planeIndex=-1] closed. Reason: Connection closed by the other side
2021-04-08 13:52:09.889  INFO 1 --- [.IO.thread-in-0] c.h.i.server.tcp.TcpServerConnection     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Connection[id=19, /10.154.24.57:5701->/10.154.47.197:32712, qualifier=null, endpoint=null, alive=false, connectionType=NONE, planeIndex=-1] closed. Reason: Connection closed by the other side

It seems like current node constantly trying to establish connection with other pods "using all variety of irrelevant ports". Is it normal behavior or not? Is there a way to restrict it somehow?

My configs:

  @Bean
  public Config hazelcastConfig() {
    Config cfg = new Config();
    cfg.setClusterName("WORKFLOW");
    cfg.getNetworkConfig().setPort(5701).setPortAutoIncrement(false);
    cfg.setSerializationConfig(serializationConfig());
    cfg.setMapConfigs(getMapConfigs());
    cfg.setProperty("hazelcast.logging.type", "slf4j");

    if (isApplicationInsideKube()) {
      JoinConfig joinConfig = cfg.getNetworkConfig().getJoin();
      joinConfig.getAutoDetectionConfig().setEnabled(false);
      joinConfig.getTcpIpConfig().setEnabled(false);
      joinConfig.getMulticastConfig().setEnabled(false);
      joinConfig.getKubernetesConfig().setEnabled(true)
          .setProperty("service-name", "workflow-service")
          .setProperty("service-port", "5701");
      log.info(joinConfig.toString());
    }

    log.info(cfg.toString());
    return cfg;
  }

Best regards & Thank you in advance,

@alparslanavci
Copy link

@tessob thanks for submitting the issue. Could you please tell us more about your k8s environment? Are there any proxy/service mesh setup?

@tessob
Copy link
Author

tessob commented Apr 8, 2021

@alparslanavci no, we have no proxy/service mesh. Our Service manifest is similar to this one. There are just two pods in separate namespace now. However cluster itself is bigger.

Our JoinConfig on application start looks like this:

JoinConfig{multicastConfig=MulticastConfig [enabled=false, multicastGroup=224.2.2.3, multicastPort=54327, multicastTimeToLive=32, multicastTimeoutSeconds=2, trustedInterfaces=[], loopbackModeEnabled=true], tcpIpConfig=TcpIpConfig [enabled=false, connectionTimeoutSeconds=5, members=[], requiredMember=null], awsConfig=AliasedDiscoveryConfig{tag='aws', enabled=false, usePublicIp=false, properties={}}, gcpConfig=AliasedDiscoveryConfig{tag='gcp', enabled=false, usePublicIp=false, properties={}}, azureConfig=AliasedDiscoveryConfig{tag='azure', enabled=false, usePublicIp=false, properties={}}, kubernetesConfig=AliasedDiscoveryConfig{tag='kubernetes', enabled=true, usePublicIp=false, properties={service-name=workflow-service, service-port=5701}}, eurekaConfig=AliasedDiscoveryConfig{tag='eureka', enabled=false, usePublicIp=false, properties={}}, discoveryConfig=DiscoveryConfig{discoveryStrategyConfigs=[], discoveryServiceProvider=null, nodeFilter=null, nodeFilterClass='null'}, autoDetectionConfig=AutoDetectionConfig{enabled=false}}

and Discovery properties like this:

Kubernetes Discovery properties: { service-dns: null, service-dns-timeout: 5, service-name: workflow-service, service-port: 5701, service-label: null, service-label-value: true, namespace: XXXXX, pod-label: null, pod-label-value: null, resolve-not-ready-addresses: true, use-node-name-as-external-address: false, kubernetes-api-retries: 3, kubernetes-master: https://kubernetes.default.svc}

and there are few additional messages that I don't realy know hot to interprete:

2021-04-08 13:51:46.207  INFO 1 --- [           main] c.h.s.d.integration.DiscoveryService     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Kubernetes Discovery activated with mode: KUBERNETES_API
2021-04-08 13:51:46.499  INFO 1 --- [           main] com.hazelcast.instance.impl.Node         : [10.154.24.57]:5701 [WORKFLOW] [4.2] Using Discovery SPI
2021-04-08 13:51:47.783  INFO 1 --- [           main] com.hazelcast.core.LifecycleService      : [10.154.24.57]:5701 [WORKFLOW] [4.2] [10.154.24.57]:5701 is STARTING
2021-04-08 13:51:48.603  INFO 1 --- [           main] c.h.s.d.integration.DiscoveryService     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Kubernetes plugin discovered availability zone: eu-XXXXXX
2021-04-08 13:51:48.771  INFO 1 --- [           main] c.h.s.d.integration.DiscoveryService     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Kubernetes plugin discovered node name: ip-XXXXXXX.compute.internal
2021-04-08 13:51:48.990  WARN 1 --- [           main] c.hazelcast.kubernetes.KubernetesClient  : Cannot fetch public IPs of Hazelcast Member PODs, you won't be able to use Hazelcast Smart Client from outside of the Kubernetes network

I hope it would be helpful.
BR,

@tessob
Copy link
Author

tessob commented Apr 9, 2021

Short update,

I have checked out morning logs with 20K attempts to find other nodes and then taken a look at source codes to comprehend where the client has got the list of irrelevant endpoints.

I reproduced this API call manually with the following curl query:

kubectl proxy --port=0 &
curl http://localhost:36891/api/v1/namespaces/XXXXXXXXX/endpoints/workflow-service

and got this output:

{
  "kind": "Endpoints",
  "apiVersion": "v1",
  "metadata": {
    "name": "workflow-service",
    "namespace": "XXXXXXXXX",
    "selfLink": "/api/v1/namespaces/XXXXXXXXX/endpoints/workflow-service",
    "uid": "64ffb585-e68a-41bc-97bf-fb54e8f3fdc1",
    "resourceVersion": "113176905",
    "creationTimestamp": "2021-04-08T11:01:06Z",
    "labels": {
      "fluxcd.io/sync-gc-mark": "sha256.VJun3ud720pxL2mM3fjAbo3OoiEfk8bQDeZA8wZQx7c"
    }
  },
  "subsets": [
    {
      "addresses": [
        {
          "ip": "10.154.24.57",
          "nodeName": "XXXXXXXXX.compute.internal",
          "targetRef": {
            "kind": "Pod",
            "namespace": "XXXXXXXXX",
            "name": "workflow-tracker-service-5769dc77cf-9rnhp",
            "uid": "7c84a0c5-7d9b-4cb9-b1d8-d6bb0d08979e",
            "resourceVersion": "113176843"
          }
        },
        {
          "ip": "10.154.92.185",
          "nodeName": "XXXXXXXXX.compute.internal",
          "targetRef": {
            "kind": "Pod",
            "namespace": "XXXXXXXXX",
            "name": "workflow-tracker-service-5769dc77cf-zf9pd",
            "uid": "29737169-ff5e-49ba-902f-1756dc4e33db",
            "resourceVersion": "113176895"
          }
        }
      ],
      "ports": [
        {
          "name": "workflow-s",
          "port": 8080,
          "protocol": "TCP"
        },
        {
          "name": "hazelcast",
          "port": 5701,
          "protocol": "TCP"
        }
      ]
    }
  ]
}

So, as I told, I have just 2 pods relevant to the Hazelcast cluster. Hence I don't really understand what is the source IPs that TcpServerConnection is trying to connect. Looks like a bug.

Best regards,

@leszko
Copy link

leszko commented Apr 9, 2021

It's hard to tell where Hazelcast tries to connect to. Some more questions:

  1. If you follow exactly the code in this guide, do you experience the same?
  2. If you deploy your application on some widely-used Kubernetes cluster (like Docker Desktop, or GKE, or EKS), do you experience the same?

@tessob
Copy link
Author

tessob commented Apr 9, 2021

Hi @leszko,

  1. We have tested as in guide, in the same k8s cluster and namespace, and everything is OK. Logs:
########################################
# JAVA_OPTS=-Djava.net.preferIPv4Stack=true -Dhazelcast.logging.type=log4j2 -Dlog4j.configurationFile=/opt/hazelcast/log4j2.properties -XX:MaxRAMPercentage=80.0 -XX:MaxGCPauseMillis=5 --add-modules java.se --add-exports java.base/jdk.internal.ref=ALL-UNNAMED --add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.nio=ALL-UNNAMED --add-opens java.base/sun.nio.ch=ALL-UNNAMED --add-opens java.management/sun.management=ALL-UNNAMED --add-opens jdk.management/com.sun.management.internal=ALL-UNNAMED -Dhazelcast.config=/data/hazelcast/hazelcast.yaml -DserviceName=hazelcast -Dnamespace=XXXXXXXXX -Dhazelcast.shutdownhook.policy=GRACEFUL -Dhazelcast.shutdownhook.enabled=true -Dhazelcast.graceful.shutdown.max.wait=600 
# CLASSPATH=/opt/hazelcast/*:/opt/hazelcast/lib/*
# starting now....
########################################
+ exec java -server -Djava.net.preferIPv4Stack=true -Dhazelcast.logging.type=log4j2 -Dlog4j.configurationFile=/opt/hazelcast/log4j2.properties -XX:MaxRAMPercentage=80.0 -XX:MaxGCPauseMillis=5 --add-modules java.se --add-exports java.base/jdk.internal.ref=ALL-UNNAMED --add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.nio=ALL-UNNAMED --add-opens java.base/sun.nio.ch=ALL-UNNAMED --add-opens java.management/sun.management=ALL-UNNAMED --add-opens jdk.management/com.sun.management.internal=ALL-UNNAMED -Dhazelcast.config=/data/hazelcast/hazelcast.yaml -DserviceName=hazelcast -Dnamespace=XXXXXXXXX -Dhazelcast.shutdownhook.policy=GRACEFUL -Dhazelcast.shutdownhook.enabled=true -Dhazelcast.graceful.shutdown.max.wait=600 com.hazelcast.core.server.HazelcastMemberStarter
2021-04-09 10:39:31,214 [[32m INFO[m] [[36mmain[m] [[34mc.h.i.c.AbstractConfigLocator[m]: Loading configuration '/data/hazelcast/hazelcast.yaml' from System property 'hazelcast.config'
2021-04-09 10:39:31,219 [[32m INFO[m] [[36mmain[m] [[34mc.h.i.c.AbstractConfigLocator[m]: Using configuration file at /data/hazelcast/hazelcast.yaml
2021-04-09 10:39:31,360 [[32m INFO[m] [[36mmain[m] [[34mc.h.system[m]: [10.154.60.11]:5701 [dev] [4.2] Hazelcast 4.2 (20210324 - 405cfd1) starting at [10.154.60.11]:5701
2021-04-09 10:39:31,749 [[32m INFO[m] [[36mmain[m] [[34mc.h.s.d.i.DiscoveryService[m]: [10.154.60.11]:5701 [dev] [4.2] Kubernetes Discovery properties: { service-dns: null, service-dns-timeout: 5, service-name: hazelcast, service-port: 0, service-label: null, service-label-value: true, namespace: XXXXXXXXX, pod-label: null, pod-label-value: null, resolve-not-ready-addresses: true, use-node-name-as-external-address: false, kubernetes-api-retries: 3, kubernetes-master: https://kubernetes.default.svc}
2021-04-09 10:39:31,752 [[32m INFO[m] [[36mmain[m] [[34mc.h.s.d.i.DiscoveryService[m]: [10.154.60.11]:5701 [dev] [4.2] Kubernetes Discovery activated with mode: KUBERNETES_API
2021-04-09 10:39:31,807 [[32m INFO[m] [[36mmain[m] [[34mc.h.i.i.Node[m]: [10.154.60.11]:5701 [dev] [4.2] Using Discovery SPI
2021-04-09 10:39:31,810 [[33m WARN[m] [[36mmain[m] [[34mc.h.c.CPSubsystem[m]: [10.154.60.11]:5701 [dev] [4.2] CP Subsystem is not enabled. CP data structures will operate in UNSAFE mode! Please note that UNSAFE mode will not provide strong consistency guarantees.
2021-04-09 10:39:32,024 [[32m INFO[m] [[36mmain[m] [[34mc.h.i.d.Diagnostics[m]: [10.154.60.11]:5701 [dev] [4.2] Diagnostics disabled. To enable add -Dhazelcast.diagnostics.enabled=true to the JVM arguments.
2021-04-09 10:39:32,035 [[32m INFO[m] [[36mmain[m] [[34mc.h.c.LifecycleService[m]: [10.154.60.11]:5701 [dev] [4.2] [10.154.60.11]:5701 is STARTING
2021-04-09 10:39:32,448 [[32m INFO[m] [[36mmain[m] [[34mc.h.s.d.i.DiscoveryService[m]: [10.154.60.11]:5701 [dev] [4.2] Kubernetes plugin discovered availability zone: XXXXXXXXX
2021-04-09 10:39:32,500 [[32m INFO[m] [[36mmain[m] [[34mc.h.s.d.i.DiscoveryService[m]: [10.154.60.11]:5701 [dev] [4.2] Kubernetes plugin discovered node name: XXXXXXXXX.compute.internal
2021-04-09 10:39:32,616 [[33m WARN[m] [[36mmain[m] [[34mc.h.k.KubernetesClient[m]: Cannot fetch public IPs of Hazelcast Member PODs, you won't be able to use Hazelcast Smart Client from outside of the Kubernetes network
2021-04-09 10:39:37,600 [[32m INFO[m] [[36mmain[m] [[34mc.h.i.c.ClusterService[m]: [10.154.60.11]:5701 [dev] [4.2] 

Members {size:1, ver:1} [
	Member [10.154.60.11]:5701 - 6e43ac32-e6bf-4cfb-81e0-a865d6f986b1 this
]

2021-04-09 10:39:37,611 [[32m INFO[m] [[36mmain[m] [[34mc.h.c.LifecycleService[m]: [10.154.60.11]:5701 [dev] [4.2] [10.154.60.11]:5701 is STARTED
2021-04-09 10:40:22,454 [[32m INFO[m] [[36mhz.boring_ganguly.IO.thread-in-1[m] [[34mc.h.i.s.t.TcpServerConnection[m]: [10.154.60.11]:5701 [dev] [4.2] Initialized new cluster connection between /10.154.60.11:5701 and /10.154.107.219:57097
2021-04-09 10:40:28,455 [[32m INFO[m] [[36mhz.boring_ganguly.priority-generic-operation.thread-0[m] [[34mc.h.i.c.ClusterService[m]: [10.154.60.11]:5701 [dev] [4.2] 

Members {size:2, ver:2} [
	Member [10.154.60.11]:5701 - 6e43ac32-e6bf-4cfb-81e0-a865d6f986b1 this
	Member [10.154.107.219]:5701 - 3140ed16-8832-4e69-a875-d8da79b091eb
]

2021-04-09 10:40:56,850 [[32m INFO[m] [[36mhz.boring_ganguly.IO.thread-in-0[m] [[34mc.h.i.s.t.TcpServerConnection[m]: [10.154.60.11]:5701 [dev] [4.2] Initialized new cluster connection between /10.154.60.11:5701 and /10.154.101.83:44681
2021-04-09 10:41:02,834 [[32m INFO[m] [[36mhz.boring_ganguly.priority-generic-operation.thread-0[m] [[34mc.h.i.c.ClusterService[m]: [10.154.60.11]:5701 [dev] [4.2] 

Members {size:3, ver:3} [
	Member [10.154.60.11]:5701 - 6e43ac32-e6bf-4cfb-81e0-a865d6f986b1 this
	Member [10.154.107.219]:5701 - 3140ed16-8832-4e69-a875-d8da79b091eb
	Member [10.154.101.83]:5701 - 21669cbd-71fc-4a70-8727-b6538ddd3d23
]
  1. We are in AWS EKS

In my dependencies I have only these two that can be related to Hazelcast:

    <dependency>
      <artifactId>spring-boot-starter-cache</artifactId>
      <groupId>org.springframework.boot</groupId>
    </dependency>
    <!-- https://mvnrepository.com/artifact/com.hazelcast/hazelcast-all -->
    <dependency>
      <groupId>com.hazelcast</groupId>
      <artifactId>hazelcast-all</artifactId>
      <version>4.2</version>
    </dependency>

My config is in very first message.

Thanks,

@tessob
Copy link
Author

tessob commented Apr 10, 2021

Short update,

I have tested my application in the minikube locally and everything is OK. So the root-cause of such behavior highly likely in K8s cluster and/or deployment manifests. However, for me still unclear why Hazelcast is trying to establish connection with nodes that are irrelevant to service-name.

@leszko
Copy link

leszko commented Apr 14, 2021

It may be the phone home call. Could you try disabling phone home and check if it helps?

@tessob
Copy link
Author

tessob commented Apr 15, 2021

After I disabled it I still have the same behavior in logs, so there should be another cause.

@fallthough
Copy link

Hi @tessob , were you able to find out the cause? I'm currently experiencing the same issue. I'm also in AWS EKS

@fatih-iver
Copy link

I'm also experiencing the same problem and it is also in AWS EKS.

@tessob
Copy link
Author

tessob commented Dec 25, 2022

Hi @tessob , were you able to find out the cause? I'm currently experiencing the same issue. I'm also in AWS EKS

Sorry for A late reply ))
I have migrated to Infinispan.

@AshishWat
Copy link

AshishWat commented Dec 13, 2023

Hey folks,

I am facing same issue .
We have java spring app, with embedded hazelcast, and run it in Kubernetes.
For member lookup we using DNS Lookup Discovery mode , member creation looks good but Hazelcast trying to connect some random IP and giving below warning.

java.lang.IllegalStateException: Unknown protocol: OPT
java.lang.IllegalStateException: TLS handshake header detected, but plain protocol header was expected.

@tessob Did you find the root cause of this? My issue is similar to what you have posted .

@tessob
Copy link
Author

tessob commented Jan 23, 2024

@AshishWat No, just forget about Hazelcast after code observation and moved to Infinispan. So far so good.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants